Geotechnical and seismic applications, ranging from site response analysis and HVSR simulations to dispersion curve modeling, increasingly depend on large, well-labeled datasets for robust model development. However, the scarcity of publicly available borehole datasets, coupled with the proprietary nature of high-quality field records, creates a significant bottleneck for data-driven research, particularly in machine learning. To address this limitation, this study introduces SoilGen, an open-source framework that procedurally generates physically consistent multilayer soil columns as synthetic soil profiles. Unlike simple randomization, SoilGen computes a complete suite of geotechnical properties, including layer thickness, shear-wave velocity, P-wave velocity, density, and Poisson ratio, while enforcing physical constraints to ensure realism. The algorithmic foundations of the framework and its implementation are outlined, and its utility is demonstrated through representative near-surface geological scenarios relevant to site characterization and near-surface geophysics. By facilitating the rapid generation of large-scale model libraries exceeding one hundred thousand realizations, SoilGen enables comprehensive parametric studies and the training of deep learning inversion networks that require extensive labeled datasets for shear-wave velocity profiling and related site characterization tasks.
1
SoilGen: A Comprehensive Tool for Generating Synthetic Soil Profiles for Geotechnical
and Seismic Analysis
Mersad Fathizadeh1, Hosna Kianfar2
1University of Arkansas, Graduate Research Assistant, Dept. of Civil Eng., 4190 Bell Engineering
Center Fayetteville, AR 72701, USA, mersadf@uark.edu
2University of Arkansas, Graduate Research Assistant, Dept. of Civil Eng., 4190 Bell Engineering
Center Fayetteville, AR 72701, USA, hkianfar@uark.edu
ABSTRACT
Geotechnical and seismic applications, ranging from site response analysis and HVSR simulations
to dispersion curve modeling, increasingly depend on large, well-labeled datasets for robust model
development. However, the scarcity of publicly available borehole datasets—coupled with the
proprietary nature of high-quality field records—creates a significant bottleneck for data-driven
research, particularly in machine learning. To address this limitation, this study introduces SoilGen, an
open-source framework that procedurally generates physically consistent, multilayered soil columns as
synthetic soil profiles. Unlike simple randomization, SoilGen computes a complete suite of geotechnical
properties—including thickness, 𝑉𝑉𝑆𝑆, P-wave velocity (𝑉𝑉𝑃𝑃, Density and Poisson’s ratio—while enforcing
physical constraints to ensure realism. The algorithmic foundations of the framework and its
implementation are outlined, and its utility is demonstrated through representative near-surface
geological scenarios relevant to site characterization and near-surface geophysics. By facilitating the
rapid generation of large-scale model libraries (𝑁𝑁> 105), SoilGen enables comprehensive parametric
studies and the training of deep learning inversion networks that require extensive, labeled datasets for
shear-wave velocity (𝑉𝑉𝑆𝑆) profiling and other site characterization tasks.
Keywords: synthetic soil profiles; near-surface geophysics; machine learning; site
characterization; shear-wave velocity (Vs)
1
INTRODUCTION
Accurate characterization of the near-surface velocity structure is fundamental to seismic site
response evaluation, dispersion curve analysis, and a broad range of geotechnical studies. Techniques
such as Horizontal-to-Vertical Spectral Ratio (HVSR), Multichannel Analysis of Surface Waves
(MASW), and numerical site response modeling all rely on robust subsurface models to yield reliable
predictions. However, traditional inversion methods are often computationally intensive and suffer
from non-uniqueness, while publicly available borehole datasets containing complete geotechnical
properties remain scarce. This data deficit is particularly critical for data-hungry machine learning
approaches, which demand hundreds of thousands of labeled models to learn robust mappings from
geophysical observations to subsurface properties.
SoilGen addresses this need by programmatically generating one-dimensional layered soil
profiles that exhibit realistic thicknesses and velocities, subject to rigorous geophysical constraints.
Crucially, the package computes a complete suite of geotechnical parameters—including layer
thickness, shear-wave velocity (𝑉𝑉𝑆𝑆), P-wave velocity (𝑉𝑉𝑃𝑃), density, and Poisson’s ratio—ensuring that
each generated model is immediately applicable to dispersion curve forward modeling, HVSR
2
simulation, site response analysis, or machine learning pipelines. Integrated validation routines strictly
enforce physical laws, such as ensuring that 𝑉𝑉𝑃𝑃 exceeds 𝑉𝑉𝑆𝑆 and that material properties remain within
plausible limits.
The framework facilitates the rapid generation of extensive model libraries (𝑁𝑁> 105), allowing
users to assign profiles to predefined geological scenarios, export them in multiple formats, and
visualize them via a modern graphical user interface. The remainder of this paper is organized as
follows: Section 2 outlines the SoilGen methodology, detailing the scenario definitions, empirical
relationships, and implementation specifics. Section 3 presents representative results, illustrating the
tool’s output through multi-panel figures for various geological settings. Finally, Section 4 concludes
with a discussion of the package’s broader applications in geotechnical modeling, including its
integration with complementary tools such as hvstrip-progressive (Fathizadeh et al., 2025) for
advanced layer-stripping analyses.
2
METHODOLOGY AND DATA PROCESSING
2.1 Profile Generation Algorithm
SoilGen generates randomized 1D soil profiles—typically comprising 3 to 8 layers—by
stochastically sampling layer thicknesses and shear-wave velocities (𝑉𝑉𝑆𝑆), subsequently computing
derived elastic properties and validating the physical consistency of each model. The overall
generation workflow is illustrated in Figure 1, which depicts the primary interface for parameter
definition. To create a synthetic dataset, the user selects a target geological scenario and specifies
boundary conditions, including the tot
This content is AI-processed based on open access ArXiv data.