On the use of LLMs to generate a dataset of Neural Networks
Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring, and migration. These tools play a crucial role in guaranteeing both the correctness and maintainability of neural network architectures, helping to prevent implementation errors, simplify model updates, and ensure that complex networks can be reliably extended and reused. Yet, assessing their effectiveness remains challenging due to the lack of publicly diverse datasets of neural networks that would allow systematic evaluation. To address this gap, we leverage large language models (LLMs) to automatically generate a dataset of neural networks that can serve as a benchmark for validation. The dataset is designed to cover diverse architectural components and to handle multiple input data types and tasks. In total, 608 samples are generated, each conforming to a set of precise design choices. To further ensure their consistency, we validate the correctness of the generated networks using static analysis and symbolic tracing. We make the dataset publicly available to support the community in advancing research on neural network reliability and adaptability.
💡 Research Summary
The paper addresses a critical gap in the evaluation of neural‑network (NN) tooling—namely, the lack of a publicly available, diverse collection of NN source code that can serve as a benchmark for verification, refactoring, and migration tools. Existing datasets focus on model weights or architecture performance and do not capture the structural variety needed to stress‑test code‑level analysis tools. To fill this void, the authors propose an end‑to‑end pipeline that leverages a large language model (LLM), specifically GPT‑5, to automatically generate a dataset of 608 distinct PyTorch NN implementations, and then validates each model through static abstract‑syntax‑tree (AST) analysis and symbolic tracing.
Design of Requirements
Four orthogonal dimensions are defined: (1) Architecture (MLP, CNN‑1D, CNN‑2D, CNN‑3D, Simple RNN, LSTM, GRU), (2) Learning Task (binary classification, multiclass classification, regression, representation learning), (3) Input Type & Scale (Tabular, Time‑Series, Text, Image, each with “small” and “large” scale variants), and (4) Complexity (Simple, Wide, Deep, Wide‑Deep). For each architecture a “characterising layer” (CL) is identified, and width/depth thresholds are set (Table 1). This systematic taxonomy guarantees that generated models cover a broad design space while still adhering to realistic constraints.
Prompt Engineering
A template prompt is constructed that embeds the four requirement values and adds strict generation rules: output only the model code (no comments), fix all layer parameters inside the class, avoid custom helper functions, and stay within reasonable memory limits. By varying the placeholders, the same prompt can drive GPT‑5 to produce models across all permissible combinations. The authors also filter out nonsensical combos (e.g., MLP for image data) as shown in Table 2.
Generation Process
Using the prompt, GPT‑5 produces 608 Python files. Each file begins with the exact prompt used, ensuring reproducibility. Filenames encode architecture, task, input type/scale, and complexity (e.g., mlp_classification-binary_tabular-large_simple.py). The generated code follows a uniform style, making downstream parsing straightforward.
Diversity Analysis
Across the dataset, 6,842 layer and tensor‑operation instances are observed, spanning 38 unique PyTorch primitives (Linear, Conv1d, Conv2d, Conv3d, LSTM, GRU, etc.). Model depth ranges from 2 to 35 layers. Complexity categories exhibit the expected depth distribution: Simple models are shallow, Wide‑Deep models are the deepest. This confirms that the LLM does not collapse to a narrow set of patterns but respects the prescribed design space.
Technical Validation
Two complementary validation tools are provided:
-
Static AST Analysis – Parses each file, extracts layer definitions, checks for the presence of at least one CL, verifies that the output layer matches the declared task (e.g., one sigmoid neuron for binary classification), and confirms that the first CL’s parameters align with the specified input type and scale.
-
Symbolic Tracing – Employs
torch.fxto trace a dummy forward pass, ensuring that tensor shapes propagate correctly through the network given the expected input dimensions.
During validation, eight models (all RNN‑LSTM or RNN‑GRU with large time‑series inputs) were flagged because a linear projection layer preceded the recurrent block, breaking the temporal ordering. These models were regenerated, after which the entire set achieved a 99.7 % pass rate.
Functional Verification
For each input modality, a representative model was trained on standard benchmarks (e.g., UCI tabular datasets, MNIST for images, PTB for text). Training curves and final accuracies were comparable to hand‑crafted baselines, confirming that the generated code is not only syntactically correct but also functionally viable.
Release and Impact
The dataset, along with the validation scripts, is released on GitHub (https://github.com/BESSER-PEARL/LLM-Generated-NN-Dataset). The authors argue that this resource enables systematic, reproducible benchmarking of NN verification, refactoring, and migration tools, much like ImageNet did for computer‑vision algorithms. Moreover, the pipeline demonstrates that LLMs can dramatically reduce human effort in curating large, diverse code corpora while still meeting strict specification constraints.
Future Directions
The paper suggests extending the approach to more sophisticated architectures (Transformers, Graph Neural Networks), multi‑task settings, and possibly incorporating performance metrics (e.g., FLOPs, memory footprint) into the generation constraints. Another promising avenue is using the dataset to evaluate LLMs themselves on code generation quality, creating a feedback loop between model creation and model assessment.
In summary, this work presents a novel, reproducible method for generating a high‑quality, diverse NN code dataset using LLMs, validates it with rigorous static and dynamic analyses, and makes it publicly available to foster fair and comprehensive evaluation of NN tooling.
Comments & Academic Discussion
Loading comments...
Leave a Comment