An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems

An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurately modelling the dynamics of complex systems and discovering their governing differential equations are critical tasks for accelerating scientific discovery. Using noisy, synthetic data from two damped oscillatory systems, we explore the extrapolation capabilities of Neural Ordinary Differential Equations (NODEs) and the ability of Symbolic Regression (SR) to recover the underlying equations. Our study yields three key insights. First, we demonstrate that NODEs can extrapolate effectively to new boundary conditions, provided the resulting trajectories share dynamic similarity with the training data. Second, SR successfully recovers the equations from noisy ground-truth data, though its performance is contingent on the correct selection of input variables. Finally, we find that SR recovers two out of the three governing equations, along with a good approximation for the third, when using data generated by a NODE trained on just 10% of the full simulation. While this last finding highlights an area for future work, our results suggest that using NODEs to enrich limited data and enable symbolic regression to infer physical laws represents a promising new approach for scientific discovery.


💡 Research Summary

This paper investigates how Neural Ordinary Differential Equations (NODEs) and Symbolic Regression (SR) can be combined to model dynamical systems and recover their governing differential equations when only noisy, sparse data are available. The authors focus on two benchmark damped‑oscillatory systems: a classic cart‑pole apparatus from control theory and a biological “Bio‑model” describing bacterial adaptation to sudden changes in nutrient quality. Both systems are simulated with added uniform noise ranging from 0.5 % to 5 % of the signal amplitude, thereby mimicking realistic experimental conditions.

Neural ODE Experiments
The authors implement continuous‑time neural networks using the JAX‑based Diffrax library. For the cart‑pole, two training regimes are explored. Model A is trained on 35 initial‑condition combinations (angles and angular velocities) but only on the first second of each trajectory sampled at 25 Hz (26 points per trajectory). Model B uses a smaller subset of initial conditions confined to a red‑boxed region in the state‑space. For the Bio‑model, twelve nutrient‑shift scenarios (six up‑shifts, six down‑shifts) are simulated for eight hours with a 0.01‑hour step; NODEs are trained on the first two hours using only ten points per hour (extremely sparse data).

Results show that NODEs do not simply memorize the training trajectories; instead, they learn the underlying flow field. A heat‑map of mean‑squared error (MSE) across the full grid of initial conditions reveals low‑error “islands” that extend beyond the training region whenever the test point lies on the same phase‑space trajectory as any training point. In other words, NODEs extrapolate successfully to new boundary conditions as long as the new trajectories share the same dynamical similarity. Moreover, the authors demonstrate that long‑term predictions (up to eight hours) remain accurate even when the training data are extremely sparse (as few as five to six points per variable per shift). Short‑term error (within the first hour) does increase for the lowest sampling rates, indicating that noise dominates when data are too scarce, but the long‑term MSE is remarkably stable across all sampling frequencies. This suggests that NODEs can act as effective continuous‑time interpolators and that diverse dynamical coverage in the training set is more important than sheer quantity of initial conditions.

Symbolic Regression Experiments
For SR, the authors employ PySR, a genetic‑programming based symbolic regression framework. They test two data sources: (i) the ground‑truth simulated data (both noise‑free and with 5 % noise) and (ii) data generated by the trained NODEs (again both noise‑free and noisy). Input variables include the three state variables (θ, ω for the cart‑pole; ψ_A, φ_R, χ_R for the Bio‑model) and, optionally, an auxiliary term λ that appears in the original Bio‑model equations.

When λ is supplied as an input, SR recovers all three target equations from the noise‑free ground‑truth data. However, with 5 % noise, SR fails to identify the rational term λ·ψ_A in Equation 2; the algorithm collapses the expression to a much simpler form (essentially a constant minus λ). This failure is attributed to the low signal‑to‑noise ratio of the λ·ψ_A term (its magnitude is an order of magnitude smaller than the dominant constant). When λ is omitted, SR can only recover Equation 4; the other two equations are lost because the missing term masks the true structure.

When SR is applied to NODE‑generated data, the picture improves. The NODE acts as a denoising filter: even when trained on noisy data, the generated trajectories are smoother, allowing SR to recover Equations 3 and 4 exactly and to obtain a close approximation of Equation 2 (the constant term is slightly shifted, but the overall functional form is recognizable). This demonstrates that a modestly trained NODE can enrich a limited, noisy dataset and make the subsequent symbolic search more tractable.

NODE‑Augmented SR Pipeline
The central contribution of the paper is a two‑stage pipeline: (1) train a NODE on only 10 % of the full simulation (i.e., a few hours of data at very low sampling rates); (2) use the trained NODE to generate a dense, long‑duration synthetic dataset; (3) run SR on this synthetic data. Using this pipeline, the authors recover two of the three governing equations exactly and obtain a good approximation for the third. While the recovered approximation for Equation 2 is not perfect, the result is encouraging because it shows that even with a severely data‑starved regime, the combination of NODE and SR can extract substantial physical insight.

Limitations and Future Directions
The authors acknowledge several avenues for improvement. First, SR was only tested on single‑shift simulations; extending the search to multi‑condition, multi‑shift datasets could improve robustness. Second, the NODE architectures used are relatively simple; more expressive models such as Neural Controlled Differential Equations or hybrid approaches with SINDy could enhance extrapolation. Third, incorporating physical priors (e.g., unit consistency, conservation laws) into the SR search could help recover small‑magnitude terms that are currently masked by noise. Finally, systematic optimization of the training set—ensuring coverage of diverse dynamical regimes rather than merely increasing the number of initial conditions—remains a key design principle.

Conclusions
The study demonstrates that (i) NODEs can learn the underlying flow of damped‑oscillatory systems and extrapolate to unseen boundary conditions provided the new trajectories are dynamically similar to those seen during training, and (ii) SR can recover governing equations from noisy data if the appropriate input variables are supplied. By using a NODE as a data‑augmentation tool, the authors show that a sparse, noisy experimental dataset can be transformed into a rich synthetic corpus that enables successful symbolic discovery of physical laws. This NODE‑augmented SR pipeline offers a promising route for scientific discovery in domains where data are scarce or expensive to acquire, such as experimental biology, climate modeling, or astrophysics.


Comments & Academic Discussion

Loading comments...

Leave a Comment