Interpretability and Generalization Bounds for Learning Spatial Physics
While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green’s function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.
💡 Research Summary
This paper presents a rigorous numerical analysis of the interpretability and generalization bounds of machine learning models applied to linear differential equations, using the 1D Poisson equation as a canonical testbed. The central thesis is that beyond the quantity and discretization of data, the function space from which training data is sampled is a critical, often overlooked determinant of a model’s ability to generalize.
The authors make two key theoretical contributions. First, for parameter discovery (e.g., learning a material coefficient), they prove that training on polynomial data of degree p using a finite-difference scheme of order q leads to accurate parameter estimation only if p < q. Counterintuitively, using higher-degree polynomial data (p ≥ q) introduces a theoretical error that grows with p, demonstrating that “richer” data can be detrimental for learning fundamental physical laws. Second, for learning a full solution operator (e.g., a Green’s function matrix), they prove that gradient descent converges to the projection of the true operator onto the subspace spanned by the training data’s function space. This is a powerful no-go theorem: a model cannot learn components of the operator for which it has seen no corresponding data, regardless of the amount of data within its subspace.
These theories are empirically validated and extended through comprehensive experiments on eight model classes: white-box (finite-difference, PINNs for inverse problems), black-box (linear, deep linear, MLP), SciML-specific (DeepONet, Fourier Neural Operator), and physics-informed hybrids (Physics-Informed DeepONet). The models are trained and tested across 25 distinct datasets generated from different function classes (polynomials, sines, cosines, piecewise linear) and varying basis dimensions (p). The results reveal starkly divergent generalization behaviors. Models like linear and deep linear networks generalize well only when the test function space is a subset of the training function space. In contrast, overparameterized nonlinear models (MLP, DeepONet) can fail to generalize even under this subset condition, highlighting that inductive bias, not just capacity, governs out-of-distribution performance.
Building on this analysis, the paper introduces a novel mechanistic interpretability lens for SciML. It demonstrates that the Green’s function kernel—the fundamental solution operator of the PDE—can be directly extracted from the weights of a trained black-box linear model, offering a concrete method to peer inside the “black box” and verify if it has learned the underlying physics.
Ultimately, the work sounds a cautionary note about the blind application of ML to scientific problems, where visual fits can be deceiving. It provides a new, rigorous framework for understanding failure modes. The authors propose a practical outcome: a new cross-validation benchmark where models are systematically evaluated on datasets drawn from different function spaces, not just different resolutions or random seeds. This provides a more stringent and physically meaningful test of generalization for scientific machine learning models, moving the field toward more reliable and interpretable applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment