Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model
Recent developments in 3D vision have enabled significant progress in inferring neural fluid fields and realistic rendering of fluid dynamics. However, these methods require dense captures of real-world flows, which demand specialized laboratory setups, making the process costly and challenging. Scientific machine learning (SciML) foundation models, pretrained on extensive simulations of partial differential equations (PDEs), encode rich multiphysics knowledge and thus provide promising sources of domain priors for fluid field inference. Nevertheless, the transferability of these foundation models to real-world vision problems remains largely underexplored. In this work, we demonstrate that SciML foundation models can significantly reduce the data requirements for inferring real-world 3D fluid dynamics while improving generalization. Our method leverages the strong forecasting capabilities and meaningful representations learned by SciML foundation models. We introduce a novel collaborative training strategy that equips neural fluid fields with augmented frames and fluid features extracted from the foundation model. Extensive experiments show substantial improvements in both quantitative metrics and visual quality over prior approaches. In particular, our method achieves a 9-36% improvement in peak signal-to-noise ratio (PSNR) for future prediction while reducing the number of required training frames by 25-50%. These results highlight the practical applicability of SciML foundation models for real-world fluid dynamics reconstruction. Our code is available at: https://github.com/delta-lab-ai/SciML-HY.
💡 Research Summary
The paper tackles the problem of reconstructing 3‑D fluid fields (density and velocity) from sparse video observations, a task that traditionally requires dozens to hundreds of densely captured frames and elaborate multi‑camera setups. To alleviate this data bottleneck, the authors propose leveraging a scientific machine learning (SciML) foundation model that has been pre‑trained on a large collection of partial differential equation (PDE) simulations spanning multiple physics domains (compressible and incompressible Navier‑Stokes, shallow water, reaction‑diffusion).
The foundation model is a 3‑D Swin‑Transformer with about 6.5 M parameters. It tokenizes a short temporal window of 2‑D frames using a 3‑D convolution, processes them with windowed self‑attention, and predicts the next time step. Pre‑training is performed on the PDEBench dataset using a normalized root‑mean‑square error (nRMSE) loss, which forces the network to learn generic advection, diffusion, and conservation properties shared across the diverse PDEs.
After pre‑training, the model is fine‑tuned on the real‑world ScalarFlow smoke dataset. The fine‑tuned model can forecast future frames from a very limited set of input frames (as few as 20–30). Forecasts with a peak‑signal‑to‑noise ratio (PSNR) above 25 are deemed reliable and are added to the training pool of the neural fluid field model (HyFluid). The authors introduce a collaborative training loop:
- Initial stage (v0) – HyFluid and the foundation model are trained separately on the sparse original frames.
- Co‑training stage (v1) – The reliable forecasts from the foundation model are concatenated to HyFluid’s training set, and HyFluid’s own predictions are fed back to the foundation model. Both networks are fine‑tuned alternately, effectively distilling knowledge in the output space.
- Final stage (v2) – HyFluid is trained on the expanded set of augmented frames, achieving much stronger future predictions despite using fewer original frames.
In parallel, the authors extract intermediate feature embeddings from the foundation model and inject them into HyFluid’s density field via the 4‑D hash‑encoding (iNGP) and an MLP that also receives camera ray information. This feature aggregation supplies semantic fluid cues that improve generalization across different flow patterns.
Experimental results on ScalarFlow demonstrate that the proposed method reduces the required number of training frames by 25–50 % while improving future‑frame PSNR by 9–36 % relative to strong baselines (HyFluid, PINF). Ablation studies confirm that (i) multi‑physics pre‑training is beneficial, (ii) the number of autoregressive steps (gradually increased from 3 to 8) directly correlates with performance, and (iii) the inclusion of learned fluid features yields a consistent PSNR boost.
The paper’s contributions are threefold: (1) showing that a SciML foundation model can serve as a powerful prior for real‑world fluid reconstruction, (2) devising a collaborative training scheme that leverages forecasted frames to dramatically improve data efficiency, and (3) integrating learned fluid representations into the neural rendering pipeline to enhance visual fidelity.
Limitations include reliance on a relatively small transformer (larger models may yield further gains), evaluation only on smoke flows (generalization to other fluids remains open), and the use of a fixed PSNR threshold for frame selection, which may be sensitive to noise levels. Future work could explore scaling the foundation model, applying the framework to multi‑phase or high‑Reynolds‑number flows, and developing adaptive criteria for frame augmentation.
Overall, the study provides compelling evidence that large‑scale physics‑aware pre‑training can bridge the gap between synthetic PDE simulations and real‑world vision tasks, opening a promising pathway toward cost‑effective, high‑quality fluid field reconstruction in practical settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment