On the Importance of Pretraining Data Alignment for Atomic Property Prediction
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
💡 Research Summary
The paper challenges the prevailing belief in atomic property prediction that larger pre‑training datasets and massive compute automatically yield better models. By introducing the Chemical Similarity Index (CSI), an adaptation of the Frechet Inception Distance for 3D molecular graphs, the authors provide a quantitative measure of how well an upstream (pre‑training) dataset aligns with a downstream target task. CSI is computed from the means and covariances of node‑level embeddings extracted from each dataset; a lower CSI indicates higher chemical and structural similarity between the two distributions.
Using CSI, the authors evaluate several large‑scale molecular and materials datasets (e.g., ANI‑1x, OC20, OC22) and select, for each downstream task, the single upstream dataset with the smallest CSI distance. They then pre‑train a multi‑task graph neural network (predicting energies and forces) on this selected dataset under a fixed computational budget C = E × N (epochs × samples). The budget is deliberately set to be 1/24 of that used by the Joint Multi‑domain Pre‑training (JMP) approach, which aggregates all upstream data (≈240 M samples). Despite using only ≈10 M samples, the CSI‑guided models achieve mean absolute errors (MAE) that match or surpass JMP across a suite of downstream benchmarks, demonstrating that a carefully aligned, smaller dataset can be far more efficient.
A second set of experiments adds poorly aligned data to the selected dataset. The results show a consistent degradation in downstream performance, confirming that indiscriminate scaling of data can be detrimental when the added samples are not task‑relevant. This finding overturns the “bigger is better” mantra and highlights the importance of data quality and relevance.
The paper also discusses broader implications: CSI can serve as a universal metric for dataset selection in any graph‑based molecular or materials learning pipeline, guiding researchers to allocate compute resources wisely. The authors release code and datasets, ensuring reproducibility and encouraging the community to explore CSI‑driven pre‑training across different architectures and domains.
In summary, the work demonstrates that (1) strategic data selection based on a principled similarity metric can achieve state‑of‑the‑art performance with dramatically reduced compute, and (2) adding misaligned data can harm model accuracy. This shifts the focus of future research from sheer data volume toward intelligent, chemistry‑aware dataset curation for efficient and effective atomic property prediction.
Comments & Academic Discussion
Loading comments...
Leave a Comment