A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a “scale-at-all-costs” paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

💡 Research Summary

The paper tackles two fundamental inefficiencies that have become entrenched in the development of medical vision‑language foundation models: (1) massive redundancy and severe class imbalance in publicly available chest X‑ray (CXR) datasets, and (2) the indiscriminate consumption of computational resources when training on every available image‑report pair. Traditional “scale‑at‑all‑costs” approaches simply aggregate as many CXR images and associated free‑text radiology reports as possible, assuming that larger data volumes inevitably yield better generalization. In practice, however, normal findings and common pathologies dominate the data distribution, while rare but clinically critical conditions are under‑represented. Training on the full set therefore biases the learned representations toward over‑represented patterns and wastes compute on redundant examples.

To address these challenges, the authors propose CheXficient, a contrastive vision‑language foundation model that incorporates an online prototype‑driven data curator during pretraining. The overall architecture follows the CLIP paradigm: a DINOv2 vision encoder and a BioClinicalBERT text encoder are jointly trained with an InfoNCE contrastive loss. The novelty lies in the dynamic selection of training samples. A set of learnable prototypes (centroids) is maintained throughout training to approximate the underlying data manifold. For each image‑report pair, the distance between its joint embedding and the nearest prototype is computed. Samples that lie far from any prototype—indicative of low‑density, under‑represented regions of the data distribution—receive a higher sampling probability, whereas samples close to a prototype (i.e., redundant or highly typical examples) are down‑sampled. This selection is performed online, allowing the prototype set to evolve as the model’s representation space changes.

The authors assembled a massive pretraining corpus of 1,235,004 paired CXR images and reports drawn from 13 public datasets (e.g., CheXpert‑Plus, MIMIC‑CXR, PadChest, BIMCV COVID‑19, etc.). CheXficient was trained on 280,000 pairs, which is 22.7 % of the full corpus, and consumed only 27.3 % of the total compute budget required to train on the entire set. For a fair comparison, two baselines were defined: CheXfull, which uses the same architecture and training recipe but is trained on the complete dataset, and CheXrandom, which is trained on a randomly sampled subset of equal size to CheXficient.

Evaluation was conducted across 20 benchmarks spanning five task categories: (i) zero‑shot findings classification (47 thoracic findings) on eight public datasets, (ii) zero‑shot cross‑modal retrieval (image‑to‑report, image‑to‑findings, image‑to‑impression, and the reverse), (iii) fine‑tuned multi‑label disease prediction, (iv) fine‑tuned semantic segmentation of anatomical structures and abnormalities, and (v) fine‑tuned radiology report generation. Performance metrics included AUROC for classification, Recall@1 for retrieval, Dice score for segmentation, and RadGraph‑based metrics for report generation.

Key results:

In zero‑shot classification, CheXficient matched or exceeded CheXfull on seven of eight datasets, achieving statistically significant improvements on three. It consistently outperformed CheXrandom, confirming that the curated subset is more informative than a random one of the same size.
For cross‑modal retrieval, CheXficient attained higher Recall@1 than CheXfull across all retrieval directions, demonstrating that the learned joint embedding space is more discriminative despite using fewer training pairs.
In downstream fine‑tuning, CheXficient’s disease prediction AUROC, segmentation Dice, and report generation RadGraph F1 scores were on par with or slightly better than CheXfull, while requiring far less training time. Notably, performance gains were most pronounced on rare disease cohorts, reflecting the curator’s emphasis on low‑density regions.
Analyses of the selected data revealed that the curated subset occupies sparsely populated zones in a PCA projection of the joint embeddings, and its average k‑nearest‑neighbor distance is significantly larger than that of the full set (p < 0.001). Over 32 % of curated samples fall within the lowest‑density quartile of the full distribution, whereas a random subset shows no such enrichment.

From a computational standpoint, CheXficient converged in roughly a quarter of the GPU/TPU hours required for CheXfull, confirming that eliminating redundant samples not only reduces data volume but also accelerates optimization.

The study’s broader implication is a paradigm shift: rather than pursuing ever‑larger datasets, strategic data curation can achieve comparable or superior performance with dramatically lower resource demands. The prototype‑driven curator offers a scalable, model‑agnostic mechanism that could be extended to other imaging modalities (CT, MRI) and even non‑image clinical data (EHR notes, lab results). Future work may explore multi‑prototype ensembles, curriculum‑style prototype updates, or domain‑adaptive prototypes to further refine sample selection.

In summary, CheXficient demonstrates that data quality and intelligent sampling outweigh sheer quantity for medical vision‑language foundation models. By integrating a lightweight online curator, the authors achieve a 77 % reduction in data and a 73 % reduction in compute while maintaining or improving performance across a wide spectrum of clinically relevant tasks, thereby paving the way for more accessible, sustainable, and equitable AI development in radiology.

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment