FMIR, a foundation model-based Image Registration Framework for Robust Image Registration
Deep learning has revolutionized medical image registration by achieving unprecedented speeds, yet its clinical application is hindered by a limited ability to generalize beyond the training domain, a critical weakness given the typically small scale of medical datasets. In this paper, we introduce FMIR, a foundation model-based registration framework that overcomes this limitation.Combining a foundation model-based feature encoder for extracting anatomical structures with a general registration head, and trained with a channel regularization strategy on just a single dataset, FMIR achieves state-of-the-art(SOTA) in-domain performance while maintaining robust registration on out-of-domain images.Our approach demonstrates a viable path toward building generalizable medical imaging foundation models with limited resources. The code is available at https://github.com/Monday0328/FMIR.git.
💡 Research Summary
The paper introduces FMIR (Foundation Model‑based Image Registration), a novel framework that leverages large‑scale 2‑D foundation models to achieve robust, generalizable medical image registration. The authors identify the prevailing problem that deep‑learning‑based registration methods, while fast, typically over‑fit to the training domain and perform poorly on unseen data—a critical limitation given the small size of most medical imaging datasets.
FMIR consists of two main components: (1) a foundation‑model‑based feature encoder and (2) a multi‑scale pyramid registration head. The encoder adapts a pre‑trained 2‑D model (e.g., DINO ViT‑B, SAM) to 3‑D medical volumes by processing each axial slice independently. Slices are padded to a square size, passed through the frozen foundation model, and produce feature maps of size c × K/16 × K/16 (c = 768 for DINO, 256 for SAM). A channel regularization (CR) step reduces the dimensionality to a unified c′ = 256, after which the slice‑wise features are re‑assembled into a 3‑D volume and refined with a three‑layer 3‑D convolutional block that restores local volumetric context and compresses channels to n ≪ c′. This design allows FMIR to inherit the semantic richness and domain‑invariance of foundation models while remaining lightweight enough for practical use.
The registration head receives the moving and fixed feature volumes (F_m, F_f) and builds a five‑level feature pyramid via trilinear down‑sampling. At each pyramid level i, a three‑layer convolution predicts a residual deformation field u_i. The final deformation field is assembled coarse‑to‑fine: the up‑sampled field from level i − 1 is composed with u_i, effectively breaking a large displacement into a sequence of manageable refinements. This hierarchical strategy improves stability and accuracy, especially for large deformations.
A key contribution is the channel regularization strategy. During training, a random subset of the c′ channels is selected for each forward pass, acting as a form of channel dropout. This forces the network to avoid reliance on any fixed channel set and to learn structural correlations between moving and fixed features. At inference, deterministic dimensionality reduction via PCA projects the features onto the most informative c′‑dimensional subspace. Ablation experiments show that removing CR (replacing it with PCA during training) yields comparable in‑domain performance but dramatically degrades out‑of‑domain results, confirming the regularizer’s role in discarding dataset‑specific priors.
Experiments were conducted on two public benchmarks: the ACDC cardiac MR dataset (intra‑subject ED↔ES registration) and the Learn2Reg abdomen CT dataset (inter‑subject multi‑organ registration). FMIR was trained on a single dataset (either ACDC, abdomen, or a hybrid of both) using either unsupervised (NCC + smoothness loss) or weakly‑supervised (adding Dice loss) objectives. Results demonstrate that FMIR achieves state‑of‑the‑art in‑domain Dice scores (~80 % on ACDC, ~73 % on abdomen) and competitive Hausdorff distances, while maintaining strong out‑of‑domain performance (e.g., training on abdomen and testing on ACDC still yields Dice ≈ 73 %). Compared to recent learning‑based methods such as VoxelMorph, TransMorph, LKU‑Net, CorrMLP, MemWarp, RDP, and the foundation‑model‑based uniGradICON, FMIR matches or exceeds accuracy while requiring far less inference time (≈ 0.6 s per pair versus ≈ 5 s for uniGradICON).
The framework is also shown to be backbone‑agnostic. Although FMIR was trained with DINO features, it can directly operate on SAM features at test time without any retraining, achieving comparable or slightly better performance. This plug‑and‑play capability underscores that the registration head learns a general correspondence mapping rather than overfitting to a specific feature distribution.
In summary, FMIR demonstrates that a carefully designed combination of foundation‑model encoders, multi‑scale registration heads, and channel regularization can deliver fast, accurate, and domain‑robust medical image registration using only a single training dataset. The approach offers a practical pathway toward generalizable medical imaging foundation models, addressing the data scarcity and computational constraints that have limited the clinical translation of deep‑learning registration methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment