DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cone-beam computed tomography (CBCT) is a critical 3D imaging technology in the medical field, while the high radiation exposure required for high-quality imaging raises significant concerns, particularly for vulnerable populations. Sparse-view reconstruction reduces radiation by using fewer X-ray projections while maintaining image quality, yet existing methods face challenges such as high computational demands and poor generalizability to different datasets. To overcome these limitations, we propose DeepSparse, the first foundation model for sparse-view CBCT reconstruction, featuring DiCE (Dual-Dimensional Cross-Scale Embedding), a novel network that integrates multi-view 2D features and multi-scale 3D features. Additionally, we introduce the HyViP (Hybrid View Sampling Pretraining) framework, which pretrains the model on large datasets with both sparse-view and dense-view projections, and a two-step finetuning strategy to adapt and refine the model for new datasets. Extensive experiments and ablation studies demonstrate that our proposed DeepSparse achieves superior reconstruction quality compared to state-of-the-art methods, paving the way for safer and more efficient CBCT imaging.

💡 Research Summary

The paper introduces DeepSparse, the first foundation‑model‑style approach for sparse‑view cone‑beam computed tomography (CBCT) reconstruction. The authors identify the critical clinical need to reduce radiation dose, especially for vulnerable populations, by decreasing the number of X‑ray projections while preserving high‑quality three‑dimensional images. Existing sparse‑view methods either rely on heavy 3‑D CNNs, suffer from prohibitive computational costs, or lack the ability to generalize across different anatomies and view counts.

To address these gaps, the authors propose a two‑part solution: (1) DiCE (Dual‑Dimensional Cross‑Scale Embedding), a novel network architecture built on the earlier C2RV model, and (2) HyViP (Hybrid View Sampling Pretraining), a large‑scale pre‑training regime that mixes sparse and dense projection data.

DiCE removes the costly 2‑D decoder of C2RV and introduces a multi‑scale projection encoder that extracts hierarchical 2‑D features from each view. These features are back‑projected into a low‑resolution 3‑D voxel grid at several scales, producing a set of scale‑specific 3‑D feature volumes. A cross‑scale 3‑D embedding module aggregates these volumes via 3‑D convolutions and down‑sampling, yielding an enhanced 3‑D representation. For any sampled point in continuous space, DiCE concatenates pixel‑aligned 2‑D features (interpolated from the multi‑view 2‑D maps) with voxel‑aligned 3‑D features (interpolated from the aggregated volume) and feeds them to a lightweight point decoder (MLP) that predicts the attenuation coefficient. This design keeps memory and FLOPs roughly linear with the number of views, enabling efficient processing of both sparse (≤ 10) and dense (≥ 100) view scenarios.

HyViP tackles the data‑scarcity problem. Using a curated collection of ~8,000 CT scans covering abdomen, knee, pelvis, spine, and brain, the authors generate paired sparse (6–10 views) and dense (200–400 views) projection sets for each volume. The model is jointly trained on both regimes, allowing the 2‑D encoder to learn robust, view‑invariant features while the 3‑D decoder benefits from dense‑view supervision that yields high‑quality volumetric embeddings.

After pre‑training, a two‑step fine‑tuning pipeline adapts DeepSparse to a target dataset: (i) the whole network is fine‑tuned on the new anatomy to align both encoder and decoder to the specific geometry; (ii) a residual denoising layer is trained on sparse‑view inputs only, refining the 3‑D features and suppressing artifacts caused by extreme undersampling.

Experimental evaluation spans five public datasets with view counts ranging from 6 to 100. Quantitative metrics (PSNR, SSIM, RMSE) show that DeepSparse consistently outperforms state‑of‑the‑art methods such as C2RV, DIF‑Net, and R2‑Gaussian, achieving an average PSNR gain of ~1.8 dB and SSIM improvement of 0.03–0.05 in the 6‑view regime. Visual inspection confirms markedly reduced streaking and more faithful anatomical detail. Importantly, inference memory consumption is reduced by ~30 % compared with full 3‑D CNN pipelines, and runtime remains comparable (≈0.8–1.2× of baseline).

Ablation studies isolate each component: removing multi‑scale back‑projection drops PSNR by ~1.2 dB; omitting cross‑scale embedding reduces SSIM by 0.02; training without HyViP pre‑training degrades all metrics by 5–10 %. These results validate that both architectural innovations and hybrid pre‑training are essential for the observed performance gains.

The authors acknowledge limitations: the current implementation supports up to 128³ voxel resolution due to GPU memory constraints, and the pre‑training corpus is dominated by adult anatomy, leaving pediatric or pathological cases under‑explored. Future work is outlined to incorporate memory‑efficient tokenization, multimodal pre‑training (e.g., MRI‑CT joint embeddings), and hardware‑aware optimizations for real‑time clinical deployment.

In summary, DeepSparse establishes a new paradigm for low‑dose CBCT imaging by marrying a computationally efficient dual‑dimensional network with a large‑scale hybrid view pre‑training strategy. The model delivers superior reconstruction quality across a wide range of view counts, demonstrates strong cross‑anatomy generalization, and opens the door for foundation‑model‑driven advances in other medical imaging reconstruction tasks.

DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction

💡 Research Summary

Comments & Academic Discussion

Leave a Comment