Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression
Automated radiology report generation from 3D computed tomography (CT) volumes is challenging due to extreme sequence lengths, severe class imbalance, and the tendency of large language models (LLMs) to ignore visual tokens in favor of linguistic priors. We present Ker-VLJEPA-3B, a four-phase curriculum learning framework for free-text report generation from thoracic CT volumes. A phased training curriculum progressively adapts a Llama 3.2 3B decoder to ground its output in visual features from a frozen, self-supervised encoder. Our visual backbone (LeJEPA ViT-Large) is trained via self-supervised joint-embedding prediction on unlabeled CTs, without text supervision. Unlike contrastive models (CLIP, BiomedCLIP), this language-free backbone yields modality-pure representations. Vision-language alignment is deferred to the curriculum’s bridge and generation phases. This modality-agnostic design can integrate any self-supervised encoder into an LLM without paired text during foundation training. Methodological innovations include: (1) zone-constrained cross-attention compressing slice embeddings into 32 spatially-grounded visual tokens; (2) PCA whitening of anisotropic LLM embeddings; (3) a positive-findings-only strategy eliminating posterior collapse; (4) warm bridge initialization transferring projection weights; and (5) selective cross-attention freezing with elastic weight consolidation to prevent catastrophic forgetting. Evaluated on the CT-RATE benchmark (2,984 validation volumes, 18 classes), Ker-VLJEPA-3B achieves a macro F1 of 0.429, surpassing the state-of-the-art (U-VLM, macro F1 = 0.414) by 3.6%, and reaching 0.448 (+8.2%) with threshold optimization. Ablation studies confirm 56.6% of generation quality derives from patient-specific visual content. Code and weights are available.
💡 Research Summary
The paper introduces Ker‑VLJEPA‑3B, a four‑phase curriculum learning framework that tackles the long‑standing challenges of automated radiology report generation from three‑dimensional thoracic CT scans. The core idea is to completely decouple visual representation learning from any language supervision and then progressively graft the resulting modality‑pure visual embeddings into a large language model (LLM) decoder.
Visual backbone – A frozen LeJEPA ViT‑Large (1024‑dimensional) is trained on unlabeled CT volumes using a self‑supervised joint‑embedding predictive objective. Because no image‑text pairs or segmentation masks are used, the encoder learns purely visual features without linguistic bias.
Zone‑constrained cross‑attention – CT volumes contain 300–600 axial slices. To keep the number of visual tokens manageable for the LLM, the authors partition the slice sequence into 32 anatomical zones along the cranio‑caudal axis. A multi‑head cross‑attention module attends only to slices within each zone, compressing the variable‑length slice embeddings into exactly 32 spatially grounded visual tokens (each 1024‑d). A global self‑attention layer then allows limited inter‑zone communication, preserving both locality and overall context.
JEP‑A bridge – The 32 visual tokens are linearly projected into the LLM’s embedding space (3072‑d) by a JEP‑A predictor. Because LLM token embeddings are highly anisotropic (average cosine similarity ≈ 0.95), the authors apply PCA whitening to obtain an isotropic 256‑d space, then scale the projection with a norm‑calibrator α that matches the magnitude of visual and textual embeddings. This whitening dramatically improves contrastive alignment (InfoNCE, MMD) and stabilizes training.
Positive‑findings‑only loss – In radiology reports, normal sentences dominate (≈ 90 % of tokens), causing gradient imbalance that leads to posterior collapse (the model ignores visual tokens). The authors therefore mask out loss contributions from normal tokens and train only on tokens that describe pathological findings. This strategy prevents collapse and enables stable generation over 15+ epochs, whereas prior methods diverge after 1–4 epochs.
Warm bridge initialization – Weights learned in the first curriculum phase (visual‑to‑LLM projection and cross‑attention adapters) are copied forward to the subsequent classification‑fine‑tuning and report‑generation phases. Empirically, this yields an immediate macro F1 of 0.425 at epoch 1 versus 0.360 with a cold start, accelerating convergence and improving final performance.
Selective cross‑attention freezing with Elastic Weight Consolidation (EWC) – During the final narrative fine‑tuning, only LoRA adapters (r = 16) are updated to adapt the writing style, while the gated cross‑attention modules are frozen but penalized with an EWC quadratic term that protects parameters identified as important in earlier phases. This prevents catastrophic forgetting of pathology detection while allowing stylistic refinement.
Experiments – Evaluation is performed on the CT‑RATE benchmark (2,984 validation volumes, 18 binary abnormality classes) using the official RadBERT label extraction pipeline. Ker‑VLJEPA‑3B achieves a macro F1 of 0.429, surpassing the previous state‑of‑the‑art U‑VLM (0.414) by 3.6 percentage points; with per‑class threshold optimization the score rises to 0.448 (+8.2 pp). Ablation studies show that visual tokens contribute 56.6 % of the overall generation quality and that the contribution to pathology‑specific words is roughly twice that of generic language. Additional ablations confirm the necessity of each component: removing zone‑constrained attention, PCA whitening, positive‑findings‑only loss, warm bridge, or EWC each degrades performance substantially.
Discussion and impact – By demonstrating that a language‑free visual encoder can be successfully integrated into a powerful LLM through a carefully staged curriculum, the work resolves three major obstacles: (1) extreme sequence length, (2) severe class imbalance, and (3) posterior collapse. The zone‑constrained compression preserves anatomical localization while respecting the LLM’s context window. PCA whitening solves the anisotropy problem that has hampered prior contrastive alignment attempts. The positive‑findings‑only loss directly addresses the gradient domination of normal text. Warm bridge initialization and EWC‑guided freezing enable knowledge transfer across phases without forgetting.
Limitations and future work – The current system is specialized to thoracic CT; extending to other body regions, modalities (e.g., MRI, PET), or multi‑modal inputs (genomics, lab results) will test the claimed modality‑agnostic nature. Dynamic token allocation instead of a fixed 32‑token budget could further improve representation of large or diffuse lesions. End‑to‑end training that jointly learns the label extraction step rather than relying on a separate RadBERT classifier is another promising direction. Finally, inference speed and memory footprint need optimization before clinical deployment.
Conclusion – Ker‑VLJEPA‑3B sets a new benchmark for 3D CT report generation, establishing a robust, modular pipeline that first learns pure visual features, then aligns them to language through a series of principled curriculum stages. The approach not only advances automated radiology reporting but also offers a general blueprint for integrating any self‑supervised visual (or non‑visual) foundation model with large language models in a controlled, data‑efficient manner.
Comments & Academic Discussion
Loading comments...
Leave a Comment