CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT Volumes

CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT Volumes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid increase of computed tomography (CT) scans and their time-consuming manual analysis have created an urgent need for robust automated analysis techniques in clinical settings. These aim to assist radiologists and help them managing their growing workload. Existing methods typically generate entire reports directly from 3D CT images, without explicitly focusing on observed abnormalities. This unguided approach often results in repetitive content or incomplete reports, failing to prioritize anomaly-specific descriptions. We propose a new anomaly-guided report generation model, which first predicts abnormalities and then generates targeted descriptions for each. Evaluation on a public dataset demonstrates significant improvements in report quality and clinical relevance. We extend our work by conducting an ablation study to demonstrate its effectiveness.


💡 Research Summary

The paper introduces CT‑AGRG, a novel two‑stage framework for automated report generation from three‑dimensional chest CT volumes that explicitly guides the textual output with predicted abnormalities. In the first stage, a visual feature extractor—either the convolution‑based CT‑Net or the transformer‑based CT‑ViT—is used to encode the entire CT volume into a 2048‑dimensional global embedding. This embedding is then fed into 18 parallel projection heads, each followed by a binary classification head dedicated to one of the 18 predefined abnormality labels. By training these heads in a multi‑task fashion, the model learns label‑specific 1024‑dimensional embeddings (h_i) that capture the visual characteristics of each potential finding.

During inference, each classification head produces a probability score; a label is considered abnormal if its score exceeds a threshold selected to maximize F1 on the validation set. For every abnormal label, the corresponding h_i is embedded into a larger 18 × 1024 “multi‑abnormality” vector where all other slots are zero‑padded. A lightweight MLP (Φ_T) projects this vector into the textual latent space, yielding a 1024‑dimensional token e_i that serves as a conditioning signal for the language model.

The second stage employs a GPT‑2 Medium model pre‑trained on PubMed abstracts. Instead of the standard self‑attention, the authors introduce pseudo self‑attention that injects e_i as key and value vectors while the query comes from the generated token sequence. This mechanism directly couples the visual abnormality representation with the language generation process, allowing the model to produce a concise, abnormality‑specific sentence for each detected finding. The language model is fine‑tuned on a next‑token prediction loss, while all other parameters of the visual encoder remain frozen. At test time, the model generates a sentence for each abnormality and concatenates them to form the final report.

Experiments are conducted on the public CT‑RATE dataset, which contains 34,781 training volumes, 3,075 validation volumes, and 3,039 test volumes, each annotated with 18 abnormality types extracted via the RadBERT labeler. Volumes are standardized to 240 × 480 × 480 voxels, clipped to Hounsfield units between –1000 and +200, and normalized to


Comments & Academic Discussion

Loading comments...

Leave a Comment