Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks
The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.
💡 Research Summary
Lung cancer remains one of the most lethal malignancies worldwide, and early detection via CT screening dramatically improves survival rates. However, deep‑learning based computer‑aided diagnosis (CAD) systems for nodule detection suffer from a chronic shortage of annotated CT data. Conventional data augmentation (rotation, scaling, flipping) only marginally expands the dataset and cannot address the fundamental lack of diversity in nodule size, shape, location, and surrounding tissue context. Recent attempts to use generative adversarial networks (GANs) for synthetic nodule generation have shown promise but still exhibit two critical shortcomings: (1) limited variability in nodule placement and background anatomy, and (2) poor controllability over morphological features, often leading to unrealistic textures or distorted anatomical structures.
To overcome these issues, the authors propose a Two‑Stage Generative Adversarial Network (TSGAN) that explicitly decouples the generation of anatomical structure from the synthesis of texture. In Stage 1, a StyleGAN model is trained to produce semantic segmentation masks of size 512 × 512. Each pixel in the mask is assigned one of six labels (background, lung parenchyma, left lung, right lung, airway, and nodule), thereby encoding precise spatial priors. By sampling the latent vector z ∼ N(0, I) and mapping it through the StyleGAN mapping network to the intermediate space w, the system can generate a wide variety of realistic nodule shapes and positions while preserving the overall lung anatomy. The StyleGAN loss incorporates the standard adversarial term, a gradient‑penalty regularizer, and a drift penalty to stabilize training.
Stage 2 translates the generated mask into a realistic CT image using a modified DL‑Pix2Pix model. The base Pix2Pix architecture is enhanced with two attention mechanisms: (i) Local Importance‑based Attention (LIA) inserted into the UNet skip connections, and (ii) Dynamically Weighted Multi‑Head Window Attention (DWMH) placed in the bottleneck. LIA employs a dynamic soft‑pooling layer, a heat‑map generation branch, and a channel‑gating path to emphasize salient local features (e.g., nodule edges) while keeping computational cost low. DWMH partitions the feature map into non‑overlapping windows, computes multi‑head self‑attention within each window, and learns per‑head scaling factors γ that are initialized to zero to prevent early training instability. This combination allows the generator to capture both fine‑grained texture details and global contextual consistency.
The overall loss for the DL‑Pix2Pix generator is a weighted sum of (1) an adversarial loss, (2) an L1 pixel‑wise reconstruction loss, and (3) a perceptual loss computed from a pre‑trained VGG‑19 network. The L1 term preserves high‑frequency details, while the perceptual term enforces semantic similarity between the synthesized CT and the ground‑truth image.
Experiments are conducted on the LUNA16 dataset, which contains 888 CT scans (1,186 slices after lung parenchyma extraction). The authors preprocess the data by normalizing intensities, extracting lung fields, and converting mask‑image pairs to COCO format for downstream detection. Training uses a 4:1 train‑test split, with StyleGAN implemented in TensorFlow and DL‑Pix2Pix in PyTorch, running on an RTX 4070 Super GPU. Quantitative evaluation shows that augmenting the training set with TSGAN‑generated images improves nodule detection accuracy by 4.6 percentage points and mean Average Precision (mAP) by 4 percentage points compared with training on the original dataset alone. Visual inspection confirms that the synthetic nodules exhibit realistic shapes, appropriate placement within the lung anatomy, and high‑fidelity texture that blends seamlessly with surrounding tissue.
Key contributions of the work are: (1) a mask‑driven approach that provides explicit spatial control over nodule morphology and location, (2) the integration of LIA and DWMH attention modules that jointly enhance local detail preservation and global context modeling, and (3) a two‑stage pipeline that achieves higher diversity and controllability than previous single‑stage 3D GAN methods while maintaining reasonable computational efficiency. Limitations include the reliance on 2‑D slice generation, which does not guarantee inter‑slice consistency, and the computational cost of training StyleGAN for high‑resolution mask synthesis. Future directions suggested by the authors involve extending the framework to full 3‑D mask‑to‑volume generation and exploring conditional latent space manipulation to tailor synthetic nodules to specific clinical attributes (e.g., malignancy risk, texture patterns). Overall, TSGAN represents a significant step toward alleviating data scarcity in lung nodule detection and demonstrates how careful architectural design can yield both diverse and controllable medical image synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment