MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer


💡 Research Summary

MedSteer addresses a critical gap in medical image augmentation: existing diffusion‑based methods either re‑prompt, which completely rewrites the generation trajectory and changes anatomy, texture, and background, or rely on DDIM inversion, which introduces reconstruction errors and structural drift. Both approaches fail to produce true counterfactual pairs where only a specific pathology changes while all other visual factors remain identical.

The proposed framework is training‑free and operates entirely on a frozen diffusion transformer (DiT). First, a pathology vector is estimated offline from a set of contrastive text prompts (e.g., “dyed lifted polyp” vs. “polyp”). For each prompt pair, multiple forward passes with the same random seed are performed; the cross‑attention (CA) activations hₗ,ₜ are averaged across seeds and spatial tokens for both the positive and negative prompts. The L2‑normalized difference yields a unit vector vₗ,ₜ for each transformer layer l and denoising timestep t. This vector captures the semantic shift associated with the target pathology and is reused for all subsequent inferences at no extra cost.

During inference, two image generation branches share the same noise seed and the same positive prompt. One branch runs the frozen DiT unchanged (unsteered). The other branch applies Spatially Selective Pathology Steering (SSPS) at a predefined layer window (empirically layers 8‑16) and at every denoising step. For each token, a cosine‑similarity score σₗ,ₜ = max(⟨hₗ,ₜ, vₗ,ₜ⟩, 0) is computed; only tokens positively aligned with the pathology vector are modified. The activation update h′ₗ,ₜ = hₗ,ₜ − α σₗ,ₜ vₗ,ₜ subtracts the pathology‑aligned component scaled by a strength parameter α, leaving orthogonal components (anatomy, texture, viewpoint) untouched. Because both branches start from the identical seed, any difference between the final images is guaranteed to stem solely from the steering operation, yielding a true minimal‑edit counterfactual.

Experiments were conducted on Kvasir v3 (8 k endoscopic images, 8 classes) for training, vector construction, and evaluation, and on the held‑out HyperKvasir set for downstream testing. Three clinical concept pairs were examined: (A) Polyp ↔ Normal Cecum, (B) Ulcerative Colitis ↔ Normal Cecum, and (C) Esophagitis ↔ Normal Z‑line. MedSteer achieved flip rates of 0.800, 0.925, and 0.950 respectively, with background preservation metrics (Bg‑LPIPS, Bg‑SSIM, Bg‑PSNR) markedly better than the inversion‑based baselines (Plug‑and‑Play and h‑Edit). The method also excelled at dye disentanglement: when steering “dyed lifted polyp” → “polyp”, the dye detection rate dropped to 0.250 (75 % dye removal), far surpassing PnP (0.800) and h‑Edit (0.900).

A downstream polyp detection study demonstrated practical impact. Synthetic counterfactual pairs generated by MedSteer (1 k images per condition) were used to augment training of ConvNeXt and Vision‑Transformer classifiers. The ViT model trained with MedSteer‑augmented data reached an AUC of 0.9755 on the out‑of‑distribution HyperKvasir test set, compared to 0.9083 for quantity‑matched re‑prompting and ~0.95 for the best inversion‑based method. This confirms that preserving anatomical consistency while altering only the pathology provides highly informative training signals.

Ablation analyses revealed that steering layers 8‑16 correspond to the semantic formation stage of the diffusion process; steering outside this window yields near‑zero flips. The steering strength α exhibits a sweet spot around 2.5 (flip ≈ 80 %); larger α causes over‑steering and degrades background fidelity. The pathology vector stabilizes after roughly 50 random seeds, indicating modest offline computation. Importantly, the per‑token σ scores can be visualized as spatial maps, offering built‑in interpretability: early timesteps show broad activation footprints, while later steps focus on compact lesion regions.

In summary, MedSteer provides a simple yet powerful recipe for generating medically plausible counterfactual images without any model fine‑tuning, source images, or pixel‑level masks. By leveraging the cross‑attention space of a frozen diffusion transformer, it isolates and removes the semantic component of a target pathology while leaving all other visual factors untouched. The resulting paired data improve downstream diagnostic models and furnish transparent, token‑level explanations of where and when the model intervenes. This training‑free, architecture‑agnostic approach holds promise for a wide range of medical imaging modalities where labeled data are scarce and structural fidelity is paramount.


Comments & Academic Discussion

Loading comments...

Leave a Comment