Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method supports causal interventions on target attributes and consistently propagates their effects to causal dependents while preserving the core identity of the image. Unlike prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling with two attribute-regularization strategies: (i) prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and (ii) a conditioned token contrastive loss that disentangles attribute factors and reduces spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, including up to a 91% reduction in MAE on Pendulum for accurate attribute control and up to an 87% reduction in FID on ADNI for high-fidelity MRI generation. These results demonstrate robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation. Code and models will be released at: https://leitong02.github.io/causaladapter/.

💡 Research Summary

Causal‑Adapter introduces a modular framework that equips frozen text‑to‑image diffusion models with explicit structural causal modeling to enable faithful counterfactual image generation. The authors start by assuming a known causal graph over a set of semantic attributes (e.g., age, gender, beard, baldness) and represent this graph with a binary adjacency matrix A. For each attribute y_i they learn a nonlinear additive‑noise mechanism f_i such that ŷ_i = f_i(A_i ⊙ Y; ω_i) + u_i, where u_i ∼ N(0,σ_i²). This formulation allows direct do‑interventions on any attribute by simply replacing the corresponding token embedding, without needing to infer latent exogenous variables from images.

To inject these causal signals into a pretrained diffusion backbone (e.g., Stable‑Diffusion) without altering its parameters, the authors design a lightweight “Causal‑Adapter” (ε_ψ) that runs in parallel with the frozen UNet denoiser (ε_θ). The adapter receives the image latent z_t and the text embeddings V, and modifies the cross‑attention queries/keys (QK) through two regularization strategies:

Prompt‑Aligned Injection (PAI) – causal attribute embeddings ŷ_i are concatenated with the textual token sequence, aligning semantic attributes with spatial features in the diffusion latents. This ensures that an intervention such as “do(age=young)” propagates consistently to downstream attributes dictated by the causal graph (e.g., beard disappears for a young female).
Conditioned Token Contrastive loss (CTC) – a contrastive objective applied at the token level forces embeddings of intervened tokens to diverge across different interventions while keeping non‑intervened token embeddings stable. This reduces spurious correlations and improves disentanglement.

Training optimizes three losses jointly: (i) the standard diffusion denoising loss L_DM, (ii) a negative log‑likelihood L_NLL for the causal mechanisms, and (iii) the contrastive loss L_CTC, weighted by hyper‑parameters. The overall loss is L = λ_DM L_DM + λ_NLL L_NLL + λ_CTC L_CTC. An optional Attention Guidance (AG) module can further sharpen the cross‑attention maps of intervened tokens to achieve localized edits while preserving identity‑critical tokens.

During inference, a user specifies a do‑intervention on any attribute; the corresponding token embedding is replaced, the adapted diffusion process runs with the same abducted noise z* from the original image, and a counterfactual image \bar{x} is generated via DDIM inversion. The method thus produces images that reflect the causal effect of the intervention while keeping all non‑intervened aspects (including fine‑grained identity features) unchanged.

Empirical evaluation spans synthetic Pendulum data and two real‑world datasets: CelebA (human faces) and ADNI (brain MRI). On Pendulum, attribute control error (MAE) drops by up to 91 % compared with baseline prompt‑only methods. On CelebA, the model achieves up to 87 % reduction in FID, 86 % reduction in LPIPS, and only 4 % increase in CLD (identity drift), demonstrating high fidelity and minimal unintended changes. On ADNI, the approach reduces MAE by 50 % for volumetric edits and improves realism dramatically. Qualitative results show that interventions respecting the causal graph (e.g., changing gender also updates beard and baldness appropriately) are faithfully rendered, whereas prior methods either fail to propagate changes or introduce unrelated artifacts.

The paper’s contributions are threefold: (1) a plug‑and‑play adapter that enables causal conditioning of large diffusion models without retraining the backbone, (2) two novel regularization mechanisms (PAI and CTC) that align causal semantics with textual embeddings and suppress spurious correlations, and (3) extensive quantitative and qualitative validation demonstrating state‑of‑the‑art performance on both synthetic and high‑stakes medical imaging tasks. The authors discuss future directions such as learning the causal graph automatically, extending to multi‑step interventions, and applying the framework to text‑to‑video diffusion models.

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment