CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

CompDiﬀ: Hierarc hical Comp ositional Diﬀusion for F air and Zero-Shot In tersectional Medical Image Generation Mahmoud Ibrahim 1 , 2 , 3 ⋆ , Bart Elen 3 , Chang Sun 1 , 2 , Gökhan Erta ylan 3 , and Mic hel Dumontier 1 , 2 1 Institute of Data Science, F acult y of Science and Engineering, Maastric ht Univ ersity , Maastrich t, The Netherlands 2 Departmen t of Adv anced Computing Sciences, F acult y of Science and Engineering, Maastric ht Universit y , Maastric ht, The Netherlands 3 VITO, Be lgium mahmoud.ibrahim@vito.be Abstract. Generativ e mo dels are increasingly used to augmen t med- ical imaging datasets for fairer AI. Y et a k ey assumption often go es unexamined: that generators themselv es pro duce equally high-qualit y images across demographic groups. Mo dels trained on imbalanced data can inherit these im balances, yielding degraded synthesis qualit y for rare subgroups and struggling with demographic in tersections absen t from training. W e refer to this as the im balanced generator problem. Exist- ing remedies suc h as loss reweigh ting op erate at the optimization level and pro vide limited b eneﬁt when training signal is scarce or absent for certain combinations. W e prop ose CompDiﬀ, a hierarchical comp o- sitional diﬀusion framew ork that addresses this problem at the repre- sen tation lev el. A dedicated Hierarc hical Conditioner Netw ork (HCN) decomp oses demographic conditioning, pro ducing a demographic token concatenated with CLIP embeddings as cross-atten tion con text. This structured factorization encourages parameter sharing across subgroups and supp orts compositional generalization to rare or unseen demographic in tersections. Exp erimen ts on c hest X-rays (MIMIC-CXR) and fundus images (F airGenMed) show that CompDiﬀ compares fav orably against b oth standard ﬁne-tuning and F airDiﬀusion across image qualit y (FID: 64.3 vs. 75.1), subgroup equit y (ES-FID), and zero-shot intersectional generalization (up to 21% FID improv emen t on held-out intersections). Do wnstream classiﬁers trained on CompDiﬀ-generated data also show impro ved AUR OC and reduced demographic bias, suggesting that ar- c hitectural design of demographic conditioning is an imp ortan t and un- derexplored factor in fair medical image generation. Co de is av ailable at https://anonymous.4open.science/r/CompDiff-6FE6 . Keyw ords: Compositional Demographic Conditioning · F air Medical Image Synthesis · In tersectional Bias and Zero-Shot Generalization. ⋆ Corresp onding author: mahmoud.ibrahim@vito.be 2 M. Ibrahim et al. 1 In tro duction Diﬀusion models hav e emerged as p o werful to ols for medical image syn thesis, oﬀering promising solutions to data scarcit y [1–3]. T ext-to-image models enable generation of synthetic datasets conditioned on clinical ﬁndings, with growing ap- plications in training and augmen ting diagnostic AI systems [4, 5]. A compelling use case is addressing demographic im balance: syn thetic data could supplemen t underrepresen ted p opulations to train fairer classiﬁers [6, 7]. Ho wev er, a fundamen tal question is often ov erlooked: do gener ative mo d- els themselves pr o duc e e qual ly high-quality images acr oss demo gr aphic gr oups? When trained on imbalanced data, diﬀusion mo dels can achiev e strong av erage ﬁdelit y while pro ducing degraded samples for rare subgroups; for some demo- graphic in tersections, training examples ma y b e absen t altogether. A dataset ma y contain elderly patien ts, Asian patien ts, and female patients, y et ha ve zero examples at the intersection of all three with a sp eciﬁc pathology . No amount of o versampling, rew eighting, or balanced mini-batching can address groups that do not exist in the training data. W e refer to this as the imb alanc e d gener ator pr oblem . F airDiﬀusion [8] is among the ﬁrst to explicitly address fair synthetic data generation. It takes a signiﬁcant step b y in tro ducing F air Bay esian Perturbation, whic h adaptively reweigh ts training loss to equalize learning across subgroups. Ho wev er, this optimization-level approac h do es not address ho w demographics are r epr esente d : it relies on implicit enco ding within text prompts, where demo- graphic tok ens comp ete for CLIP’s [9] limited 77-token budget and rare inter- sectional combinations lack suﬃcient training signal regardless of loss weigh ting. Critically , rew eigh ting, lik e all data-lev el strategies, cannot generate learning signal for com binations the mo del has never observed. W e propose CompDiﬀ , whic h addresses the im balanced generator prob- lem at the r epr esentation level. Our key insight is that demographic identit y is comp ositional: a rare intersection such as “80+ Asian female” can b e c omp ose d from well-learned single-attribute em b eddings and mo derately learned pairwise in teractions, enabling generalization ev en to combinations entirely absen t from training, analogous to how language models comp ose known words into nov el sen tences. CompDiﬀ introduces a Hierarchical Conditioner Netw ork (HCN) that explicitly models demographic attribute in teractions, producing a dedicated de- mographic token concatenated with clinical text em b eddings. Where F airDif- fusion asks “ho w muc h to w eight eac h sample,” CompDiﬀ asks “ho w to rep- resen t demographics so that unseen combinations can be composed from seen ones.” This compositional structure facilitates b etter zero-shot generaliza- tion to unseen demographic intersections : a capabilit y that data-lev el and optimization-lev el metho ds are unlikely to pro vide without structural inductive bias. Through extensive exp erimen ts on chest X-ra ys (MIMIC-CXR [20]) and fundus images (F airGenMed [8]), w e demonstrate that CompDiﬀ outp erforms b oth standard baselines and F airDiﬀusion across image quality , demographic fairness, and do wnstream utility . CompDiﬀ 3 2 Our Prop osed Metho d 2.1 Ov erview Standard diﬀusion models encode demographics within the text prompt, forc- ing demographic tokens to comp ete with clinical tokens in a shared embedding space. W e introduce CompDiﬀ , which pro cesses demographic attributes sepa- rately through a dedicated Hier ar chic al Conditioner Network (HCN), produc- ing a demographic token concatenated to CLIP embeddings as cross-atten tion con text. F ormally , clinical ﬁndings are enco ded using CLIP , pro ducing E text ∈ R B × 77 × d ctx , while demographic attributes (age, sex, race) are processed sepa- rately through HCN, outputting c ∈ R B × 1 × d ctx , and we concatenate E combined = [ E text , c ] ∈ R B × 78 × d ctx as cross-atten tion context for the diﬀusion UNet. 2.2 Hierarc hical Conditioner Netw ork HCN in tro duces structured inductiv e bias by decomp osing demographic condi- tioning into hierarchical comp onen ts: single-attribute em b eddings, pairwise in- teractions, and full comp osition. Single-A ttribute Emb e ddings (“gr andp ar ents”) Each demographic attribute x v is em b edded into a shared latent space e v = Em bed v ( x v ) of dimension d node . F or age a , sex s , and race r : e a , e s , e r ∈ R d node . Pairwise inter actions (“p ar ents”) T o capture non-additive relationships b et ween attributes, w e mo del all pairwise interactions using dedicated MLPs: h a,s = f a,s ([ e a , e s ]) , h a,r = f a,r ([ e a , e r ]) , h s,r = f s,r ([ e s , e r ]) . (1) W e restrict the hierarch y to pairwise in teractions to balance expressivit y against o verﬁtting on rare subgroups. F ul l Comp osition (“child”) The ﬁnal demographic representation is obtained by com bining pairwise interactions through an MLP g ( · ) : h demo = g ([ h a,s , h a,r , h s,r ]) (2) This structured factorization encourages parameter sharing across subgroups and impro v es data eﬃciency for rare intersections. h demo is then mapp ed to a diagonal Gaussian ( µ, log σ ) = Linear( h demo ) , after whic h z is sampled via reparameterization at training and set to µ at inference. The latent z is then pro jected to the cross-attention dimension c = pro j ctx ( z ) ∈ R d ctx . (3) 4 M. Ibrahim et al. 2.3 T raining Ob jective The mo del is trained end-to-end with total loss L = L diﬀ + λ comp L comp + λ aux L aux + λ KL L KL . (4) The diﬀusion loss is L diﬀ = E x 0 ,ϵ,t ∥ ϵ − ϵ θ ( x t , t, E combined ) ∥ 2 2 . W e regularize the v ariational demographic latent to ward a standard normal via the KL term L KL = E  KL( N ( µ, σ 2 I ) ∥ N (0 , I ))  . W e add a comp ositional consistency term L comp = 1 − cos( h demo , e age + e sex + e race ) as a soft anchor that stabilizes training to ward a simple additiv e baseline while still allowing non-additiv e interactions. Ablations (§3.4) sho w it impro ves FID. T o ensure demographic information surviv es pro jection in to the cross-attention space, w e apply auxiliary classiﬁcation directly on the ﬁnal token c : L aux = CE( ˆ y age , y age ) + CE( ˆ y sex , y sex ) + CE( ˆ y race , y race ) . (5) W e deliberately apply L aux on the pro jected token c (not on µ ), so that the represen tation actually seen b y the UNet remains demographically informativ e (see ablations in §3.4). Implementation. W e set d node = 256 and d ctx = 1024 to match the Stable Diﬀusion 2.1 cross-attention dimension. HCN introduces minimal computational o verhead relative to the UNet backbone and do es not require modiﬁcation of diﬀusion timesteps or sampling procedures, contributing to only a 0.19% increase in trainable parameters o ver the baseline. 3 Exp erimen ts 3.1 Datasets W e ev aluate on t w o medical imaging modalities. F or c hest X-ra ys , w e use MIMIC-CXR [20] p ostero-anterior views with demographic metadata, split in to 62,094/1,300/7,039 training/v alidation/test images with no patient o v erlap. T ext prompts follo w: " year old . " . F or fun- dus imaging , w e use F airGenMed [8] con taining 6,000/1,000/3,000 SLO fun- dus images with prompts enco ding race, sex, ethnicit y , and clinical attributes (glaucoma, cup-disc ratio, RNFL thic kness, near vision status). 3.2 Ev aluation Metrics W e assess generated images along four dimensions, computed on held-out test sets. Image quality. W e report F réchet Inception Distance (FID) [10] and FID- RadImageNet (using radiology-sp eciﬁc embeddings [11]), BioViL [12] cosine similarit y for semantic alignmen t, and MS-SSIM [13] for structural similarit y . CompDiﬀ 5 T ext-pr ompt alignment. W e ev aluate whether generated images reﬂect condi- tioned attributes using pretrained classiﬁers: T orchXRa yVision [14] DenseNet- 121 for c hest X-ray disease AUR OC, sex/race accuracy [15], and age RMSE ; pretrained Eﬃcien tNet mo dels for fundus glaucoma classiﬁcation and cup-disc ratio prediction. F airness. F ollo wing the fairness ev aluation in [8], we compute equit y-scaled FID (ES-FID) [16, 17] which penalizes qualit y disparities across demographic subgroups: ES-FID A i = FID ·  1 + 1 |A i |· FID P |A i | j =1 | FID − FID A i j |  (6) where A i denotes subgroups for protected attribute i . ES-FID equals FID when all subgroups ha ve iden tical quality , and increases with disparit y . Downstr e am utility. W e train disease classiﬁers on syn thetic data and ev alu- ate on real data (TSTR), rep orting A UR OC, equit y-scaled A UROC (ES-AUC), Diﬀerence in Equalized Odds (DEOdds), and underdiagnosis rate: the false p os- itiv e rate for ‘No Finding’ predictions at the subgroup level [19]. 3.3 Results W e trained CompDiﬀ and baseline mo dels (ﬁne-tuned Stable Diﬀusion 2.1 and F airDiﬀusion) using iden tical training budgets (up to 30,000 steps) on the same training sets wi th complete demographic lab els. F or each mo del, we select the b est c heckpoint based on v alidation performance across the four dimensions men- tioned in §3.2. All results are computed on held-out generated test sets using three diﬀeren t generation seeds; we report mean and standard deviation. Ov erall Generation Quality T able 1 compares CompDiﬀ against baselines across c hest X-ray and fundus mo dalities. CompDiﬀ achiev es the best FID on b oth mo dalities (64.3 chest, 54.6 fundus). While F airDiﬀusion achiev es sligh tly lo wer FID-RadImageNet on chest (6.2 vs 6.8), CompDiﬀ outp erforms on disease classiﬁcation A UROC (0.82 vs 0.74), indicating better clinical feature align- men t. MS-SSIM v alues remain in the acceptable range (0.25–0.75) for all mo d- els, conﬁrming neither ov erﬁtting nor generation failure. Sex accuracy remains near-p erfect across all methods. Sligh tly reduced race accuracy and increased age RMSE are expected trade-oﬀs from HCN conditioning, oﬀset by improv ed subgroup-lev el fairness b elo w. F airness in Image Generation Qualit y CompDiﬀ achiev es the lo west ES- FID across sex, race, and age on b oth mo dalities (T able 1).F airDiﬀusion improv es o ver baseline but consistently underperforms CompDiﬀ on single-attribute fair- ness. T able 2 presen ts selected intersectional subgroups spanning common to rare demographics (full results omitted for compactness). CompDiﬀ improv es FID ( ↓ FID indicates b etter p erformance) for rare subgroups (e.g., 40-60 F/A: 6 M. Ibrahim et al. T able 1. Overall generation qualit y and fairness metrics across chest X-ray and fundus mo dalities. V alues rep orted as mean (std) across three runs. ↑ indicates higher is b etter, ↓ indicates low er is b etter. Legend: B=Baseline, FD=F airDiﬀusion, CD=CompDiﬀ, FID-RAD = FID-RadImageNet. Mo dalit y Metho d Image Qualit y Disease A UROC ↑ Equit y-Scaled FID (ES-FID) FID ↓ FID-RAD ↓ Sex ↓ Race ↓ Age ↓ Chest X-ray B 82 . 8(2 . 2) 8 . 7(0 . 1) 0 . 80(0 . 00) 98 . 3(1 . 3) 122 . 9(1 . 2) 111 . 8(0 . 7) FD 75 . 1(0 . 1) 6 . 2 (0 . 0) 0 . 74(0 . 03) 88 . 6(0 . 3) 115 . 7(0 . 5) 102 . 5(0 . 8) CD 64 . 3 (0 . 3) 6 . 8(0 . 1) 0 . 82 (0 . 01) 78 . 4 (0 . 1) 106 . 2 (0 . 4) 98 . 3 (0 . 6) F undus B 72 . 2(0 . 2) 6 . 4(0 . 0) 0 . 94 82 . 4(1 . 4) 105 . 7(0 . 1) 97 . 3(0 . 8) FD 64 . 3(0 . 5) 5 . 0(0 . 1) 0 . 93 76 . 7(0 . 9) 106 . 6(2 . 2) 98 . 1(1 . 3) CD 54 . 6 (0 . 4) 4 . 9 (0 . 1) 0 . 96 65 . 2 (0 . 6) 97 . 7 (1 . 4) 85 . 1 (1 . 0) T able 2. FID on selected in tersectional subgroups for Chest (top) and F un- dus (b ottom), ranging from common to rare; low er is b etter. % (training) in- dicates each subgroup’s proportion of the training set, and v alues are rep orted as mean (std). Lo wer FIDs are better. Legend: B=Baseline, FD=F airDiﬀusion, CD=CompDiﬀ. A1=18–40, A2=40–60, A3=60–80, A4=80+; M/F=Male/F emale; W/B/H/A=White/Blac k/Hispanic/Asian. Mo del A3/M/W A2/M/W A1/M/W A2/F/H A2/M/H A2/M/A A2/F/A %(train) 16% 14% 4% 2% 1.5% 0.5% 0.5% B 115 . 2(1 . 2) 114 . 1(1 . 6) 130 . 9(1 . 0) 168 . 5(1 . 9) 170 . 4(9 . 4) 192 . 6(4 . 9) 204 . 0(5 . 4) FD 102 . 9(2 . 4) 104 . 6(0 . 8) 125 . 4(3 . 2) 162 . 0(2 . 4) 165 . 6(3 . 6) 201 . 7(15 . 4) 209 . 4(5 . 2) CD 97 . 6 (2 . 0) 89 . 3 (1 . 6) 116 . 9 (1 . 6) 149 . 0 (2 . 7) 135 . 0 (7 . 0) 184 . 5 (3 . 0) 167 . 9 (1 . 7) Mo del A3/F/W A3/M/W A4/F/W A1/F/W A4/M/W A2/M/A A4/M/B % (train) 24.5% 17% 4% 5% 3% 1.5% 0.5% B 103 . 3(5 . 7) 91 . 6(2 . 6) 176 . 0(5 . 4) 132 . 3(6 . 5) 132 . 5(0 . 8) 163 . 0(9 . 6) 229 . 4(2 . 7) FD 101 . 4(0 . 1) 85 . 5(0 . 4) 180 . 2(1 . 4) 128 . 5(7 . 2) 132 . 9(2 . 0) 156 . 8(3 . 9) 300 . 2(29 . 2) CD 83 . 9 (2 . 1) 78 . 1 (1 . 1) 155 . 5 (1 . 4) 117 . 1 (2 . 5) 118 . 6 (4 . 0) 146 . 4 (4 . 9) 217 . 8 (2 . 1) 204.0 → 167.9 on chest) while maintaining gains on common subgroups (e.g., 60- 80 M/W: 115.2 → 97.6), demonstrating that fairness improv ements do not come at the cost of ma jority group p erformance. F airDiﬀusion impro ves ov er baseline on most subgroups but shows limited gains for the rarest intersections where training signal is scarce. Zero-Shot Comp ositional Generalization T o directly test whether Com- pDiﬀ can generalize to unseen demographic com binations, we remov e ﬁv e inter- sectional subgroups entirely from training (based on rarit y) and ev aluate gen- eration quality on these held-out groups (results and subgroups remov ed are in T able 3). CompDiﬀ outp erforms both baseline and F airDiﬀusion on all held-out subgroups, achieving up to 21% FID improv ement. Notably , F airDiﬀusion p er- forms worse than baseline on several subgroups (e.g., 80+ F emale Asian: 247.2 CompDiﬀ 7 T able 3. Zero-shot generalization to held-out demographic subgroups (FID ↓ ). These intersections w ere remo ved entirely from training. Legend:F/A=F emale Asian, M/A=Male Asian, M/H=Male Hisp anic. Low er FID is b etter. Metho d 18-40 F/A 18-40 M/A 80+ F/A 80+ M/A 80+ M/H B 183.3 161.3 210.7 208.1 231.7 FD 181.7 152.1 247.2 265.5 229.9 CD 159.8 127.6 195.4 206.6 212.2 T able 4. Do wnstream classiﬁer p erformance when trained on synthetic data and ev al- uated on real test sets. A UC and ES-AUC measure classiﬁcation performance and demographic equity;Higher AUC/ES-A UC is better. Underdiagnosis rate (chest) and equalized odds diﬀerence (fundus) measure fairness in mo del predictions; lo wer v alues indicate reduced diagnostic bias across demographic groups. V alues are rep orted as mean (std) across runs. Legend: B =Baseline, FD=F airDiﬀusion, CD=CompDiﬀ. Metric Subgroup Chest F undus B FD CD B FD CD A UC ↑ Ov erall 0.69(0.01) 0.68(0.01) 0.72(0.01) 0.75(0.01) 0.76(0.01) 0.78(0.01) sex 0.68(0.01) 0.67(0.01) 0.71(0.01) 0.75(0.01) 0.76(0.01) 0.77(0.01) Race 0.67(0.01) 0.67(0.01) 0.69(0.01) 0.70(0.02) 0.71(0.02) 0.72(0.02) Age 0.67(0.01) 0.66(0.01) 0.71(0.01) 0.66(0.05) 0.56(0.07) 0.61(0.04) F airness ↓ Ov erall 0.46(0.01) 0.44(0.01) 0.40(0.01) 0.15(0.05) 0.13(0.05) 0.12(0.04) sex 0.45(0.01) 0.42(0.01) 0.39(0.01) 0.02(0.02) 0.01(0.02) 0.01(0.02) Race 0.45(0.01) 0.42(0.01) 0.39(0.01) 0.15(0.05) 0.13(0.05) 0.12(0.04) Age 0.43(0.01) 0.40(0.01) 0.37(0.01) 0.39(0.16) 0.47(0.12) 0.28(0.14) vs 210.7), conﬁrming that loss rew eighting cannot easily help when training samples are absent. In contrast, CompDiﬀ comp oses represen tations for unseen in tersections from learned single-attribute and pairwise embeddings, v alidating our core h yp othesis that hierarchical comp osition enables generalization b eyond the training distribution. Do wnstream Classiﬁcation Impact T o assess practical impact, we train dis- ease classiﬁers on syn thetic data and ev aluate on real test sets. T able 4 presents results across both mo dalities. On chest X-rays (Lung Lesion and Opacity Detec- tion), our model ac hieves higher mean AUC (0.72 vs 0.69) and low er underdiag- nosis rates (0.40 vs 0.46). On fundus (glaucoma detection), our mo del impro ves A UC (0.78 vs 0.75) while reducing equalized odds diﬀerence ov erall (0.12 vs 0.15), demonstrating that generation quality directly impacts downstream fair- ness. 3.4 Ablations T able 5 ﬁrst v alidates key architectural decisions. The baseline enco des demo- graphics in text, achieving excellen t demographic accuracy (sex and race: 0.99) 8 M. Ibrahim et al. T able 5. Ablation study results. CompDiﬀ achiev es the b est trade-oﬀ b et ween image qualit y and demographic control. Results are rep orted on the holdout v alidation set rather than the t est set, which explains the discrepancy with results reported earlier. V ariant Key Change FID ↓ Sex ↑ Race ↑ Age ↓ AUR OC ↑ Baseline Demo in text 94.5 0.99 0.99 5.79 0.75 Stripped No demo 70.9 0.52 0.68 17.4 0.78 Dual T ext Separate CLIP branch 140.0 0.75 0.50 20.7 0.74 DemoEnc Flat MLP 80.6 0.50 0.70 17.2 0.70 HCN (no aux) No sup ervision 80.3 0.51 0.69 18.1 0.70 HCN + CFG CFG training 80.1 0.51 0.71 16.6 0.72 HCN (aux on µ ) Sup ervision b efore pro j 79.8 0.54 0.68 17.9 0.71 CompDiﬀ Demo in HCN, aux on c 75.5 0.99 0.96 8.75 0.76 No uncertaint y No uncertaint y , λ KL = 0 77.6 1.00 0.96 10.1 0.74 No L comp λ comp = 0 . 0 88 0.97 0.94 8 0.75 Strong L comp λ comp = 0 . 5 97.1 0.99 0.94 10.8 0.72 but p oor FID (94.5); removing demographics impro ves FID to 70.9 but de- stro ys controllabilit y (sex: 0.52). W e ev aluate three architectures to recov er this trade-oﬀ: (1) Dual T ext Conditioner extends CLIP with a separate demographic branc h, but extending b ey ond 77 tokens breaks pre-trained representations (sex: 0.75, race: 0.50, FID: 140); (2) Flat MLP Enco der fuses concatenated embed- dings but fails to recov er con trol (sex: 0.50); (3) Our CompDiﬀ mo del with hierarc hical comp osition (HCN) succeeds (sex: 0.99, race: 0.96, FID: 75.5). The stark con trast b et ween ﬂat (sex: 0.50) and hierarchical (sex: 0.99) under iden- tical supervision demonstrates that arc hitectural inductiv e bias is critical . Ho wev er, hierarch y alone is insuﬃcient. Without auxiliary loss, HCN fails (sex: 0.51). W e ﬁnd that auxiliary sup ervision must b e on the output tok en c (after pro jection), not on µ (before pro jection, sex: 0.54) or via classiﬁer-free guidance (sex: 0.51), conﬁrming sup ervision m ust b e applied where the UNet receives the signal. W e next ablate the loss terms used for HCN while keeping the architecture ﬁxed (Eq. 4). Removing uncertaint y slightly worsens FID and degrades age con- trollabilit y (age RMSE worsens from 8 to 10.1), indicating that the v ariational laten t provides a modest but consisten t b eneﬁt. Removing the compositional consistency term substantially w orsens FID (75.5 → 88.0) without clear gains in demographic accuracy , conﬁrming that L comp acts as a useful regularizer. Increasing its w eight further degrades FID (104.2) with no additional fairness b eneﬁt, suggesting that our default setting λ comp = 0 . 1 strikes a goo d balance. 4 Conclusion W e presen t CompDiﬀ, a simple and intuitiv e hierarchical in tersectional condi- tioning framework for fair medical image diﬀusion inspired b y the intuition of comp ositionalit y . By modifying represen tation structure rather than optimiza- tion w eights, our method enables comp ositional generalization to rare demo- CompDiﬀ 9 graphic intersections and improv es b oth generative ﬁdelit y and downstream fair- ness. F uture work will explore more sophisticated conditioning mechanisms, such as graph-based interaction modeling or other structured represen tations. While CompDiﬀ impro ves subgroup equit y , several limitations remain. First, fairness ev aluation relies on quan titative metrics rather than clinical expert assessment. Second, hierarc hical comp osition assumes structured demographic attributes, and does not extend to contin uous or unstructured attributes. Finally , although zero-shot in tersectional generalization impro ves, p erformance still degrades rela- tiv e to well-represen ted groups, indicating that representation-lev el solutions do not fully eliminate data im balance eﬀects. References 1. Ho, J., Jain, A., Abbeel, P .: Denoising diﬀusion probabilistic mo dels. In: A dv. Neur al Inf. Pr o cess. Syst. 33 , 6840–6851 (2020) 2. Song, Y., et al.: Solving inv erse problems in medical imaging with score-based gen- erativ e mo dels. In: International Conferenc e on L e arning R epresentations (ICLR) (2021) 3. Rom bach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image syn thesis with laten t diﬀusion models. In: Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp. 10684–10695 (2022) 4. Cham b on, P ., Bluethgen, C., Langlotz, C.P ., Chaudhari, A.: RoentGen: Vision- language foundation mo del for c hest X-ra y generation. arXiv:2211.12737 (2022) 5. Bluethgen, C., et al.: A vision–language foundation mo del for the generation of realistic chest X-ray images. Nat. Biome d. Eng. (2024) 6. Moroian u, S.L., Bluethgen, C., Chambon, P ., Cherti, M., Delbrouck, J.-B., P aschali, M., Price, B., Gic hoy a, J., Jitsev, J., Langlotz, C.P ., Chaudhari, A.S.: Improving p er- formance, robustness, and fairness of radiographic AI mo dels with ﬁnely-controllable syn thetic data. arXiv:2508.16783 (2025) 7. Ktena, I., et al.: Generativ e models impro v e fairness of medical classiﬁers under distribution shifts. Nat. Me d. 30 , 1166–1173 (2024) 8. Luo, Y., et al.: F airDiﬀusion: Enhancing equity in laten t diﬀusion mo dels via fair Ba y esian p erturbation. Sci. A dv. 11 , eads4593 (2025). h ttps://doi.org/10.1126/sciadv.ads4593 9. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskev er, I.: Learning T ransferable Visual Models F rom Natural Language Supervision. arXiv:2103.00020 (2021) 10. Heusel, M., Ramsauer, H., Un terthiner, T., Nessler, B., Ho c hreiter, S.: GANs trained by a tw o time-scale up date rule conv erge to a lo cal Nash equilibrium. In: A dv. Neur al Inf. Pr o c ess. Syst. 30 , 6629–6640 (2017) 11. Mei, S., et al.: RadImageNet: An open radiologic deep learning research dataset. R adiol. A rtif. Intel l. 4 , e210315 (2022) 12. Bo ec king, B., et al.: Making the most of text semantics to improv e biomedical vision-language processing. In: Eur op e an Confer enc e on Computer Vision (ECCV) (2022) 13. W ang, Z., Simoncelli, E.P ., Bo vik, A.C.: Multi-scale structural similarity for im- age qualit y assessment. In: Pr o c. Asilomar Conf. Signals, Syst. Comput. , vol. 2, pp. 1398 –1402 (2003) 10 M. Ibrahim et al. 14. Cohen, J., et al.: T orchXRa yVision: A library of chest X-ray datasets and mo dels. arXiv:2111.00595 (2021) 15. Glo c ker, B., Jones, C., Bernhardt, M., Winzec k, S.: Algorithmic encoding of pro- tected c haracteristics in c hest X-ra y disease detection models. eBioMe dicine 89 , 104467 (2023) 16. Luo, Y., Tian, Y., Shi, M., Elze, T., W ang, M.: F airvision: equitable deep learning for eye disease screening via fair identit y scaling. arXiv:2310.02492 (2024) 17. Tian, Y., Shi, M., Luo, Y., Kouhana, A., Elze, T., W ang, M.: F airSeg: A large- scale medical image segmen tation dataset for fairness learning with fair error-bound scaling. I n: International Confer ence on L e arning R epr esentations (ICLR) (2024) 18. Cohen, J., et al.: Age prediction from chest radiographs using deep learning. In: Pr o c. Mach. L e arn. R es. 149 , 39–53 (2021) 19. Seyy ed-Kalantari, L., Liu, G., McDermott, M., Chen, I.Y., Ghassemi, M.: CheX- clusion: F airness gaps in deep c hest X-ray classiﬁers. In: Pr o c. Mach. L e arn. R es. 149 , 232 –243 (2021) 20. Johnson, A.E.W., Pollard, T.J., Berk owitz, S.J., Greenbaum, N.R., Lungren, M.P ., Deng, C.-Y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-iden tiﬁed publicly av ailable database of c hest radiographs with free-text rep orts. Scientiﬁc Data 6 (1) (2019)

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment