CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalan…

Authors: Mahmoud Ibrahim, Bart Elen, Chang Sun

CompDiff: Hierarc hical Comp ositional Diffusion for F air and Zero-Shot In tersectional Medical Image Generation Mahmoud Ibrahim 1 , 2 , 3 ⋆ , Bart Elen 3 , Chang Sun 1 , 2 , Gökhan Erta ylan 3 , and Mic hel Dumontier 1 , 2 1 Institute of Data Science, F acult y of Science and Engineering, Maastric ht Univ ersity , Maastrich t, The Netherlands 2 Departmen t of Adv anced Computing Sciences, F acult y of Science and Engineering, Maastric ht Universit y , Maastric ht, The Netherlands 3 VITO, Be lgium mahmoud.ibrahim@vito.be Abstract. Generativ e mo dels are increasingly used to augmen t med- ical imaging datasets for fairer AI. Y et a k ey assumption often go es unexamined: that generators themselv es pro duce equally high-qualit y images across demographic groups. Mo dels trained on imbalanced data can inherit these im balances, yielding degraded synthesis qualit y for rare subgroups and struggling with demographic in tersections absen t from training. W e refer to this as the im balanced generator problem. Exist- ing remedies suc h as loss reweigh ting op erate at the optimization level and pro vide limited b enefit when training signal is scarce or absent for certain combinations. W e prop ose CompDiff, a hierarchical comp o- sitional diffusion framew ork that addresses this problem at the repre- sen tation lev el. A dedicated Hierarc hical Conditioner Netw ork (HCN) decomp oses demographic conditioning, pro ducing a demographic token concatenated with CLIP embeddings as cross-atten tion con text. This structured factorization encourages parameter sharing across subgroups and supp orts compositional generalization to rare or unseen demographic in tersections. Exp erimen ts on c hest X-rays (MIMIC-CXR) and fundus images (F airGenMed) show that CompDiff compares fav orably against b oth standard fine-tuning and F airDiffusion across image qualit y (FID: 64.3 vs. 75.1), subgroup equit y (ES-FID), and zero-shot intersectional generalization (up to 21% FID improv emen t on held-out intersections). Do wnstream classifiers trained on CompDiff-generated data also show impro ved AUR OC and reduced demographic bias, suggesting that ar- c hitectural design of demographic conditioning is an imp ortan t and un- derexplored factor in fair medical image generation. Co de is av ailable at https://anonymous.4open.science/r/CompDiff-6FE6 . Keyw ords: Compositional Demographic Conditioning · F air Medical Image Synthesis · In tersectional Bias and Zero-Shot Generalization. ⋆ Corresp onding author: mahmoud.ibrahim@vito.be 2 M. Ibrahim et al. 1 In tro duction Diffusion models hav e emerged as p o werful to ols for medical image syn thesis, offering promising solutions to data scarcit y [1–3]. T ext-to-image models enable generation of synthetic datasets conditioned on clinical findings, with growing ap- plications in training and augmen ting diagnostic AI systems [4, 5]. A compelling use case is addressing demographic im balance: syn thetic data could supplemen t underrepresen ted p opulations to train fairer classifiers [6, 7]. Ho wev er, a fundamen tal question is often ov erlooked: do gener ative mo d- els themselves pr o duc e e qual ly high-quality images acr oss demo gr aphic gr oups? When trained on imbalanced data, diffusion mo dels can achiev e strong av erage fidelit y while pro ducing degraded samples for rare subgroups; for some demo- graphic in tersections, training examples ma y b e absen t altogether. A dataset ma y contain elderly patien ts, Asian patien ts, and female patients, y et ha ve zero examples at the intersection of all three with a sp ecific pathology . No amount of o versampling, rew eighting, or balanced mini-batching can address groups that do not exist in the training data. W e refer to this as the imb alanc e d gener ator pr oblem . F airDiffusion [8] is among the first to explicitly address fair synthetic data generation. It takes a significant step b y in tro ducing F air Bay esian Perturbation, whic h adaptively reweigh ts training loss to equalize learning across subgroups. Ho wev er, this optimization-level approac h do es not address ho w demographics are r epr esente d : it relies on implicit enco ding within text prompts, where demo- graphic tok ens comp ete for CLIP’s [9] limited 77-token budget and rare inter- sectional combinations lack sufficient training signal regardless of loss weigh ting. Critically , rew eigh ting, lik e all data-lev el strategies, cannot generate learning signal for com binations the mo del has never observed. W e propose CompDiff , whic h addresses the im balanced generator prob- lem at the r epr esentation level. Our key insight is that demographic identit y is comp ositional: a rare intersection such as “80+ Asian female” can b e c omp ose d from well-learned single-attribute em b eddings and mo derately learned pairwise in teractions, enabling generalization ev en to combinations entirely absen t from training, analogous to how language models comp ose known words into nov el sen tences. CompDiff introduces a Hierarchical Conditioner Netw ork (HCN) that explicitly models demographic attribute in teractions, producing a dedicated de- mographic token concatenated with clinical text em b eddings. Where F airDif- fusion asks “ho w muc h to w eight eac h sample,” CompDiff asks “ho w to rep- resen t demographics so that unseen combinations can be composed from seen ones.” This compositional structure facilitates b etter zero-shot generaliza- tion to unseen demographic intersections : a capabilit y that data-lev el and optimization-lev el metho ds are unlikely to pro vide without structural inductive bias. Through extensive exp erimen ts on chest X-ra ys (MIMIC-CXR [20]) and fundus images (F airGenMed [8]), w e demonstrate that CompDiff outp erforms b oth standard baselines and F airDiffusion across image quality , demographic fairness, and do wnstream utility . CompDiff 3 2 Our Prop osed Metho d 2.1 Ov erview Standard diffusion models encode demographics within the text prompt, forc- ing demographic tokens to comp ete with clinical tokens in a shared embedding space. W e introduce CompDiff , which pro cesses demographic attributes sepa- rately through a dedicated Hier ar chic al Conditioner Network (HCN), produc- ing a demographic token concatenated to CLIP embeddings as cross-atten tion con text. F ormally , clinical findings are enco ded using CLIP , pro ducing E text ∈ R B × 77 × d ctx , while demographic attributes (age, sex, race) are processed sepa- rately through HCN, outputting c ∈ R B × 1 × d ctx , and we concatenate E combined = [ E text , c ] ∈ R B × 78 × d ctx as cross-atten tion context for the diffusion UNet. 2.2 Hierarc hical Conditioner Netw ork HCN in tro duces structured inductiv e bias by decomp osing demographic condi- tioning into hierarchical comp onen ts: single-attribute em b eddings, pairwise in- teractions, and full comp osition. Single-A ttribute Emb e ddings (“gr andp ar ents”) Each demographic attribute x v is em b edded into a shared latent space e v = Em bed v ( x v ) of dimension d node . F or age a , sex s , and race r : e a , e s , e r ∈ R d node . Pairwise inter actions (“p ar ents”) T o capture non-additive relationships b et ween attributes, w e mo del all pairwise interactions using dedicated MLPs: h a,s = f a,s ([ e a , e s ]) , h a,r = f a,r ([ e a , e r ]) , h s,r = f s,r ([ e s , e r ]) . (1) W e restrict the hierarch y to pairwise in teractions to balance expressivit y against o verfitting on rare subgroups. F ul l Comp osition (“child”) The final demographic representation is obtained by com bining pairwise interactions through an MLP g ( · ) : h demo = g ([ h a,s , h a,r , h s,r ]) (2) This structured factorization encourages parameter sharing across subgroups and impro v es data efficiency for rare intersections. h demo is then mapp ed to a diagonal Gaussian ( µ, log σ ) = Linear( h demo ) , after whic h z is sampled via reparameterization at training and set to µ at inference. The latent z is then pro jected to the cross-attention dimension c = pro j ctx ( z ) ∈ R d ctx . (3) 4 M. Ibrahim et al. 2.3 T raining Ob jective The mo del is trained end-to-end with total loss L = L diff + λ comp L comp + λ aux L aux + λ KL L KL . (4) The diffusion loss is L diff = E x 0 ,ϵ,t ∥ ϵ − ϵ θ ( x t , t, E combined ) ∥ 2 2 . W e regularize the v ariational demographic latent to ward a standard normal via the KL term L KL = E  KL( N ( µ, σ 2 I ) ∥ N (0 , I ))  . W e add a comp ositional consistency term L comp = 1 − cos( h demo , e age + e sex + e race ) as a soft anchor that stabilizes training to ward a simple additiv e baseline while still allowing non-additiv e interactions. Ablations (§3.4) sho w it impro ves FID. T o ensure demographic information surviv es pro jection in to the cross-attention space, w e apply auxiliary classification directly on the final token c : L aux = CE( ˆ y age , y age ) + CE( ˆ y sex , y sex ) + CE( ˆ y race , y race ) . (5) W e deliberately apply L aux on the pro jected token c (not on µ ), so that the represen tation actually seen b y the UNet remains demographically informativ e (see ablations in §3.4). Implementation. W e set d node = 256 and d ctx = 1024 to match the Stable Diffusion 2.1 cross-attention dimension. HCN introduces minimal computational o verhead relative to the UNet backbone and do es not require modification of diffusion timesteps or sampling procedures, contributing to only a 0.19% increase in trainable parameters o ver the baseline. 3 Exp erimen ts 3.1 Datasets W e ev aluate on t w o medical imaging modalities. F or c hest X-ra ys , w e use MIMIC-CXR [20] p ostero-anterior views with demographic metadata, split in to 62,094/1,300/7,039 training/v alidation/test images with no patient o v erlap. T ext prompts follo w: " year old . " . F or fun- dus imaging , w e use F airGenMed [8] con taining 6,000/1,000/3,000 SLO fun- dus images with prompts enco ding race, sex, ethnicit y , and clinical attributes (glaucoma, cup-disc ratio, RNFL thic kness, near vision status). 3.2 Ev aluation Metrics W e assess generated images along four dimensions, computed on held-out test sets. Image quality. W e report F réchet Inception Distance (FID) [10] and FID- RadImageNet (using radiology-sp ecific embeddings [11]), BioViL [12] cosine similarit y for semantic alignmen t, and MS-SSIM [13] for structural similarit y . CompDiff 5 T ext-pr ompt alignment. W e ev aluate whether generated images reflect condi- tioned attributes using pretrained classifiers: T orchXRa yVision [14] DenseNet- 121 for c hest X-ray disease AUR OC, sex/race accuracy [15], and age RMSE ; pretrained Efficien tNet mo dels for fundus glaucoma classification and cup-disc ratio prediction. F airness. F ollo wing the fairness ev aluation in [8], we compute equit y-scaled FID (ES-FID) [16, 17] which penalizes qualit y disparities across demographic subgroups: ES-FID A i = FID ·  1 + 1 |A i |· FID P |A i | j =1 | FID − FID A i j |  (6) where A i denotes subgroups for protected attribute i . ES-FID equals FID when all subgroups ha ve iden tical quality , and increases with disparit y . Downstr e am utility. W e train disease classifiers on syn thetic data and ev alu- ate on real data (TSTR), rep orting A UR OC, equit y-scaled A UROC (ES-AUC), Difference in Equalized Odds (DEOdds), and underdiagnosis rate: the false p os- itiv e rate for ‘No Finding’ predictions at the subgroup level [19]. 3.3 Results W e trained CompDiff and baseline mo dels (fine-tuned Stable Diffusion 2.1 and F airDiffusion) using iden tical training budgets (up to 30,000 steps) on the same training sets wi th complete demographic lab els. F or each mo del, we select the b est c heckpoint based on v alidation performance across the four dimensions men- tioned in §3.2. All results are computed on held-out generated test sets using three differen t generation seeds; we report mean and standard deviation. Ov erall Generation Quality T able 1 compares CompDiff against baselines across c hest X-ray and fundus mo dalities. CompDiff achiev es the best FID on b oth mo dalities (64.3 chest, 54.6 fundus). While F airDiffusion achiev es sligh tly lo wer FID-RadImageNet on chest (6.2 vs 6.8), CompDiff outp erforms on disease classification A UROC (0.82 vs 0.74), indicating better clinical feature align- men t. MS-SSIM v alues remain in the acceptable range (0.25–0.75) for all mo d- els, confirming neither ov erfitting nor generation failure. Sex accuracy remains near-p erfect across all methods. Sligh tly reduced race accuracy and increased age RMSE are expected trade-offs from HCN conditioning, offset by improv ed subgroup-lev el fairness b elo w. F airness in Image Generation Qualit y CompDiff achiev es the lo west ES- FID across sex, race, and age on b oth mo dalities (T able 1).F airDiffusion improv es o ver baseline but consistently underperforms CompDiff on single-attribute fair- ness. T able 2 presen ts selected intersectional subgroups spanning common to rare demographics (full results omitted for compactness). CompDiff improv es FID ( ↓ FID indicates b etter p erformance) for rare subgroups (e.g., 40-60 F/A: 6 M. Ibrahim et al. T able 1. Overall generation qualit y and fairness metrics across chest X-ray and fundus mo dalities. V alues rep orted as mean (std) across three runs. ↑ indicates higher is b etter, ↓ indicates low er is b etter. Legend: B=Baseline, FD=F airDiffusion, CD=CompDiff, FID-RAD = FID-RadImageNet. Mo dalit y Metho d Image Qualit y Disease A UROC ↑ Equit y-Scaled FID (ES-FID) FID ↓ FID-RAD ↓ Sex ↓ Race ↓ Age ↓ Chest X-ray B 82 . 8(2 . 2) 8 . 7(0 . 1) 0 . 80(0 . 00) 98 . 3(1 . 3) 122 . 9(1 . 2) 111 . 8(0 . 7) FD 75 . 1(0 . 1) 6 . 2 (0 . 0) 0 . 74(0 . 03) 88 . 6(0 . 3) 115 . 7(0 . 5) 102 . 5(0 . 8) CD 64 . 3 (0 . 3) 6 . 8(0 . 1) 0 . 82 (0 . 01) 78 . 4 (0 . 1) 106 . 2 (0 . 4) 98 . 3 (0 . 6) F undus B 72 . 2(0 . 2) 6 . 4(0 . 0) 0 . 94 82 . 4(1 . 4) 105 . 7(0 . 1) 97 . 3(0 . 8) FD 64 . 3(0 . 5) 5 . 0(0 . 1) 0 . 93 76 . 7(0 . 9) 106 . 6(2 . 2) 98 . 1(1 . 3) CD 54 . 6 (0 . 4) 4 . 9 (0 . 1) 0 . 96 65 . 2 (0 . 6) 97 . 7 (1 . 4) 85 . 1 (1 . 0) T able 2. FID on selected in tersectional subgroups for Chest (top) and F un- dus (b ottom), ranging from common to rare; low er is b etter. % (training) in- dicates each subgroup’s proportion of the training set, and v alues are rep orted as mean (std). Lo wer FIDs are better. Legend: B=Baseline, FD=F airDiffusion, CD=CompDiff. A1=18–40, A2=40–60, A3=60–80, A4=80+; M/F=Male/F emale; W/B/H/A=White/Blac k/Hispanic/Asian. Mo del A3/M/W A2/M/W A1/M/W A2/F/H A2/M/H A2/M/A A2/F/A %(train) 16% 14% 4% 2% 1.5% 0.5% 0.5% B 115 . 2(1 . 2) 114 . 1(1 . 6) 130 . 9(1 . 0) 168 . 5(1 . 9) 170 . 4(9 . 4) 192 . 6(4 . 9) 204 . 0(5 . 4) FD 102 . 9(2 . 4) 104 . 6(0 . 8) 125 . 4(3 . 2) 162 . 0(2 . 4) 165 . 6(3 . 6) 201 . 7(15 . 4) 209 . 4(5 . 2) CD 97 . 6 (2 . 0) 89 . 3 (1 . 6) 116 . 9 (1 . 6) 149 . 0 (2 . 7) 135 . 0 (7 . 0) 184 . 5 (3 . 0) 167 . 9 (1 . 7) Mo del A3/F/W A3/M/W A4/F/W A1/F/W A4/M/W A2/M/A A4/M/B % (train) 24.5% 17% 4% 5% 3% 1.5% 0.5% B 103 . 3(5 . 7) 91 . 6(2 . 6) 176 . 0(5 . 4) 132 . 3(6 . 5) 132 . 5(0 . 8) 163 . 0(9 . 6) 229 . 4(2 . 7) FD 101 . 4(0 . 1) 85 . 5(0 . 4) 180 . 2(1 . 4) 128 . 5(7 . 2) 132 . 9(2 . 0) 156 . 8(3 . 9) 300 . 2(29 . 2) CD 83 . 9 (2 . 1) 78 . 1 (1 . 1) 155 . 5 (1 . 4) 117 . 1 (2 . 5) 118 . 6 (4 . 0) 146 . 4 (4 . 9) 217 . 8 (2 . 1) 204.0 → 167.9 on chest) while maintaining gains on common subgroups (e.g., 60- 80 M/W: 115.2 → 97.6), demonstrating that fairness improv ements do not come at the cost of ma jority group p erformance. F airDiffusion impro ves ov er baseline on most subgroups but shows limited gains for the rarest intersections where training signal is scarce. Zero-Shot Comp ositional Generalization T o directly test whether Com- pDiff can generalize to unseen demographic com binations, we remov e fiv e inter- sectional subgroups entirely from training (based on rarit y) and ev aluate gen- eration quality on these held-out groups (results and subgroups remov ed are in T able 3). CompDiff outp erforms both baseline and F airDiffusion on all held-out subgroups, achieving up to 21% FID improv ement. Notably , F airDiffusion p er- forms worse than baseline on several subgroups (e.g., 80+ F emale Asian: 247.2 CompDiff 7 T able 3. Zero-shot generalization to held-out demographic subgroups (FID ↓ ). These intersections w ere remo ved entirely from training. Legend:F/A=F emale Asian, M/A=Male Asian, M/H=Male Hisp anic. Low er FID is b etter. Metho d 18-40 F/A 18-40 M/A 80+ F/A 80+ M/A 80+ M/H B 183.3 161.3 210.7 208.1 231.7 FD 181.7 152.1 247.2 265.5 229.9 CD 159.8 127.6 195.4 206.6 212.2 T able 4. Do wnstream classifier p erformance when trained on synthetic data and ev al- uated on real test sets. A UC and ES-AUC measure classification performance and demographic equity;Higher AUC/ES-A UC is better. Underdiagnosis rate (chest) and equalized odds difference (fundus) measure fairness in mo del predictions; lo wer v alues indicate reduced diagnostic bias across demographic groups. V alues are rep orted as mean (std) across runs. Legend: B =Baseline, FD=F airDiffusion, CD=CompDiff. Metric Subgroup Chest F undus B FD CD B FD CD A UC ↑ Ov erall 0.69(0.01) 0.68(0.01) 0.72(0.01) 0.75(0.01) 0.76(0.01) 0.78(0.01) sex 0.68(0.01) 0.67(0.01) 0.71(0.01) 0.75(0.01) 0.76(0.01) 0.77(0.01) Race 0.67(0.01) 0.67(0.01) 0.69(0.01) 0.70(0.02) 0.71(0.02) 0.72(0.02) Age 0.67(0.01) 0.66(0.01) 0.71(0.01) 0.66(0.05) 0.56(0.07) 0.61(0.04) F airness ↓ Ov erall 0.46(0.01) 0.44(0.01) 0.40(0.01) 0.15(0.05) 0.13(0.05) 0.12(0.04) sex 0.45(0.01) 0.42(0.01) 0.39(0.01) 0.02(0.02) 0.01(0.02) 0.01(0.02) Race 0.45(0.01) 0.42(0.01) 0.39(0.01) 0.15(0.05) 0.13(0.05) 0.12(0.04) Age 0.43(0.01) 0.40(0.01) 0.37(0.01) 0.39(0.16) 0.47(0.12) 0.28(0.14) vs 210.7), confirming that loss rew eighting cannot easily help when training samples are absent. In contrast, CompDiff comp oses represen tations for unseen in tersections from learned single-attribute and pairwise embeddings, v alidating our core h yp othesis that hierarchical comp osition enables generalization b eyond the training distribution. Do wnstream Classification Impact T o assess practical impact, we train dis- ease classifiers on syn thetic data and ev aluate on real test sets. T able 4 presents results across both mo dalities. On chest X-rays (Lung Lesion and Opacity Detec- tion), our model ac hieves higher mean AUC (0.72 vs 0.69) and low er underdiag- nosis rates (0.40 vs 0.46). On fundus (glaucoma detection), our mo del impro ves A UC (0.78 vs 0.75) while reducing equalized odds difference ov erall (0.12 vs 0.15), demonstrating that generation quality directly impacts downstream fair- ness. 3.4 Ablations T able 5 first v alidates key architectural decisions. The baseline enco des demo- graphics in text, achieving excellen t demographic accuracy (sex and race: 0.99) 8 M. Ibrahim et al. T able 5. Ablation study results. CompDiff achiev es the b est trade-off b et ween image qualit y and demographic control. Results are rep orted on the holdout v alidation set rather than the t est set, which explains the discrepancy with results reported earlier. V ariant Key Change FID ↓ Sex ↑ Race ↑ Age ↓ AUR OC ↑ Baseline Demo in text 94.5 0.99 0.99 5.79 0.75 Stripped No demo 70.9 0.52 0.68 17.4 0.78 Dual T ext Separate CLIP branch 140.0 0.75 0.50 20.7 0.74 DemoEnc Flat MLP 80.6 0.50 0.70 17.2 0.70 HCN (no aux) No sup ervision 80.3 0.51 0.69 18.1 0.70 HCN + CFG CFG training 80.1 0.51 0.71 16.6 0.72 HCN (aux on µ ) Sup ervision b efore pro j 79.8 0.54 0.68 17.9 0.71 CompDiff Demo in HCN, aux on c 75.5 0.99 0.96 8.75 0.76 No uncertaint y No uncertaint y , λ KL = 0 77.6 1.00 0.96 10.1 0.74 No L comp λ comp = 0 . 0 88 0.97 0.94 8 0.75 Strong L comp λ comp = 0 . 5 97.1 0.99 0.94 10.8 0.72 but p oor FID (94.5); removing demographics impro ves FID to 70.9 but de- stro ys controllabilit y (sex: 0.52). W e ev aluate three architectures to recov er this trade-off: (1) Dual T ext Conditioner extends CLIP with a separate demographic branc h, but extending b ey ond 77 tokens breaks pre-trained representations (sex: 0.75, race: 0.50, FID: 140); (2) Flat MLP Enco der fuses concatenated embed- dings but fails to recov er con trol (sex: 0.50); (3) Our CompDiff mo del with hierarc hical comp osition (HCN) succeeds (sex: 0.99, race: 0.96, FID: 75.5). The stark con trast b et ween flat (sex: 0.50) and hierarchical (sex: 0.99) under iden- tical supervision demonstrates that arc hitectural inductiv e bias is critical . Ho wev er, hierarch y alone is insufficient. Without auxiliary loss, HCN fails (sex: 0.51). W e find that auxiliary sup ervision must b e on the output tok en c (after pro jection), not on µ (before pro jection, sex: 0.54) or via classifier-free guidance (sex: 0.51), confirming sup ervision m ust b e applied where the UNet receives the signal. W e next ablate the loss terms used for HCN while keeping the architecture fixed (Eq. 4). Removing uncertaint y slightly worsens FID and degrades age con- trollabilit y (age RMSE worsens from 8 to 10.1), indicating that the v ariational laten t provides a modest but consisten t b enefit. Removing the compositional consistency term substantially w orsens FID (75.5 → 88.0) without clear gains in demographic accuracy , confirming that L comp acts as a useful regularizer. Increasing its w eight further degrades FID (104.2) with no additional fairness b enefit, suggesting that our default setting λ comp = 0 . 1 strikes a goo d balance. 4 Conclusion W e presen t CompDiff, a simple and intuitiv e hierarchical in tersectional condi- tioning framework for fair medical image diffusion inspired b y the intuition of comp ositionalit y . By modifying represen tation structure rather than optimiza- tion w eights, our method enables comp ositional generalization to rare demo- CompDiff 9 graphic intersections and improv es b oth generative fidelit y and downstream fair- ness. F uture work will explore more sophisticated conditioning mechanisms, such as graph-based interaction modeling or other structured represen tations. While CompDiff impro ves subgroup equit y , several limitations remain. First, fairness ev aluation relies on quan titative metrics rather than clinical expert assessment. Second, hierarc hical comp osition assumes structured demographic attributes, and does not extend to contin uous or unstructured attributes. Finally , although zero-shot in tersectional generalization impro ves, p erformance still degrades rela- tiv e to well-represen ted groups, indicating that representation-lev el solutions do not fully eliminate data im balance effects. References 1. Ho, J., Jain, A., Abbeel, P .: Denoising diffusion probabilistic mo dels. In: A dv. Neur al Inf. Pr o cess. Syst. 33 , 6840–6851 (2020) 2. Song, Y., et al.: Solving inv erse problems in medical imaging with score-based gen- erativ e mo dels. In: International Conferenc e on L e arning R epresentations (ICLR) (2021) 3. Rom bach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image syn thesis with laten t diffusion models. In: Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp. 10684–10695 (2022) 4. Cham b on, P ., Bluethgen, C., Langlotz, C.P ., Chaudhari, A.: RoentGen: Vision- language foundation mo del for c hest X-ra y generation. arXiv:2211.12737 (2022) 5. Bluethgen, C., et al.: A vision–language foundation mo del for the generation of realistic chest X-ray images. Nat. Biome d. Eng. (2024) 6. Moroian u, S.L., Bluethgen, C., Chambon, P ., Cherti, M., Delbrouck, J.-B., P aschali, M., Price, B., Gic hoy a, J., Jitsev, J., Langlotz, C.P ., Chaudhari, A.S.: Improving p er- formance, robustness, and fairness of radiographic AI mo dels with finely-controllable syn thetic data. arXiv:2508.16783 (2025) 7. Ktena, I., et al.: Generativ e models impro v e fairness of medical classifiers under distribution shifts. Nat. Me d. 30 , 1166–1173 (2024) 8. Luo, Y., et al.: F airDiffusion: Enhancing equity in laten t diffusion mo dels via fair Ba y esian p erturbation. Sci. A dv. 11 , eads4593 (2025). h ttps://doi.org/10.1126/sciadv.ads4593 9. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J., Krueger, G., Sutskev er, I.: Learning T ransferable Visual Models F rom Natural Language Supervision. arXiv:2103.00020 (2021) 10. Heusel, M., Ramsauer, H., Un terthiner, T., Nessler, B., Ho c hreiter, S.: GANs trained by a tw o time-scale up date rule conv erge to a lo cal Nash equilibrium. In: A dv. Neur al Inf. Pr o c ess. Syst. 30 , 6629–6640 (2017) 11. Mei, S., et al.: RadImageNet: An open radiologic deep learning research dataset. R adiol. A rtif. Intel l. 4 , e210315 (2022) 12. Bo ec king, B., et al.: Making the most of text semantics to improv e biomedical vision-language processing. In: Eur op e an Confer enc e on Computer Vision (ECCV) (2022) 13. W ang, Z., Simoncelli, E.P ., Bo vik, A.C.: Multi-scale structural similarity for im- age qualit y assessment. In: Pr o c. Asilomar Conf. Signals, Syst. Comput. , vol. 2, pp. 1398 –1402 (2003) 10 M. Ibrahim et al. 14. Cohen, J., et al.: T orchXRa yVision: A library of chest X-ray datasets and mo dels. arXiv:2111.00595 (2021) 15. Glo c ker, B., Jones, C., Bernhardt, M., Winzec k, S.: Algorithmic encoding of pro- tected c haracteristics in c hest X-ra y disease detection models. eBioMe dicine 89 , 104467 (2023) 16. Luo, Y., Tian, Y., Shi, M., Elze, T., W ang, M.: F airvision: equitable deep learning for eye disease screening via fair identit y scaling. arXiv:2310.02492 (2024) 17. Tian, Y., Shi, M., Luo, Y., Kouhana, A., Elze, T., W ang, M.: F airSeg: A large- scale medical image segmen tation dataset for fairness learning with fair error-bound scaling. I n: International Confer ence on L e arning R epr esentations (ICLR) (2024) 18. Cohen, J., et al.: Age prediction from chest radiographs using deep learning. In: Pr o c. Mach. L e arn. R es. 149 , 39–53 (2021) 19. Seyy ed-Kalantari, L., Liu, G., McDermott, M., Chen, I.Y., Ghassemi, M.: CheX- clusion: F airness gaps in deep c hest X-ray classifiers. In: Pr o c. Mach. L e arn. R es. 149 , 232 –243 (2021) 20. Johnson, A.E.W., Pollard, T.J., Berk owitz, S.J., Greenbaum, N.R., Lungren, M.P ., Deng, C.-Y., Mark, R.G., Horng, S.: MIMIC-CXR, a de-iden tified publicly av ailable database of c hest radiographs with free-text rep orts. Scientific Data 6 (1) (2019)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment