On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a ch…

Authors: Omer Dahary, Benaya Koren, Daniel Garibi

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
On-the-f ly Repulsion in the Contextual Space for Rich Div ersity in Diusion T ransformers OMER D AHARY ∗ , T el A viv University, Israel and Snap Research, Israel BENA Y A K OREN ∗ , T el A viv University, Israel D ANIEL GARIBI, T el A viv University, Israel and Snap Research, Israel D ANIEL COHEN-OR, T el A viv University, Israel and Snap Research, Israel “ P e o p l e w i t h 3 D h o l o gr a m s ” F lu x - d e v O u r s Fig. 1. Example results of our Conte xtual Space repulsion framew ork using Flux-dev . The base model (top) typically converges on a narrow set of visual solutions. By applying semantic intervention within the internal multi-modal aention channels, our approach (boom) pr oduces a diverse set of images with minimal computational overhead. Modern T ext-to-Image (T2I) diusion models have achieved remarkable semantic alignment, yet they often suer from a signicant lack of variety , converging on a narrow set of visual solutions for any given prompt. This typ- icality bias presents a challenge for creative applications that require a wide range of generative outcomes. W e identify a fundamental trade-o in current approaches to diversity: modifying model inputs requires costly optimiza- tion to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, w e propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Dif- fusion Transformers. By intervening in the multimodal attention channels, we apply on-the-y repulsion during the transformer’s forward pass, inject- ing the intervention between blocks where text conditioning is enriched with emergent image structure. This allo ws for redirecting the guidance trajectory after it is structurally informed but before the composition is xed. Our results demonstrate that repulsion in the Contextual Space produces signicantly richer diversity without sacricing visual delity or seman- tic adherence. Furthermor e, our method is uniquely ecient, imp osing a small computational overhead while remaining eective even in modern “T urbo” and distilled models where traditional trajectory-based inter ventions typically fail. Project page: https://contextual- repulsion.github.io/. 1 Introduction The rapid evolution of T ext-to-Image (T2I) ge nerative models has ushered in a new era of high-delity visual synthesis, where mod- els now exhibit unprecedented alignment with complex textual prompts [Esser et al . 2024; Podell et al . 2023; Rombach et al . 2022]. Howev er , this progress has come at a signicant cost: the reduction * Denotes equal contribution. of generative diversity . As advanced generative models are increas- ingly optimized for precision and human prefer ence, they tend to converge on a narro w set of “typical” visual solutions, a phenome- non often described as typicality bias [T eotia et al . 2025]. Diversity is no longer a secondary metric; it has b ecome a central research problem addressed by a growing bo dy of work [Jalali et al . 2025; Morshed and Bo ddeti 2025; Um and Y e 2025]. This is b ecause the utility of generative AI depends on its ability to act as a creative part- ner that e xplores the vast manifold of human imagination. It should function as a generative engine rather than merely a sophisticated retrieval mechanism. The diversity problem is fundamentally dicult due to the struc- tural tension between quality and variety . High-quality genera- tion currently relies on strong conditioning signals, most notably Classier-Free Guidance (CFG) [Ho and Salimans 2022], which eec- tively sharp ens the probability distribution around a single mode by suppressing nearby semantically valid alternativ es. Consequently , restoring diversity requires an ecient me chanism to overcome this bias without degrading the structural integrity of the image or losing semantic adherence. Previous attempts to bridge the diversity-alignment gap can be categorized by their point of intervention within the denoising trajectory , as illustrated in Figure 2. Upstream methods (Figure 2a) attempt to solve the problem by altering initial conditions, such as noise seeds or prompt embeddings. However , these approaches are often decoupled from the actual generation process [Sadat et al . 2023]; to achieve semantic gr ounding, they must either r ely on noisy intermediate estimates [Kim et al . 2025] or employ optimization that incur signicant computational overhead [Parmar et al . 2025; 2 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or                     (a) Upstream                     (b) Downstream                     (c) Ours Fig. 2. Conceptual comparison of diversity strategies in dual-stream DiT architectures. Here 𝑝 ( 𝑖 ) denotes the prompt embedding for sample 𝑖 , 𝑧 ( 𝑖 ) 𝑡 denotes the latent at timestep 𝑡 for sample 𝑖 , and the red double- arrow icon indicates the point of diversity manipulation. (a) Upstream : Interventions on noise or prompt embeddings lack structural feedback fr om the emerging image. (b) Downstream : Repulsion in image latents acts on a fixed visual mode and can push samples o the data manifold, causing artifacts. ( c) Ours : By applying on-the-fly repulsion within the Contextual Space (text-aention channels), we steer the model’s generative intent. This allows for a semantically driven inter vention synchronized with the emergent visual structure. Um and Y e 2025]. Conversely , downstream methods (Figure 2b) enforce repulsion in the image latent space during denoising [Corso et al . 2023; Jalali et al . 2025]. While these can force variance , they often push samples outside the learned data manifold, resulting in catastrophic drops in visual delity and unnatural visual artifacts. The core diculty lies in an inter ventional trade-o: early in- terventions lack structural feedback, while late inter ventions face a committed visual mode. This is particularly acute in few-step "T urbo" models, where the generative path is de cided almost in- stantly . Upstream methods require slow optimization to search for diversity-inducing initial conditions, while do wnstream repulsion arrives too late to steer the composition. In this work, we present a novel approach that bypasses this trade- o by identifying and lev eraging the Contextual Space (Figure 2c), which emerges inside the multimodal attention blocks of Diu- sion Transformer (DiT) ar chitectures [Esser et al. 2024; Labs 2024]. Unlike previous U-Net mo dels where text conditioning remains a static external signal, these blocks facilitate a dynamic bidirectional exchange between text and image tokens, continuously updating the text representations in response to the evolving image. This interaction creates an “enriched” semantic representation that is both aware of the prompt and synchronized with emergent visual details [Helbling et al. 2025]. By leveraging these enriched textual representations, our ap- proach steers the model’s generative intent to overcome the CFG mode collapse. By targeting these representations rather than raw pixels, w e preser ve samples within the learned data manifold, avoid- ing the artifacts common in downstream interventions. T o achieve this, we apply repulsion to the tokens as they pass between multi- modal attention blocks. This intervention is performed on-the-y during the transformer’s forward pass, at a stage where the emer- gent representation is already structurally informed but the nal composition is not yet xed. Intervening while the representation is still exible allows for steering that remains semantically driven yet image-aware . This enables the model to explor e diverse paths while maintaining natural, high-quality results. T o demonstrate the ecacy of our approach, we conduct ex- tensive experiments across multiple DiT -base d architectures. W e evaluate our r esults on the COCO benchmark using metrics for both visual quality and distributional variety . Our results show that repul- sion in the Contextual Space consistently produces richer diversity without the mode collapse or semantic misalignment characteristic of prior work. Furthermore, we demonstrate that our metho d is uniquely ecient, requiring only a small computational ov erhead and no additional memory , making it compatible with the rapid inference requirements of modern distilled models. 2 Related W ork Diusion transformers. While foundational diusion mo dels pre- dominantly utilized UNet-based architectures [Podell et al . 2023; Ramesh et al . 2022; Razzhigaev et al . 2023; Rombach et al . 2022; Saharia et al . 2022], contemp orary state-of-the-art text-to-image systems have largely shifted toward Diusion T ransformers (DiT s) as their backb one [Esser et al . 2024; Kong et al . 2025; Labs 2024; Labs et al . 2025]. A key distinction lies in the conditioning mecha- nism: whereas UNets typically incorporate text via cross-attention layers, DiT s process text and image tokens concurrently within the transformer . This ar chitecture emplo ys multimodal attention blocks to facilitate bidirectional interaction, ensuring a unied integration of visual and textual information throughout the generation pro- cess. A growing b ody of research has successfully employed this architecture across div erse downstream tasks [A vrahami et al . 2025; Dalva et al . 2024; Garibi et al . 2025; Kamenetsky et al . 2025; Labs et al. 2025; T an et al. 2025; Zarei et al. 2025] Research addressing the diversity-alignment gap in T ext-to-Image (T2I) models generally falls into two categories based on the stage and level of intervention: upstream methods, which modify condi- tions prior to or in the earliest stages of the generative process, and downstream methods, which manipulate the image latents through- out the denoising trajectory . Upstream Interventions. Upstream methods attempt to induce di- versity by optimizing input conditions, namely the initial noise or text conditioning, before a stable image structure emerges. Pur ely decoupled interventions like CADS [Sadat et al . 2023] inje ct prompt- agnostic noise into text embeddings, which often leads to semantic drifting due to a lack of structural feedback. T o bridge this, meth- ods like CNO [Kim et al . 2025] utilize the very rst timestep’s ˆ 𝑥 0 prediction to force divergence, yet these estimates are frequently structurally unformed at high noise levels, providing an unstable signal for conceptual variety . Similarly , optimization-based regimes such as MinorityPrompt [Um and Y e 2025] and Scalable Group In- ference (SGI) [Parmar et al . 2025] seek diversity-inducing initial conditions through iterative search; howe ver , their heavy computa- tional overhead makes them increasingly impractical for r eal-time applications or integration with fast-inference distilled models. On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 3 Downstream Interventions. Do wnstream methods manipulate the latent trajectory throughout the denoising process, either through interacting particle systems or modied guidance schedules. The former , pione ered by Particle Guidance (PG) [Corso et al . 2023], uses kernel-based repulsion in the image latent space to force variance between samples, with subsequent works focusing on improving repulsion loss obje ctives [Askari Hemmat et al . 2024; Jalali et al . 2025; Morshed and Boddeti 2025]. Despite these renements, these methods operate on non-semantic representations, repelling low- level pixel-space features rather than semantic content. Imp ortantly , semantic concepts in the image latent space are spatially entangled and not aligned across samples, so the same high-level attribute may correspond to dierent spatial locations and congurations in dierent generations. As a result, repulsion in this space often pushes samples outside the learne d manifold, leading to unnatu- ral artifacts. In addition, such approaches lack sucient trajectory depth to remain eective in modern distilled “T urbo” models; since the generative path is de cided almost instantly , the remaining de- noising trajectory is insucient for late-stage repulsion to steer the model toward diverse modes. Alternatively , scheduling-based approaches like Inter val Guid- ance [K ynkäänniemi et al . 2024] preserve variety by modulating the CFG scale during denoising. However , because these rescal- ing schedules are xed and indep endent of the model’s internal state, they often reduce the prompt’s inuence before the model has suciently established semantic alignment to the prompt. A recurring limitation of these approaches is that their steering signals, whether derived from raw latents or external enco ders, lack the semantic coher ence necessar y for meaningful control during the critical early stages of denoising. This forces an unfavorable trade- o: upstream inter vention must incur signicant computational overhead to nd valid diversity-inducing paths, while downstream interventions occur on a committed visual mode where the com- position is already xed, often producing noise-level variance that pushes samples outside the learned manifold and results in unnatu- ral artifacts. Our work departs from these by identifying a Contex- tual Space within Diusion Transformers that is both semantically exible and structurally informe d. This allows us to redirect the guidance trajectory once the bidirectional exchange between text and image tokens has established a stable semantic signal, but b efore the model has fully converged on a specic generative outcome. 3 Method: Repulsion in the Contextual Space In this section, w e formalize our approach to generative div ersity by shifting the intervention focus to the Contextual Space . As identied in Se ction 2, the cor e diculty of existing methods lies in the timing and location of the repulsion: upstream methods act on unformed noise, while downstream methods act on a rigid latent manifold. Our central insight is that the Contextual Space, inher ent to multi- modal transformer architectures such as DiT s, provides an ee ctive environment for diversity interventions b ecause it is structurally informed yet conceptually exible. 3.1 Defining the Contextual Space The Contextual Space is the high-dimensional manifold formed within the Multimodal Attention (MM- Attention) blocks of a DiT . Unlike the static text embeddings used in U-Net architectures, the DiT processing ow facilitates a bidirectional exchange b etween text features 𝑓 𝑇 and image features 𝑓 𝐼 . In each transformer block 𝑙 , the resulting tokens undergo a struc- tural transformation: ˆ 𝑓 ( 𝑙 ) 𝑇 , ˆ 𝑓 ( 𝑙 ) 𝐼 = MM- Attn ( 𝑓 ( 𝑙 − 1 ) 𝑇 , 𝑓 ( 𝑙 − 1 ) 𝐼 ) . (1) In this interaction, the text features 𝑓 𝑇 guide the image tokens toward the prompt’s semantic requirements. Simultaneously , the image fea- tures 𝑓 𝐼 provide fe edback regarding the spatial composition and emerging visual details, which the text features absorb to b ecome uniquely tied to the specic image being formed. W e therefore iden- tify the resulting enriched text tokens ˆ 𝑓 ( 𝑙 ) 𝑇 as the primary elements of the Contextual Space. A key advantage of this space is its inherent token ordering. Unlike the image latent space, where specic semantic content can shift spatially across dierent samples, the Contextual Space maintains a xe d semantic alignment across the se quence index. This facilitates a consistent representation where each token index generally represents the same conceptual comp onent across the entire batch, largely indep endent of its realized placement in the emergent image structure. 3.2 The Mechanism of Contextual Repulsion W e illustrate the positioning of our intervention in Figure 2c. Our key insight is that applying repulsion within the Contextual Space allows for the manipulation of generativ e intent . By enforcing dis- tance between batch samples in this space, we steer the mo del’s high-level planning before it commits to a specic visual mode. T o achieve this, we adopt the particle guidance framework [Corso et al . 2023], which treats a batch of 𝐵 samples as interacting particles. Howev er , unlike prior work that applies guidance to the image la- tents 𝑧 𝑡 (Figure 2b), we apply the repulsive forces directly to the Contextual Space tokens ˆ 𝑓 𝑇 (Figure 2c). Since the conditioning for each sample is initialized from the same unmodied prompt encoding at every timestep, the intervention mitigates the risk of permanent semantic drift. This common starting point promotes a state where contextual features remain closely aligned to the original prompt and directly comparable acr oss the batch throughout the trajectory , allowing the repulsion to act as a force that dierentiates how the same pr ompt is visually realized. A critical advantage of our approach is that these forces are com- puted on-the-y . Because we inter vene directly on the internal activations, the method does not require backpropagating through the model layers, making it signicantly more computationally e- cient than optimization-based methods. Within each transformer block, we apply 𝑀 inner-block iterations to iteratively rene the token positions. Following the gradient-based guidance formula- tion [Corso et al . 2023], the updated state of the contextual tokens for a sample 𝑖 ∈ { 1 , . . . , 𝐵 } after each iteration is given by: ˆ 𝑓 ( 𝑙 ) ′ 𝑇 ,𝑖 = ˆ 𝑓 ( 𝑙 ) 𝑇 ,𝑖 + 𝜂 𝑀 ∇ ˆ 𝑓 ( 𝑙 ) 𝑇 ,𝑖 L 𝑑 𝑖 𝑣 ( { ˆ 𝑓 ( 𝑙 ) 𝑇 , 𝑗 } 𝐵 𝑗 = 1 ) , (2) 4 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or where 𝜂 is the overall repulsion scale and L 𝑑 𝑖 𝑣 is a diversity loss dened over the batch of 𝐵 samples. T o maintain diversity through- out the traje ctory , we apply this repulsion across all transformer MM-blocks. Howev er , since the initial stages of the denoising tra- jectory are the most crucial for the eventual semantic meaning and global composition [Balaji et al. 2023; Cao et al. 2025; Dahar y et al. 2025, 2024; Hub erman et al . 2025; Patashnik et al . 2023; Y ehezkel et al . 2025], and are also where strong guidance signals such as CFG most strongly bias the generative path, we restrict the intervention to a chosen interval of the rst few timesteps. 3.3 Diversity Objective The Contextual Space encodes global semantic intent shared across the batch, making diversity objectives based on batch-level similarity more appropriate than token-wise or local measures. While our framework is exible and can adopt various diversity losses dene d in prior work [Jalali et al . 2025; Morshed and Boddeti 2025], we specically utilize the V endi Score [Askari Hemmat et al . 2024; Friedman and Dieng 2022] as our primar y objective. The V endi Score provides a principled way to measure the eective numb er of distinct samples in a batch by considering the eigenvalues of a similarity matrix. Formally , it is dene d as the exponent of the von Neumann entropy of that matrix. For simplicity , we represent each sample 𝑖 at block 𝑙 as a single vector c ( 𝑙 ) 𝑖 ∈ R 𝑁 𝐷 by attening the sequence of 𝑁 contextual tokens, each of dimension 𝐷 . For a batch of size 𝐵 represented by these attened contextual vectors { c ( 𝑙 ) 𝑖 } 𝐵 𝑖 = 1 , w e rst dene a kernel matrix K ∈ R 𝐵 × 𝐵 , where each entry 𝐾 𝑖 𝑗 represents the similarity between samples 𝑖 and 𝑗 . In our work, we use the cosine similarity as our kernel: 𝐾 𝑖 𝑗 = ⟨ c ( 𝑙 ) 𝑖 , c ( 𝑙 ) 𝑗 ⟩ ∥ c ( 𝑙 ) 𝑖 ∥ ∥ c ( 𝑙 ) 𝑗 ∥ (3) T o maximize diversity , we compute the eigenvalues { 𝜆 𝑘 } of the normalized kernel ˜ K = 1 𝐵 K and dene our loss L 𝑑 𝑖 𝑣 as the negative von Neumann entropy: L 𝑑 𝑖 𝑣 = − 𝐵  𝑘 = 1 𝜆 𝑘 log 𝜆 𝑘 (4) This objective eectiv ely pushes the tokens in the Conte xtual Space to span a higher-dimensional manifold, preventing the semantic collapse typically induced by CFG. 4 The Contextual Space In this section, we empirically e xamine the properties of the Con- textual Space by analyzing how internal representations b ehave under controlled interpolation and extrapolation. W e focus on how semantic structure is preserved or degraded when steering repre- sentations in two internal spaces of the DiT: the V AE latent space and the contextual (enriched text) token space . The goal is to char- acterize how each of these spaces reects semantic variation when multiple samples are generate d fr om the same prompt, and to assess their suitability for diversity control without introducing visual artifacts. “ A mythical creature” T arget Interpolation Source Extrapolation Contextual Space Latent Space “ A p erson with their pet” T arget Interpolation Source Extrapolation Contextual Space Latent Space Fig. 3. Comparison of interpolation and extrapolation between the internal representations of tw o images. Intermediate frames are gen- erated by denoising the source image while linearly blending its internal features with those of the target; extrapolation extends this vector beyond the endpoints. While Latent Space interpolation leads to structural blurring and artifacts due to spatial misalignment, the Contextual Space maintains high visual fidelity . This demonstrates that the Contextual Space enables smooth semantic transitions by decoupling generative intent from fixe d spatial structures. T o examine this, we conduct an interpolation and extrapolation experiment across these two internal representation spaces. W e consider two prompts, “a person with their pet” and “a mythical creature ” . For each pr ompt, we generate two samples using dierent initial noise se eds, which we designate as a source image and a target image . Maintaining the initial noise of the source image, we intervene during the denoising process by replacing its internal representation with a linear combination of the source and target representations h 𝑖𝑛𝑡 𝑒𝑟 𝑝 = h 𝑠𝑜 𝑢𝑟 𝑐𝑒 + 𝛼 ( h 𝑡 𝑎𝑟 𝑔𝑒𝑡 − h 𝑠𝑜 𝑢𝑟 𝑐𝑒 ) , (5) where h represents the representation in a given space, and 𝛼 is the steering coecient. W e compare this behavior across two distinct spaces: the V AE Latent Space ( 𝑧 𝑡 ) and our proposed Contextual Space (enriched text tokens ˆ 𝑓 𝑇 ). As illustrate d in Figure 3, the results highlight a fundamental dierence in how these spaces handle semantic information. In the V AE Latent Space, representations are tied to the spe cic spatial grid and pixel-lev el layout of the sample. Since the source and target images are spatially unaligned (exhibiting dierent poses and com- positions) interpolating between them results in a structural blur . On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 5 The model attempts to resolve tw o conicting geometries simulta- neously , leading to incoherent overlays and ghostly artifacts. More critically , extrapolating in the V AE Latent Space quickly pushes the latents outside the learned data manifold, resulting in severe artifacts. In contrast, performing the same operation within the Contextual Space yields a smooth semantic transition. Rather than blending pixels or geometries, the model reallocates visual elements in a coherent manner , gradually modifying appearance and composition while maintaining a sharp, high-delity structure . For instance, as we move from the source image toward the target, we observe a meaningful evolution in high-level appearance attributes of the subject, such as facial features and overall visual style, which shift naturally from the source toward the target. In the bottom example, this transition applies coherently to each subject independently , with both the woman and the accompanying p et undergoing meaningful semantic changes (e.g., the pet gradually shifting from a dog-like to a cat-like app earance). Throughout this interpolation, the pre- trained weights retain the generated images on-manifold, preserving structural integrity and visual plausibility . Furthermore, the Contextual Space maintains its integrity during extrapolation, where the shifts remain semantically consistent with the direction of the steering vector ( h 𝑡 𝑎𝑟 𝑔𝑒𝑡 − h 𝑠𝑜 𝑢𝑟 𝑐𝑒 ). As shown in the right-most columns of Figure 3, applying extrapolation ( 𝛼 < 0 ) relative to the target does not lead to manifold collapse. Instead, it generates a semantically meaningful extrapolation: In the top example, extrapolation progr essively remo ves the creature ’s horns and beast-like features, producing a plausible semantic evolution rather than noise or collapse. In the bottom example, the w oman’s features e volve toward a darker-tone, eectively mo ving away from the characteristics of the reference. Simultaneously , the pet’s appear- ance is modied in a logically consistent manner , such as deepening the coat color and shifting the ears to a more drooping shape. These observations suggest that the Contextual Space encodes global se- mantic features independently of a xed spatial grid. Intervening in this space enables the modication of high-level attributes while the transformer’s attention mechanisms maintain the structural coherence of the output. 5 Experiments T o evaluate the generality of our approach, we conduct experiments across three state-of-the-art Diusion T ransformer (DiT) architec- tures that span distinct design choices and sampling regimes: Flux- dev [Labs 2024], a guidance-distilled model; SD3.5- Turbo, distilled for high-speed, few-step inference; and SD3.5-Large [Esser et al . 2024], a standard non-distilled model. T ogether , these models cover a broad spectrum of modern DiT variants, allowing us to demon- strate that Contextual Space r epulsion is broadly applicable and not tied to a specic architecture, training regime, or sampling budget. W e compare our Contextual Space repulsion against represen- tative diversity-enhancing baselines, including upstr eam methods that modify initial conditions such as CADS [Sadat et al . 2023] and SGI [Parmar et al . 2025], as well as downstream methods that inter- vene in the latent space, including Particle Guidance [Corso et al . Flux Ours “Kids with paper airplanes” Flux Ours “ A ballet dancer on stage” Fig. 4. alitative results. For each prompt, we compare the base model results (top) to our results (boom). 2023] and SP ARKE [Jalali et al . 2025]. Full implementation details and hyperparameter settings are provided in Appendix A. 5.1 alitative Results Flux-dev results. W e compare our results with the base Flux-dev model in Figures 4 and 11; additional comparisons with Flux-dev , SD3.5-Large and SD3.5- Turbo are provided in App endix B. Even when sampled with dierent random initial noises, the base model typically produces a ver y narrow and repetitive range of outputs for many prompts. As shown in Figure 11, our method alleviates typicality biases, such as the barely visible or harsh lighting seen in the “musician” and “scientist” examples. Furthermore, it generates a diverse array of compositions, arrangements, and camera angles for the “painter” and “stadium” prompts. Baseline comparisons. W e present qualitative comparisons against the baseline in Figure 12. A s illustrated, downstream methods like PG and SP ARKE often introduce visual artifacts be cause they in- tervene directly in the V AE latent space. For instance, in the “red bus” example, PG fails to modify the image structure , while SP ARKE succeeds in moving obje cts but leaves patterned “holes” in their original locations. In contrast, upstream methods maintain higher image quality , though they face dierent trade-os. CADS frequently leads to semantic drift, where diversity is achieved through weak prompt alignment ( e.g., replacing “photographs” with people, or a “phoenix“ with a bonre). SGI, which lters a large set of initial noise candi- dates through optimization, achieves both high quality and prompt 6 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Flux Kontext Ours Input Image “a person running a marathon” Fig. 5. Integration with image editing models. W e demonstrates that our method can be successfully integrated into Flux-Kontext to generate high- quality diverse results. adherence by minimizing intervention. How ever , SGI often strug- gles to produce high variation for prompts where the base model lacks inherent diversity , resulting in repetitive subject appearances and compositions (e.g., the “r ed bus”). Our method achiev es richer div ersity even with challenging prompts, without sacricing alignment or quality . Interestingly , the axes of variation adapt to each prompt: for the “phoenix, ” the mo del alternates b etween artistic styles; for the “bus, ” it varies weather and pose; and for the “camera with old photographs” and “wolf pack, ” it generates unique compositions and object arrangements. Example result on Flux-Kontext. In Figure 5, we demonstrate that our method generalizes beyond text-to-image generation and can b e applied out of the box to image editing models, sp ecically Flux Kon- text [Labs et al . 2025]. Perhaps surprisingly , this requires no modi- cation to the mo del or to our intervention strategy: we apply the exact same Contextual Space repulsion within the editing instruc- tion stream. While the base editing model produces nearly identical edits across dierent random seeds, our approach yields diverse yet coherent edit realizations, all while preser ving the intended edit semantics and maintaining the visual integrity of the original image. This result highlights that contextual r epulsion operates at a level of abstraction that is compatible with both generation and editing paradigms, despite being developed specically for text-to-image models. 5.2 antitative Results Diversity-Quality trade-o. W e evaluated our method using 1,000 prompts sampled from the MS-COCO 2017 validation set, generating four images per prompt for a total of 4,000 images per conguration. T o provide a holistic view of the diversity-quality trade-o, we uti- lize the V endi Inception Score [Friedman and Dieng 2022; Szegedy et al . 2017] to measure high-level semantic diversity alongside three primary quality and alignment axes: ImageReward [Xu et al . 2023] for human preference , V Q AScore [Lin et al . 2024] for ne-grained prompt adher ence, and Kernel Inception Distance (KID) [Bińkowski et al . 2018] for distributional delity . By plotting the Pareto frontier of the diversity score versus each of these metrics, we can ana- lyze how ee ctively each method navigates the tension b etween generative variety and visual delity . Fig. 6. antitative evaluation. Pareto frontiers comparing our method against baseline methods using Flux-dev . W e evaluate the trade-o between semantic diversity (V endi Score) and three performance axes: (Le) Human Preference [ImageReward ↑ ], (Middle) Prompt Alignment [VQ AScore ↑ ], and (Right) Distributional Fidelity [KID ↓ ]. Our method (red) achieves a superior frontier across all metrics. T able 1. Runtime comparison for generating a group of four images. Our method provides a significant speedup over optimization-based diver- sity methods like SGI while maintaining a low overhead relative to the base model. Method SD3.5-Large SD3.5- Turbo Flux-dev Base Model 13.83s 4.18s 10.34s Ours (Contextual) 18.12s 5.52s 12.80s SGI 8 Candidates 66.79s 13.15s 47.47s 16 Candidates 76.79s 23.73s 56.32s 32 Candidates 101.44s 46.15s 75.39s 64 Candidates 145.14s 91.30s 113.99s T o map the Pareto frontiers, w e systematically vary the control hyperparameters for each baseline: the guidance scale for PG and SP ARKE, the noise intensity for CADS, and the number of initial noise candidates for SGI. Specic hyperparameter congurations are provided in Appendix A. As shown in Figur e 6, our method achieves a superior trade-o on Flux-dev . Notably , while our method exceeds the performance of SGI, the strongest baseline, it do es so with drastically lower computational overhead (see Paragraph 5.2). Results for additional models, including SD3.5- Turbo and SD3.5-Large, are provided in Appendix C. Runtime. Many existing diversity methods rely on costly down- stream signals, either through gradient-base d optimization or by selecting from large pools of candidate latents. Both strategies im- pose substantial time overhead. By avoiding these me chanisms entirely , our approach provides a markedly more ecient solution, increasing runtime by only 20%–30% relative to the base mo del (T able 1). User study . Standard quantitative metrics often fail to capture the nuances of generative diversity . These evaluators are typically trained on datasets dominated by common visual patterns, leading them to favor “typical” or average cases as more aesthetically pleas- ing or prompt-adherent. Consequently , methods that successfully push for greater diversity and creative interpretation may b e un- fairly penalize d by these metrics, even when the resulting variations are highly desirable to human users. T o address this limitation and On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 7 Fig. 7. Overall user preference comparison. Distribution of user choices comparing our method with five competing approaches. Bars indicate the percentage of cases in which users preferred our results ( green), preferred competing methods (red), or rated both e qually (gray). provide a more meaningful assessment of our method, we conducted a user study . W e utilized ChatGPT to generate 40 diverse prompts across vari- ous categories. For each prompt, participants w ere presented with two batches of 8 images (16 images total): one batch generated by our method and the other by a competing method or the base model (Flux-dev). Participants were tasked with performing a side-by-side comparison to determine which batch: (i) Exhibited greater visual and semantic diversity; (ii) Maintained higher image quality; (iii) Demonstrated better prompt adherence; and (iv) W as preferred overall. W e colle cted 450 responses from 45 participants. Figure 7 reports the overall user preference results of this study , with the full prefer- ence table provided in Appendix C. O verall, our method achieves higher user preference than all competing approaches. The only ex- ception is SGI, where preferences are closely matched, with a slight advantage for our method. Importantly , these gains are achieved with minimal runtime overhead, as demonstrated in T able 1. 5.3 Ablation Studies W e evaluate the impact of the repulsion scale and the specic repr e- sentation space used for inter vention below , with further hyperpa- rameter analyses provided in Appendix D. Repulsion scale ablation. In Figure 8, we ablate the eect of the repulsion scale 𝜂 . The top row ( 𝜂 = 0 ) represents the base Flux-dev generations, which exhibit a narro w interpretation of the prompt; each image displays a similar-looking house in nearly identical environments. In each subsequent row , we show the results of our method with an increasing repulsion scale. A s can be seen, higher values of 𝜂 generally yield greater diversity , introducing structural changes like adding a tower to the house, altering the landscape with a lake, or shifting the scene ’s season. Repulsion space ablation. T o isolate the ecacy of inter vening in the Contextual Space ( ˆ 𝑓 𝑇 ), we compare our framework against an identical repulsion mechanism applied instead to the image atten- tion tokens ( ˆ 𝑓 𝐼 ) within the multimodal blocks (i.e., the dual-stream 𝜂 = 0 𝜂 = 5 𝑒 10 𝜂 = 1 𝑒 11 𝜂 = 2 . 5 𝑒 11 𝜂 = 4 𝑒 11 “ A breathtaking view of a distant house in beautiful scener y” Fig. 8. Ablation of the repulsion scale 𝜂 . W e visualize the impact of the repulsion scale on our results. At 𝜂 = 0 (top r ow), the base model exhibits lo w diversity , producing similar architectural styles and environments across multiple seeds. As 𝜂 increases, our Contextual Space repulsion introduces progressively larger variations, while maintaining high image quality and prompt alignment. blocks in Flux). As illustrated in Figure 9, repulsion in the Con- textual Space produces a signicantly more r obust Pareto frontier , yielding superior human preference (ImageReward), distributional delity (KID), and prompt alignment (VQ AScore). Notably , while the image-token baseline exhibits sharp performance degradation as diversity increases, our method maintains a shallower decline across all metrics. This suggests that the Contextual Space is better suited for navigating semantic diversity while strictly preserving the integrity of samples within the learned conditional manifold. Figure 10 provides qualitative examples. As can be seen, applying repulsion in the image token space ( ˆ 𝑓 𝐼 ) often results in stagnant lay- outs due to its spatial rigidity; this forces the repulsion to articially promote diversity by modifying local textures, leading to artifacts such as the sea blending unnaturally into the road in the “street” example. In contrast, intervening in the contextual space ( ˆ 𝑓 𝑇 ) tends to promote varied compositions while maintaining alignment and quality . 6 Conclusions At a high level, this work highlights the Contextual Space in Dif- fusion Transformers as a particularly eective place to intervene when aiming for diversity . The Contextual Space sits between text and image: the representations already encode rich semantic intent shaped by the emerging image, yet they are not spatially locked in. Unlike image latents, this space is not tied to a spatial grid, so 8 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Fig. 9. Ablation of Repulsion Space. Pareto frontiers comparing r epulsion applied to text aention tokens (Contextual Space , ˆ 𝑓 𝑇 ) versus image aention tokens ( ˆ 𝑓 𝐼 ) within the Flux-dev architecture. W e evaluate the trade-o between semantic diversity (V endi Score) and three performance axes: (Le) Human Preference [ImageRewar d ↑ ], (Middle) Prompt Alignment [V QAScore ↑ ], and (Right) Distributional Fidelity [KID ↓ ]. Our method (red) achieves a superior frontier across all metrics. Image Contextual “T wo pieces of bread with a leafy green on top of it” Image Contextual “ A city street scene with a green bus coming up a street, with o cean” Fig. 10. alitative Ablation of Repulsion Space. For each prompt, we compare repulsion applied in the image aention space (Image) versus our Contextual Space (Contextual). While image-space r epulsion is limited by spatial rigidity , our method achieves more varied compositions. samples can be pushed apart semantically without tearing geometr y or introducing visual artifacts. At the same time, unlike early text embeddings, it is structurally informed, meaning that inter ventions meaningfully inuence what the model actually generates. Applying on-the-y repulsion in this space allows diversity to be increased in a controlled way , without sacricing visual quality or relying on heavy optimization with signicant computational cost. More broadly , this p oints to the imp ortance of inter vening at the right representational level, where decisions are still exible, but already grounded in the image being formed. Limitations . Contextual repulsion increases diversity but does not provide direct control ov er which attributes will vary , and may sometimes favor coarse semantic changes ov er ne, user-specied ones. In addition, the inter vention is focused on early to mid stages of generation; how to best coordinate it with later stages, or combine it with other control mechanisms, remains an open question. Future directions . An interesting direction for future work is to investigate whether a user provided textual cue , such as “color” or “size” , can be used to guide the repulsion along a specic semantic direction in the Contextual Space. Instead of encouraging diversity in an unconstrained manner , the idea would be to bias the repulsiv e forces so that samples spread primarily along attributes associated with the given word. This could enable a more controlled and inter- pretable form of diversity , where variation is focuse d on selecte d semantic aspects while other parts of the generation remain stable. Acknowledgments W e would like to thank Or Patashnik, Yuval Alaluf, Nir Goren, Maya Vishnevsky , Sara Dorfman, Shelly Golan, Saar Huberman, and Jackson W ang for their early feedback and insightful discussions. W e also thank the anonymous reviewers for their thorough and constructive comments, which helped improve this work. References Reyhane Askari Hemmat, Melissa Hall, Alicia Sun, Candace Ross, Michal Drozdzal, and Adriana Romero-Soriano. 2024. Improving geo-diversity of generated images with contextualized vendi score guidance . In European Conference on Computer Vision . Springer , 213–229. Omri A vrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kr Aberman, Dani Lischinski, and Daniel Cohen-Or. 2025. Stable F low: Vital Layers for Training- Free Image Editing. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, 7877–7888. doi:10.1109/cvpr52734.2025.00738 Y ogesh Balaji, Seungjun Nah, Xun Huang, Arash V ahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, T ero Karras, and Ming- Y u Liu. 2023. eDi-I: T ext-to-Image Diusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324 [cs.CV] https://ar xiv .org/abs/2211.01324 Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. 2018. Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018). Y u Cao, Zengqun Zhao, Ioannis Patras, and Shaogang Gong. 2025. T emporal Score Anal- ysis for Understanding and Correcting Diusion Artifacts. arXiv:2503.16218 [cs.CV] https://arxiv .org/abs/2503.16218 Gabriele Corso, Yilun Xu, V alentin De Bortoli, Regina Barzilay , and T ommi Jaakkola. 2023. Particle guidance: non-iid diverse sampling with diusion models. arXiv preprint arXiv:2310.13102 (2023). Omer Dahary , Y ehonathan Cohen, Or Patashnik, Kr Aberman, and Daniel Cohen-Or . 2025. Be Decisive: Noise-Induced Layouts for Multi-Subject Generation. In Procee d- ings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers . 1–12. Omer Dahar y , Or Patashnik, Kr Aberman, and Daniel Cohen-Or . 2024. Be your- self: Bounded attention for multi-subject text-to-image generation. In European Conference on Computer Vision . Springer , 432–448. Y usuf Dalva, Kavana V enkatesh, and Pinar Y anardag. 2024. FluxSpace: Disentangled Semantic Editing in Rectied Flow Transformers. arXiv:2412.09611 [cs.CV] https: //arxiv .org/abs/2412.09611 Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller , Harry Saini, Y am Levi, Dominik Lorenz, Axel Sauer , Frederic Boesel, et al . 2024. Scal- ing rectied ow transformers for high-resolution image synthesis. In Forty-rst international conference on machine learning . Dan Friedman and Adji Bousso Dieng. 2022. The vendi scor e: A diversity evaluation metric for machine learning. arXiv preprint arXiv:2210.02410 (2022). On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 9 Daniel Garibi, Shahar Y adin, Roni Paiss, Omer T ov , Shiran Zada, Ariel Ephrat, T omer Michaeli, Inbar Mosseri, and T ali Dekel. 2025. T okenV erse: V ersatile Multi-concept Personalization in T oken Modulation Space. arXiv:2501.12224 [cs.CV] https://ar xiv . org/abs/2501.12224 Alec Helbling, T una Han Salih Meral, Ben Hoover , Pinar Y anardag, and Duen Horng Chau. 2025. Conceptattention: Diusion transformers learn highly interpretable features. arXiv preprint arXiv:2502.04320 (2025). Jonathan Ho and Tim Salimans. 2022. Classier-free diusion guidance. arXiv preprint arXiv:2207.12598 (2022). Saar Huberman, Or Patashnik, Omer Dahary , Ron Mokady , and Daniel Cohen-Or . 2025. Image Generation from Contextually-Contradictory Prompts. arXiv preprint arXiv:2506.01929 (2025). Mohammad Jalali, LEI Hao yu, Amin Gohari, and Farzan Farnia. 2025. SP ARKE: Scalable Prompt- A ware Diversity and Novelty Guidance in Diusion Models via RKE Score. In The Thirty-ninth A nnual Conference on Neural Information Processing Systems . Ronen Kamenetsky , Sara Dorfman, Daniel Garibi, Roni Paiss, Or Patashnik, and Daniel Cohen-Or . 2025. SAEdit: T oken-level control for continuous image editing via Sparse AutoEncoder . arXiv:2510.05081 [cs.GR] https://ar xiv .org/abs/2510.05081 Byungjun Kim, Soobin Um, and Jong Chul Y e. 2025. Diverse Te xt-to-Image Generation via Contrastive Noise Optimization. arXiv preprint arXiv:2510.03813 (2025). W eijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianw ei Zhang, Kathrina W u, Qin Lin, Junkun Y uan, Y anxin Long, Aladdin W ang, Andong W ang, Changlin Li, Duojun Huang, Fang Y ang, Hao T an, Hongmei W ang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai W ang, Mengyang Liu, Pengyu Li, Shuai Li, W eiyan W ang, W enqing Yu, Xinchi Deng, Y ang Li, Yi Chen, Yutao Cui, Y uanb o Peng, Zhentao Y u, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Y angyu Tao , Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa W ang, Y ong Y ang, Di Wang, Yuhong Liu, Jie Jiang, and Caesar Zhong. 2025. HunyuanVide o: A Systematic Framework For Large Video Generative Models. arXiv:2412.03603 [cs.CV] T uomas K ynkäänniemi, Miika Aittala, T ero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. 2024. Applying guidance in a limited interval improves sample and distribution quality in diusion models. Advances in Neural Information Processing Systems 37 (2024), 122458–122483. Black Forest Labs. 2024. FLUX. https://github.com/black- forest- labs/ux. Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Bo esel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser , et al . 2025. FLUX. 1 Konte xt: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv preprint arXiv:2506.15742 (2025). Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. 2024. Evaluating text-to-visual generation with image- to-text generation. In European Conference on Computer Vision . Springer , 366–384. Mashrur M Morshed and Vishnu Boddeti. 2025. DiverseF low: Sample-Ecient Di- verse Mode Coverage in Flows. In Proceedings of the Computer Vision and Pattern Recognition Conference . 23303–23312. Gaurav Parmar , Or Patashnik, Daniil Ostashev , Kuan-Chieh W ang, Kr Aberman, Srinivasa Narasimhan, and Jun- Y an Zhu. 2025. Scaling Group Inference for Diverse and High-Quality Generation. arXiv preprint arXiv:2508.15773 (2025). Or Patashnik, Daniel Garibi, Idan Azuri, Hadar A verbuch-Elor , and Daniel Cohen-Or. 2023. Localizing Object-level Shape V ariations with T ext-to-Image Diusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) . Dustin Podell, Zion English, Kyle Lacey , Andreas Blattmann, Tim Dockhorn, Jonas Müller , Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023). Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical T ext-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV] Anton Razzhigaev , Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov , Ilya Ryab ov , Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov , and Denis Dimitrov . 2023. Kandinsky: an Improved T ext-to-Image Synthesis with Image Prior and Latent Diusion. arXiv:2310.03502 [cs. CV] https://arxiv .org/abs/2310. 03502 Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser , and Björn Ommer . 2022. High-resolution image synthesis with latent diusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition . 10684–10695. Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley , Otmar Hilliges, and Romann M W eber . 2023. CADS: Unleashing the div ersity of diusion models thr ough condition- annealed sampling. arXiv preprint arXiv:2310.17347 (2023). Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Den- ton, Seyed Kamyar Seyed Ghasemipour , Burcu Karagol A yan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic T ext-to-Image Diusion Mo dels with Deep Language Understanding. arXiv:2205.11487 [cs.CV] Christian Szegedy , Sergey Ioe, Vincent V anhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learn- ing. In Proceedings of the AAAI conference on articial intelligence , V ol. 31. Zhenxiong Tan, Songhua Liu, Xingyi Y ang, Qiao chu Xue, and Xinchao W ang. 2025. OminiControl: Minimal and Universal Control for Diusion Transformer . arXiv:2411.15098 [cs.CV] Revant T eotia, Candace Ross, Karen Ullrich, Sumit Chopra, Adriana Romero-Soriano , Melissa Hall, and Matthew Muckley. 2025. DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in T ext-to-Image Gener- ative Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 16431–16440. Soobin Um and Jong Chul Y e. 2025. Minority-Focused T ext-to-Image Generation via Prompt Optimization. In Proceedings of the Computer Vision and Pattern Recognition Conference . 20926–20936. Jiazheng Xu, Xiao Liu, Yuchen W u, Y uxuan T ong, Qinkai Li, Ming Ding, Jie Tang, and Y uxiao Dong. 2023. Imagereward: Learning and evaluating human prefer ences for text-to-image generation. Advances in Neural Information Processing Systems 36 (2023), 15903–15935. Shai Y ehezkel, Omer Dahar y , Andrey V oynov , and Daniel Cohen-Or . 2025. Navigating with Annealing Guidance Scale in Diusion Space. arXiv preprint (2025). Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui W ang, Vi- jay V asudevan, Alexander Ku, Yinfei Y ang, Burcu Karagol A yan, Ben Hutchinson, W ei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Y onghui Wu. 2022. Scaling A utoregressive Models for Content-Rich T ext-to-Image Generation. arXiv:2206.10789 [cs.CV] Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, and Soheil Feizi. 2025. SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control. arXiv:2511.09715 [cs.CV] 10 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Flux Ours “ A jazz musician playing saxophone in a dimly lit club” Flux Ours “ An artist painting a landscap e in an outdoor studio” Flux Ours “ A scientist in a mo dern laboratory ” Flux Ours “ A crowd cheering at a sports stadium” Fig. 11. alitative results. For each prompt, we compar e the base model results (top) to our r esults (boom). Each batch of images was generated using the same random seed to ensure a fair comparison. Additional results are pro vided in Appendix B. On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 11 Ours SGI CADS SP ARKE PG “ A wolf pack howling at the moon” Ours SGI CADS SP ARKE PG “ A pho enix rising from ashes” Ours SGI CADS SP ARKE PG “ A camera with old photographs” Ours SGI CADS SP ARKE PG “ A red London double-de cker bus” Fig. 12. alitative comparison of our Contextual Repulsion approach against baseline metho ds. Each quadrant displays four generated samples p er method for a given prompt. 12 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Appendix A Implementation Details All experiments were conducte d on an N VIDIA A100 GP U. Quantita- tive metrics and runtime evaluations wer e performed by generating groups of 4 images. Div ersity metrics were calculated within each 4-image group and subsequently averaged across all groups. The number of denoising steps was chosen based on the model architecture: 4 steps for SD3.5- Turbo [Esser et al . 2024], 20 steps for Flux-dev [Labs et al. 2025], and 28 steps for SD3.5-Large [Esser et al . 2024]. The guidance scale was set to 3.5 for both Flux-dev and SD3.5-Large, and 0.0 for SD3.5- T urbo. For our proposed method, we employed 𝑀 = 100 gradient steps for the Stable Diusion models and 𝑀 = 50 for Flux-dev . For all models, we apply repulsion to the text tokens in the multimodal attention blo cks (dual-stream in F lux). For SD3.5-Large, which is not distilled for classier-free guidance, the repulsion is applied to both the conditional and unconditional branches. For Flux-dev and Flux-Kontext, we additionally apply it to all tokens in the later single-stream blocks, which are specic to these architectures. The repulsion scale 𝜂 was used to balance the trade-o between diversity and delity , with the inter vention disabled after a xed numb er of timesteps, denote d by 𝜏 . The range of 𝜂 was tuned per model: 𝜂 ∈ [ 2 . 5 · 10 7 , 5 · 10 8 ] with 𝜏 = 4 for SD3.5-Large; 𝜂 ∈ [ 5 · 10 6 , 1 · 10 8 ] with 𝜏 = 1 for SD3.5-T urbo; and 𝜂 ∈ [ 2 . 5 · 10 8 , 5 · 10 10 ] with 𝜏 = 1 for Flux-dev . For simplicity , 𝜂 remained constant throughout the intervention window . W e utilized ocial implementations for all baseline methods, where available. For baselines without compatible ocial implemen- tations, we re-implemented them and tuned their hyperparameters to ensure competitive diversity levels. In addition to the shared guidance and step congurations, the following hyperparameters were used for the baselines: • PG [Corso et al . 2023]: Repulsion scales were varied b e- tween 10 and 100. • CADS [Sadat et al . 2023]: Scales were varied between 0.1 and 0.7, with 𝜏 1 = 0 . 3 , 𝜏 2 = 0 . 8 , and 𝜓 = 1 . • SP ARKE [Jalali et al . 2025]: Scales were selected between 0.02 and 0.14, depending on the model. • SGI [Parmar et al . 2025]: Evaluated with initial candi- date groups of 𝑁 ∈ { 8 , 16 , 32 , 64 } , utilizing default hyperpa- rameters from the ocial implementation. All qualitative comparisons and the user study results reported here were conducted with 𝑁 = 64 . B Additional alitative Results W e present additional qualitative r esults of our method on SD3.5- Large (Figure 15), SD3.5- Turbo (Figur e 16) and Flux-Dev (Figures 17, 18, and 19). C Additional antitative Results Additional comparisons. W e present additional quantitative com- parisons on SD3.5-Large (Figure 13) and SD3.5- T urb o (Figure 14). Our method achieves competitive quality-div ersity trade-os at a Fig. 13. antitative evaluation on SD3.5-Large. Fig. 14. antitative evaluation on SD3.5- Turbo. T able 2. Detailed metrics for the Flux-dev Pareto frontiers in Figure 6. Method V endi ( ↑ ) IR ( ↑ ) VQ A ( ↑ ) KID × 10 − 4 ( ↓ ) Base Model 1.780 1.075 0.883 Ours 𝜂 = 2 . 5 · 10 8 1.810 1.102 0.884 0.066 𝜂 = 5 · 10 8 1.831 1.092 0.883 0.103 𝜂 = 5 · 10 9 1.869 1.075 0.883 0.157 𝜂 = 2 . 5 · 10 10 1.898 1.070 0.880 0.172 CADS 𝑠 = 10 − 20 1.908 0.377 0.719 0.558 𝑠 = 10 − 18 1.908 0.377 0.719 0.558 𝑠 = 10 − 12 1.910 0.303 0.699 0.530 𝑠 = 10 − 11 1.923 0.208 0.674 0.588 PG 𝑠 = 1 1.753 0.991 0.871 0.555 𝑠 = 80 1.759 1.018 0.864 0.675 𝑠 = 150 1.787 0.846 0.848 2.650 SGI 8 Candidates 1.778 1.152 0.875 0.440 16 Candidates 1.829 1.085 0.873 0.461 32 Candidates 1.860 1.063 0.872 0.289 64 Candidates 1.916 1.042 0.872 0.297 SP ARKE 𝑠 = 0 . 01 1.790 1.094 0.884 0.057 𝑠 = 0 . 02 1.850 1.067 0.873 1.079 fraction of the computational cost required by SGI. Detailed metrics across all evaluated models are provided in T ables 2, 3, and 4. User study table. W e provide the full results of our user study in T able 5. Evaluation on detailed prompts. While diversity is typically easier to achieve when prompts leav e signicant room for interpretation, we evaluate our method on the 100 longest prompts from the “Com- plex” and “Fine-Grained Detail” categories of PartiPrompts [Y u et al . 2022] using Flux-dev . Even under these highly constrained condi- tions, our method increases diversity and human preference scores with a negligible impact on prompt alignment. Specically , we ob- serve an increase in V endi score ( + 0 . 08 ) and ImageReward ( + 0 . 05 ), On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 13 SD3.5-Large Ours “ An abandone d carnival” SD3.5-Large Ours “ A couple stargazing” SD3.5-Large Ours “Elephants at a waterhole” SD3.5-Large Ours “ A climb er on a cli ” Fig. 15. alitative results on SD3.5-Large. SD3.5- T urbo Ours “ A dragon guarding its treasure” SD3.5- T urbo Ours “ A picnic under cherr y blossoms” SD3.5- T urbo Ours “ A french baker y at dawn” SD3.5- T urbo Ours “ A snow y village at night” Fig. 16. alitative results on SD3.5- Turbo. 14 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Flux Ours “ A family enjoying a traditional meal together at home” Flux Ours “ A b eautiful Japanese garden with a koi pond and cherry blossoms” Flux Ours “ A jazz singer p erforming on stage with a vintage microphone” Flux Ours “ A bustling street market in Morocco with colorful spices” Fig. 17. Additional qualitative results on Flux-dev . Each batch of images was generated using the same random seed to ensure a fair comparison. On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 15 Flux Ours “ A group of students studying together in a university library” Flux Ours “ A futuristic warrior standing on the e dge of a neon-lit cli ” Flux Ours “ A wedding couple sharing a romantic moment” Flux Ours “ A chef preparing a gourmet meal in a professional kitchen” Fig. 18. Additional qualitative results on Flux-dev . Each batch of images was generated using the same random seed to ensure a fair comparison. 16 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or Flux Ours “ An astronaut exploring the terrain of an alien planet” Flux Ours “ An astronaut oating in space with Earth in the background” Flux Ours “ A classic bicycle leaned against an old brick wall” Flux Ours “ A delicious breakfast spread served on a woo den table” Fig. 19. Additional qualitative results on Flux-dev . Each batch of images was generated using the same random seed to ensure a fair comparison. On-the-f ly Repulsion in the Contextual Space for Rich Diversity in Diusion Transformers • 17 T able 3. Detailed metrics for the SD3.5-Large Pareto frontiers in Fig- ure 13. Method V endi ( ↑ ) IR ( ↑ ) V QA ( ↑ ) KID × 10 − 4 ( ↓ ) Base Model 1.819 1.051 0.905 Ours 𝜂 = 2 . 5 · 10 4 1.851 1.018 0.904 0.619 𝜂 = 2 . 5 · 10 6 1.878 1.012 0.904 0.627 𝜂 = 2 . 5 · 10 7 1.941 0.988 0.900 0.625 𝜂 = 2 . 5 · 10 8 1.980 0.940 0.890 0.445 CADS 𝑠 = 10 − 12 2.004 0.131 0.717 0.941 𝑠 = 10 − 10 2.025 0.051 0.692 0.953 𝑠 = 10 − 08 2.018 0.066 0.692 0.953 PG 𝑠 = 1 1.900 0.783 0.878 1.521 𝑠 = 60 1.913 0.707 0.868 4.053 𝑠 = 80 1.924 0.632 0.861 5.930 SGI 8 Candidates 1.828 1.050 0.903 0.465 16 Candidates 1.862 1.025 0.902 0.455 32 Candidates 1.883 1.030 0.902 0.429 64 Candidates 1.915 1.004 0.901 0.421 SP ARKE 𝑠 = 0 . 01 1.860 1.027 0.902 0.362 𝑠 = 0 . 02 1.887 0.999 0.901 0.770 𝑠 = 0 . 03 1.912 0.925 0.899 1.393 𝑠 = 0 . 04 1.989 0.735 0.882 2.918 T able 4. Detailed metrics for the SD3.5- T urbo Pareto frontiers in Fig- ure 14. Method V endi ( ↑ ) IR ( ↑ ) V QA ( ↑ ) KID × 10 − 4 ( ↓ ) Base Model 1.724 0.978 0.891 Ours 𝜂 = 5 · 10 6 1.819 0.914 0.887 1.796 𝜂 = 2 . 5 · 10 7 1.879 0.899 0.884 1.786 𝜂 = 5 · 10 7 1.914 0.864 0.876 1.897 𝜂 = 5 · 10 8 2.079 0.562 0.822 1.914 CADS 𝑠 = 0 . 1 1.808 0.551 0.772 0.158 𝑠 = 0 . 5 1.853 0.383 0.731 0.526 𝑠 = 0 . 8 1.911 0.180 0.683 1.319 𝑠 = 0 . 9 1.958 0.127 0.673 1.348 PG 𝑠 = 2 1.765 0.915 0.884 0.881 𝑠 = 10 1.857 0.638 0.859 2.285 𝑠 = 40 1.926 0.221 0.821 14.128 SGI 4 Candidates 1.707 0.962 0.888 0.078 8 Candidates 1.775 0.944 0.889 0.079 16 Candidates 1.829 0.933 0.883 0.005 32 Candidates 1.853 0.923 0.884 0.028 64 Candidates 1.879 0.913 0.886 0.120 SP ARKE 𝑠 = 0 . 04 1.728 1.011 0.890 0.206 𝑠 = 0 . 08 1.763 0.928 0.885 0.744 𝑠 = 0 . 1 1.812 0.837 0.871 1.219 𝑠 = 0 . 12 1.869 0.629 0.850 2.742 𝑠 = 0 . 14 1.970 0.231 0.803 7.037 while V Q AScore remains nearly constant ( − 0 . 01 ). These results demonstrate that intervening in the Contextual Space eectively identies and navigates remaining semantic degrees of freedom, even in the presence of e xtensive conditioning. T able 5. User study results comparing our method against fiv e com- peting approaches across four evaluation metrics. V alues show the percentage of times users preferred our method (Ours), the competitor (Comp.), or rated both equally (Tie). Results are aggregated from 450 pair- wise comparisons per metric. Metric Choice Base Model CADS SGI PG SP ARKE A verage Diversity Ours 71.6 52.2 56.7 80.0 34.4 61.1 Comp. 12.9 30.0 11.1 14.4 53.1 22.0 Tie 15.5 17.8 32.2 5.6 12.5 16.9 Quality Ours 49.1 67.8 15.6 82.2 85.9 58.0 Comp. 6.9 11.1 31.1 12.2 3.1 13.1 Tie 44.0 21.1 53.3 5.6 10.9 28.9 Adherence Ours 25.0 74.4 13.3 67.8 79.7 48.9 Comp. 15.5 11.1 22.2 13.3 4.7 14.0 Tie 59.5 14.4 64.4 18.9 15.6 37.1 Overall Ours 57.8 74.4 31.1 83.3 87.5 65.1 Comp. 13.8 15.6 27.8 10.0 9.4 15.6 Tie 28.4 10.0 41.1 6.7 3.1 19.3 All Metrics Ours 50.9 67.2 29.2 78.3 71.9 58.3 Comp. 12.3 16.9 23.1 12.5 17.6 16.2 Tie 36.9 15.8 47.8 9.2 10.5 25.6 T able 6. Scalability across batch sizes. antitative results on SD3.5- T urbo for varying batch sizes. W e report the average V endi score per pair to normalize for batch size constraints. Batch size V endi V endi (avg. pair ) ImageReward 4 1.819 1.393 0.914 8 2.295 1.401 0.923 16 2.768 1.404 0.928 D Additional Ablation Studies Batch size ablation. W e examine the scalability of our method by evaluating performance across varying batch sizes on SD3.5- Turbo. T o ensure a fair comparison across dierent sample counts, we report the average V endi score per pair , as the raw V endi score is in- herently bounded by the batch size. As shown in T able 6, our method exhibits a consistent positive tr end across all evaluated metrics as the batch size increases. This suggests that the repulsion mechanism scales eectively and benets from the denser representation of the conditional manifold provided by larger batches. Timestep ablation. W e analyze the impact of the repulsion window across the diusion trajectory by applying the intervention within specic timestep inter vals while keeping all other hyperparameters constant. T able 7 summarizes these results. For both SD3.5-Large and SD3.5- T urbo, applying repulsion later in the trajectory typically improves ImageReward at the expense of diversity . Conversely , maintaining the intervention throughout the entire trajectory yields the highest diversity but results in a more pronounced decline in delity and alignment scores. Transformer block ablation. W e further investigate how the selec- tion of transformer blocks inuences performance by restricting the intervention to the rst, middle, or last third of the architecture’s blocks. As reported in Table 8, applying repulsion to the middle 18 • Omer Dahary ∗ , Benaya Koren ∗ , Daniel Garibi, and Daniel Cohen-Or T able 7. Eect of the timestep interval on diversity and human pref- erence. W e evaluate dierent intervention windows during the diusion trajectory for SD3.5-Large and SD3.5-T urbo. Model Timestep interval V endi ImageReward SD3.5- Turbo [0,1/4] 1.764 0.829 [1/4,2/4] 1.776 0.811 [2/4,3/4] 1.809 0.745 [3/4,1] 1.988 0.660 [0,1] 2.064 0.501 SD3.5-Large [0,1/7] 1.849 0.942 [1/7,2/7] 1.854 0.942 [2/7,3/7] 1.849 0.946 [3/7,4/7] 1.847 0.932 [4/7,5/7] 1.848 0.954 [5/7,6/7] 1.900 0.919 [6/7,1] 1.960 0.852 [0,1] 2.135 0.535 T able 8. Performance across dierent transformer block groups. Re- sults are reported for interventions applied to the first, middle, or last third of the blocks for SD3.5-Large and SD3.5- T urb o. SD3.5- Turbo SD3.5-Large Block group V endi ImageReward V endi ImageReward First third 1.878 0.774 1.887 0.895 Middle third 1.947 0.844 1.947 0.902 Last third 1.765 0.913 1.835 0.985 All blocks 1.764 0.829 1.960 0.852 blocks yields the strongest diversity among the partitioned groups, while preserving high preference scores for both SD3.5-Large and SD3.5- Turbo.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment