Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Reading time: 5 minute
...

📝 Original Info

  • Title: Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
  • ArXiv ID: 2512.12675
  • Date: 2025-12-14
  • Authors: Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

📝 Abstract

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

💡 Deep Analysis

Figure 1

📄 Full Content

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling Yuran Wang1,2* Bohan Zeng1,2* Chengzhuo Tong1,2 Wenxuan Liu1 Yang Shi1,2, Xiaochen Ma1 Hao Liang1 Yuanxing Zhang2 Wentao Zhang1† 1Peking University 2Kling Team, Kuaishou Technology Abstract Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting dis- tinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic vi- sual settings. We propose Scone, a unified understanding- generation method that integrates composition and distinc- tion. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guid- ing the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through se- mantic alignment and attention-based masking. We also in- troduce SconeEval, a benchmark for evaluating both com- position and distinction across diverse scenarios. Exper- iments demonstrate that Scone outperforms existing open- source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone. 1. Introduction Image generation methods [3, 8, 36] have demonstrated ex- ceptional capabilities, enabling the generation of desired images across diverse scenarios [35]. Subject-driven image generation has recently gained significant attention, with the focus evolving from single-subject to multi-subject gen- eration, incorporating more input images. Existing meth- ods [36, 37, 39, 40] can process two or more input im- ages and combine subjects based on instructions. Moreover, methods such as [8, 44] extend this capability by accepting more than four images, showcasing potential for more com- plex composition tasks. However, existing works primarily focus on expanding subject combinations while neglecting the ability to dis- *Equal contribution †Corresponding author: wentao.zhang@pku.edu.cn GPT-4o Gemini-2.5 -Flash-Image USO (a) Problem The third boot from the right in Image 1 lies on a muddy path. Subject redundancy Subject omission The baby lion in image 1 is gazing toward the horizon. Subject redundancy The bird with a purple neck and a blue belly is walking on street. Success Und. Und. + Gen. Subject error (c) Challenge 2 Understanding Generation (b) Challenge 1 Success Subject error : Target Scone (Ours) Figure 1. The distinction problem and challenges. (a) Prob- lem. State-of-the-art methods have limitations in distinguishing target subjects specified by the instruction. (b) Challenge 1: semantic deficiency in generation. Reference image informa- tion from the understanding and generation experts in the unified model is used to compute semantic similarity with instruction. (c) Challenge 2: biased understanding and misaligned generation. “Und.” and “Und.+Gen.” indicate whether texture information from generation expert in the unified model is included to collabo- rate with understanding expert. The unified model is BAGEL [7]. tinguish target subjects in complex contexts. As shown in Fig. 1(a), although current models can combine multiple subjects, they may fail to distinguish and generate the cor- rect target subject when a reference image contains multiple candidates, leading to problems such as subject omissions (none of the candidate subjects appear) or errors (misiden- tification of the target subject). Real-world images often in- volve interference and intricate details [19, 32], further lim- iting practical performance. Thus, we emphasize examining the input subjects themselves, focusing on the model’s abil- ity to distinguish the target subject within complex con- texts and leverage this information for generation. A core challenge is extracting useful information from complex references, which remains difficult for genera- tion models. Subject distinction relies on semantic under- standing of instruction’s expression of references, where understanding models are more proficient [1, 17, 47]. As 1 arXiv:2512.12675v1 [cs.CV] 14 Dec 2025 shown in Fig. 1(b), in a unified understanding-generation model consisting of an understanding expert and a genera- tion expert, the information encoded by the understanding expert is more similar to the instruction, which means more aligned with instruction than that encoded by the genera- tion expert, revealing generation models’ deficiency and un- derstanding model’s advantage in interacting with instruc- tions and semantically understanding reference informa- tion. However, this semantic advantage of understanding models is not entirely reliable: understanding models of- ten exhibit biases [14, 18, 31, 46], which become problem- atic when directly used to assist generation. A

📸 Image Gallery

scone_wo_bg_cropped.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut