Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.
💡 Deep Analysis
📄 Full Content
Scone: Bridging Composition and Distinction in Subject-Driven Image
Generation via Unified Understanding-Generation Modeling
Yuran Wang1,2*
Bohan Zeng1,2*
Chengzhuo Tong1,2
Wenxuan Liu1
Yang Shi1,2,
Xiaochen Ma1
Hao Liang1
Yuanxing Zhang2
Wentao Zhang1†
1Peking University
2Kling Team, Kuaishou Technology
Abstract
Subject-driven image generation has advanced from
single- to multi-subject composition, while neglecting dis-
tinction, the ability to identify and generate the correct
subject when inputs contain multiple candidates.
This
limitation restricts effectiveness in complex, realistic vi-
sual settings. We propose Scone, a unified understanding-
generation method that integrates composition and distinc-
tion. Scone enables the understanding expert to act as a
semantic bridge, conveying semantic information and guid-
ing the generation expert to preserve subject identity while
minimizing interference. A two-stage training scheme first
learns composition, then enhances distinction through se-
mantic alignment and attention-based masking. We also in-
troduce SconeEval, a benchmark for evaluating both com-
position and distinction across diverse scenarios. Exper-
iments demonstrate that Scone outperforms existing open-
source models in composition and distinction tasks on two
benchmarks. Our model, benchmark, and training data are
available at: https://github.com/Ryann-Ran/Scone.
1. Introduction
Image generation methods [3, 8, 36] have demonstrated ex-
ceptional capabilities, enabling the generation of desired
images across diverse scenarios [35]. Subject-driven image
generation has recently gained significant attention, with
the focus evolving from single-subject to multi-subject gen-
eration, incorporating more input images. Existing meth-
ods [36, 37, 39, 40] can process two or more input im-
ages and combine subjects based on instructions. Moreover,
methods such as [8, 44] extend this capability by accepting
more than four images, showcasing potential for more com-
plex composition tasks.
However, existing works primarily focus on expanding
subject combinations while neglecting the ability to dis-
*Equal contribution
†Corresponding author: wentao.zhang@pku.edu.cn
GPT-4o
Gemini-2.5
-Flash-Image
USO
(a) Problem
The third boot from
the right in Image 1
lies on a muddy path.
Subject redundancy
Subject omission
The baby lion
in image 1 is
gazing toward
the horizon.
Subject redundancy
The bird with a
purple neck and
a blue belly is
walking on street.
Success
Und.
Und. + Gen.
Subject error
(c) Challenge 2
Understanding
Generation
(b) Challenge 1
Success
Subject error
: Target
Scone (Ours)
Figure 1. The distinction problem and challenges. (a) Prob-
lem. State-of-the-art methods have limitations in distinguishing
target subjects specified by the instruction.
(b) Challenge 1:
semantic deficiency in generation. Reference image informa-
tion from the understanding and generation experts in the unified
model is used to compute semantic similarity with instruction. (c)
Challenge 2: biased understanding and misaligned generation.
“Und.” and “Und.+Gen.” indicate whether texture information
from generation expert in the unified model is included to collabo-
rate with understanding expert. The unified model is BAGEL [7].
tinguish target subjects in complex contexts.
As shown
in Fig. 1(a), although current models can combine multiple
subjects, they may fail to distinguish and generate the cor-
rect target subject when a reference image contains multiple
candidates, leading to problems such as subject omissions
(none of the candidate subjects appear) or errors (misiden-
tification of the target subject). Real-world images often in-
volve interference and intricate details [19, 32], further lim-
iting practical performance. Thus, we emphasize examining
the input subjects themselves, focusing on the model’s abil-
ity to distinguish the target subject within complex con-
texts and leverage this information for generation.
A core challenge is extracting useful information from
complex references, which remains difficult for genera-
tion models. Subject distinction relies on semantic under-
standing of instruction’s expression of references, where
understanding models are more proficient [1, 17, 47]. As
1
arXiv:2512.12675v1 [cs.CV] 14 Dec 2025
shown in Fig. 1(b), in a unified understanding-generation
model consisting of an understanding expert and a genera-
tion expert, the information encoded by the understanding
expert is more similar to the instruction, which means more
aligned with instruction than that encoded by the genera-
tion expert, revealing generation models’ deficiency and un-
derstanding model’s advantage in interacting with instruc-
tions and semantically understanding reference informa-
tion. However, this semantic advantage of understanding
models is not entirely reliable: understanding models of-
ten exhibit biases [14, 18, 31, 46], which become problem-
atic when directly used to assist generation. A