Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

Reading time: 5 minute
...

📝 Original Info

  • Title: Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
  • ArXiv ID: 2601.01224
  • Date: 2026-01-03
  • Authors: Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon, Yuki Mitsufuji

📝 Abstract

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Objectcentric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes. Code is available at https://github. com/sony/coda.

💡 Deep Analysis

📄 Full Content

Object-centric learning (OCL) aims to decompose complex scenes into structured, interpretable object representations, enabling downstream tasks such as visual reasoning (Assouel et al., 2022;D'Amario et al., 2021), causal inference (Schölkopf et al., 2021;Zholus et al., 2022), world modeling (Ke et al., 2021), robotic control (Haramati et al., 2024), and compositional generation (Singh et al., 2022a). Yet, learning such compositional representations directly from images remains a core challenge. Unlike text, where words naturally form composable units, images lack explicit boundaries for objects and concepts. For example, in a street scene with pedestrians, cars, and traffic lights, a model must disentangle these entities without labels and also capture their spatial relations (e.g., a person crossing in front of a car). Multi-object scenes add further complexity: models must not only detect individual objects but also capture their interactions. As datasets grow more cluttered and textured, this becomes even harder. Manual annotation of object boundaries or compositional structures is costly, motivating the need for fully unsupervised approaches such as Slot Attention (SA) (Locatello et al., 2020). While effective in simple synthetic settings, SA struggles with large variations in real-world images, limiting its applicability to visual tasks such as image or video editing.

Combining SA with diffusion models has recently pushed forward progress in OCL (Jiang et al., 2023;Wu et al., 2023;Akan & Yemez, 2025). In particular, Stable-LSD (Jiang et al., 2023) and SlotAdapt (Akan & Yemez, 2025) achieve strong object discovery and high-quality generation by leveraging pretrained diffusion backbones such as Stable Diffusion (Rombach et al., 2022) (SD). Nevertheless, these approaches still face two key challenges. First, as illustrated in Fig. 1 (left), they often suffer from slot entanglement, where a slot encodes features from multiple objects or fragments of them, leading to unfaithful single-slot generations. This entanglement degrades segmentation quality and prevents composable generation to novel scenes and object configurations. Second, they exhibit weak alignment, where slots fail to consistently correspond to distinct image regions, especially on real-world images. As shown in our experiments, slots often suffer from over-segmentation (splitting one object into multiple slots), under-segmentation (merging multiple Both methods can reconstruct the full scene when conditioned on all slots (last column). However, Stable-LSD (without register slots) fails to generate images from individual slots. Our method yields faithful single-concept generations, demonstrating disentangled and well-aligned slots.

objects into one slot), or inaccurate object boundaries. Together, these two issues reduce both the accuracy of object-centric representations and their utility for compositional scene generation.

In response, we propose Contrastive Object-centric Diffusion Alignment (CODA), a slot-attention model that uses a pretrained diffusion decoder to reconstruct the input image. CODA augments the model with register slots, which absorb residual attention and reduce interference between object slots, and a contrastive objective, which explicitly encourages slot-image alignment. As illustrated in Fig. 1 (right), CODA faithfully generates images from both individual slots as well as their compositions. In summary, the contributions of this paper can be outlined as follows.

(i) Register-augmented slot diffusion. We employ register slots that are independent of the input image into slot diffusion. Although these register slots carry no semantic information, they act as attention sinks, absorbing residual attention mass so that semantic slots remain focused on meaningful object-concept associations. This reduces interference between object slots and mitigating slot entanglement (Section 4.1). (ii) Mitigating text-conditioning bias. To reduce the influence of text-conditioning biases inherited from pretrained diffusion models, we finetune the key, value, and output projections in crossattention layers. This adaptation further improves alignment between slots and visual content, ensuring more faithful object-centric decomposition (Section 4.2). (iii) Contrastive alignment objective. We propose a contrastive loss that ensures slots capture concepts present in the image (Section 4.3). Together with the denoising loss, our training objective can be viewed as a tractable surrogate for maximizing the mutual information (MI) between inputs and slots, improving slot representation quality (Section 4.4). (iv) Comprehensive evaluation. We demonstrate that CODA outperforms existing unsupervised diffusion-based approaches across synthetic and real-world benchmarks in object discovery (Section 5.1), property prediction (Section 5.2), and compositional generation (Section 5.3). On the VOC dataset, CODA improves instance-level object discover

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut