DiCo: Disentangled Concept Representation for Text-to-image Person Re-identification

DiCo: Disentangled Concept Representation for Text-to-image Person Re-identification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.


💡 Research Summary

The paper introduces DiCo (Disentangled Concept Representation), a novel framework for text‑to‑image person re‑identification (TIReID) that tackles two core challenges: the large modality gap between visual and linguistic data, and the difficulty of modeling fine‑grained correspondences such as color, texture, or shape. DiCo’s central idea is to employ a shared set of learnable “slots” that act as modality‑agnostic anchors for body parts, and to further decompose each slot into multiple “concept blocks” that capture complementary attributes.

In the architecture, an image is processed by a Vision‑Transformer‑style patch encoder, while a textual description is encoded by a transformer‑based language model. Both modalities produce a global embedding (the


Comments & Academic Discussion

Loading comments...

Leave a Comment