PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.

💡 Research Summary

The paper introduces PinPoint, a large‑scale benchmark designed to expose and measure shortcomings of current Composed Image Retrieval (CIR) systems that are invisible to existing datasets such as CIRR, FashionIQ, and CIRCO. PinPoint contains 7,635 real‑world queries and 329 K human‑verified relevance judgments across 23 diverse domains. Each query is annotated with an average of 9.1 “very relevant” positive images, 32.8 explicit hard negatives (visually similar distractors that are semantically wrong), and six paraphrased textual instructions that capture different linguistic styles. Moreover, 13.4 % of the queries involve multiple reference images, testing a model’s ability to reason across visual inputs. Demographic metadata (Monk Skin Tone) is also provided for fairness analysis.

The dataset construction pipeline uses three multimodal large language models (GPT‑5, Claude‑4 Sonnet, Gemini 2.5) to generate candidate modification instructions, paraphrases, and candidate target descriptors. Human annotators then filter for specificity, visual grounding, and language quality, and finally verify all positives and negatives. This three‑layer safeguard mitigates LLM bias while enabling the scale required for exhaustive annotation.

Evaluation metrics go beyond traditional Recall@K. The authors compute mean average precision (mAP) on the full corpus, ΔmAP@10 (the drop in mAP when hard negatives are included), Negative Recall@10 (frequency of hard negatives in the top‑10), and a linguistic sensitivity range (max‑min mAP across the six paraphrases for each query). These metrics quantify false‑positive avoidance, robustness to phrasing, and multi‑answer ranking quality.

Twenty‑plus zero‑shot models spanning four paradigms are benchmarked: (1) vanilla CLIP variants (Meta CLIP‑2, Apple DFN‑CLIP) with four fusion recipes (image‑only, text‑only, early‑fusion, SLERP); (2) CIR‑specific methods (MMRet, MagicLens, LinCIR, Pic2Word) using either direct embedding composition or proxy generation; (3) proxy‑based approaches that synthesize textual descriptions via GPT‑5 and retrieve with CLIP text embeddings; (4) a pure text‑only baseline using GPT‑5 retrieval. Results show that even the best model reaches only 28.5 % mAP@10, retrieves hard negatives 9 % of the time, and suffers a 25.1 % performance swing across paraphrases. Multi‑image queries cause a dramatic 40–70 % degradation, indicating that current architectures lack compositional reasoning across multiple visual inputs. Notably, the text‑only GPT‑5 baseline outperforms several specialized CIR methods, highlighting the power of large language models for visual grounding.

To address the identified gaps, the authors propose a training‑free, point‑wise reranking scheme using an off‑the‑shelf multimodal large language model (Qwen2.5‑VL‑7B). For each candidate retrieved in the first stage, the MLLM is prompted with the query image, instruction, and candidate image, asking a binary “yes/no” relevance question. The logits for “yes” and “no” are used as relevance scores to reorder the top‑N list. This reranker is model‑agnostic and requires no additional training. Across all evaluated systems, reranking improves mAP@10 by 4–6 % and reduces Negative Recall@10 by more than 30 %, demonstrating that a simple LLM‑based post‑processing step can substantially close the gap between current CIR performance and the richer evaluation criteria introduced by PinPoint.

In summary, PinPoint provides a comprehensive, real‑world testbed that captures multiple answers, explicit hard negatives, multi‑image composition, linguistic paraphrase robustness, and demographic fairness. Its extensive analysis reveals that state‑of‑the‑art zero‑shot CIR models still struggle with false positives, language variation, and multi‑image reasoning. The proposed LLM‑based reranking offers an immediate, training‑free remedy, and the released dataset, index, and benchmarking code enable the community to develop more robust, fair, and compositional image retrieval systems.

PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment