Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Reading time: 5 minute
...

📝 Original Info

  • Title: Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
  • ArXiv ID: 2512.21221
  • Date: 2025-12-24
  • Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

📝 Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

💡 Deep Analysis

Figure 1

📄 Full Content

Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval Dao Sy Duy Minh∗ 23122041@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Huynh Trung Kiet∗ 23132039@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Nguyen Lam Phu Quy† 23122048@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Phu-Hoa Pham† 23122030@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Tran Chi Nguyen† 23122044@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Figure 1: System Architecture for Lightweight Entity-Guided Event-Based Image Retrieval Abstract Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language pro- cessing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context- dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval ∗The first two authors contributed equally as lead authors. †The last three authors contributed equally as supporting roles. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’25, Dublin, Ireland © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-XXXX-X/2018/06 https://doi.org/XXXXXXX.XXXXXXX pipeline that leverages event-centric entity extraction to incorpo- rate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Eval- uated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event- Based-Image-Retrieval CCS Concepts • Computing methodologies →Visual content-based index- ing and retrieval. Keywords image retrieval, multimodal dataset, real-world events, event-centric vision-language arXiv:2512.21221v1 [cs.CV] 24 Dec 2025 MM ’25, October 27–31, 2025, Dublin, Ireland Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, and Tran Chi Nguyen ACM Reference Format: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, and Tran Chi Nguyen. 2018. Leveraging Lightweight Entity Ex- traction for Scalable Event-Based Image Retrieval. In Proceedings of 2025 ACM Multimedia Conference (MM ’25). ACM, New York, NY, USA, 6 pages. https://doi.org/XXXXXXX.XXXXXXX 1 Introduction Retrieving images from natural language descriptions plays a cen- tral role in various applications such as web search, news archiving, e-commerce, and media curation. As the volume of multimodal content continues to grow rapidly, effective cross-modal retrieval systems are becoming increasingly important for organizing and accessing relevant visual information from textual inputs. Most existing models, such as CLIP [12] and its variants, are trained primarily on clean visual descriptions—typically short im- age captions sourced from large-scale web datasets like LAION- 400M [14]. While effective for general-purpose retrieval, these cap- tions often lack the complexity, entity density, and contextual vari- ability present in real-world queries. As a result, such models strug- gle when applied to domains like news or event retrieval, where queries are significantly more complex: they may involve multi- ple named entities, temporal references, or require event-centric grounding. Furthermore, in many practical scenarios, captions are embed- ded within broader news content, rather than standing alone as purely visual descriptions. This mixing of modalities introduces noise and ambiguity, making it difficult for conventional text-to- image retrieval models—optimized for literal, surface-level visual alignment—to perform effectively. In addition, natural queries are frequently truncated due to token limits in transformer-based mod- els, leading to semantic misalignment between the i

📸 Image Gallery

Images_pipeline.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut