AMAQA: A Metadata-based QA Dataset for RAG Systems

AMAQA: A Metadata-based QA Dataset for RAG Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-augmented generation (RAG) systems are widely used in question-answering (QA) tasks, but current benchmarks lack metadata integration, limiting their evaluation in scenarios requiring both textual data and external information. To address this, we present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata. The integration of metadata is especially important in fields that require rapid analysis of large volumes of data, such as cybersecurity and intelligence, where timely access to relevant information is critical. AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups, enriched with metadata such as timestamps and chat names. It also contains 20,000 hotel reviews with metadata. In addition, the dataset provides 2,600 high-quality QA pairs built across both domains, Telegram messages and hotel reviews, making AMAQA a valuable resource for advancing research on metadata-driven QA and RAG systems. Both Telegram messages and Hotel reviews are enriched with emotional tones or toxicity indicators. To the best of our knowledge, AMAQA is the first single-hop QA benchmark to incorporate metadata. We conduct extensive tests on the benchmark, setting a new reference point for future research. We show that leveraging metadata boosts accuracy from 0.5 to 0.86 for GPT-4o and from 0.27 to 0.76 for open source LLMs, highlighting the value of structured context. We conducted experiments on our benchmark to assess the performance of known techniques designed to enhance RAG, highlighting the importance of properly managing metadata throughout the entire RAG pipeline.


💡 Research Summary

The paper introduces AMAQA, a novel open‑access question‑answering benchmark specifically designed to evaluate retrieval‑augmented generation (RAG) systems that can exploit both textual content and structured metadata. Existing QA and RAG benchmarks focus almost exclusively on raw text, ignoring auxiliary signals such as timestamps, source identifiers, emotional tone, or toxicity labels that are crucial in domains like cybersecurity, intelligence analysis, and real‑time decision making. AMAQA fills this gap by providing a unified dataset that combines two distinct corpora: (1) approximately 1.1 million English messages collected from 26 public Telegram groups during June‑August 2024, and (2) 20 000 hotel reviews spanning 2014‑2018. Each Telegram message is enriched with a rich set of metadata—including exact posting time, chat name, a multi‑label topic list (58 topics), an emotion label (seven categories based on Ekman’s model plus neutral), and a suite of toxicity scores (hate, profanity, identity attack, insult, threat) derived from the Perspective API with a 0.7 threshold. Hotel reviews carry basic metadata (date, location) and a single emotion label (joy, anger, etc.).

From these raw corpora the authors construct 2 600 high‑quality QA pairs that require the model to reason over both text and metadata. Example questions involve filtering messages by time window, chat channel, and emotional tone before extracting a numeric fact, thereby demanding a hybrid of information retrieval, metadata filtering, and natural‑language understanding.

The dataset creation pipeline is thoroughly described. Message collection leveraged Telegram’s public API and a seed‑topic selection process focused on geopolitics (Russia‑Ukraine conflict, US election, Israeli‑Palestinian conflict). Automatic labeling employed a zero‑shot classifier for emotions, GPT‑4o for topic extraction, and the Perspective API for toxicity. Because large‑scale LLM‑based labeling can be noisy, the authors performed systematic post‑processing: normalizing topic names, removing off‑list topics, and applying confidence thresholds. The final statistics reveal a heavily polarized Telegram subset (dominant “anger” emotion, high toxicity proportion, short message length) contrasted with a more positive, longer, and less toxic hotel‑review subset.

Four RAG configurations were evaluated: (i) Vanilla RAG (plain retrieval‑then‑generate), (ii) Metadata‑Filtering RAG (metadata constraints applied during retrieval), (iii) Vanilla + Re²G (a reranker that re‑orders retrieved passages), and (iv) Metadata‑Filtering + Re²G* (combined metadata filtering and reranking). Experiments were run with two families of language models: the proprietary GPT‑4o and several open‑source LLMs (e.g., Llama‑2‑70B, Mistral‑7B). Performance was measured primarily by accuracy on the 2 600 QA pairs. Results demonstrate a dramatic boost when metadata is leveraged: GPT‑4o’s accuracy jumps from 0.50 (no metadata) to 0.86 (metadata + reranker), while open‑source models improve from 0.27 to 0.76 under the same conditions. The gains are attributed to metadata‑driven candidate pruning (e.g., time‑range filtering dramatically reduces the search space) and to the reranker’s ability to prioritize passages that satisfy both textual relevance and metadata constraints.

The authors discuss several implications. First, metadata is not a peripheral signal; it can serve as a logical constraint that guides retrieval and reasoning, especially for single‑hop questions that embed temporal, source, or affective filters. Second, the dataset highlights the need for RAG pipelines that treat metadata as a first‑class citizen throughout indexing, retrieval, and generation stages. Third, limitations are acknowledged: the Telegram data is temporally narrow (summer 2024) and topic‑biased toward geopolitics, potentially limiting generalization; reliance on LLMs for annotation introduces systematic biases that require external validation; and the current benchmark focuses on single‑hop reasoning, leaving multi‑hop metadata‑driven queries for future work.

Future research directions suggested include extending AMAQA to multi‑hop scenarios, incorporating additional metadata types (e.g., user reputation, geolocation), exploring joint text‑metadata pre‑training objectives, and evaluating robustness across diverse domains (legal, medical, scientific literature). The authors also propose developing standardized evaluation protocols that separate the contribution of metadata filtering from pure textual retrieval, enabling clearer attribution of performance gains.

In conclusion, AMAQA is the first single‑hop QA benchmark that systematically integrates rich metadata with textual content, providing a valuable testbed for next‑generation RAG systems. The extensive experiments confirm that proper metadata handling can more than double the accuracy of state‑of‑the‑art LLMs, underscoring the importance of designing retrieval and generation pipelines that are metadata‑aware. This work thus opens a new research frontier at the intersection of information retrieval, natural language understanding, and structured data integration.


Comments & Academic Discussion

Loading comments...

Leave a Comment