Who Benefits from RAG? The Role of Exposure, Utility and Attribution Bias

Who Benefits from RAG? The Role of Exposure, Utility and Attribution Bias
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have achieved substantial improvements in accuracy by grounding their responses in external documents that are relevant to the user’s query. However, relatively little work has investigated the impact of RAG in terms of fairness. Particularly, it is not yet known if queries that are associated with certain groups within a fairness category systematically receive higher accuracy, or accuracy improvements in RAG systems compared to LLM-only, a phenomenon we refer to as query group fairness. In this work, we conduct extensive experiments to investigate the impact of three key factors on query group fairness in RAG, namely: Group exposure, i.e., the proportion of documents from each group appearing in the retrieved set, determined by the retriever; Group utility, i.e., the degree to which documents from each group contribute to improving answer accuracy, capturing retriever-generator interactions; and Group attribution, i.e., the extent to which the generator relies on documents from each group when producing responses. We examine group-level average accuracy and accuracy improvements disparities across four fairness categories using three datasets derived from the TREC 2022 Fair Ranking Track for two tasks: article generation and title generation. Our findings show that RAG systems suffer from the query group fairness problem and amplify disparities in terms of average accuracy across queries from different groups, compared to an LLM-only setting. Moreover, group utility, exposure, and attribution can exhibit strong positive or negative correlations with average accuracy or accuracy improvements of queries from that group, highlighting their important role in fair RAG. Our data and code are publicly available from Github.


💡 Research Summary

This paper investigates fairness in Retrieval‑Augmented Generation (RAG) systems, focusing on whether certain demographic or topical groups systematically receive higher answer accuracy or larger accuracy gains when a retriever is added to a large language model (LLM). The authors introduce the notion of “query group fairness” and define two fairness criteria: Equitable Accuracy Improvements (EAI), which requires that all groups experience the same increase in accuracy after adding a retriever, and Equitable Accuracy (EA), which requires that the final accuracy be equal across groups.

To study the problem, the authors construct three datasets from the TREC 2022 Fair Ranking Track, each centered on a Wikipedia project (Cities, Geography, Military History). Every document is annotated with four fairness categories—Age of Topic, Popularity, Age of Article, and Alphabetical order—each containing four groups. Two generation tasks are defined: (a) article generation, where a document title serves as the query and the model must produce the article body, and (b) title generation, where the article body is the query and the model must generate a title. Ground‑truth answers are the original article body or title respectively.

Three retrievers are employed: BM25 (lexical), SPLADE (learned sparse), and Contriever (dense). For each query, the top‑10 documents from each retriever are fed to the LLM using a consistent prompt template. Accuracy is measured against the ground truth using an automatic metric (E). The authors also compute per‑group averages of accuracy and accuracy improvement.

Crucially, the paper defines three group‑level factors that may influence fairness:

  1. Group Exposure (E(g)) – the average number of documents from group g appearing in the retrieved set; this reflects retriever bias.
  2. Group Utility (U(g)) – the average marginal gain in accuracy contributed by documents from group g when added individually to the LLM; this captures the retriever‑generator interaction.
  3. Group Attribution (A(g)) – the average score from a Natural Language Inference (NLI) model indicating whether the generator actually used documents from group g in producing the answer.

The experimental analysis yields several key findings:

  • RAG amplifies disparities – Across all four fairness categories, RAG systems typically violate both EAI and EA. For example, in the Popularity category, “high” popularity groups see large accuracy gains, while “low” popularity groups experience little improvement or even degradation.
  • Exposure correlates positively with accuracy – Groups that are retrieved more often tend to have higher average accuracy and larger accuracy gains. This effect is strongest for BM25, which over‑exposes popular groups.
  • Utility shows mixed effects – Some groups provide highly useful documents that substantially boost accuracy, while others contribute little or even harmful noise. The distribution of utility varies by retriever; dense retriever Contriever often retrieves newer articles (high Age‑of‑Article) but those documents have lower utility for the tasks.
  • Attribution reflects generator bias – NLI‑based attribution scores are strongly linked to accuracy improvements. When the generator actually incorporates information from a group (high attribution), that group’s accuracy improves. Low attribution groups are effectively ignored even if they appear in the retrieved set.
  • Retriever choice matters – BM25 tends to favor popular, lexically rich documents, leading to high exposure and utility for popular groups but poor exposure for niche groups. SPLADE offers a middle ground, while Contriever’s dense embeddings surface semantically similar but sometimes less useful documents.

The authors argue that achieving fair RAG requires balancing these three factors rather than merely improving retrieval relevance. Potential mitigation strategies include group‑aware re‑ranking to equalize exposure, utility‑based weighting of retrieved documents, and feedback loops where attribution signals guide the generator to consider under‑represented groups.

All data and code are released on GitHub, enabling replication and extension to other domains (e.g., legal or medical QA). The paper thus provides a concrete empirical framework for diagnosing and addressing group‑level fairness issues in modern retrieval‑augmented language generation systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment