Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents “migrate” from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $τ$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.

💡 Research Summary

The paper “Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks” investigates how the evolution of a technical document collection impacts the reliability of an information‑retrieval (IR) benchmark. Traditional IR test collections follow the Cranfield paradigm: a fixed corpus, a fixed set of queries, and static relevance judgments. While this works well for relatively stable domains, technical documentation—especially for rapidly evolving frameworks such as LangChain—undergoes frequent additions, deletions, and reorganizations that can render a benchmark stale.

To study this phenomenon, the authors focus on FreshStack, a retrieval benchmark that targets technical domains. They construct two independent snapshots of the underlying corpus: one from October 2024 and another from October 2025. Both snapshots consist of ten GitHub repositories that together host the documentation, code, Jupyter notebooks, and other artefacts relevant to retrieval‑augmented generation (RAG) frameworks (e.g., LangChain, LangChainJS, LlamaIndex, Transformers, Chroma, Azure‑OpenAI samples, etc.). The authors replicate the FreshStack pipeline for each snapshot, which involves four main stages:

Corpus Preparation – The latest commit before each snapshot date is checked out, files are chunked into ≤2048‑token pieces, and a unique identifier (repo name + file path + byte offsets) is assigned.
Nugget Generation – From the original Stack Overflow Q&A pairs used in FreshStack, atomic “nuggets” (key facts needed for a complete answer) are automatically extracted using GPT‑4o. These nuggets serve as fine‑grained relevance targets.
Oracle Retrieval – A diverse pool of candidate documents is retrieved for each query using a hybrid fusion of five models: BM25 (lexical), BGE‑Gemma‑2 (dense, 3584‑dim embeddings, 8K context), E5‑Mistral‑7B (dense, 4096‑dim), Qwen3‑4B (dense, 4096‑dim, 32K context), and Qwen3‑8B (dense, larger). Scores from each model are normalized to

Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment