Overview of the TREC 2025 Tip-of-the-Tongue track

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tip-of-the-tongue (ToT) known-item retrieval involves re-finding an item for which the searcher does not reliably recall an identifier. ToT information requests (or queries) are verbose and tend to include several complex phenomena, making them especially difficult for existing information retrieval systems. The TREC 2025 ToT track focused on a single ad-hoc retrieval task. This year, we extended the track to general domain and incorporated different sets of test queries from diverse sources, namely from the MS-ToT dataset, manual topic development, and LLM-based synthetic query generation. This year, 9 groups (including the track coordinators) submitted 32 runs.

💡 Research Summary

The paper provides a comprehensive overview of the TREC 2025 Tip‑of‑the‑Tongue (ToT) track, detailing its motivation, data construction, participant activity, and experimental results. ToT known‑item retrieval concerns queries where users cannot recall a precise identifier (e.g., a movie title) but can describe the item using a mixture of semantic memories (features of the item) and episodic memories (the context in which they encountered it). Such queries are unusually verbose and embed linguistic phenomena—uncertainty expressions, exclusion criteria, relative comparisons, false memories, and social niceties—that are rarely seen in standard IR tasks and that defeat simple keyword‑matching approaches.

The 2025 track extended the previous domain‑specific focus (movies in 2023, movies/landmarks/celebrities in 2024) to a truly general‑domain setting covering 53 entity types. The test set comprised 622 queries drawn from three sources: (1) 172 queries sampled from the Microsoft ToT Known‑Item Retrieval Dataset (movie domain), providing a legacy benchmark; (2) 150 human‑elicited queries collected by NIST assessors using an image‑driven workflow that forced assessors to recognize an entity, fail to recall its name, and then write a verbose description (minimum ~300 characters). These human queries span movies, celebrities, and landmarks; (3) 300 synthetic queries generated by large language models—150 by Llama‑3.1‑8B‑Instruct and 150 by GPT‑4o—using a domain‑agnostic prompt that fed a Wikipedia article’s title and summary into the model to produce a plausible ToT description. This synthetic pipeline sampled 50 domains, six articles per domain, and generated three ToT entities per model per domain.

All participants retrieved from a static Wikipedia snapshot (2023) containing 6,407,814 articles, each supplied with doc_id, URL, title, full text, and section metadata. The corpus was guaranteed to contain the correct answer for every training, development, and test query. Participants could submit rankings of up to 1,000 document IDs per query, with the official metric being NDCG@1000; additional metrics (R@1000, MAP, etc.) were reported for completeness. External resources such as Wikidata were permitted, but participants were explicitly warned not to train on the MS‑ToT dataset or the “I Remember This Movie…” community Q&A data.

Nine groups submitted a total of 32 runs, including three baseline runs contributed by the organizers: two BM25 baselines (Anserini and PyTerrier) and a dense‑retrieval baseline (Lightning IR). Of the non‑baseline runs, 11 claimed to have used only the provided training data, while 18 used only the training data (including baselines), six incorporated additional external datasets, and eight ignored the provided training data entirely. Seven runs leveraged the baseline runs as re‑ranking candidates or negative samples, but none of these appeared among the top‑four performing systems.

Performance was widely dispersed across runs. The PyTerrier‑BM25 baseline achieved the highest NDCG@1000 among the baselines. Correlation analysis showed that scores on synthetic queries and MS‑ToT queries were highly aligned (Kendall’s τ = 0.847), whereas human (NIST) queries correlated less strongly with the other two sets (τ = 0.703 with MS‑ToT, τ = 0.737 with synthetic). This suggests that human‑crafted ToT queries contain richer, more challenging linguistic constructs. Moreover, R@1000 showed weaker correlation with other metrics, indicating that retrieving the exact answer early remains difficult for many systems.

The authors conclude that the 2025 track successfully broadened the evaluation landscape for ToT retrieval by (1) incorporating diverse query generation methods (real, human‑elicited, synthetic), (2) providing a large, well‑structured Wikipedia corpus, and (3) allowing external knowledge sources while maintaining clear data‑usage policies. The strong alignment between synthetic and legacy queries validates synthetic generation as a cost‑effective supplement for future evaluations. However, the relatively lower performance on human queries highlights the need for models capable of handling uncertainty, exclusion, and multi‑hop reasoning inherent in genuine ToT descriptions. Future work is encouraged to explore advanced dense‑retrieval, LLM‑based re‑ranking, uncertainty modeling, and multi‑step reasoning to better capture the nuanced memory traces that characterize tip‑of‑the‑tongue information needs.

Overview of the TREC 2025 Tip-of-the-Tongue track

💡 Research Summary

Comments & Academic Discussion

Leave a Comment