Hidden in the Haystack: Smaller Needles are More Difficult for LLMs to Find
Large language models (LLMs) face significant challenges with needle-in-ahaystack tasks, where relevant information (“the needle”) must be drawn from a large pool of irrelevant context (“the haystack”). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size, the length of the answer-containing document, has received little attention. We present the first systematic study of gold context size in long-context question answering, spanning three diverse benchmarks (general knowledge, biomedical reasoning, and mathematical reasoning), eleven state-of-the-art LLMs (including recent reasoning models), and more than 150K controlled runs. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This effect persists under rigorous confounder analysis: even after controlling for gold context position, answer token repetition, gold-to-distractor ratio, distractor volume, and domain specificity, gold context size remains a decisive, independent predictor of success. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.
💡 Research Summary
This paper investigates a previously under‑explored factor in long‑context question answering: the size of the gold context, i.e., the length of the document that actually contains the answer. While prior work has highlighted positional bias and the sheer number of distractor passages as major obstacles in needle‑in‑a‑haystack (NIAH) tasks, the authors ask whether the amount of relevant text itself influences model performance. To answer this, they construct a controlled experimental framework spanning three heterogeneous benchmarks—CARDBiomedBench (biomedical reasoning), NaturalQuestions (open‑domain general knowledge), and NuminaMath1.5 (high‑level mathematical reasoning). For each benchmark they create three nested gold variants: Small (minimal span sufficient for the answer), Medium (adds explanatory material), and Large (full reasoning chain). A fixed pool of distractor documents (~20 k tokens per benchmark) is interleaved with the gold passage, and the gold’s position within the context window is varied across six equally spaced points to probe positional effects.
Eleven state‑of‑the‑art language models are evaluated, including closed‑weight systems (GPT‑4o, Gemini‑Flash series, o3‑mini) and open‑weight models (DeepSeek‑R1, Phi‑4‑reasoning, LLaMA‑3.1‑405B, LLaMA‑3.3‑70B). Over 150 000 runs are performed, allowing the authors to average across random seeds, prompts, and document orderings. Baseline conditions (closed‑book, gold‑only, distractor‑only) confirm that each gold variant is sufficient to answer the question in isolation and that distractors alone do not provide the answer.
The results are strikingly consistent: larger gold contexts yield substantially higher accuracy across all models and domains. For example, Gemini‑2.0‑Flash’s accuracy on CARDBiomedBench rises from 48 % (small) to 73 % (large); GPT‑4o improves from 77 % to 98 %; LLaMA‑3.1‑405B climbs from 74 % to 92 %. Moreover, performance with large gold contexts approaches the gold‑only baselines (near‑perfect scores), indicating that the presence of long, informative passages essentially neutralizes the distractor interference.
Positional analysis reveals that small gold contexts are highly vulnerable to placement: accuracy drops dramatically when they appear later in the input, whereas large gold contexts degrade more gracefully. This amplifies the well‑known primacy bias—models attend more to early tokens—but shows that the bias is exacerbated when the relevant evidence is brief. The authors also examine answer‑token repetition, gold‑to‑distractor token ratios, total distractor length, and domain specificity. Using regression and stratified analyses, they demonstrate that gold‑size remains a significant predictor of success even after controlling for each of these confounders, confirming its status as an independent hidden variable.
From a systems perspective, the findings imply that pipelines which combine extremely short, critical evidence with much longer, irrelevant material are intrinsically fragile. Practitioners are advised to monitor length disparities among retrieved documents, apply length‑based weighting or normalization, or employ multi‑stage retrieval‑aggregation where short passages are first isolated or expanded before being merged with longer context. Training data should also incorporate a variety of gold‑size examples to teach models to recognize and attend to concise evidence.
In conclusion, the study provides the first systematic evidence that gold context size dramatically affects LLM performance on long‑context NIAH tasks, independently of position, distractor volume, or answer token frequency. The authors suggest future work on multi‑needle scenarios, dynamic length‑aware attention mechanisms, and deeper interpretability analyses (e.g., attention heatmaps) to further mitigate this brittleness in real‑world agentic systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment