Less is More: Benchmarking LLM Based Recommendation Agents

Less is More: Benchmarking LLM Based Recommendation Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17–0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88% by using context (5–10 items) instead of longer histories (50 items), without sacrificing recommendation quality. We also analyze latency patterns across providers and find model specific behaviors that inform deployment decisions. This work challenges the existing ``more context is better’’ paradigm and provides actionable guidelines for cost effective LLM based recommendation systems.


💡 Research Summary

The paper “Less is More: Benchmarking LLM Based Recommendation Agents” investigates a widely held assumption in the recommender‑systems community: that providing a larger user purchase history (i.e., a longer context window) to a large language model (LLM) will improve the quality of product recommendations. To test this hypothesis, the authors conduct a systematic benchmark using four state‑of‑the‑art LLMs—OpenAI’s GPT‑4o‑mini, DeepSeek‑V3, Alibaba’s Qwen2.5‑72B, and Google’s Gemini 2.5 Flash—across five context lengths (5, 10, 15, 25, 50 most recent items). The experiments are performed on the REGEN dataset, which enriches Amazon product reviews with detailed metadata, user profiles, and narrative explanations of the next purchase.

A within‑subject design is employed: the same 50 users (randomly sampled with seed 42) are evaluated at every context length, ensuring that differences are not confounded by user variability. For each user, the most recent k items are formatted into a standardized prompt that lists item title, category, and rating, followed by the question “Based on this user’s purchase history, predict what product they will buy next.” The models generate a textual prediction and a brief reasoning statement.

Quality is measured with a composite score that combines a keyword‑overlap metric (Jaccard‑like similarity between predicted and ground‑truth product keywords) weighted at 0.7, and a binary category‑match indicator weighted at 0.3. This mirrors evaluation practices in prior zero‑shot recommendation work. Latency is recorded as wall‑clock time from API request to response, and token usage is taken directly from each provider’s API as a proxy for computational cost.

Results show that across all four models, increasing context length from 5 to 50 items does not lead to statistically significant improvements in recommendation quality. Quality scores remain in a narrow band (0.16–0.23) with overlapping confidence intervals. Paired t‑tests for each model (5 vs. 50 items) yield p > 0.05, and a repeated‑measures ANOVA across all five lengths reports F(4,196)=1.12, p=0.35, confirming the null effect. The average change in quality is essentially zero (Δ ≈ ‑0.01).

In stark contrast, token consumption grows roughly eightfold when moving from 5 to 50 items (e.g., GPT‑4o‑mini: 293 → 2,381 tokens). Assuming typical per‑token pricing, this translates to an ≈ 88 % potential cost reduction if practitioners limit context to 5–10 items without sacrificing recommendation quality.

Latency analysis reveals model‑specific patterns. Qwen2.5‑72B maintains a stable latency around 4.1–4.4 seconds regardless of context length, making it well‑suited for real‑time scenarios. GPT‑4o‑mini shows moderate latency (4.5–5.9 seconds). DeepSeek‑V3 exhibits higher and more variable latency (6.4–10.3 seconds), while Gemini 2.5 Flash’s latency increases noticeably with longer prompts (10–15 seconds). The authors argue that for contexts under roughly 3,000 tokens, network round‑trip time and API overhead dominate over pure token‑processing time, especially for models with optimized inference pipelines.

The discussion attributes the flat quality curves to several known phenomena. First, the “Lost in the Middle” effect (Liu et al., 2024) suggests LLMs struggle to attend to information positioned in the middle of long sequences, causing most of the added items to be under‑utilized. Second, a recency bias in user behavior means the most recent few purchases already capture current preferences, aligning with findings from traditional sequential recommendation literature. Third, signal saturation implies diminishing returns: after a few items, additional history contributes more noise than useful signal. Finally, the intrinsic difficulty of predicting the exact next product imposes a performance ceiling that cannot be overcome merely by feeding more context.

Practical implications are clear. For large‑scale e‑commerce deployments that rely on API‑based LLM services, limiting the prompt to the most recent 5–10 items can cut token costs by up to 88 % and reduce latency, especially when paired with a model like Qwen2.5‑72B that shows stable response times. The paper also highlights the need for prompt‑compression techniques (e.g., LLMLingua, Selective Context) and more sophisticated evaluation metrics (MRR, NDCG) in future work to better capture ranking performance in realistic recommendation pipelines.

In summary, the study provides robust empirical evidence that “more context is better” does not hold for LLM‑driven product recommendation tasks. Instead, a concise, recent history is sufficient, enabling significant cost and latency savings without compromising recommendation quality. This insight challenges prevailing design heuristics and offers actionable guidance for building efficient, scalable LLM‑based recommendation agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment