Improve Large Language Model Systems with User Logs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model’s prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

💡 Research Summary

The paper “Improve Large Language Model Systems with User Logs” introduces UNO (User log‑driveN Optimization), a novel framework that enables continual learning for large language model (LLM) systems directly from real‑world user interaction logs. The authors argue that the traditional scaling paradigm—adding more data and parameters—is hitting diminishing returns due to the scarcity of high‑quality data and rising computational costs. In contrast, deployed LLMs generate massive streams of user logs (queries, model outputs, revisions, and implicit feedback) that contain valuable, authentic supervision signals. However, these logs are noisy, unstructured, and often collected under policies that differ from the current model, creating an off‑policy optimization problem and a “Signal‑or‑Noise Dilemma”.

UNO addresses these challenges in four stages. First, pre‑processing filters out empty or meaningless sessions and uses a prompting LLM to distill each interaction into a semi‑structured rule set (e.g., “use non‑technical language”, “highlight real‑world impact”). The original response is then regenerated under these rules, producing a preference pair (original vs. revised response). Second, clustering groups sessions by query similarity and rule content using agglomerative clustering, thereby managing heterogeneity across the log corpus. Third, a cognitive gap metric quantifies how well the base LLM already aligns with the distilled rules for each cluster. This gap is estimated by a simulated performance verifier (LLM‑as‑Judge) that scores the quality of a LoRA‑adapted model trained on the cluster’s data. If the gap is small and the verifier approves, the cluster is designated a Primary Experience cluster; otherwise it becomes a Reflective Experience cluster.

For Primary Experience clusters, UNO trains a lightweight LoRA “Expert” adapter that directly generates answers at inference time. For Reflective Experience clusters, UNO trains a LoRA “Critic” adapter that does not modify the base model but instead provides actionable revision suggestions on the model’s initial output, enabling an iterative refine‑and‑regenerate loop. This dual‑module design ensures positive adaptivity (leveraging high‑value signals) while preserving noise robustness (avoiding degradation from low‑quality logs). Crucially, because the base model’s parameters remain untouched, the framework mitigates off‑policy risks that arise when logs are collected under older system policies.

During inference, a new user query is matched to the nearest cluster. If the cluster hosts a Primary Experience module, the corresponding Expert LoRA is loaded and the answer is produced directly. If it hosts a Reflective Experience module, the base LLM first generates a draft, the Critic LoRA supplies feedback, and the LLM revises the response accordingly.

The authors evaluate UNO on MemoryBench, a comprehensive continual‑learning benchmark covering multiple datasets, languages, and domains, which provides realistic user‑log streams and downstream evaluation tasks. Compared against full fine‑tuning, Retrieval‑Augmented Generation (RAG), and several memory‑based baselines, UNO achieves state‑of‑the‑art performance: higher accuracy (4–7 percentage points), lower latency, and substantially reduced memory and compute overhead (≈30 % savings). Importantly, UNO maintains stable performance even when noisy logs are injected, demonstrating the effectiveness of its cognitive‑gap‑driven module selection and simulated verification.

In summary, UNO offers a practical roadmap for turning LLMs from static, pre‑trained artifacts into continuously evolving AI services. By converting raw, noisy user logs into structured rules and preference pairs, clustering them, assessing model‑log alignment, and deploying either expert or critic adapters, UNO extracts maximal learning signal while safeguarding against off‑policy and noise‑induced pitfalls. Future work suggested includes improving automatic rule extraction, exploring multi‑module composition strategies, and extending the framework to online, streaming log updates.

Improve Large Language Model Systems with User Logs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment