PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating this decision process is crucial yet challenging. Scheduling logistics can drain hours, and human delegation often fails at scale, which motivates us to ask: Can we trust large language models (LLMs) or language agents to manage time? To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. In CalConflictBench, conflicts are presented to agents round-by-round over a calendar year, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has an average error rate of 35%. To address this gap, we propose PEARL, a reinforcement-learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e.g., attendee priorities, topic importance, time/location preferences), and (ii) optimizes the agent with round-wise rewards that directly supervise decision correctness, ranking quality, and memory usage across rounds. Experiments on CalConflictBench show that PEARL achieves an error reduction rate of 0.76 and a 55% improvement in average error rate compared to the strongest baseline.
💡 Research Summary
The paper tackles the practical problem of calendar conflict resolution, where busy professionals must repeatedly decide which overlapping meetings to attend, postpone, or decline. The authors formalize this as a long‑horizon sequential decision‑making task that requires an agent to infer and adapt to a user’s hidden preference principles (e.g., priority of attendees, importance of topics, time and location preferences) over many interaction rounds. To evaluate such agents systematically, they introduce CalConflictBench, a synthetic benchmark that generates realistic year‑long calendars for synthetic users representing various organizational roles (e.g., CEOs, PIs, postdocs). The benchmark proceeds week by week: each round presents a set of overlapping events together with contextual information (organization chart, current calendar state, event metadata). The agent must accept exactly one event, reject the rest, output a priority ranking, and provide a rationale. Evaluation metrics include per‑round decision accuracy, Optimal Rank Distance (ORD), average error rate across the trajectory, average ORD, and an Error Reduction Rate that measures how much the agent improves from the first to the last quarter of the year.
Baseline experiments with a wide range of strong language models—open‑source (Qwen‑3‑8B/14B/30B‑Think, OLMo‑3‑7B/32B‑Think, LLaMA‑3.1‑8B) and proprietary (GPT‑5, Gemini‑2.5‑Flash)—as well as agentic rollouts (ReAct, Memory‑Augmented ReAct) reveal a striking weakness: despite impressive zero‑shot capabilities, these models achieve average error rates between 30 % and 40 % and show virtually no error reduction when the number of decision rounds increases. Moreover, performance degrades sharply as the number of conflicting events per round (M) grows, indicating that current LLM agents cannot maintain or refine preference representations over long horizons.
To address these deficiencies, the authors propose PEARL (Preference‑Evolving Agent with Reinforcement Learning). PEARL augments a base language model with an external, structured “Strategy Hub” memory that stores inferred preference states after each round. The memory is a key‑value store of interpretable strategy descriptors (e.g., “high‑priority attendee: senior manager”, “topic importance: project deadline”). At every round the agent retrieves relevant entries, updates them based on the latest feedback, and incorporates them into the prompt for the next decision. This explicit memory enables persistent preference modeling beyond the limited context window of the LLM.
Training PEARL uses a curriculum‑based reinforcement‑learning objective. The round‑wise reward combines three components: (1) decision correctness (binary reward for selecting the ground‑truth event), (2) ORD reward (higher reward for ranking the true event near the top), and (3) memory‑efficiency reward (penalizing unnecessary memory growth). Early curriculum stages place higher weight on preference inference, encouraging the agent to explore and populate the Strategy Hub. Later stages shift weight toward consistency, rewarding decisions that align with the accumulated preferences. The authors employ PPO for policy optimization and a KL‑regularization term to keep memory updates stable.
Empirical results on CalConflictBench demonstrate that PEARL reduces the average error rate to 0.76 (i.e., a 24 % error rate) and achieves a 55 % relative improvement over the strongest baseline. Error reduction rates become positive and substantial, showing that the agent learns from early mistakes and refines its policy over time. PEARL’s performance degrades much more gracefully as M increases, confirming that the external memory mitigates combinatorial explosion in local decision complexity. Ablation studies indicate that both the Strategy Hub and the curriculum‑shaped reward are essential; removing the memory leads to baseline‑level errors, while training without curriculum yields slower convergence and lower final accuracy.
The paper’s contributions are threefold: (1) definition of calendar conflict resolution as a novel, long‑horizon, preference‑driven task for LLM agents; (2) creation of CalConflictBench, a synthetic yet human‑validated benchmark with detailed evaluation protocols; (3) introduction of PEARL, a reinforcement‑learning framework that couples an explicit preference memory with round‑wise rewards, substantially improving LLM‑based personal assistants’ reliability. By demonstrating that LLM agents can be equipped with persistent, interpretable memory and trained via RL to evolve preferences, the work opens a promising path toward trustworthy AI assistants capable of managing real‑world time‑sensitive tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment