User-Centric Object Navigation: A Benchmark with Integrated User Habits for Personalized Embodied Object Search
In the evolving field of robotics, the challenge of Object Navigation (ON) in household environments has attracted significant interest. Existing ON benchmarks typically place objects in locations guided by general scene priors, without accounting for the specific placement habits of individual users. This omission limits the adaptability of navigation agents in personalized household environments. To address this, we introduce User-centric Object Navigation (UcON), a new benchmark that incorporates user-specific object placement habits, referred to as user habits. This benchmark requires agents to leverage these user habits for more informed decision-making during navigation. UcON encompasses approximately 22,600 user habits across 489 object categories. UcON is, to our knowledge, the first benchmark that explicitly formalizes and evaluates habit-conditioned object navigation at scale and covers the widest range of target object categories. Additionally, we propose a habit retrieval module to extract and utilize habits related to target objects, enabling agents to infer their likely locations more effectively. Experimental results demonstrate that current SOTA methods exhibit substantial performance degradation under habit-driven object placement, while integrating user habits consistently improves success rates. Code is available at https://github.com/whcpumpkin/User-Centric-Object-Navigation.
💡 Research Summary
The paper introduces User‑centric Object Navigation (UcON), a novel benchmark that explicitly incorporates individual user habits into the object navigation (ON) problem. While existing ON benchmarks (e.g., Habitat‑Matterport3D, RoboTHOR, ProcTHOR) assume that objects follow generic scene priors—beds in bedrooms, sofas in living rooms—UcON acknowledges that personal placement habits can deviate significantly from these norms. To capture this, the authors generate a large‑scale User Habit Knowledge Base (UHKB) containing approximately 22,600 natural‑language habit statements covering 489 object categories. Each habit links an object to a spatial relation such as “place next to”, “place on top”, “place inside”, or “place under”. The habit statements and their associated placements are synthesized using GPT‑4, validated for physical plausibility in the Omnigibson simulator, and then used to re‑position objects in a “habit‑shaped scene”.
The task definition is straightforward: an agent starts at a random location, receives a target object category and access to the full UHKB, and must locate the object within 300 timesteps using a minimal action set (MoveAhead, RotateLeft/Right, LookUp/Down, Open, Done). Success is recorded when the agent calls Done while the target is visible within a 1‑meter distance threshold.
A central contribution is the Habit Retrieval Module (HRM). At each decision step, HRM extracts from the UHKB only those habits that are directly relevant to the current target object. These retrieved habits are formatted as prompts for a large language model (LLM), which then generates a high‑level navigation plan (e.g., “search near the kitchen table because the user reads the newspaper at breakfast”). Simultaneously, an object detector processes the current RGB‑D observation to provide concrete visual evidence. The LLM’s plan and detector output are fused to produce low‑level actions for the agent.
The authors evaluate several state‑of‑the‑art ON methods, including a PPO policy trained on Habitat, PixelNav (LLM‑guided pixel‑level navigation), and vision‑language models such as L3MVN. Experiments reveal two key findings. First, when objects are placed according to user habits rather than generic priors, all baseline methods experience a substantial drop in success rate (often >30 % absolute). This confirms that existing approaches rely heavily on scene‑level statistics and cannot generalize to personalized environments. Second, augmenting these methods with HRM‑derived habit information consistently improves performance. Success rates increase by 12‑18 % points, and the average number of steps to goal decreases markedly. Notably, even relatively small LLMs (7‑billion‑parameter models) achieve near‑state‑of‑the‑art results when combined with HRM, highlighting the efficiency of targeted habit retrieval.
Human validation was performed with 26 participants who judged the plausibility of a random sample of generated habits and placements; 98.5 % of habits were deemed feasible and 96.7 % of placements were judged consistent with the described habits, supporting the realism of the synthetic data.
The benchmark runs on the Omnigibson simulator, using 22 base scenes and dynamically re‑configuring them per episode. All processing is designed to run locally on consumer‑grade GPUs (RTX 3090/4090), addressing privacy concerns associated with cloud‑based services.
In summary, UcON establishes the first large‑scale, reproducible benchmark that evaluates an embodied agent’s ability to retrieve, filter, and exploit personalized habit knowledge for object search. It opens new research directions: (1) learning habits from long‑term interaction data, (2) resolving conflicting habits across multiple household members, and (3) integrating privacy‑preserving, on‑device reasoning for real‑world service robots. The provided codebase and dataset enable the community to explore these challenges and move toward truly user‑centric robotic assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment