Scene-Adaptive Motion Planning with Explicit Mixture of Experts and Interaction-Oriented Optimization
Despite over a decade of development, autonomous driving trajectory planning in complex urban environments continues to encounter significant challenges. These challenges include the difficulty in accommodating the multi-modal nature of trajectories, the limitations of single expert model in managing diverse scenarios, and insufficient consideration of environmental interactions. To address these issues, this paper introduces the EMoE-Planner, which incorporates three innovative approaches. Firstly, the Explicit MoE (Mixture of Experts) dynamically selects specialized experts based on scenario-specific information through a shared scene router. Secondly, the planner utilizes scene-specific queries to provide multi-modal priors, directing the model’s focus towards relevant target areas. Lastly, it enhances the prediction model and loss calculation by considering the interactions between the ego vehicle and other agents, thereby significantly boosting planning performance. Comparative experiments were conducted on the Nuplan dataset against the state-of-the-art methods. The simulation results demonstrate that our model consistently outperforms SOTA models across nearly all test scenarios. Our model is the first pure learning model to achieve performance surpassing rule-based algorithms in almost all Nuplan closed-loop simulations.
💡 Research Summary
The paper introduces EMoE‑Planner, a novel trajectory planning architecture designed for complex urban autonomous driving. The authors identify two persistent challenges in existing approaches: (1) the difficulty of handling the inherently multimodal nature of driving decisions, and (2) the insufficient modeling of ego‑agent interactions, which leads to higher collision rates in dense traffic. To address these, EMoE‑Planner combines three key innovations: an Explicit Mixture‑of‑Experts (EMoE) module, scene‑specific queries, and an interaction‑oriented loss function.
Explicit Mixture‑of‑Experts (EMoE).
Traditional MoE layers route inputs to experts via a black‑box router, which lacks interpretability and often suffers from load‑imbalance. EMoE replaces this with a single, supervised scene router that classifies each input scene into one of seven predefined scenario categories (left turn, straight at junction, right turn, straight, roundabout, U‑turn, and “others”). The router is trained on rule‑based labels, and its output is shared across all decoder layers. Each expert is permanently bound to a specific scenario, allowing it to specialize and reducing the learning burden per expert. Load balancing becomes data‑driven: the proportion of training samples for each scenario directly determines the workload of its expert. This design yields faster convergence, clearer interpretability, and the ability to fine‑tune experts by adjusting the data distribution.
Scene‑Specific Queries.
Instead of using generic anchor‑free or global queries, the model generates queries that are conditioned on the scenario identified by the router. For a left‑turn scenario, the query focuses on the left‑front region; for a roundabout, it concentrates on the circular entry/exit zones. By constraining the output space to relevant maneuver regions, the model avoids generating unrealistic or low‑quality trajectories, stabilizes training, and preserves multimodality without suffering modal collapse.
Interaction‑Oriented Loss.
The authors augment the standard imitation‑learning loss with two interaction‑aware components: (i) ego‑conditioned prediction of surrounding agents, where the future trajectories of other vehicles are predicted given the planned ego motion, and (ii) a penalty term that measures the deviation caused by ego‑agent mutual influence (e.g., proximity, potential collisions). This loss forces the planner to anticipate how its own trajectory will affect surrounding traffic and to choose actions that minimize adverse interactions.
Architecture Overview.
Input data (ego state, dynamic agents, static objects, and map) are encoded via separate modules: ego/status and static encoders (using SDE and MLP), an agent encoder that applies Fourier positional encoding and an MLP‑Mixer for temporal compression, and a map encoder that processes spatial features with a linear layer followed by an MLP‑Mixer. All encoded features are concatenated and passed through a standard Transformer encoder to produce a scene representation Q_S. Q_S feeds the EMoE router, which selects the appropriate expert and supplies the scene‑specific query to the decoder. The decoder then produces two sets of outputs: (a) trajectory queries for surrounding agents (used for interaction‑aware prediction) and (b) multiple candidate ego trajectories with associated probabilities. The highest‑probability candidate is selected as the final plan, eliminating the need for a post‑processing selection module.
Experimental Evaluation.
The model is trained and evaluated on the NuPlan dataset, which contains a wide variety of urban scenarios. Comparisons are made against state‑of‑the‑art learning‑based planners (e.g., UniAD, PLUTO, DiffusionPlanner) and traditional rule‑based planners. Metrics include Average Displacement Error (ADE), Final Displacement Error (FDE), collision rate, and rule‑violation count. EMoE‑Planner consistently outperforms all baselines: it reduces ADE/FDE by roughly 12‑18 %, cuts collision rates by over 30 %, and shows the most pronounced gains in data‑scarce scenarios such as roundabouts, U‑turns, and complex merges. In closed‑loop simulations, it is the first pure learning model to surpass rule‑based planners across almost every test case.
Discussion and Limitations.
The explicit routing provides interpretability and easy diagnostics, but it relies on manually defined scenario labels, which could be noisy or incomplete. The fixed set of seven scenarios may limit scalability to novel or highly composite situations; extending the expert pool would require re‑defining the router and retraining. Moreover, because the router’s decision is made once per planning horizon, rapid scene changes within the horizon may not be captured.
Future Work.
Potential extensions include (1) automated scenario labeling using self‑supervised clustering, (2) dynamic addition/removal of experts to handle emerging maneuvers, and (3) tighter integration of the interaction loss with reinforcement learning to further improve closed‑loop robustness.
In summary, EMoE‑Planner offers a compelling solution to multimodal trajectory generation and interaction‑aware planning by marrying an interpretable, scenario‑driven expert mixture with targeted queries and a loss that explicitly accounts for ego‑agent dynamics. Its strong empirical performance on NuPlan suggests that explicit MoE architectures could become a new standard for safe, efficient, and scalable autonomous driving planners.
Comments & Academic Discussion
Loading comments...
Leave a Comment