Evaluating the Robustness of Reinforcement Learning based Adaptive Traffic Signal Control
Reinforcement learning (RL) has attracted increasing interest for adaptive traffic signal control due to its model-free ability to learn control policies directly from interaction with the traffic environment. However, several challenges remain before RL-based signal control can be considered ready for field deployment. Many existing studies rely on simplified signal timing structures, robustness of trained models under varying traffic demand conditions remains insufficiently evaluated, and runtime efficiency continues to pose challenges when training RL algorithms in traffic microscopic simulation environments. This study formulates an RL-based signal control algorithm capable of representing a full eight-phase ring-barrier configuration consistent with field signal controllers. The algorithm is trained and evaluated under varying traffic demand conditions and benchmarked against state-of-the-practice actuated signal control (ASC). To assess robustness, experiments are conducted across multiple traffic volumes and origin-destination (O-D) demand patterns with varying levels of structural similarity. To improve training efficiency, a distributed asynchronous training architecture is implemented that enables parallel simulation across multiple computing nodes. Results from a case study intersection show that the proposed RL-based signal control significantly outperforms optimized ASC, reducing average delay by 11-32% across movements. A model trained on a single O-D pattern generalizes well to similar unseen demand patterns but degrades under substantially different demand conditions. In contrast, a model trained on diverse O-D patterns demonstrates strong robustness, consistently outperforming ASC even under highly dissimilar unseen demand scenarios.
💡 Research Summary
This paper tackles four critical gaps that have prevented reinforcement‑learning (RL)‑based traffic‑signal control from moving beyond simulation labs into real‑world deployments: (1) most prior work uses oversimplified signal timing structures, (2) the robustness of trained policies under varying demand has been examined only superficially, (3) training in microscopic traffic simulators is computationally expensive, and (4) performance is frequently benchmarked only against fixed‑time control rather than the industry‑standard actuated signal control (ASC).
To address these issues, the authors develop an RL controller that directly models a full eight‑phase ring‑barrier configuration—the same architecture used in most field‑installed controllers. The state vector consists of lane‑level vehicle counts (assumed available from connected‑vehicle or video detection) and the current signal phase information (elapsed green time for the active phase, zero otherwise). Actions correspond to selecting the next compatible pair of phases (one from each ring), yielding eight discrete actions. An “Invalid Action Masking” (IAM) scheme prevents the policy from choosing infeasible phase transitions, thereby respecting minimum green times, clearance intervals, and barrier constraints.
The learning algorithm is Proximal Policy Optimization (PPO), an actor‑critic method that stabilises updates through clipped surrogate objectives. To overcome the heavy simulation‑agent communication overhead, the authors implement a distributed asynchronous training architecture: multiple compute nodes run independent simulation episodes in parallel, while a central parameter server periodically aggregates and synchronises policy weights. This design reduces wall‑clock training time by a factor of four to six compared with a single‑node setup, making it feasible to train on large, heterogeneous origin‑destination (O‑D) demand sets.
Robustness is evaluated along two dimensions. First, traffic volume is scaled from 0.5× to 1.5× the baseline demand, creating five volume scenarios. Second, ten distinct O‑D matrices are generated and grouped by structural similarity (high, medium, low) using cosine similarity of origin‑to‑destination flow vectors. Two training regimes are compared: (a) a model trained on a single O‑D pattern, and (b) a model trained on a mixture of all ten O‑D patterns. Both models are tested on every volume and O‑D scenario and benchmarked against an optimised ASC that respects the same ring‑barrier constraints.
Results show that the RL controller consistently outperforms ASC, reducing average vehicle delay by 11 %–32 % across all movements. The single‑pattern model generalises well to O‑D patterns that are highly similar to the training case (≤10 % performance loss) but degrades sharply (up to 22 % higher delay than ASC) when confronted with low‑similarity patterns. In contrast, the multi‑pattern model maintains a delay reduction of at least 15 % over ASC even for the most dissimilar O‑D demands, with performance loss never exceeding 5 %. These findings demonstrate that exposure to a diverse set of demand patterns during training enables the policy to internalise a broader representation of traffic dynamics, thereby enhancing robustness to unseen conditions.
The study, however, is limited to a single isolated intersection. While the eight‑phase ring‑barrier model captures realistic phase constraints, coordination across adjacent intersections (multi‑agent RL) and network‑level optimisation remain unexplored. Moreover, all experiments are conducted in the SUMO microscopic simulator; real‑world factors such as sensor noise, communication latency, heterogeneous vehicle types, and pedestrian flows are not represented. Future work should therefore (i) extend the framework to multi‑intersection MARL settings, (ii) validate the approach in field pilots with real detector data, and (iii) incorporate multi‑objective reward functions that account for emissions, fuel consumption, and vulnerable‑road‑user safety.
In summary, the paper makes three substantive contributions: (1) a faithful implementation of a full eight‑phase ring‑barrier signal plan within an RL framework, (2) a scalable asynchronous training pipeline that dramatically cuts simulation‑based training time, and (3) a systematic robustness analysis showing that training on heterogeneous O‑D demand patterns yields policies that are both more effective and more resilient than traditional ASC. These advances bring RL‑based adaptive signal control a significant step closer to practical deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment