An Open-Source Framework for Adaptive Traffic Signal Control

JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 1 An Open-Source Frame work for Adapti ve T raf ﬁc Signal Control W ade Genders and Saiedeh Razavi Abstract —Sub-optimal control policies in transportation sys- tems negatively impact mobility , the envir onment and human health. Developing optimal transportation control systems at the appropriate scale can be difﬁcult as cities’ transportation systems can be lar ge, complex and stochastic. Intersection trafﬁc signal controllers are an important element of modern transportation infrastructure where sub-optimal control policies can incur high costs to many users. Many adapti ve trafﬁc signal contr ollers hav e been proposed by the community but research is lacking regarding their relativ e performance difference - which adaptive trafﬁc signal controller is best remains an open question. This resear ch contrib utes a framework f or de veloping and e valuating different adaptive trafﬁc signal contr oller models in simula- tion - both learning and non-learning - and demonstrates its capabilities. The framework is used to ﬁrst, inves tigate the performance v ariance of the modelled adaptiv e trafﬁc signal controllers with respect to their hyperparameters and second, analyze the performance differences between controllers with optimal hyperparameters. The proposed framework contains implementations of some of the most popular adaptiv e trafﬁc signal controllers from the literature; W ebster’s, Max-pressure and Self-Organizing T rafﬁc Lights, along with deep Q-network and deep deterministic policy gradient reinf orcement learning controllers. This framework will aid researchers by accelerating their work from a common starting point, allowing them to generate results faster with less effort. All framework source code is a vailable at https://github.com/docwza/sumolights . Index T erms —trafﬁc signal control, adaptiv e trafﬁc signal con- trol, intelligent transportation systems, reinf orcement learning, neural networks. I . I N T RO D U C T I O N C ITIES rely on road infrastructure for transporting indi- viduals, goods and services. Sub-optimal control poli- cies incur environmental, human mobility and health costs. Studies observe vehicles consume a signiﬁcant amount of fuel accelerating, decelerating or idling at intersections [1]. Land transportation emissions are estimated to be responsible for one third of all mortality from ﬁne particulate matter pollution in North America [2]. Globally , ov er three million deaths are attributed to air pollution per year [3]. In 2017, residents of three of the United States’ biggest cities, Los Angeles, New W . Genders w as a Ph.D. student with the Department of Civil En- gineering, McMaster University , Hamilton, Ontario, Canada e-mail: gen- derwt@mcmaster .ca S. Razavi is an Associate Professor , Chair in Heavy Construction and Direc- tor of McMaster Institute for Transportation & Logistics at the Department of Civil Engineering, McMaster University , Hamilton, Ontario, Canada e-mail: razavi@mcmaster .ca 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collective works, for resale or redistrib ution to servers or lists, or reuse of any copyrighted component of this work in other works. Manuscript received August X, 2019; revised August X, 2019. Y ork and San Francisco, spent between three and four days on average delayed in congestion ov er the year , respectiv ely costing 19 , 33 and 10 billion USD from fuel and individual time w aste [4]. It is paramount to ensure transportation systems are optimal to minimize these costs. Automated control systems are used in man y aspects of transportation systems. Intelligent transportation systems seek to develop optimal solutions in transportation using intelli- gence. Intersection trafﬁc signal controllers are an important element of many cities’ transportation infrastructure where sub-optimal solutions can contribute high costs. Traditionally , trafﬁc signal controllers ha ve functioned using primitive logic which can be improv ed. Adapti ve trafﬁc signal controllers can improv e upon traditional trafﬁc signal controllers by conditioning their control on current trafﬁc conditions. T rafﬁc microsimulators such as SUMO [5], Paramics, VIS- SUM and AIMSUM ha ve become popular tools for de veloping and testing adaptive trafﬁc signal controllers before ﬁeld de- ployment. Ho wev er , researchers interested in studying adapti ve trafﬁc signal controllers are often burdened with dev eloping their own adaptiv e trafﬁc signal control implementations de novo . This research contributes an adaptive traf ﬁc signal control framework, including W ebster’ s, Max-pressure, Self- organizing traf ﬁc lights (SO TL), deep Q-network (DQN) and deep deterministic policy gradient (DDPG) implementations for the freely av ailable SUMO trafﬁc microsimulator to aid researchers in their work. The framew ork’ s capabilities are demonstrated by studying the ef fect of optimizing trafﬁc signal controllers hyperparameters and comparing optimized adapti ve trafﬁc signal controllers relati ve performance. I I . B AC K G R O U N D A. T rafﬁc Signal Contr ol An intersection is composed of trafﬁc mov ements or ways that a vehicle can traverse the intersection be ginning from an incoming lane to an outgoing lane. T rafﬁc signal controllers use phases, combinations of coloured lights that indicate when speciﬁc mov ements are allowed, to control v ehicles at the intersection. Fundamentally , a trafﬁc signal control policy can be decou- pled into two sequential decisions at an y given time; what should the next phase be and for how long in duration? A variety of models have been proposed as policies. The simplest and most popular traf ﬁc signal controller determines the next phase by displaying the phases in an ordered sequence known as a cycle, where each phase in the cycle has a ﬁxed, potentially unique, duration - this is known as a ﬁxed-time, cycle-based trafﬁc signal controller . Although simple, ﬁxed- time, cycle-based trafﬁc signal controllers are ubiquitous in JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 2 transportation networks because they are predictable, stable and ef fectiv e, as traf ﬁc demands e xhibit reliable patterns over regular periods (i.e., times of the day , days of the week). How- ev er , as ubiquitous as the ﬁxed-time controller is, researchers hav e long sought to dev elop improved trafﬁc signal controllers which can adapt to changing trafﬁc conditions. Actuated trafﬁc signal controllers use sensors and boolean logic to create dynamic phase durations. Adaptiv e trafﬁc signal controllers are capable of acyclic phase sequences and dy- namic phase durations to adapt to changing intersection traf ﬁc conditions. Adaptiv e controllers attempt to achieve higher performance at the expense of complexity , cost and reliability . V arious techniques hav e been proposed as the foundation for adaptiv e trafﬁc signal controllers, from analytic mathematical solutions, heuristics and machine learning. B. Liter ature Re view Dev eloping an adaptiv e trafﬁc signal control ultimately requires some type of optimization technique. For decades researchers have proposed adaptiv e trafﬁc signal controllers based on a variety of techniques such as evolutionary algo- rithms [6], [7], [8], [9], [10], [11] and heuristics such as pres- sure [12], [13], [14], immunity [15], [16] and self-organization [17], [18], [19]. Additionally , many comprehensiv e adaptive trafﬁc signal control systems hav e been proposed such as OP AC [20], SCA TS [21], RHODES [22] and A CS-Lite [23]. Reinforcement learning has been demonstrated to be an effecti ve method for developing adaptiv e trafﬁc signal con- trollers in simulation [6], [24], [25], [26], [27], [28]. Recently , deep reinforcement learning has been used for adaptive traf ﬁc signal control with varying degrees of success [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]. A comprehensive revie w of reinforcement learning adaptive trafﬁc signal con- trollers is presented in T able I. Readers interested in additional adaptiv e trafﬁc signal con- trol research can consult extensi ve re view articles [40], [41], [42], [43], [44]. Although ample research exists proposing nov el adapti ve trafﬁc signal controllers, it can be arduous to compare be- tween previously proposed ideas. De veloping adaptiv e traf ﬁc signal controllers can be challenging as man y of them require deﬁning many hyperparameters. The authors seek to address this problem by contributing an adaptiv e trafﬁc signal control framew ork to address these problems and aid in their research. C. Contrib ution The authors’ work contributes in the follo wing areas: • Diverse Adaptive T rafﬁc Signal Contr oller Implemen- tations : The proposed framew ork contributes adapti ve trafﬁc signal controllers based on a v ariety of paradigms, the broadest being non-learning (e.g., W ebster’ s, SO TL, Max-pressure) and learning (e.g., DQN and DDPG). The div ersity of adaptive trafﬁc signal controllers allows re- searchers to experiment at their leisure without in vesting time dev eloping their own implementations. • Scalable, Optimized : The proposed framework is opti- mized for use with parallel computation techniques lev er- aging modern multicore computer architecture. This fea- ture signiﬁcantly reduces the compute time of learning- based adapti ve traf ﬁc signal controllers and the generation of results for all controllers. By making the frame work computationally efﬁcient, the search for optimal hyperpa- rameters is tractable with modest hardware (e.g., 8 core CPU). The framework was designed to scale to de velop adaptiv e controllers for any SUMO network. All source code used in this manuscript can be retriev ed from https://github.com/docwza/sumolights . I I I . T R A FFI C S I G N A L C O N T R O L L E R S Before describing each traf ﬁc signal controller in detail, elements common to all are detailed. All of the included trafﬁc signal controllers share the following; a set of intersection lanes L , decomposed into incoming L inc and outgoing lanes L out and a set of green phases P . The set of incoming lanes with green movements in phase p ∈ P is denoted as L p,inc and their outgoing lanes as L p,out . A. Non-Learning T rafﬁc Signal Contr ollers 1) Uniform: A simple cycle-based, uniform phase duration trafﬁc signal controller is included for use as a base-line comparison to the other controllers. The uniform controller’ s only hyperparameter is the green duration u , which deﬁnes the same duration for all green phases; the ne xt phase is determined by a cycle. 2) W ebsters: W ebster’ s method dev elops a cycle-based, ﬁxed phase length trafﬁc signal controller using phase ﬂow data [53]. The authors propose an adaptive W ebster’ s traf ﬁc signal controller by collecting data for a time interval W in duration and then using W ebster’ s method to calculate the cycle and green phase durations for the next W time interval. This adapti ve W ebster’ s essentially uses the most recent W interval to collect data and assumes the trafﬁc demand will be approximately the same during the next W interv al. The selection of W is important and exhibits various trade-offs, smaller v alues allow for more frequent adaptations to changing trafﬁc demands at the risk of instability while larger values adapt less frequently b ut allow for increased stability . Pseudo- code for the W ebster’ s trafﬁc signal controller is presented in Algorithm 1. In Algorithm 1, F represents the set of phase ﬂows collected ov er the most recent W interval and R represents the total cycle lost time. In addition to the time interval hyperparameter W , the adaptive W ebster’ s algorithm also has hyperparameters deﬁning a minimum cycle duration c min , maximum cycle duration c max and lane saturation ﬂow rate s . 3) Max-pr essur e: The Max-pressure algorithm develops an acyclic, dynamic phase length trafﬁc signal controller . The Max-pressure algorithm models vehicles in lanes as a substance in a pipe and enacts control in a manner which attempts to maximize the relief of pressure between incoming JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 3 T ABLE I A DA P TI V E T R A FFI C S I GN AL C O NT RO L RE L A T E D W O R K . Research Network Intersections Multi-agent RL Function Approximation [45] Grid 15 Max-plus Model-based N/A [27] Grid, Corridor < 10 None Q-learning Linear [46] Springﬁeld, USA 20 Max-plus Q-learning N/A [28] T oronto, Canada 59 Game Theory Q-learning T abular [47] N/A 50 Holonic Q-learning N/A [48] Grid 22 Rew ard Sharing Q-learning Bayesian [49] Grid 100 Regional Q-learning Linear [50] Barcelona, Spain 43 Centralized DDPG DNN 1 [33] T ehran, Iran 50 None Actor-Critic RBF 2 , Tile Coding [51] Changsha, China 96 Rew ard sharing Q-learning Linear [52] Luxembour g City 195 None DDPG DNN 1 Deep Neural Network (DNN). 2 Radial Basis Function (RBF). Algorithm 1 W ebster’ s Algorithm 1: pr ocedure W E B S T E R ( c min , c max , s, F , R ) 2: # compute cr itical l anes f or each phase 3: Y = { max( { F l s for l in L p,inc } ) for p in P } 4: # compute cy cle l ength 5: C = (1 . 5 ∗ R )+5 1 . 0 − P Y 6: if C < c min then 7: C = c min 8: else if C > c max then 9: C = c max 10: end if 11: G = C − R 12: # all ocate g reen time pr oportional to f low 13: retur n C , { G y P Y for y in Y } 14: end pr ocedure and outgoing lanes [13]. For a given green phase p , the pressure is deﬁned in (1). Pressure( p ) = X l ∈ L p,inc | V l | − X l ∈ L p,out | V l | (1) Where L p,inc represents the set of incoming lanes with green movements in phase p and L p,out represents the set of outgoing lanes from all incoming lanes in L p,inc . Pseudo-code for the Max-pressure trafﬁc signal controller is presented in Algorithm 2. Algorithm 2 Max-pressure Algorithm 1: pr ocedure M A X P R E S S U R E ( g min , t p , P ) 2: if t p < g min then 3: t p = t p + 1 4: else 5: t p = 0 6: # next phase has l arg est pr essure 7: retur n argmax( { Pressure( p ) for p in P } ) 8: end if 9: end pr ocedure In Algorithm 2, t p represents the time spent in the current phase. The Max-pressure algorithm requires a minimum green time hyperparameter g min which ensures a newly enacted phase has a minimum duration. 4) Self Or ganizing T rafﬁc Lights: Self-organizing traf ﬁc lights (SO TL) [17], [18], [19] de velop a cycle-based, dynamic phase length trafﬁc signal controller based on self-org anizing principles, where a “...self-organizing system would be one in which elements are designed to dynamically and au- tonomously solve a problem or perform a function at the system lev el. ” [18, p. 2]. Pseudo-code for the SO TL trafﬁc signal controller is pre- sented in Algorithm 3. Algorithm 3 SO TL Algorithm 1: pr ocedure S O T L( t p , g min , θ, ω , µ ) 2: # accumulate r ed phase v ehicles time integ r al 3: κ = κ + P l ∈ L inc − L p,inc | V l | 4: if t p > g min then 5: # v ehicles appr oaching in cur r ent g r een phase 6: # < ω distance of stop l ine 7: n = P l ∈ L p,inc | V l | 8: # only consider phase chang e if no platoon 9: # or too lar g e n > µ 10: if n > µ or n == 0 then 11: if κ > θ then 12: κ = 0 13: # next phase in cy cle 14: i = i + 1 15: retur n P i mod | P | 16: end if 17: end if 18: end if 19: end pr ocedure The SO TL algorithm functions by changing lights accord- ing to a vehicle-time integral threshold θ constrained by a minimum green phase duration g min . Additionally , small (i.e. n < µ ) vehicle platoons are kept together by prev enting a phase change if sufﬁciently close (i.e., at a distance < ω ) to the stop line. B. Learning T rafﬁc Signal Contr ollers Reinforcement learning uses the framework of Markov Decision Processes to solve goal-oriented, sequential decision- making problems by repeatedly acting in an en vironment. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 4 At discrete points in time t , a reinforcement learning agent observes the en vironment state s t and then uses a policy π to determine an action a t . After implementing its selected action, the agent recei ves feedback from the en vironment in the form of a re ward r t and observes a new environment state s t +1 . The rew ard quantiﬁes how ‘well’ the agent is achieving its goal (e.g., score in a game, completed tasks). This process is repeated until a terminal state s terminal is reached, and then begins ane w . The return G t = P k = T k =0 γ k r t + k is the accumulation of rewards by the agent ov er some time horizon T , discounted by γ ∈ [0 , 1) . The agent seeks to maximize the expected return E [ G t ] from each state s t . The agent dev elops an optimal policy π ∗ to maximize the return. There are many techniques for an agent to learn the optimal policy , howe ver , most of them rely on estimating value func- tions. V alue functions are useful to estimate future re wards. State value functions V π ( s ) = E [ G t | s t = s ] represent the expected return starting from state s and following policy π . Action v alue functions Q π ( s, a ) = E [ G t | s t = s, a t = a ] represent the expected return starting from state s , taking ac- tion a and follo wing polic y π . In practice, value functions are unknown and must be estimated using sampling and function approximation techniques. Parametric function approximation, such as neural networks, use a set of parameters θ to estimate an unknown function f ( x | θ ) ≈ f ( x ) . T o develop accurate approximations, the function parameters must be dev eloped with some optimization technique. Experiences are tuples e t = ( s t , a t , r t , s t +1 ) that represent an interaction between the agent and the en vironment at time t . A reinforcement learning agent interacts with its en vironment in trajectories or sequences of experiences e t , e t +1 , e t +2 , ... . T rajectories be gin in an initial state s init and end in a terminal state s terminal . T o accurately estimate value functions, expe- riences are used to optimize the parameters. If neural network function approximation is used, the parameters are optimized using experiences to perform gradient-based techniques and backpropagation [54], [55]. Additional technical details re- garding the proposed reinforcement learning adapti ve trafﬁc signal controllers can be found in the Appendix. T o train reinforcement learning controllers for all intersec- tions, a distributed acting, centralized learning architecture is dev eloped [56], [57], [58]. Using parallel computing, multiple actors and learners are created, illustrated in Figure 1. Actors hav e their own instance of the trafﬁc simulation and neural networks for all intersections. Learners are assigned a subset of all intersections, for each they have a neural network and an experience replay buf fer D . Actors generate experiences e t for all intersections and send them to the appropriate learner . Learners only receive experiences for their assigned subset of intersections. The learner stores the experiences in an experience replay buf fer, which is uniformly sampled for batches to optimize the neural network parameters. After computing parameter updates, learners send new parameters to all actors. There are many beneﬁts to this architecture, foremost being that it makes the problem feasible; because there are hundreds of agents, distrib uting computation across man y actors and learners is necessary to decrease training time. Another beneﬁt is experience diversity , granted by multiple en vironments and varied exploration rates. C. DQN The proposed DQN traf ﬁc signal controller enacts control by choosing the next green phase without utilizing a phase cycle. This acyclic architecture is motiv ated by the observation that enacting phases in a repeating sequence may be contributing to sub-optimal control policy . After the DQN has selected the next phase, it is enacted for a ﬁxed duration known as an action repeat a repeat . After the phase has been enacted for the action repeat duration, a new phase is selected acyclically . 1) State: The proposed state observation for the DQN is a combination of the most recent green phase and the density and queue of incoming lanes at the intersection at time t . Assume each intersection has a set L of incoming lanes and a set P of green phases. The state space is then deﬁned as S ∈ ( R 2 | L | × B | P | +1 ) . The density and queue of each lane are normalized to the range [0 , 1] by dividing by the lane’ s jam density k j . The most recent phase is encoded as a one- hot v ector B | P | +1 , where the plus one encodes the all-red clearance phase. 2) Action: The proposed action space for the DQN traf ﬁc signal controller is the next green phase. The DQN selects one action from a discrete set, in this model one of the many possible green phases a t ∈ P . After a green phase has been selected, it is enacted for a duration equal to the action repeat a repeat . 3) Re ward: The reward used to train the DQN traf ﬁc signal controller is a function of vehicle delay . Delay d is the difference in time between a vehicle’ s free-ﬂow travel time and actual travel time. Speciﬁcally , the rew ard is the ne gativ e sum of all vehicles’ delay at the intersection, deﬁned in (2): r t = − X v ∈ V d v t (2) Where V is the set of all vehicles on incoming lanes at the intersection, and d t v is the delay of vehicle v at time t . Deﬁned in this w ay , the reward is a punishment, with the agent’ s goal to minimize the amount of punishment it receives. Each intersection sa ves the reward with the largest magnitude experienced to perform minimum re ward normalization r t | r min | to scale the reward to the range [ − 1 , 0] for stability . 4) Ag ent Arc hitecture: The agent approximates the action- value Q function with a deep artiﬁcial neural network. The action-value function Q is two hidden layers of 3( | s t | ) fully connected neurons with exponential linear unit (ELU) acti- vation functions and the output layer is | P | neurons with linear activ ation functions. The Q function’ s input is the local intersection state s t . A visualization of the DQN is presented in Fig. 1. D. DDPG T rafﬁc Signal Contr oller The proposed DDPG trafﬁc signal controller implements a c ycle with dynamic phase durations. This architecture is motiv ated by the observation that cycle-based policies can JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 5 Fig. 1. Adaptive trafﬁc signal control DDPG and DQN neural network agents (left) and distributed acting, centralized learning architecture (right) composed of actors and learners. Each actor has one SUMO network as an environment and neural networks for all intersections. Each learner is assigned a subset of intersections at the beginning of training and is only responsible for computing parameter updates for their assigned intersections, ef fectively distributing the computation load for learning. Howev er, learners distribute parameter updates to all actors. maintain fairness and ensure a minimum quality of service between all intersection users. Once the ne xt green phase has been determined using the cycle, the policy π is used to select its duration. Explicitly , the reinforcement learning agent is learning how long in duration to make the next green phase in the c ycle to maximize its return. Additionally , the cycle skips phases when no vehicles are present on incoming lanes. 1) Actor State: The proposed state observation for the actor is a combination of the current phase and the density and queue of incoming lanes at the intersection at time t . The state space is then deﬁned as S ∈ ( R 2 | L | × B | P | +1 ) . The density and queue of each lane are normalized to the range [0 , 1] by di viding by the lane’ s jam density k j . The current phase is encoded as a one-hot vector B | P | +1 , where the plus one encodes the all-red clearance phase. 2) Critic State: The proposed state observ ation for the critic combines the state s t and the actor’ s action a t , depicted in Figure 1. 3) Action: The proposed action space for the adaptive trafﬁc signal controller is the duration of the next green phase in seconds. The action controls the duration of the next phase; there is no agency over what the next phase is, only on ho w long it will last. The DDPG algorithm produces a continuous output, a real number o ver some range a t ∈ R . Since the DDPG algorithm outputs a real number and the phase duration is deﬁned in interv als of seconds, the output is rounded to the nearest integer . In practice, phase durations are bounded by minimum time g min and a maximum time g max hyperparameters to ensure a minimum quality of service for all users. Therefore the agent selects an action { a t ∈ Z | g min ≤ a t ≤ g max } as the next phase duration. 4) Re ward: The reward used to train the DDPG trafﬁc signal controller is the same delay re ward used by the DQN trafﬁc signal controller deﬁned in (2). 5) Ag ent Ar chitectur e: The agent approximates the policy π and action-value Q function with deep artiﬁcial neural networks. The policy function is two hidden layers of 3 | s t | fully connected neurons, each with batch normalization and ELU activ ation functions, and the output layer is one neuron with a hyperbolic tangent activ ation function. The action-value function Q is two hidden layers of 3( | s t | + | a t | ) fully connected neurons with batch normalization and ELU activ ation func- tions and the output layer is one neuron with a linear activ ation function. The policy’ s input is the intersection’ s local trafﬁc state s t and the action-v alue function’ s input is the local state concatenated with the local action s t + a t . The action-value Q function also uses a L 2 weight regularization of λ = 0 . 01 . By deep reinforcement learning standards the networks used are not that deep, howe ver , their architecture is selected for simplicity and they can easily be modiﬁed within the frame- work. Simple deep neural networks were also implemented to allow for future scalability , as the proposed framework can be deployed to any SUMO network - to reduce the computational load the default networks are simple. I V . E X P E R I M E N T S A. Hyperpar ameter Optimization T o demonstrate the capabilities of the proposed framework, experiments are conducted on optimizing adaptiv e trafﬁc sig- nal control hyperparameters. The frame work is for use with the SUMO trafﬁc microsimulator [5], which was used to e valuate the de veloped adapti ve traf ﬁc signal controllers. Understanding how sensitiv e any speciﬁc adaptiv e trafﬁc signal controller’ s performance is to changes in hyperparameters is important to instill conﬁdence that the solution is robust. Determining optimal hyperparameters is necessary to ensure a balanced comparison between adaptiv e trafﬁc signal control methods. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 6 gn e J 0 gn e J 6 Fig. 2. T wo intersection SUMO network used for hyperparameter ex- periments. In addition to this two intersection network, a single, isolated intersection is also included with the framework. Using the hyperparameter optimization script included in the framew ork, a grid search is performed with the imple- mented controllers’ hyperparameters on a two intersection network, sho wn in Fig. 2 under a simulated three hour dynamic trafﬁc demand scenario. The results for each traf ﬁc signal controller are displayed in Fig. 3 and collecti vely in Fig. 4. As can be observed in Fig. 3 and Fig. 4 the choice of hyperparameter signiﬁcantly impacts the performance of the giv en traf ﬁc signal controller . As a general trend observed in Fig. 3, methods with larger numbers of hyperparameters (e.g., SOTL, DDPG, DQN) exhibit greater performance v ari- ance than methods with fewer hyperparameters (e.g., Max- pressure). Directly comparing methods in Fig. 4 demonstrates non-learning adapti ve trafﬁc signal control methods (e.g., Max-pressure, W ebster’ s) robustness to hyperparameter values and high performance (i.e., lo west trav el time). Learning-based methods exhibit higher v ariance with changes in hyperparam- eters, DQN more so than DDPG. In the following section, the best hyperparameters for each adaptive trafﬁc signal controller will be used to further inv estigate and compare performance. B. Optimized Adaptive T rafﬁc Signal Contr ollers Using the optimized hyperparameters all trafﬁc signal con- trollers are subjected to an additional 32 simulations with random seeds to estimate their performance, quantiﬁed using network trav el time, individual intersection queue and delay measures of effecti veness (MoE). Results are presented in Fig. 5 and Fig. 6. Observing the travel time boxplots in Fig. 5, the SO TL controller produces the worst results, exhibiting a mean tra vel time almost twice the next closest method and with many signiﬁcant outliers. The Max-pressure algorithm achieves the best performance, with the lo west mean and median along with the lowest standard de viation. The DQN, DDPG, Uniform and W ebster’ s controllers achieve approximately equal perfor - mance, howe ver , the DQN controller has signiﬁcant outliers, indicating some vehicles e xperience much longer tra vel times than most. Each intersection’ s queue and delay MoE with respect to each adaptiv e trafﬁc signal controller is presented in Fig. 6. The results are consistent with previous observ ations from the hyperparameter search and tra vel time data, howe ver , the reader’ s attention is directed to comparing the performance of DQN and DDPG in Fig. 6. The DQN controller performs poorly (i.e., high queues and delay) at the beginning and end of the simulation when trafﬁc demand is low . Howe ver , at the demand peak, the DQN controller performs just as well, if not a little better , than ev ery method except the Max-pressure controller . Simultaneously considering the DDPG controller, the performance is opposite the DQN controller . The DDPG controller achie ves relatively low queues and delay at the beginning and end of the simulation and then is bested by the DQN controller in the middle of the simulation when the demand peaks. This performance difference can potentially be understood by considering the difference between the DQN and DDPG controllers. The DQN’ s ability to select the ne xt phase acyclically under high trafﬁc demand may allow it to reduce queues and delay more than the cycle constrained DDPG controller . Howe ver , it is curious that under lo w demands the DQN controller performance suf fers when it should be relatively simple to develop the optimal policy . The DQN controller may be overﬁtting to the periods in the en vironment when the magnitude of the rew ards are large (i.e., in the middle of the simulation when the demand peaks) and con ver ging to a policy that doesn’t generalize well to the en vironment when the trafﬁc demand is low . The author’ s present these ﬁndings to reader’ s and suggest future research in vestigate this and other issues to understand the performance dif ference between reinforcement learning trafﬁc signal controllers. Understanding the advantages and disadvantages of a variety of controllers can provide insight into dev eloping future improvements. V . C O N C L U S I O N & F U T U R E W O R K Learning and non-learning adapti ve trafﬁc signal controllers hav e been dev eloped within an optimized framew ork for the trafﬁc microsimulator SUMO for use by the research com- munity . The proposed framew ork’ s capabilities were demon- strated by studying adaptiv e trafﬁc signal control algorithm’ s sensitivity to hyperparameters, which was found to be sensi- tiv e with hyperparameter rich controllers (i.e., learning) and relativ ely insensitive with h yperparameter sparse controllers (i.e., heuristics). Poor hyperparameters can drastically alter the performance of an adaptive trafﬁc signal controller , lead- ing researchers to erroneous conclusions about an adapti ve trafﬁc signal controller’ s performance. This research provides evidence that dozens or hundreds of hyperparameter conﬁg- urations may have to be tested before selecting the optimal one. Using the optimized hyperparamters, each adapti ve con- troller’ s performance was estimated and the Max-pressure controller was found to achieve the best performance, yielding the lo west travel times, queues and delay . This manuscript’ s research provides e vidence that heuristics can offer powerful solutions e ven compared to complex deep-learning methods. This is not to suggest that this is deﬁniti vely the case in all en- vironments and circumstances. The authors’ hypothesize that learning-based controllers can be further de veloped to offer JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 7 0 50 100 150 0 20 40 60 80 100 120 140 Standard Deviation T r a v e l T i m e ( s ) Uniform 0 250 500 750 1000 0 200 400 600 800 1000 1200 1400 SOTL 0 20 40 60 80 0 20 40 60 80 DDPG 0 20 40 60 Mean T r a v e l T i m e ( s ) 0 5 10 15 20 25 30 35 Standard Deviation T r a v e l T i m e ( s ) Max-pressure 0 200 400 600 Mean T r a v e l T i m e ( s ) 0 200 400 600 800 1000 DQN 0 20 40 60 80 Mean T r a v e l T i m e ( s ) 0 10 20 30 40 Webster's Best Worst Hyperparameter Performance Fig. 3. Individual hyperparameter results for each trafﬁc signal controller . T rav el time is used as a measure of effectiv eness and is estimated for each hyperparameter from eight simulations with random seeds in units of seconds ( s ). The coloured dots gradient from green (best) to red (worst) orders the hyperparameters by the sum of the tra vel time mean and standard deviation. Note dif fering scales between graph axes making direct visual comparison biased. 0 25 50 75 100 125 150 175 200 Mean T r a v e l T i m e ( s ) 0 25 50 75 100 125 150 175 200 Standard Deviation T r a v e l T i m e ( s ) Traffic Signal Control Hyperparameter Comparison DDPG DQN Max-pressure SOTL Uniform Webster's Fig. 4. Comparison of all traf ﬁc signal controller hyperparameter tra vel time performance. Note both vertical and horizontal axis limits have been clipped at 200 to improv e readability . improv ed performance that may yet best the non-learning, heuristic-based methods detailed in this research. Promising extensions that have improved reinforcement learning in other applications and may do the same for adaptiv e trafﬁc signal control include richer function approximators [59], [60], [61] and reinforcement learning algorithms [62], [63], [64]. The authors intend for the frame work to gro w , with the addi- tion of more adapti ve traf ﬁc signal controllers and features. In its current state, the framework can already aid adaptive trafﬁc signal control researchers rapidly e xperiment on a SUMO network of their choice. Acknowledging the importance of optimizing our transportation systems, the authors hope this research helps others solve practical problems. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 8 Traffic Signal Controller 0 250 500 750 1000 1250 1500 1750 T r a v e l T i m e ( s ) ( , , m e d i a n ) ( 7 2 , 3 4 , 6 5 ) ( 7 8 , 4 6 , 6 6 ) ( 5 9 , 2 1 , 5 4 ) ( 1 5 8 , 1 6 9 , 8 5 ) ( 7 9 , 3 7 , 7 4 ) ( 7 1 , 3 0 , 6 6 ) Traffic Signal Controller Travel Time) DDPG DQN Max-pressure SOTL Uniform Webster's Fig. 5. Boxplots depicting the distribution of tra vel times for each trafﬁc signal controller . The solid white line represents the median, solid coloured box the interquartile range (IQR) (i.e., from ﬁrst (Q1) to third quartile (Q3)), solid coloured lines the Q1- 1 . 5 IQR and Q3+ 1 . 5 IQR and coloured crosses the outliers. 0 25 50 75 100 125 150 175 0 2000 4000 6000 8000 Q u e u e ( v e h ) gneJ0 DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 0 2000 4000 6000 8000 gneJ6 DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 T i m e ( m i n ) 0 20000 40000 60000 80000 D e l a y ( s ) DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 T i m e ( m i n ) 0 10000 20000 30000 40000 50000 60000 70000 DDPG DQN Max-pressure Uniform Webster's Intersection Measures of Effectiveness Fig. 6. Comparison of trafﬁc signal controller individual intersection queue and delay measure of ef fectiv eness in units of vehicles ( v eh ) and seconds ( s ). Solid coloured lines represent the mean and shaded areas represent the 95% conﬁdence interv al. SOTL has been omitted to improv e readability since its Queue and Delay values are exclusiv ely outside the graph range. A P P E N D I X A D Q N Deep Q-Networks [65] combine Q-learning and deep neural networks to produce autonomous agents capable of solving complex tasks in high-dimensional en vironments. Q-learning [66] is a model-free, off-policy , value-based temporal dif- ference [67] reinforcement learning algorithm which can be used to dev elop an optimal discrete action space policy for a giv en problem. Like other temporal difference algorithms, Q-learning uses bootstrapping [68] (i.e., using an estimate to improv e future estimates) to de velop an action-value function Q ( s, a ) which can estimate the expected return of taking action a in state s and acting optimally thereafter . If the Q function can be estimated accurately it can be used to deri ve the optimal policy π ∗ = argmax a Q ( s, a ) . In DQN the Q function is approximated with a deep neural netw ork. The DQN algorithm JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 9 utilizes two techniques to ensure stable de velopment of the Q function with DNN function approximation - a target network and experience replay . T wo parameter sets are used when training a DQN, online θ and tar get θ 0 . The target parameters θ 0 are used to stabilize the return estimates when performing updates to the neural network and are periodically changed to the online parameters θ 0 = θ at a ﬁxed interv al. The e xperience replay is a buf fer which stores the most recent D experience tuples to create a slowly changing dataset. The experience replay is uniformly sampled from for experience batches to update the Q function online parameters θ . T raining a deep neural network requires a loss function, which is used to determine ho w to change the parameters to achiev e better approximations of the training data. Reinforce- ment learning dev elops value functions (e.g., neural networks) using experiences from the en vironment. The DQN’ s loss function is the gradient of the mean squared error of the return G t target y t and the prediction, deﬁned in (3). y t = r t + γ Q ( s t +1 , argmax a Q ( s t +1 , a | θ 0 ) | θ 0 ) L DQN ( θ ) = ( y t − Q ( s t , a t | θ )) 2 (3) A P P E N D I X B D D P G Deep deterministic policy gradients [69] are an extension of DQN to continuous action spaces. Similiar to Q-learning, DDPG is a model-free, of f-policy reinforcement learning al- gorithm. The DDPG algorithm is an example of actor-critic learning, as it dev elops a policy function π ( s | φ ) (i.e., actor) using an action-value function Q ( s, a | θ ) (i.e., critic). The actor interacts with the environment and modiﬁes its behaviour based on feedback from the critic. The DDPG critic’ s loss function is the gradient of the mean squared error of the return G t target y t and the prediction, deﬁned in (4). y t = r t + γ Q ( s t +1 , π ( s t +1 | φ 0 ) | θ 0 ) L Critic ( θ ) = ( y t − Q ( s t , a t | θ )) 2 (4) The DDPG actor’ s loss function is the sampled policy gradient, deﬁned in (5). L Actor ( θ ) = ∇ θ Q ( s t , π ( s t | φ ) | θ ) (5) Like DQN, DDPG uses two sets of parameters, online θ and target θ 0 , and experience replay [70] to reduce instability during training. DDPG performs updates on the parameters for both the actor and critic by uniformly sampling batches of experiences from the replay . The target parameters are slowly updated tow ards the online parameters according to θ 0 = (1 − τ ) θ 0 + ( τ ) θ after every batch update. A P P E N D I X C T E C H N I C A L Software used include SUMO 1.2.0 [5], T ensorﬂow 1.13 [71], SciPy [72] and public code [73]. The neural network parameters were initialized with He [74] and optimized using Adam [75]. T o ensure intersection safety , two second yellow change and three second all-red clearanc e phases were inserted between all green phase transitions. For the DQN and DDPG trafﬁc signal controllers, if no vehicles are present at the intersection, the phase defaults to all-red, which is considered a terminal state s terminal . Each intersection’ s state observ ation is bounded by 150 m (i.e., the queue and density are calculated from vehicles up to a maximum of 150 m from the intersection stop line). A C K N O W L E D G M E N T S This research was enabled in part by support in the form of computing resources provided by SHARCNET (www . sharcnet.ca), their McMaster University staf f and Compute Canada (www .computecanada.ca). R E F E R E N C E S [1] L. W u, Y . Ci, J. Chu, and H. Zhang, “The inﬂuence of intersections on fuel consumption in urban arterial road trafﬁc: a single vehicle test in harbin, china, ” PloS one , vol. 10, no. 9, 2015. [2] R. A. Silva, Z. Adelman, M. M. Fry , and J. J. W est, “The impact of individual anthropogenic emissions sectors on the global b urden of human mortality due to ambient air pollution, ” Envir onmental health perspectives , vol. 124, no. 11, p. 1776, 2016. [3] W orld Health Organization et al. , “ Ambient air pollution: A global assessment of exposure and burden of disease, ” 2016. [4] G. Cookson, “INRIX global trafﬁc scorecard, ” INRIX, T ech. Rep., 2018. [5] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker , “Recent de vel- opment and applications of SUMO - Simulation of Urban MObility , ” International Journal On Advances in Systems and Measur ements , vol. 5, no. 3&4, pp. 128–138, December 2012. [6] S. Mikami and Y . Kakazu, “Genetic reinforcement learning for cooper- ativ e traf ﬁc signal control, ” in Evolutionary Computation, 1994. IEEE W orld Congress on Computational Intelligence ., Proceedings of the F irst IEEE Conference on . IEEE, 1994, pp. 223–228. [7] J. Lee, B. Abdulhai, A. Shalaby , and E.-H. Chung, “Real-time optimiza- tion for adapti ve trafﬁc signal control using genetic algorithms, ” Journal of Intelligent T ransportation Systems , vol. 9, no. 3, pp. 111–122, 2005. [8] H. Prothmann, F . Rochner, S. T omforde, J. Branke, C. M ¨ uller-Schloer , and H. Schmeck, “Organic control of trafﬁc lights, ” in International Confer ence on Autonomic and Trusted Computing . Springer, 2008, pp. 219–233. [9] L. Singh, S. Tripathi, and H. Arora, “Time optimization for trafﬁc signal control using genetic algorithm, ” International Journal of Recent T r ends in Engineering , vol. 2, no. 2, p. 4, 2009. [10] E. Ricalde and W . Banzhaf, “Evolving adaptive trafﬁc signal controllers for a real scenario using genetic programming with an epigenetic mechanism, ” in 2017 16th IEEE International Confer ence on Machine Learning and Applications (ICMLA) . IEEE, 2017, pp. 897–902. [11] X. Li and J.-Q. Sun, “Signal multiobjectiv e optimization for urban trafﬁc network, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 19, no. 11, pp. 3529–3537, 2018. [12] T . W ongpiromsarn, T . Uthaicharoenpong, Y . W ang, E. Frazzoli, and D. W ang, “Distributed trafﬁc signal control for maximum network throughput, ” in Intelligent Tr ansportation Systems (ITSC), 2012 15th International IEEE Confer ence on . IEEE, 2012, pp. 588–595. [13] P . V araiya, “The max-pressure controller for arbitrary networks of signalized intersections, ” in Advances in Dynamic Network Modeling in Complex T ransportation Systems . Springer , 2013, pp. 27–66. [14] J. Gregoire, X. Qian, E. Frazzoli, A. De La Fortelle, and T . W ong- piromsarn, “Capacity-aware backpressure traf ﬁc signal control, ” IEEE T ransactions on Control of Network Systems , vol. 2, no. 2, pp. 164– 173, 2015. [15] S. Darmoul, S. Elkosantini, A. Louati, and L. B. Said, “Multi-agent immune networks to control interrupted ﬂow at signalized intersections, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 82, pp. 290–313, 2017. [16] A. Louati, S. Darmoul, S. Elkosantini, and L. ben Said, “ An artiﬁcial immune network to control interrupted ﬂow at a signalized intersection, ” Information Sciences , vol. 433, pp. 70–95, 2018. [17] C. Gershenson, “Self-organizing trafﬁc lights, ” arXiv preprint nlin/0411066 , 2004. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 10 [18] S.-B. Cools, C. Gershenson, and B. DHooghe, “Self-organizing trafﬁc lights: A realistic simulation, ” in Advances in applied self-organizing systems . Springer, 2013, pp. 45–55. [19] S. Goel, S. F . Bush, and C. Gershenson, “Self-organization in trafﬁc lights: Evolution of signal control with advances in sensors and com- munications, ” arXiv preprint , 2017. [20] N. Gartner , “ A demand-responsiv e strategy for trafﬁc signal control, ” T ransportation Researc h Recor d , vol. 906, pp. 75–81, 1983. [21] P . Lo wrie, “Scats, sydney co-ordinated adaptive trafﬁc system: A trafﬁc responsiv e method of controlling urban trafﬁc, ” 1990. [22] P . Mirchandani and L. Head, “ A real-time traf ﬁc signal control system: architecture, algorithms, and analysis, ” T ransportation Research P art C: Emer ging T echnologies , vol. 9, no. 6, pp. 415–432, 2001. [23] F . Luyanda, D. Gettman, L. Head, S. Shelby , D. Bullock, and P . Mirchan- dani, “ Acs-lite algorithmic architecture: applying adapti ve control system technology to closed-loop trafﬁc signal control systems, ” T ransportation Resear ch Recor d: Journal of the T ransportation Research Boar d , no. 1856, pp. 175–184, 2003. [24] T . L. Thorpe and C. W . Anderson, “Trafﬁc light control using sarsa with three state representations, ” Citeseer , T ech. Rep., 1996. [25] E. Bingham, “Reinforcement learning in neurofuzzy trafﬁc signal con- trol, ” European Journal of Operational Research , vol. 131, no. 2, pp. 232–241, 2001. [26] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning for true adapti ve traf ﬁc signal control, ” Journal of T ransportation Engineering , vol. 129, no. 3, pp. 278–285, 2003. [27] L. Prashanth and S. Bhatnagar , “Reinforcement learning with function approximation for trafﬁc signal control, ” IEEE T ransactions on Intelli- gent T ransportation Systems , vol. 12, no. 2, pp. 412–421, 2011. [28] S. El-T antawy , B. Abdulhai, and H. Abdelgawad, “Multiagent rein- forcement learning for integrated network of adaptive trafﬁc signal controllers (marlin-atsc): methodology and large-scale application on downto wn toronto, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 14, no. 3, pp. 1140–1150, 2013. [29] T . Rijken, “Deeplight: Deep reinforcement learning for signalised trafﬁc control, ” Ph.D. dissertation, Masters Thesis. Uni versity College London, 2015. [30] E. van der Pol, “Deep reinforcement learning for coordination in trafﬁc light control, ” Ph.D. dissertation, Masters Thesis. Univ ersity of Amsterdam, 2016. [31] L. Li, Y . Lv , and F .-Y . W ang, “Traf ﬁc signal timing via deep reinforce- ment learning, ” IEEE/CAA Journal of Automatica Sinica , v ol. 3, no. 3, pp. 247–254, 2016. [32] W . Genders and S. Razavi, “Using a deep reinforcement learning agent for trafﬁc signal control, ” arXiv preprint , 2016, https: //arxiv .org/abs/1611.01142. [33] M. Aslani, M. S. Mesgari, and M. Wiering, “ Adaptive traf ﬁc signal control with actor-critic methods in a real-world trafﬁc network with different trafﬁc disruption events, ” T ransportation Researc h P art C: Emer ging T echnologies , vol. 85, pp. 732–752, 2017. [34] S. S. Mousavi, M. Schukat, P . Corcoran, and E. Howley , “Traf ﬁc light control using deep policy-gradient and value-function based reinforce- ment learning, ” arXiv pr eprint arXiv:1704.08883 , 2017. [35] X. Liang, X. Du, G. W ang, and Z. Han, “Deep reinforcement learn- ing for traf ﬁc light control in vehicular networks, ” arXiv preprint arXiv:1803.11115 , 2018. [36] ——, “ A deep reinforcement learning network for trafﬁc light cycle control, ” IEEE Tr ansactions on V ehicular T echnology , v ol. 68, no. 2, pp. 1243–1253, 2019. [37] S. W ang, X. Xie, K. Huang, J. Zeng, and Z. Cai, “Deep reinforcement learning-based trafﬁc signal control using high-resolution event-based data, ” Entr opy , vol. 21, no. 8, p. 744, 2019. [38] W . Genders and S. Razavi, “ Asynchronous n-step q-learning adaptiv e trafﬁc signal control, ” Journal of Intelligent Tr ansportation Systems , vol. 23, no. 4, pp. 319–331, 2019. [39] T . Chu, J. W ang, L. Codec ` a, and Z. Li, “Multi-agent deep reinforcement learning for large-scale trafﬁc signal control, ” IEEE T ransactions on Intelligent T ransportation Systems , 2019. [40] A. Ste vanovic, Adaptive trafﬁc contr ol systems: domestic and foreign state of practice , 2010, no. Project 20-5 (T opic 40-03). [41] S. El-T antawy , B. Abdulhai, and H. Abdelga wad, “Design of reinforce- ment learning parameters for seamless application of adapti ve trafﬁc signal control, ” J ournal of Intelligent T ransportation Systems , vol. 18, no. 3, pp. 227–245, 2014. [42] S. Araghi, A. Khosra vi, and D. Creighton, “ A review on computational intelligence methods for controlling traf ﬁc signal timing, ” Expert systems with applications , vol. 42, no. 3, pp. 1538–1550, 2015. [43] P . Mannion, J. Duggan, and E. Howle y , “ An experimental review of reinforcement learning algorithms for adaptiv e trafﬁc signal control, ” in Autonomic Road T ransport Support Systems . Springer , 2016, pp. 47–66. [44] K.-L. A. Y au, J. Qadir, H. L. Khoo, M. H. Ling, and P . K omisarczuk, “ A survey on reinforcement learning models and algorithms for traf ﬁc signal control, ” ACM Computing Surve ys (CSUR) , vol. 50, no. 3, p. 34, 2017. [45] L. Kuyer , S. Whiteson, B. Bakker, and N. Vlassis, “Multiagent rein- forcement learning for urban traf ﬁc control using coordination graphs, ” in Joint European Confer ence on Machine Learning and Knowledge Discovery in Databases . Springer, 2008, pp. 656–671. [46] J. C. Medina and R. F . Benekohal, “T rafﬁc signal control using reinforce- ment learning and the max-plus algorithm as a coordinating strategy , ” in Intelligent Tr ansportation Systems (ITSC), 2012 15th International IEEE Conference on . IEEE, 2012, pp. 596–601. [47] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic multi-agent system for traf ﬁc signals control, ” Engineering Applications of Artiﬁcial Intelligence , vol. 26, no. 5, pp. 1575–1587, 2013. [48] M. A. Khamis and W . Gomaa, “ Adaptiv e multi-objectiv e reinforcement learning with hybrid exploration for traf ﬁc signal control based on co- operativ e multi-agent framework, ” Engineering Applications of Artiﬁcial Intelligence , vol. 29, pp. 134–151, 2014. [49] T . Chu, S. Qu, and J. W ang, “Large-scale trafﬁc grid signal control with regional reinforcement learning, ” in American Contr ol Conference (ACC), 2016 . IEEE, 2016, pp. 815–820. [50] N. Casas, “Deep deterministic policy gradient for urban trafﬁc light control, ” arXiv preprint , 2017. [51] W . Liu, G. Qin, Y . He, and F . Jiang, “Distributed cooperativ e rein- forcement learning-based trafﬁc signal control that integrates v2x net- works dynamic clustering, ” IEEE T ransactions on V ehicular T echnology , vol. 66, no. 10, pp. 8667–8681, 2017. [52] W . Genders, “Deep reinforcement learning adaptive trafﬁc signal con- trol, ” Ph.D. dissertation, McMaster University , 2018. [53] F . W ebster , “T rafﬁc signal settings, road research technical paper no. 39, ” Road Research Laboratory , 1958. [54] S. Linnainmaa, “T aylor expansion of the accumulated rounding error, ” BIT Numerical Mathematics , vol. 16, no. 2, pp. 146–160, 1976. [55] D. E. Rumelhart, G. E. Hinton, and R. J. W illiams, “Learning represen- tations by back-propagating errors, ” nature , vol. 323, no. 6088, p. 533, 1986. [56] V . Mnih, A. P . Badia, M. Mirza, A. Grav es, T . Lillicrap, T . Harley , D. Silver , and K. Ka vukcuoglu, “ Asynchronous methods for deep rein- forcement learning, ” in International Conference on Machine Learning , 2016, pp. 1928–1937. [57] D. Horg an, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. V an Hasselt, and D. Silver , “Distributed prioritized experience replay , ” arXiv preprint , 2018. [58] L. Espeholt, H. Soyer , R. Munos, K. Simonyan, V . Mnih, T . W ard, Y . Doron, V . Firoiu, T . Harley , I. Dunning et al. , “Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures, ” arXiv preprint arXiv:1802.01561 , 2018. [59] M. G. Bellemare, W . Dabney , and R. Munos, “ A distrib utional perspec- tiv e on reinforcement learning, ” in Proceedings of the 34th International Confer ence on Machine Learning-V olume 70 . JMLR. org, 2017, pp. 449–458. [60] W . Dabney , M. Ro wland, M. G. Bellemare, and R. Munos, “Distrib u- tional reinforcement learning with quantile regression, ” in Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018. [61] W . Dabney , G. Ostrovski, D. Silver , and R. Munos, “Implicit quan- tile networks for distributional reinforcement learning, ” arXiv preprint arXiv:1806.06923 , 2018. [62] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framew ork for temporal abstraction in reinforcement learning, ” Artiﬁcial intelligence , vol. 112, no. 1-2, pp. 181–211, 1999. [63] P .-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture, ” in Thirty-F irst AAAI Confer ence on Artiﬁcial Intelligence , 2017. [64] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran, “Learning to repeat: Fine grained action repetition for deep reinforcement learning, ” arXiv preprint arXiv:1702.06054 , 2017. [65] V . Mnih, K. Kavukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-le vel control through deep reinforcement learning, ” Natur e , vol. 518, no. 7540, pp. 529–533, 2015. [66] C. J. W atkins and P . Dayan, “Q-learning, ” Machine learning , v ol. 8, no. 3-4, pp. 279–292, 1992. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 11 [67] R. S. Sutton, “Learning to predict by the methods of temporal differ - ences, ” Machine learning , vol. 3, no. 1, pp. 9–44, 1988. [68] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT press Cambridge, 1998, vol. 1, no. 1. [69] T . P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T . Erez, Y . T assa, D. Silver , and D. W ierstra, “Continuous control with deep reinforcement learning, ” arXiv preprint , 2015. [70] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn- ing, planning and teaching, ” Machine learning , v ol. 8, no. 3-4, pp. 293– 321, 1992. [71] M. Abadi, A. Agarwal, P . Barham, E. Bre vdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow , A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefowicz, L. Kaiser , M. K udlur , J. Le venberg, D. Man ´ e, R. Monga, S. Moore, D. Murray , C. Olah, M. Schuster , J. Shlens, B. Steiner , I. Sutskever , K. T alwar , P . T ucker , V . V anhoucke, V . V asudev an, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. W icke, Y . Y u, and X. Zheng, “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015, software av ailable from tensorﬂow .org. [Online]. A vailable: https://www .tensorﬂow .org/ [72] E. Jones, T . Oliphant, P . Peterson et al. , “SciPy: Open source scientiﬁc tools for Python, ” 2001, ”http://www .scipy .org/”. [73] P . T abor , 2019. [Online]. A vailable: https://github .com/philtabor/ Y outube- Code- Repository/blob/master/ReinforcementLearning/ PolicyGradient/DDPG/pendulum/tensorﬂo w/ddpg orig tf.py [74] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Surpassing human-lev el performance on imagenet classiﬁcation, ” in Pr oceedings of the IEEE international conference on computer vision , 2015, pp. 1026–1034. [75] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint arXiv:1412.6980 , 2014. W ade Genders earned a Software B.Eng. & Society in 2013, Civil M.A.Sc. in 2014 and Civil Ph.D in 2018 from McMaster Univ ersity . His research interests include traf ﬁc signal control, intelligent transportation systems, machine learning and arti- ﬁcial intelligence. Saiedeh Razavi Saiedeh Razavi is the inaugural Chair in Heavy Construction, Director of the Mc- Master Institute for T ransportation and Logistics and Associate Professor at the Department of Civil Engineering at McMaster Uni versity . Dr . Razavi has a multidisciplinary background and considerable experience in collaborating and leading national and international multidisciplinary team-based projects in sensing and data acquisition, sensor technologies, data analytics, data fusion and their applications in safety , producti vity , and mobility of transportation, construction, and other systems. She combines several years of industrial e xpe- rience with academic teaching and research. Her formal education includes de- grees in Computer Engineering (B.Sc), Artiﬁcial Intelligence (M.Sc) and Civil Engineering (Ph.D.). Her research, funded by Canadian council (NSERC), as well as the ministry of Transportation of Ontario, focuses on connected and automated vehicles, on smart and connected work zones and on computational models for improving safety and producti vity of highway construction. Dr . Razavi brings together the priv ate and public sectors with academia for the dev elopment of high quality research in smarter mobility , construction and logistics. She has received se veral awards including McMasters Student Union Merit A ward for T eaching, the F aculty of Engineering T eam Excellent A ward, and the Construction Industry Institute best poster aw ard.

An Open-Source Framework for Adaptive Traffic Signal Control

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment