An Open-Source Framework for Adaptive Traffic Signal Control

Sub-optimal control policies in transportation systems negatively impact mobility, the environment and human health. Developing optimal transportation control systems at the appropriate scale can be difficult as cities' transportation systems can be …

Authors: Wade Genders, Saiedeh Razavi

An Open-Source Framework for Adaptive Traffic Signal Control
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 1 An Open-Source Frame work for Adapti ve T raf fic Signal Control W ade Genders and Saiedeh Razavi Abstract —Sub-optimal control policies in transportation sys- tems negatively impact mobility , the envir onment and human health. Developing optimal transportation control systems at the appropriate scale can be difficult as cities’ transportation systems can be lar ge, complex and stochastic. Intersection traffic signal controllers are an important element of modern transportation infrastructure where sub-optimal control policies can incur high costs to many users. Many adapti ve traffic signal contr ollers hav e been proposed by the community but research is lacking regarding their relativ e performance difference - which adaptive traffic signal controller is best remains an open question. This resear ch contrib utes a framework f or de veloping and e valuating different adaptive traffic signal contr oller models in simula- tion - both learning and non-learning - and demonstrates its capabilities. The framework is used to first, inves tigate the performance v ariance of the modelled adaptiv e traffic signal controllers with respect to their hyperparameters and second, analyze the performance differences between controllers with optimal hyperparameters. The proposed framework contains implementations of some of the most popular adaptiv e traffic signal controllers from the literature; W ebster’s, Max-pressure and Self-Organizing T raffic Lights, along with deep Q-network and deep deterministic policy gradient reinf orcement learning controllers. This framework will aid researchers by accelerating their work from a common starting point, allowing them to generate results faster with less effort. All framework source code is a vailable at https://github.com/docwza/sumolights . Index T erms —traffic signal control, adaptiv e traffic signal con- trol, intelligent transportation systems, reinf orcement learning, neural networks. I . I N T RO D U C T I O N C ITIES rely on road infrastructure for transporting indi- viduals, goods and services. Sub-optimal control poli- cies incur environmental, human mobility and health costs. Studies observe vehicles consume a significant amount of fuel accelerating, decelerating or idling at intersections [1]. Land transportation emissions are estimated to be responsible for one third of all mortality from fine particulate matter pollution in North America [2]. Globally , ov er three million deaths are attributed to air pollution per year [3]. In 2017, residents of three of the United States’ biggest cities, Los Angeles, New W . Genders w as a Ph.D. student with the Department of Civil En- gineering, McMaster University , Hamilton, Ontario, Canada e-mail: gen- derwt@mcmaster .ca S. Razavi is an Associate Professor , Chair in Heavy Construction and Direc- tor of McMaster Institute for Transportation & Logistics at the Department of Civil Engineering, McMaster University , Hamilton, Ontario, Canada e-mail: razavi@mcmaster .ca 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collective works, for resale or redistrib ution to servers or lists, or reuse of any copyrighted component of this work in other works. Manuscript received August X, 2019; revised August X, 2019. Y ork and San Francisco, spent between three and four days on average delayed in congestion ov er the year , respectiv ely costing 19 , 33 and 10 billion USD from fuel and individual time w aste [4]. It is paramount to ensure transportation systems are optimal to minimize these costs. Automated control systems are used in man y aspects of transportation systems. Intelligent transportation systems seek to develop optimal solutions in transportation using intelli- gence. Intersection traffic signal controllers are an important element of many cities’ transportation infrastructure where sub-optimal solutions can contribute high costs. Traditionally , traffic signal controllers ha ve functioned using primitive logic which can be improv ed. Adapti ve traffic signal controllers can improv e upon traditional traffic signal controllers by conditioning their control on current traffic conditions. T raffic microsimulators such as SUMO [5], Paramics, VIS- SUM and AIMSUM ha ve become popular tools for de veloping and testing adaptive traffic signal controllers before field de- ployment. Ho wev er , researchers interested in studying adapti ve traffic signal controllers are often burdened with dev eloping their own adaptiv e traffic signal control implementations de novo . This research contributes an adaptive traf fic signal control framework, including W ebster’ s, Max-pressure, Self- organizing traf fic lights (SO TL), deep Q-network (DQN) and deep deterministic policy gradient (DDPG) implementations for the freely av ailable SUMO traffic microsimulator to aid researchers in their work. The framew ork’ s capabilities are demonstrated by studying the ef fect of optimizing traffic signal controllers hyperparameters and comparing optimized adapti ve traffic signal controllers relati ve performance. I I . B AC K G R O U N D A. T raffic Signal Contr ol An intersection is composed of traffic mov ements or ways that a vehicle can traverse the intersection be ginning from an incoming lane to an outgoing lane. T raffic signal controllers use phases, combinations of coloured lights that indicate when specific mov ements are allowed, to control v ehicles at the intersection. Fundamentally , a traffic signal control policy can be decou- pled into two sequential decisions at an y given time; what should the next phase be and for how long in duration? A variety of models have been proposed as policies. The simplest and most popular traf fic signal controller determines the next phase by displaying the phases in an ordered sequence known as a cycle, where each phase in the cycle has a fixed, potentially unique, duration - this is known as a fixed-time, cycle-based traffic signal controller . Although simple, fixed- time, cycle-based traffic signal controllers are ubiquitous in JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 2 transportation networks because they are predictable, stable and ef fectiv e, as traf fic demands e xhibit reliable patterns over regular periods (i.e., times of the day , days of the week). How- ev er , as ubiquitous as the fixed-time controller is, researchers hav e long sought to dev elop improved traffic signal controllers which can adapt to changing traffic conditions. Actuated traffic signal controllers use sensors and boolean logic to create dynamic phase durations. Adaptiv e traffic signal controllers are capable of acyclic phase sequences and dy- namic phase durations to adapt to changing intersection traf fic conditions. Adaptiv e controllers attempt to achieve higher performance at the expense of complexity , cost and reliability . V arious techniques hav e been proposed as the foundation for adaptiv e traffic signal controllers, from analytic mathematical solutions, heuristics and machine learning. B. Liter ature Re view Dev eloping an adaptiv e traffic signal control ultimately requires some type of optimization technique. For decades researchers have proposed adaptiv e traffic signal controllers based on a variety of techniques such as evolutionary algo- rithms [6], [7], [8], [9], [10], [11] and heuristics such as pres- sure [12], [13], [14], immunity [15], [16] and self-organization [17], [18], [19]. Additionally , many comprehensiv e adaptive traffic signal control systems hav e been proposed such as OP AC [20], SCA TS [21], RHODES [22] and A CS-Lite [23]. Reinforcement learning has been demonstrated to be an effecti ve method for developing adaptiv e traffic signal con- trollers in simulation [6], [24], [25], [26], [27], [28]. Recently , deep reinforcement learning has been used for adaptive traf fic signal control with varying degrees of success [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]. A comprehensive revie w of reinforcement learning adaptive traffic signal con- trollers is presented in T able I. Readers interested in additional adaptiv e traffic signal con- trol research can consult extensi ve re view articles [40], [41], [42], [43], [44]. Although ample research exists proposing nov el adapti ve traffic signal controllers, it can be arduous to compare be- tween previously proposed ideas. De veloping adaptiv e traf fic signal controllers can be challenging as man y of them require defining many hyperparameters. The authors seek to address this problem by contributing an adaptiv e traffic signal control framew ork to address these problems and aid in their research. C. Contrib ution The authors’ work contributes in the follo wing areas: • Diverse Adaptive T raffic Signal Contr oller Implemen- tations : The proposed framew ork contributes adapti ve traffic signal controllers based on a v ariety of paradigms, the broadest being non-learning (e.g., W ebster’ s, SO TL, Max-pressure) and learning (e.g., DQN and DDPG). The div ersity of adaptive traffic signal controllers allows re- searchers to experiment at their leisure without in vesting time dev eloping their own implementations. • Scalable, Optimized : The proposed framework is opti- mized for use with parallel computation techniques lev er- aging modern multicore computer architecture. This fea- ture significantly reduces the compute time of learning- based adapti ve traf fic signal controllers and the generation of results for all controllers. By making the frame work computationally efficient, the search for optimal hyperpa- rameters is tractable with modest hardware (e.g., 8 core CPU). The framework was designed to scale to de velop adaptiv e controllers for any SUMO network. All source code used in this manuscript can be retriev ed from https://github.com/docwza/sumolights . I I I . T R A FFI C S I G N A L C O N T R O L L E R S Before describing each traf fic signal controller in detail, elements common to all are detailed. All of the included traffic signal controllers share the following; a set of intersection lanes L , decomposed into incoming L inc and outgoing lanes L out and a set of green phases P . The set of incoming lanes with green movements in phase p ∈ P is denoted as L p,inc and their outgoing lanes as L p,out . A. Non-Learning T raffic Signal Contr ollers 1) Uniform: A simple cycle-based, uniform phase duration traffic signal controller is included for use as a base-line comparison to the other controllers. The uniform controller’ s only hyperparameter is the green duration u , which defines the same duration for all green phases; the ne xt phase is determined by a cycle. 2) W ebsters: W ebster’ s method dev elops a cycle-based, fixed phase length traffic signal controller using phase flow data [53]. The authors propose an adaptive W ebster’ s traf fic signal controller by collecting data for a time interval W in duration and then using W ebster’ s method to calculate the cycle and green phase durations for the next W time interval. This adapti ve W ebster’ s essentially uses the most recent W interval to collect data and assumes the traffic demand will be approximately the same during the next W interv al. The selection of W is important and exhibits various trade-offs, smaller v alues allow for more frequent adaptations to changing traffic demands at the risk of instability while larger values adapt less frequently b ut allow for increased stability . Pseudo- code for the W ebster’ s traffic signal controller is presented in Algorithm 1. In Algorithm 1, F represents the set of phase flows collected ov er the most recent W interval and R represents the total cycle lost time. In addition to the time interval hyperparameter W , the adaptive W ebster’ s algorithm also has hyperparameters defining a minimum cycle duration c min , maximum cycle duration c max and lane saturation flow rate s . 3) Max-pr essur e: The Max-pressure algorithm develops an acyclic, dynamic phase length traffic signal controller . The Max-pressure algorithm models vehicles in lanes as a substance in a pipe and enacts control in a manner which attempts to maximize the relief of pressure between incoming JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 3 T ABLE I A DA P TI V E T R A FFI C S I GN AL C O NT RO L RE L A T E D W O R K . Research Network Intersections Multi-agent RL Function Approximation [45] Grid 15 Max-plus Model-based N/A [27] Grid, Corridor < 10 None Q-learning Linear [46] Springfield, USA 20 Max-plus Q-learning N/A [28] T oronto, Canada 59 Game Theory Q-learning T abular [47] N/A 50 Holonic Q-learning N/A [48] Grid 22 Rew ard Sharing Q-learning Bayesian [49] Grid 100 Regional Q-learning Linear [50] Barcelona, Spain 43 Centralized DDPG DNN 1 [33] T ehran, Iran 50 None Actor-Critic RBF 2 , Tile Coding [51] Changsha, China 96 Rew ard sharing Q-learning Linear [52] Luxembour g City 195 None DDPG DNN 1 Deep Neural Network (DNN). 2 Radial Basis Function (RBF). Algorithm 1 W ebster’ s Algorithm 1: pr ocedure W E B S T E R ( c min , c max , s, F , R ) 2: # compute cr itical l anes f or each phase 3: Y = { max( { F l s for l in L p,inc } ) for p in P } 4: # compute cy cle l ength 5: C = (1 . 5 ∗ R )+5 1 . 0 − P Y 6: if C < c min then 7: C = c min 8: else if C > c max then 9: C = c max 10: end if 11: G = C − R 12: # all ocate g reen time pr oportional to f low 13: retur n C , { G y P Y for y in Y } 14: end pr ocedure and outgoing lanes [13]. For a given green phase p , the pressure is defined in (1). Pressure( p ) = X l ∈ L p,inc | V l | − X l ∈ L p,out | V l | (1) Where L p,inc represents the set of incoming lanes with green movements in phase p and L p,out represents the set of outgoing lanes from all incoming lanes in L p,inc . Pseudo-code for the Max-pressure traffic signal controller is presented in Algorithm 2. Algorithm 2 Max-pressure Algorithm 1: pr ocedure M A X P R E S S U R E ( g min , t p , P ) 2: if t p < g min then 3: t p = t p + 1 4: else 5: t p = 0 6: # next phase has l arg est pr essure 7: retur n argmax( { Pressure( p ) for p in P } ) 8: end if 9: end pr ocedure In Algorithm 2, t p represents the time spent in the current phase. The Max-pressure algorithm requires a minimum green time hyperparameter g min which ensures a newly enacted phase has a minimum duration. 4) Self Or ganizing T raffic Lights: Self-organizing traf fic lights (SO TL) [17], [18], [19] de velop a cycle-based, dynamic phase length traffic signal controller based on self-org anizing principles, where a “...self-organizing system would be one in which elements are designed to dynamically and au- tonomously solve a problem or perform a function at the system lev el. ” [18, p. 2]. Pseudo-code for the SO TL traffic signal controller is pre- sented in Algorithm 3. Algorithm 3 SO TL Algorithm 1: pr ocedure S O T L( t p , g min , θ, ω , µ ) 2: # accumulate r ed phase v ehicles time integ r al 3: κ = κ + P l ∈ L inc − L p,inc | V l | 4: if t p > g min then 5: # v ehicles appr oaching in cur r ent g r een phase 6: # < ω distance of stop l ine 7: n = P l ∈ L p,inc | V l | 8: # only consider phase chang e if no platoon 9: # or too lar g e n > µ 10: if n > µ or n == 0 then 11: if κ > θ then 12: κ = 0 13: # next phase in cy cle 14: i = i + 1 15: retur n P i mod | P | 16: end if 17: end if 18: end if 19: end pr ocedure The SO TL algorithm functions by changing lights accord- ing to a vehicle-time integral threshold θ constrained by a minimum green phase duration g min . Additionally , small (i.e. n < µ ) vehicle platoons are kept together by prev enting a phase change if sufficiently close (i.e., at a distance < ω ) to the stop line. B. Learning T raffic Signal Contr ollers Reinforcement learning uses the framework of Markov Decision Processes to solve goal-oriented, sequential decision- making problems by repeatedly acting in an en vironment. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 4 At discrete points in time t , a reinforcement learning agent observes the en vironment state s t and then uses a policy π to determine an action a t . After implementing its selected action, the agent recei ves feedback from the en vironment in the form of a re ward r t and observes a new environment state s t +1 . The rew ard quantifies how ‘well’ the agent is achieving its goal (e.g., score in a game, completed tasks). This process is repeated until a terminal state s terminal is reached, and then begins ane w . The return G t = P k = T k =0 γ k r t + k is the accumulation of rewards by the agent ov er some time horizon T , discounted by γ ∈ [0 , 1) . The agent seeks to maximize the expected return E [ G t ] from each state s t . The agent dev elops an optimal policy π ∗ to maximize the return. There are many techniques for an agent to learn the optimal policy , howe ver , most of them rely on estimating value func- tions. V alue functions are useful to estimate future re wards. State value functions V π ( s ) = E [ G t | s t = s ] represent the expected return starting from state s and following policy π . Action v alue functions Q π ( s, a ) = E [ G t | s t = s, a t = a ] represent the expected return starting from state s , taking ac- tion a and follo wing polic y π . In practice, value functions are unknown and must be estimated using sampling and function approximation techniques. Parametric function approximation, such as neural networks, use a set of parameters θ to estimate an unknown function f ( x | θ ) ≈ f ( x ) . T o develop accurate approximations, the function parameters must be dev eloped with some optimization technique. Experiences are tuples e t = ( s t , a t , r t , s t +1 ) that represent an interaction between the agent and the en vironment at time t . A reinforcement learning agent interacts with its en vironment in trajectories or sequences of experiences e t , e t +1 , e t +2 , ... . T rajectories be gin in an initial state s init and end in a terminal state s terminal . T o accurately estimate value functions, expe- riences are used to optimize the parameters. If neural network function approximation is used, the parameters are optimized using experiences to perform gradient-based techniques and backpropagation [54], [55]. Additional technical details re- garding the proposed reinforcement learning adapti ve traffic signal controllers can be found in the Appendix. T o train reinforcement learning controllers for all intersec- tions, a distributed acting, centralized learning architecture is dev eloped [56], [57], [58]. Using parallel computing, multiple actors and learners are created, illustrated in Figure 1. Actors hav e their own instance of the traffic simulation and neural networks for all intersections. Learners are assigned a subset of all intersections, for each they have a neural network and an experience replay buf fer D . Actors generate experiences e t for all intersections and send them to the appropriate learner . Learners only receive experiences for their assigned subset of intersections. The learner stores the experiences in an experience replay buf fer, which is uniformly sampled for batches to optimize the neural network parameters. After computing parameter updates, learners send new parameters to all actors. There are many benefits to this architecture, foremost being that it makes the problem feasible; because there are hundreds of agents, distrib uting computation across man y actors and learners is necessary to decrease training time. Another benefit is experience diversity , granted by multiple en vironments and varied exploration rates. C. DQN The proposed DQN traf fic signal controller enacts control by choosing the next green phase without utilizing a phase cycle. This acyclic architecture is motiv ated by the observation that enacting phases in a repeating sequence may be contributing to sub-optimal control policy . After the DQN has selected the next phase, it is enacted for a fixed duration known as an action repeat a repeat . After the phase has been enacted for the action repeat duration, a new phase is selected acyclically . 1) State: The proposed state observation for the DQN is a combination of the most recent green phase and the density and queue of incoming lanes at the intersection at time t . Assume each intersection has a set L of incoming lanes and a set P of green phases. The state space is then defined as S ∈ ( R 2 | L | × B | P | +1 ) . The density and queue of each lane are normalized to the range [0 , 1] by dividing by the lane’ s jam density k j . The most recent phase is encoded as a one- hot v ector B | P | +1 , where the plus one encodes the all-red clearance phase. 2) Action: The proposed action space for the DQN traf fic signal controller is the next green phase. The DQN selects one action from a discrete set, in this model one of the many possible green phases a t ∈ P . After a green phase has been selected, it is enacted for a duration equal to the action repeat a repeat . 3) Re ward: The reward used to train the DQN traf fic signal controller is a function of vehicle delay . Delay d is the difference in time between a vehicle’ s free-flow travel time and actual travel time. Specifically , the rew ard is the ne gativ e sum of all vehicles’ delay at the intersection, defined in (2): r t = − X v ∈ V d v t (2) Where V is the set of all vehicles on incoming lanes at the intersection, and d t v is the delay of vehicle v at time t . Defined in this w ay , the reward is a punishment, with the agent’ s goal to minimize the amount of punishment it receives. Each intersection sa ves the reward with the largest magnitude experienced to perform minimum re ward normalization r t | r min | to scale the reward to the range [ − 1 , 0] for stability . 4) Ag ent Arc hitecture: The agent approximates the action- value Q function with a deep artificial neural network. The action-value function Q is two hidden layers of 3( | s t | ) fully connected neurons with exponential linear unit (ELU) acti- vation functions and the output layer is | P | neurons with linear activ ation functions. The Q function’ s input is the local intersection state s t . A visualization of the DQN is presented in Fig. 1. D. DDPG T raffic Signal Contr oller The proposed DDPG traffic signal controller implements a c ycle with dynamic phase durations. This architecture is motiv ated by the observation that cycle-based policies can JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 5 Fig. 1. Adaptive traffic signal control DDPG and DQN neural network agents (left) and distributed acting, centralized learning architecture (right) composed of actors and learners. Each actor has one SUMO network as an environment and neural networks for all intersections. Each learner is assigned a subset of intersections at the beginning of training and is only responsible for computing parameter updates for their assigned intersections, ef fectively distributing the computation load for learning. Howev er, learners distribute parameter updates to all actors. maintain fairness and ensure a minimum quality of service between all intersection users. Once the ne xt green phase has been determined using the cycle, the policy π is used to select its duration. Explicitly , the reinforcement learning agent is learning how long in duration to make the next green phase in the c ycle to maximize its return. Additionally , the cycle skips phases when no vehicles are present on incoming lanes. 1) Actor State: The proposed state observation for the actor is a combination of the current phase and the density and queue of incoming lanes at the intersection at time t . The state space is then defined as S ∈ ( R 2 | L | × B | P | +1 ) . The density and queue of each lane are normalized to the range [0 , 1] by di viding by the lane’ s jam density k j . The current phase is encoded as a one-hot vector B | P | +1 , where the plus one encodes the all-red clearance phase. 2) Critic State: The proposed state observ ation for the critic combines the state s t and the actor’ s action a t , depicted in Figure 1. 3) Action: The proposed action space for the adaptive traffic signal controller is the duration of the next green phase in seconds. The action controls the duration of the next phase; there is no agency over what the next phase is, only on ho w long it will last. The DDPG algorithm produces a continuous output, a real number o ver some range a t ∈ R . Since the DDPG algorithm outputs a real number and the phase duration is defined in interv als of seconds, the output is rounded to the nearest integer . In practice, phase durations are bounded by minimum time g min and a maximum time g max hyperparameters to ensure a minimum quality of service for all users. Therefore the agent selects an action { a t ∈ Z | g min ≤ a t ≤ g max } as the next phase duration. 4) Re ward: The reward used to train the DDPG traffic signal controller is the same delay re ward used by the DQN traffic signal controller defined in (2). 5) Ag ent Ar chitectur e: The agent approximates the policy π and action-value Q function with deep artificial neural networks. The policy function is two hidden layers of 3 | s t | fully connected neurons, each with batch normalization and ELU activ ation functions, and the output layer is one neuron with a hyperbolic tangent activ ation function. The action-value function Q is two hidden layers of 3( | s t | + | a t | ) fully connected neurons with batch normalization and ELU activ ation func- tions and the output layer is one neuron with a linear activ ation function. The policy’ s input is the intersection’ s local traffic state s t and the action-v alue function’ s input is the local state concatenated with the local action s t + a t . The action-value Q function also uses a L 2 weight regularization of λ = 0 . 01 . By deep reinforcement learning standards the networks used are not that deep, howe ver , their architecture is selected for simplicity and they can easily be modified within the frame- work. Simple deep neural networks were also implemented to allow for future scalability , as the proposed framework can be deployed to any SUMO network - to reduce the computational load the default networks are simple. I V . E X P E R I M E N T S A. Hyperpar ameter Optimization T o demonstrate the capabilities of the proposed framework, experiments are conducted on optimizing adaptiv e traffic sig- nal control hyperparameters. The frame work is for use with the SUMO traffic microsimulator [5], which was used to e valuate the de veloped adapti ve traf fic signal controllers. Understanding how sensitiv e any specific adaptiv e traffic signal controller’ s performance is to changes in hyperparameters is important to instill confidence that the solution is robust. Determining optimal hyperparameters is necessary to ensure a balanced comparison between adaptiv e traffic signal control methods. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 6 gn e J 0 gn e J 6 Fig. 2. T wo intersection SUMO network used for hyperparameter ex- periments. In addition to this two intersection network, a single, isolated intersection is also included with the framework. Using the hyperparameter optimization script included in the framew ork, a grid search is performed with the imple- mented controllers’ hyperparameters on a two intersection network, sho wn in Fig. 2 under a simulated three hour dynamic traffic demand scenario. The results for each traf fic signal controller are displayed in Fig. 3 and collecti vely in Fig. 4. As can be observed in Fig. 3 and Fig. 4 the choice of hyperparameter significantly impacts the performance of the giv en traf fic signal controller . As a general trend observed in Fig. 3, methods with larger numbers of hyperparameters (e.g., SOTL, DDPG, DQN) exhibit greater performance v ari- ance than methods with fewer hyperparameters (e.g., Max- pressure). Directly comparing methods in Fig. 4 demonstrates non-learning adapti ve traffic signal control methods (e.g., Max-pressure, W ebster’ s) robustness to hyperparameter values and high performance (i.e., lo west trav el time). Learning-based methods exhibit higher v ariance with changes in hyperparam- eters, DQN more so than DDPG. In the following section, the best hyperparameters for each adaptive traffic signal controller will be used to further inv estigate and compare performance. B. Optimized Adaptive T raffic Signal Contr ollers Using the optimized hyperparameters all traffic signal con- trollers are subjected to an additional 32 simulations with random seeds to estimate their performance, quantified using network trav el time, individual intersection queue and delay measures of effecti veness (MoE). Results are presented in Fig. 5 and Fig. 6. Observing the travel time boxplots in Fig. 5, the SO TL controller produces the worst results, exhibiting a mean tra vel time almost twice the next closest method and with many significant outliers. The Max-pressure algorithm achieves the best performance, with the lo west mean and median along with the lowest standard de viation. The DQN, DDPG, Uniform and W ebster’ s controllers achieve approximately equal perfor - mance, howe ver , the DQN controller has significant outliers, indicating some vehicles e xperience much longer tra vel times than most. Each intersection’ s queue and delay MoE with respect to each adaptiv e traffic signal controller is presented in Fig. 6. The results are consistent with previous observ ations from the hyperparameter search and tra vel time data, howe ver , the reader’ s attention is directed to comparing the performance of DQN and DDPG in Fig. 6. The DQN controller performs poorly (i.e., high queues and delay) at the beginning and end of the simulation when traffic demand is low . Howe ver , at the demand peak, the DQN controller performs just as well, if not a little better , than ev ery method except the Max-pressure controller . Simultaneously considering the DDPG controller, the performance is opposite the DQN controller . The DDPG controller achie ves relatively low queues and delay at the beginning and end of the simulation and then is bested by the DQN controller in the middle of the simulation when the demand peaks. This performance difference can potentially be understood by considering the difference between the DQN and DDPG controllers. The DQN’ s ability to select the ne xt phase acyclically under high traffic demand may allow it to reduce queues and delay more than the cycle constrained DDPG controller . Howe ver , it is curious that under lo w demands the DQN controller performance suf fers when it should be relatively simple to develop the optimal policy . The DQN controller may be overfitting to the periods in the en vironment when the magnitude of the rew ards are large (i.e., in the middle of the simulation when the demand peaks) and con ver ging to a policy that doesn’t generalize well to the en vironment when the traffic demand is low . The author’ s present these findings to reader’ s and suggest future research in vestigate this and other issues to understand the performance dif ference between reinforcement learning traffic signal controllers. Understanding the advantages and disadvantages of a variety of controllers can provide insight into dev eloping future improvements. V . C O N C L U S I O N & F U T U R E W O R K Learning and non-learning adapti ve traffic signal controllers hav e been dev eloped within an optimized framew ork for the traffic microsimulator SUMO for use by the research com- munity . The proposed framew ork’ s capabilities were demon- strated by studying adaptiv e traffic signal control algorithm’ s sensitivity to hyperparameters, which was found to be sensi- tiv e with hyperparameter rich controllers (i.e., learning) and relativ ely insensitive with h yperparameter sparse controllers (i.e., heuristics). Poor hyperparameters can drastically alter the performance of an adaptive traffic signal controller , lead- ing researchers to erroneous conclusions about an adapti ve traffic signal controller’ s performance. This research provides evidence that dozens or hundreds of hyperparameter config- urations may have to be tested before selecting the optimal one. Using the optimized hyperparamters, each adapti ve con- troller’ s performance was estimated and the Max-pressure controller was found to achieve the best performance, yielding the lo west travel times, queues and delay . This manuscript’ s research provides e vidence that heuristics can offer powerful solutions e ven compared to complex deep-learning methods. This is not to suggest that this is definiti vely the case in all en- vironments and circumstances. The authors’ hypothesize that learning-based controllers can be further de veloped to offer JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 7 0 50 100 150 0 20 40 60 80 100 120 140 Standard Deviation T r a v e l T i m e ( s ) Uniform 0 250 500 750 1000 0 200 400 600 800 1000 1200 1400 SOTL 0 20 40 60 80 0 20 40 60 80 DDPG 0 20 40 60 Mean T r a v e l T i m e ( s ) 0 5 10 15 20 25 30 35 Standard Deviation T r a v e l T i m e ( s ) Max-pressure 0 200 400 600 Mean T r a v e l T i m e ( s ) 0 200 400 600 800 1000 DQN 0 20 40 60 80 Mean T r a v e l T i m e ( s ) 0 10 20 30 40 Webster's Best Worst Hyperparameter Performance Fig. 3. Individual hyperparameter results for each traffic signal controller . T rav el time is used as a measure of effectiv eness and is estimated for each hyperparameter from eight simulations with random seeds in units of seconds ( s ). The coloured dots gradient from green (best) to red (worst) orders the hyperparameters by the sum of the tra vel time mean and standard deviation. Note dif fering scales between graph axes making direct visual comparison biased. 0 25 50 75 100 125 150 175 200 Mean T r a v e l T i m e ( s ) 0 25 50 75 100 125 150 175 200 Standard Deviation T r a v e l T i m e ( s ) Traffic Signal Control Hyperparameter Comparison DDPG DQN Max-pressure SOTL Uniform Webster's Fig. 4. Comparison of all traf fic signal controller hyperparameter tra vel time performance. Note both vertical and horizontal axis limits have been clipped at 200 to improv e readability . improv ed performance that may yet best the non-learning, heuristic-based methods detailed in this research. Promising extensions that have improved reinforcement learning in other applications and may do the same for adaptiv e traffic signal control include richer function approximators [59], [60], [61] and reinforcement learning algorithms [62], [63], [64]. The authors intend for the frame work to gro w , with the addi- tion of more adapti ve traf fic signal controllers and features. In its current state, the framework can already aid adaptive traffic signal control researchers rapidly e xperiment on a SUMO network of their choice. Acknowledging the importance of optimizing our transportation systems, the authors hope this research helps others solve practical problems. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 8 Traffic Signal Controller 0 250 500 750 1000 1250 1500 1750 T r a v e l T i m e ( s ) ( , , m e d i a n ) ( 7 2 , 3 4 , 6 5 ) ( 7 8 , 4 6 , 6 6 ) ( 5 9 , 2 1 , 5 4 ) ( 1 5 8 , 1 6 9 , 8 5 ) ( 7 9 , 3 7 , 7 4 ) ( 7 1 , 3 0 , 6 6 ) Traffic Signal Controller Travel Time) DDPG DQN Max-pressure SOTL Uniform Webster's Fig. 5. Boxplots depicting the distribution of tra vel times for each traffic signal controller . The solid white line represents the median, solid coloured box the interquartile range (IQR) (i.e., from first (Q1) to third quartile (Q3)), solid coloured lines the Q1- 1 . 5 IQR and Q3+ 1 . 5 IQR and coloured crosses the outliers. 0 25 50 75 100 125 150 175 0 2000 4000 6000 8000 Q u e u e ( v e h ) gneJ0 DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 0 2000 4000 6000 8000 gneJ6 DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 T i m e ( m i n ) 0 20000 40000 60000 80000 D e l a y ( s ) DDPG DQN Max-pressure Uniform Webster's 0 25 50 75 100 125 150 175 T i m e ( m i n ) 0 10000 20000 30000 40000 50000 60000 70000 DDPG DQN Max-pressure Uniform Webster's Intersection Measures of Effectiveness Fig. 6. Comparison of traffic signal controller individual intersection queue and delay measure of ef fectiv eness in units of vehicles ( v eh ) and seconds ( s ). Solid coloured lines represent the mean and shaded areas represent the 95% confidence interv al. SOTL has been omitted to improv e readability since its Queue and Delay values are exclusiv ely outside the graph range. A P P E N D I X A D Q N Deep Q-Networks [65] combine Q-learning and deep neural networks to produce autonomous agents capable of solving complex tasks in high-dimensional en vironments. Q-learning [66] is a model-free, off-policy , value-based temporal dif- ference [67] reinforcement learning algorithm which can be used to dev elop an optimal discrete action space policy for a giv en problem. Like other temporal difference algorithms, Q-learning uses bootstrapping [68] (i.e., using an estimate to improv e future estimates) to de velop an action-value function Q ( s, a ) which can estimate the expected return of taking action a in state s and acting optimally thereafter . If the Q function can be estimated accurately it can be used to deri ve the optimal policy π ∗ = argmax a Q ( s, a ) . In DQN the Q function is approximated with a deep neural netw ork. The DQN algorithm JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 9 utilizes two techniques to ensure stable de velopment of the Q function with DNN function approximation - a target network and experience replay . T wo parameter sets are used when training a DQN, online θ and tar get θ 0 . The target parameters θ 0 are used to stabilize the return estimates when performing updates to the neural network and are periodically changed to the online parameters θ 0 = θ at a fixed interv al. The e xperience replay is a buf fer which stores the most recent D experience tuples to create a slowly changing dataset. The experience replay is uniformly sampled from for experience batches to update the Q function online parameters θ . T raining a deep neural network requires a loss function, which is used to determine ho w to change the parameters to achiev e better approximations of the training data. Reinforce- ment learning dev elops value functions (e.g., neural networks) using experiences from the en vironment. The DQN’ s loss function is the gradient of the mean squared error of the return G t target y t and the prediction, defined in (3). y t = r t + γ Q ( s t +1 , argmax a Q ( s t +1 , a | θ 0 ) | θ 0 ) L DQN ( θ ) = ( y t − Q ( s t , a t | θ )) 2 (3) A P P E N D I X B D D P G Deep deterministic policy gradients [69] are an extension of DQN to continuous action spaces. Similiar to Q-learning, DDPG is a model-free, of f-policy reinforcement learning al- gorithm. The DDPG algorithm is an example of actor-critic learning, as it dev elops a policy function π ( s | φ ) (i.e., actor) using an action-value function Q ( s, a | θ ) (i.e., critic). The actor interacts with the environment and modifies its behaviour based on feedback from the critic. The DDPG critic’ s loss function is the gradient of the mean squared error of the return G t target y t and the prediction, defined in (4). y t = r t + γ Q ( s t +1 , π ( s t +1 | φ 0 ) | θ 0 ) L Critic ( θ ) = ( y t − Q ( s t , a t | θ )) 2 (4) The DDPG actor’ s loss function is the sampled policy gradient, defined in (5). L Actor ( θ ) = ∇ θ Q ( s t , π ( s t | φ ) | θ ) (5) Like DQN, DDPG uses two sets of parameters, online θ and target θ 0 , and experience replay [70] to reduce instability during training. DDPG performs updates on the parameters for both the actor and critic by uniformly sampling batches of experiences from the replay . The target parameters are slowly updated tow ards the online parameters according to θ 0 = (1 − τ ) θ 0 + ( τ ) θ after every batch update. A P P E N D I X C T E C H N I C A L Software used include SUMO 1.2.0 [5], T ensorflow 1.13 [71], SciPy [72] and public code [73]. The neural network parameters were initialized with He [74] and optimized using Adam [75]. T o ensure intersection safety , two second yellow change and three second all-red clearanc e phases were inserted between all green phase transitions. For the DQN and DDPG traffic signal controllers, if no vehicles are present at the intersection, the phase defaults to all-red, which is considered a terminal state s terminal . Each intersection’ s state observ ation is bounded by 150 m (i.e., the queue and density are calculated from vehicles up to a maximum of 150 m from the intersection stop line). A C K N O W L E D G M E N T S This research was enabled in part by support in the form of computing resources provided by SHARCNET (www . sharcnet.ca), their McMaster University staf f and Compute Canada (www .computecanada.ca). R E F E R E N C E S [1] L. W u, Y . Ci, J. Chu, and H. Zhang, “The influence of intersections on fuel consumption in urban arterial road traffic: a single vehicle test in harbin, china, ” PloS one , vol. 10, no. 9, 2015. [2] R. A. Silva, Z. Adelman, M. M. Fry , and J. J. W est, “The impact of individual anthropogenic emissions sectors on the global b urden of human mortality due to ambient air pollution, ” Envir onmental health perspectives , vol. 124, no. 11, p. 1776, 2016. [3] W orld Health Organization et al. , “ Ambient air pollution: A global assessment of exposure and burden of disease, ” 2016. [4] G. Cookson, “INRIX global traffic scorecard, ” INRIX, T ech. Rep., 2018. [5] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker , “Recent de vel- opment and applications of SUMO - Simulation of Urban MObility , ” International Journal On Advances in Systems and Measur ements , vol. 5, no. 3&4, pp. 128–138, December 2012. [6] S. Mikami and Y . Kakazu, “Genetic reinforcement learning for cooper- ativ e traf fic signal control, ” in Evolutionary Computation, 1994. IEEE W orld Congress on Computational Intelligence ., Proceedings of the F irst IEEE Conference on . IEEE, 1994, pp. 223–228. [7] J. Lee, B. Abdulhai, A. Shalaby , and E.-H. Chung, “Real-time optimiza- tion for adapti ve traffic signal control using genetic algorithms, ” Journal of Intelligent T ransportation Systems , vol. 9, no. 3, pp. 111–122, 2005. [8] H. Prothmann, F . Rochner, S. T omforde, J. Branke, C. M ¨ uller-Schloer , and H. Schmeck, “Organic control of traffic lights, ” in International Confer ence on Autonomic and Trusted Computing . Springer, 2008, pp. 219–233. [9] L. Singh, S. Tripathi, and H. Arora, “Time optimization for traffic signal control using genetic algorithm, ” International Journal of Recent T r ends in Engineering , vol. 2, no. 2, p. 4, 2009. [10] E. Ricalde and W . Banzhaf, “Evolving adaptive traffic signal controllers for a real scenario using genetic programming with an epigenetic mechanism, ” in 2017 16th IEEE International Confer ence on Machine Learning and Applications (ICMLA) . IEEE, 2017, pp. 897–902. [11] X. Li and J.-Q. Sun, “Signal multiobjectiv e optimization for urban traffic network, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 19, no. 11, pp. 3529–3537, 2018. [12] T . W ongpiromsarn, T . Uthaicharoenpong, Y . W ang, E. Frazzoli, and D. W ang, “Distributed traffic signal control for maximum network throughput, ” in Intelligent Tr ansportation Systems (ITSC), 2012 15th International IEEE Confer ence on . IEEE, 2012, pp. 588–595. [13] P . V araiya, “The max-pressure controller for arbitrary networks of signalized intersections, ” in Advances in Dynamic Network Modeling in Complex T ransportation Systems . Springer , 2013, pp. 27–66. [14] J. Gregoire, X. Qian, E. Frazzoli, A. De La Fortelle, and T . W ong- piromsarn, “Capacity-aware backpressure traf fic signal control, ” IEEE T ransactions on Control of Network Systems , vol. 2, no. 2, pp. 164– 173, 2015. [15] S. Darmoul, S. Elkosantini, A. Louati, and L. B. Said, “Multi-agent immune networks to control interrupted flow at signalized intersections, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 82, pp. 290–313, 2017. [16] A. Louati, S. Darmoul, S. Elkosantini, and L. ben Said, “ An artificial immune network to control interrupted flow at a signalized intersection, ” Information Sciences , vol. 433, pp. 70–95, 2018. [17] C. Gershenson, “Self-organizing traffic lights, ” arXiv preprint nlin/0411066 , 2004. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 10 [18] S.-B. Cools, C. Gershenson, and B. DHooghe, “Self-organizing traffic lights: A realistic simulation, ” in Advances in applied self-organizing systems . Springer, 2013, pp. 45–55. [19] S. Goel, S. F . Bush, and C. Gershenson, “Self-organization in traffic lights: Evolution of signal control with advances in sensors and com- munications, ” arXiv preprint , 2017. [20] N. Gartner , “ A demand-responsiv e strategy for traffic signal control, ” T ransportation Researc h Recor d , vol. 906, pp. 75–81, 1983. [21] P . Lo wrie, “Scats, sydney co-ordinated adaptive traffic system: A traffic responsiv e method of controlling urban traffic, ” 1990. [22] P . Mirchandani and L. Head, “ A real-time traf fic signal control system: architecture, algorithms, and analysis, ” T ransportation Research P art C: Emer ging T echnologies , vol. 9, no. 6, pp. 415–432, 2001. [23] F . Luyanda, D. Gettman, L. Head, S. Shelby , D. Bullock, and P . Mirchan- dani, “ Acs-lite algorithmic architecture: applying adapti ve control system technology to closed-loop traffic signal control systems, ” T ransportation Resear ch Recor d: Journal of the T ransportation Research Boar d , no. 1856, pp. 175–184, 2003. [24] T . L. Thorpe and C. W . Anderson, “Traffic light control using sarsa with three state representations, ” Citeseer , T ech. Rep., 1996. [25] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal con- trol, ” European Journal of Operational Research , vol. 131, no. 2, pp. 232–241, 2001. [26] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning for true adapti ve traf fic signal control, ” Journal of T ransportation Engineering , vol. 129, no. 3, pp. 278–285, 2003. [27] L. Prashanth and S. Bhatnagar , “Reinforcement learning with function approximation for traffic signal control, ” IEEE T ransactions on Intelli- gent T ransportation Systems , vol. 12, no. 2, pp. 412–421, 2011. [28] S. El-T antawy , B. Abdulhai, and H. Abdelgawad, “Multiagent rein- forcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downto wn toronto, ” IEEE T ransactions on Intelligent T ransportation Systems , vol. 14, no. 3, pp. 1140–1150, 2013. [29] T . Rijken, “Deeplight: Deep reinforcement learning for signalised traffic control, ” Ph.D. dissertation, Masters Thesis. Uni versity College London, 2015. [30] E. van der Pol, “Deep reinforcement learning for coordination in traffic light control, ” Ph.D. dissertation, Masters Thesis. Univ ersity of Amsterdam, 2016. [31] L. Li, Y . Lv , and F .-Y . W ang, “Traf fic signal timing via deep reinforce- ment learning, ” IEEE/CAA Journal of Automatica Sinica , v ol. 3, no. 3, pp. 247–254, 2016. [32] W . Genders and S. Razavi, “Using a deep reinforcement learning agent for traffic signal control, ” arXiv preprint , 2016, https: //arxiv .org/abs/1611.01142. [33] M. Aslani, M. S. Mesgari, and M. Wiering, “ Adaptive traf fic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events, ” T ransportation Researc h P art C: Emer ging T echnologies , vol. 85, pp. 732–752, 2017. [34] S. S. Mousavi, M. Schukat, P . Corcoran, and E. Howley , “Traf fic light control using deep policy-gradient and value-function based reinforce- ment learning, ” arXiv pr eprint arXiv:1704.08883 , 2017. [35] X. Liang, X. Du, G. W ang, and Z. Han, “Deep reinforcement learn- ing for traf fic light control in vehicular networks, ” arXiv preprint arXiv:1803.11115 , 2018. [36] ——, “ A deep reinforcement learning network for traffic light cycle control, ” IEEE Tr ansactions on V ehicular T echnology , v ol. 68, no. 2, pp. 1243–1253, 2019. [37] S. W ang, X. Xie, K. Huang, J. Zeng, and Z. Cai, “Deep reinforcement learning-based traffic signal control using high-resolution event-based data, ” Entr opy , vol. 21, no. 8, p. 744, 2019. [38] W . Genders and S. Razavi, “ Asynchronous n-step q-learning adaptiv e traffic signal control, ” Journal of Intelligent Tr ansportation Systems , vol. 23, no. 4, pp. 319–331, 2019. [39] T . Chu, J. W ang, L. Codec ` a, and Z. Li, “Multi-agent deep reinforcement learning for large-scale traffic signal control, ” IEEE T ransactions on Intelligent T ransportation Systems , 2019. [40] A. Ste vanovic, Adaptive traffic contr ol systems: domestic and foreign state of practice , 2010, no. Project 20-5 (T opic 40-03). [41] S. El-T antawy , B. Abdulhai, and H. Abdelga wad, “Design of reinforce- ment learning parameters for seamless application of adapti ve traffic signal control, ” J ournal of Intelligent T ransportation Systems , vol. 18, no. 3, pp. 227–245, 2014. [42] S. Araghi, A. Khosra vi, and D. Creighton, “ A review on computational intelligence methods for controlling traf fic signal timing, ” Expert systems with applications , vol. 42, no. 3, pp. 1538–1550, 2015. [43] P . Mannion, J. Duggan, and E. Howle y , “ An experimental review of reinforcement learning algorithms for adaptiv e traffic signal control, ” in Autonomic Road T ransport Support Systems . Springer , 2016, pp. 47–66. [44] K.-L. A. Y au, J. Qadir, H. L. Khoo, M. H. Ling, and P . K omisarczuk, “ A survey on reinforcement learning models and algorithms for traf fic signal control, ” ACM Computing Surve ys (CSUR) , vol. 50, no. 3, p. 34, 2017. [45] L. Kuyer , S. Whiteson, B. Bakker, and N. Vlassis, “Multiagent rein- forcement learning for urban traf fic control using coordination graphs, ” in Joint European Confer ence on Machine Learning and Knowledge Discovery in Databases . Springer, 2008, pp. 656–671. [46] J. C. Medina and R. F . Benekohal, “T raffic signal control using reinforce- ment learning and the max-plus algorithm as a coordinating strategy , ” in Intelligent Tr ansportation Systems (ITSC), 2012 15th International IEEE Conference on . IEEE, 2012, pp. 596–601. [47] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic multi-agent system for traf fic signals control, ” Engineering Applications of Artificial Intelligence , vol. 26, no. 5, pp. 1575–1587, 2013. [48] M. A. Khamis and W . Gomaa, “ Adaptiv e multi-objectiv e reinforcement learning with hybrid exploration for traf fic signal control based on co- operativ e multi-agent framework, ” Engineering Applications of Artificial Intelligence , vol. 29, pp. 134–151, 2014. [49] T . Chu, S. Qu, and J. W ang, “Large-scale traffic grid signal control with regional reinforcement learning, ” in American Contr ol Conference (ACC), 2016 . IEEE, 2016, pp. 815–820. [50] N. Casas, “Deep deterministic policy gradient for urban traffic light control, ” arXiv preprint , 2017. [51] W . Liu, G. Qin, Y . He, and F . Jiang, “Distributed cooperativ e rein- forcement learning-based traffic signal control that integrates v2x net- works dynamic clustering, ” IEEE T ransactions on V ehicular T echnology , vol. 66, no. 10, pp. 8667–8681, 2017. [52] W . Genders, “Deep reinforcement learning adaptive traffic signal con- trol, ” Ph.D. dissertation, McMaster University , 2018. [53] F . W ebster , “T raffic signal settings, road research technical paper no. 39, ” Road Research Laboratory , 1958. [54] S. Linnainmaa, “T aylor expansion of the accumulated rounding error, ” BIT Numerical Mathematics , vol. 16, no. 2, pp. 146–160, 1976. [55] D. E. Rumelhart, G. E. Hinton, and R. J. W illiams, “Learning represen- tations by back-propagating errors, ” nature , vol. 323, no. 6088, p. 533, 1986. [56] V . Mnih, A. P . Badia, M. Mirza, A. Grav es, T . Lillicrap, T . Harley , D. Silver , and K. Ka vukcuoglu, “ Asynchronous methods for deep rein- forcement learning, ” in International Conference on Machine Learning , 2016, pp. 1928–1937. [57] D. Horg an, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. V an Hasselt, and D. Silver , “Distributed prioritized experience replay , ” arXiv preprint , 2018. [58] L. Espeholt, H. Soyer , R. Munos, K. Simonyan, V . Mnih, T . W ard, Y . Doron, V . Firoiu, T . Harley , I. Dunning et al. , “Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures, ” arXiv preprint arXiv:1802.01561 , 2018. [59] M. G. Bellemare, W . Dabney , and R. Munos, “ A distrib utional perspec- tiv e on reinforcement learning, ” in Proceedings of the 34th International Confer ence on Machine Learning-V olume 70 . JMLR. org, 2017, pp. 449–458. [60] W . Dabney , M. Ro wland, M. G. Bellemare, and R. Munos, “Distrib u- tional reinforcement learning with quantile regression, ” in Thirty-Second AAAI Conference on Artificial Intelligence , 2018. [61] W . Dabney , G. Ostrovski, D. Silver , and R. Munos, “Implicit quan- tile networks for distributional reinforcement learning, ” arXiv preprint arXiv:1806.06923 , 2018. [62] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framew ork for temporal abstraction in reinforcement learning, ” Artificial intelligence , vol. 112, no. 1-2, pp. 181–211, 1999. [63] P .-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture, ” in Thirty-F irst AAAI Confer ence on Artificial Intelligence , 2017. [64] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran, “Learning to repeat: Fine grained action repetition for deep reinforcement learning, ” arXiv preprint arXiv:1702.06054 , 2017. [65] V . Mnih, K. Kavukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-le vel control through deep reinforcement learning, ” Natur e , vol. 518, no. 7540, pp. 529–533, 2015. [66] C. J. W atkins and P . Dayan, “Q-learning, ” Machine learning , v ol. 8, no. 3-4, pp. 279–292, 1992. JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORT A TION SYSTEMS, VOL. X, NO. X, A UGUST 2019 11 [67] R. S. Sutton, “Learning to predict by the methods of temporal differ - ences, ” Machine learning , vol. 3, no. 1, pp. 9–44, 1988. [68] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction . MIT press Cambridge, 1998, vol. 1, no. 1. [69] T . P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T . Erez, Y . T assa, D. Silver , and D. W ierstra, “Continuous control with deep reinforcement learning, ” arXiv preprint , 2015. [70] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn- ing, planning and teaching, ” Machine learning , v ol. 8, no. 3-4, pp. 293– 321, 1992. [71] M. Abadi, A. Agarwal, P . Barham, E. Bre vdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow , A. Harp, G. Irving, M. Isard, Y . Jia, R. Jozefowicz, L. Kaiser , M. K udlur , J. Le venberg, D. Man ´ e, R. Monga, S. Moore, D. Murray , C. Olah, M. Schuster , J. Shlens, B. Steiner , I. Sutskever , K. T alwar , P . T ucker , V . V anhoucke, V . V asudev an, F . V i ´ egas, O. V inyals, P . W arden, M. W attenberg, M. W icke, Y . Y u, and X. Zheng, “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015, software av ailable from tensorflow .org. [Online]. A vailable: https://www .tensorflow .org/ [72] E. Jones, T . Oliphant, P . Peterson et al. , “SciPy: Open source scientific tools for Python, ” 2001, ”http://www .scipy .org/”. [73] P . T abor , 2019. [Online]. A vailable: https://github .com/philtabor/ Y outube- Code- Repository/blob/master/ReinforcementLearning/ PolicyGradient/DDPG/pendulum/tensorflo w/ddpg orig tf.py [74] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-lev el performance on imagenet classification, ” in Pr oceedings of the IEEE international conference on computer vision , 2015, pp. 1026–1034. [75] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint arXiv:1412.6980 , 2014. W ade Genders earned a Software B.Eng. & Society in 2013, Civil M.A.Sc. in 2014 and Civil Ph.D in 2018 from McMaster Univ ersity . His research interests include traf fic signal control, intelligent transportation systems, machine learning and arti- ficial intelligence. Saiedeh Razavi Saiedeh Razavi is the inaugural Chair in Heavy Construction, Director of the Mc- Master Institute for T ransportation and Logistics and Associate Professor at the Department of Civil Engineering at McMaster Uni versity . Dr . Razavi has a multidisciplinary background and considerable experience in collaborating and leading national and international multidisciplinary team-based projects in sensing and data acquisition, sensor technologies, data analytics, data fusion and their applications in safety , producti vity , and mobility of transportation, construction, and other systems. She combines several years of industrial e xpe- rience with academic teaching and research. Her formal education includes de- grees in Computer Engineering (B.Sc), Artificial Intelligence (M.Sc) and Civil Engineering (Ph.D.). Her research, funded by Canadian council (NSERC), as well as the ministry of Transportation of Ontario, focuses on connected and automated vehicles, on smart and connected work zones and on computational models for improving safety and producti vity of highway construction. Dr . Razavi brings together the priv ate and public sectors with academia for the dev elopment of high quality research in smarter mobility , construction and logistics. She has received se veral awards including McMasters Student Union Merit A ward for T eaching, the F aculty of Engineering T eam Excellent A ward, and the Construction Industry Institute best poster aw ard.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment