Dynamic Pricing and Fleet Management for Electric Autonomous Mobility on Demand Systems
The proliferation of ride sharing systems is a major drive in the advancement of autonomous and electric vehicle technologies. This paper considers the joint routing, battery charging, and pricing problem faced by a profit-maximizing transportation s…
Authors: Berkay Turan, Ramtin Pedarsani, Mahnoosh Alizadeh
Dynamic Pricing and Fleet Management for Electric Autonomous Mobility on Demand Systems Berkay T uran Ramtin Pedarsani Mahnoosh Alizadeh Abstract The proliferation of ride sharing systems is a major driv e in the adv ancement of autonomous and electric vehicle technologies. This paper considers the joint routing, battery charging, and pricing problem faced by a profit-maximizing transportation service pro vider that operates a fleet of autonomous electric vehicles. W e first establish the static planning problem by considering time-inv ariant system parameters and deter- mine the optimal static policy . While the static policy provides stability of customer queues waiting for rides e ven if consider the system dynamics, we see that it is inef ficient to utilize a static polic y as it can lead to long wait times for customers and low profits. T o accommodate for the stochastic nature of trip demands, renew able energy availability , and electricity prices and to further optimally manage the autonomous fleet giv en the need to generate integer allocations, a real-time policy is required. The optimal real-time policy that executes actions based on full state information of the system is the solution of a complex dynamic pro- gram. Howe ver , we ar gue that it is intractable to e xactly solve for the optimal polic y using exact dynamic programming methods and therefore apply deep reinforcement learning to dev elop a near -optimal control policy . The tw o case studies we conducted in Manhattan and San Francisco demonstrate the efficacy of our real-time policy in terms of network stability and profits, while keeping the queue lengths up to 200 times less than the static policy . Keyw ords— autonomous mobility-on-demand systems, optimization and optimal control, reinforcement learning 1 Intr oduction The rapid evolution of enabling technologies for autonomous driving coupled with advancements in eco- friendly electric vehicles (EVs) has f acilitated state-of-the-art transportation options for urban mobility . Ow- ing to these dev elopments in automation, it is possible for an autonomous-mobility-on-demand (AMoD) fleet of autonomous EVs to serve the society’ s transportation needs, with multiple companies now heavily in vesting in AMoD technology [1]. This work is supported by the NSF Grant 1847096. B. T uran, R. Pedarsani, and M. Alizadeh are with the Department of Electrical and Computer Engineering, Univ ersity of California, Santa Barbara, CA, 93106 USA e-mail: { bturan,ramtin,alizadeh } @ucsb.edu. 1 The introduction of autonomous vehicles for mobility on demand services provides an opportunity for better fleet management. Specifically , idle vehicles can be r ebalanced throughout the network in order to prev ent accumulating at certain locations and to serve induced demand at ev ery location. Autonomous ve- hicles allow rebalancing to be performed centrally by a platform operator who observes the state of all the vehicles and the demand, rather than locally by individual drivers. Furthermore, EVs provide opportunities for cheap and environment-friendly energy resources (e.g., solar energy). Ho wever , electricity supplies and prices dif fer among the network both geographically and temporally . As such, this div ersity can be exploited for cheaper energy options when the fleet is operated by a platform operator that is aware of the electricity prices throughout the whole network. Moreover , a dynamic pricing sc heme for rides is essential to maximize profits earned by serving the customers. Coupling an optimal fleet management policy with a dynamic pric- ing scheme allo ws the re venues to be maximized while reducing the rebalancing cost and the waiting time of the customers by adjusting the induced demand. W e consider a model that captures the opportunities and challenges of an AMoD fleet of EVs, and con- sists of complex state and action spaces. In particular , the platform operator has to consider the number of customers waiting to be served at each location (ride request queue lengths), the electricity prices, traffic conditions, and the states of the EVs (locations, battery energy le vels) in order to make decisions. These decisions consist of pricing for rides for e very origin-destination (OD) pair and routing/char ging decision for ev ery vehicle in the network. Upon taking an action, the state of the network undergoes through a stochastic transition due to the randomness in customer behaviour , electricity prices, and travel times. W e first adopt the common approach of network flow modeling to de velop an optimal static pricing, routing, and charging policy that we use as a baseline in this paper . Howe ver , flow-based solutions generate fractional flows which can not directly be implemented. Moreov er, a static policy ex ecutes same actions independent of the network state and is obli vious to the stochastic events that occur in the real setting. Hence, it is not optimal to utilize the static policy in a real dynamic en vironment. Therefore, a real-time policy that generates integer solutions and ackno wledges the netw ork state is required, and can be determined by solving the underlying dynamic program. Due to the continuous and high dimensional state-action spaces ho wever , it is infeasible to develop an optimal real-time policy using exact dynamic programming algorithms. As such, we utilize deep reinforcement learning (RL) to de velop a near -optimal policy . Specifically , we sho w that it is possible to learn a policy via Proximal Policy Optimization (PPO) [2] that increases the total profits generated by jointly managing the fleet of EVs (by making routing and charging decisions) and pricing for the rides. W e demonstrate the performance of our policy by using the total profits generated and the queue lengths as metrics. Our contributions can be summarized as follo ws: 1. W e formalize a vehicle and network model that captures the aforementioned characteristics of an AMoD fleet of EVs as well as the stochasticity in demand and electricity prices. 2. W e analyze the static problem, where we consider a time-in variant en vironment (time-inv ariant arriv als, electricity prices, etc.) to characterize the family of policies that guarantee stability of the dynamic system, to gain insight towards the actual dynamic problem, and to further pro vide a baseline for 2 Figure 1: The schematic diagram of our framework. Our deep RL agent processes the state of the vehicles, queues and electricity prices and outputs a control policy for pricing as well as autonomous EVs’ routing and char ging. comparison. 3. W e employ deep RL methods to learn a joint pricing, routing and char ging polic y that ef fecti v ely stabilizes the queues and increases the profits. 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 Time (5 mins) × 10 6 0 2 4 6 T otal Queue Length × 10 3 Static P olicy (a) 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 Time (5 mins) × 10 6 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 T otal Queue Length × 10 3 Real − Time P olicy (b) Figure 2: (a) The optimal static pol- icy manages to stabilize the queues ov er a very long time period but is unable to clear them whereas (b) RL control policy stabilizes the queues and manages to keep them signifi- cantly low (note the scales). W e visualize our real-time frame work as a schematic diagram in Fig- ure 1 and previe w our results in Figure 2, showing that a real-time pric- ing and routing policy can successfully keep the queue lengths 400 times lower than the static polic y . This policy is also able to decrease the char g- ing costs by 25% by utilizing smart charging strategies (which will be demonstrated in Section 5). Related w ork: Comprehensi ve research perceiving various aspects of AMoD systems is being conducted in the literature. Studies surround- ing fleet management focus on optimal EV charging in order to reduce electricity costs as well as optimal vehicle routing in order to serve the customers and to rebalance the empty vehicles throughout the netw ork so as to reduce the operational costs and the customers’ waiting times. Time- in variant control policies adopting queueing theoretical [3], fluidic [4], network flow [5], and Markovian [6] models hav e been dev eloped by us- ing the steady state of the system. The authors of [7] consider ride-sharing systems with mixed autonomy . Ho wev er, the proposed control policies in these papers are not adaptiv e to the time-v arying nature of the future demand. As such, there is work on dev eloping time-v arying model pre- 3 dictiv e control (MPC) algorithms [8 – 12]. The authors of [10, 11] propose data-driv en algortihms and the authors of [12] propose a stochastic MPC algorithm focusing on vehicle rebalancing. In [8], the authors also consider a fleet of EVs and hence propose an MPC approach that optimizes vehicle routing and scheduling subject to energy constraints. Using a fluid-based optimization framework, the authors of [13] inv estigate tradeoffs between fleet size, rebalancing cost, and queueing effects in terms of passenger and vehicle flows under time-varying demand. The authors in [14] develop a parametric controller that approximately solves the intractable dynamic program for rebalancing over an infinite-horizon. Similar to AMoD, carsharing sys- tems also require rebalancing in order to operate efficiently . By adopting a Marko vian model, the authors of [15] introduce a dynamic proactive rebalancing algorithm for carsharing systems by taking into account an estimate of the future demand using historical data. In [16], the authors de velop an integrated multi-objecti ve mixed inte ger linear programming optimization and discrete e vent simulation frame work to optimize v ehicle and personnel rebalancing in an electric carsharing system. Using a network-flow based model, the authors of [17] propose a two-stage approximation scheme to establish a real-time rebalancing algorithm for shared mobility systems that accounts for stochasticity in customer demand and journey v aluations. Aside from these, there are studies on applications of RL methods in transportation such as adapti ve routing [18], traffic management [19, 20], traffic signal control [21, 22], and dynamic routing of autonomous vehicles with the goal of reducing congestion in mixed autonomy traffic networks [23]. Rele vant studies to our work aim to de velop dynamic policies for rebalancing as well as ride request assignment via decentralized reinforcement learning approaches [24 – 27]. In these works howe ver , the policies are dev eloped and applied locally by each autonomous v ehicle and this decentralized approach may sacrifice system le vel optimality . A centralized deep RL approach tackling the rebalancing problem is proposed in [28], which is closest to the approach we adopt in this paper . Although their study adopts a centralized deep RL approach similar to our paper , they hav e a different system model and solely focus on the rebalancing problem and do not consider pricing for rides as a control variable for the queues nor the char ging problem of EVs as revie wed next. Regarding char ging strategies for lar ge populations of EVs, [29–31] pro vide in-depth re views and studies of smart charging technologies. An agent-based model to simulate the operations of an AMoD fleet of EVs under v arious v ehicle and infrastructure scenarios has been e xamined in [32]. By augmenting optimal battery management of autonomous electric vehicles to the classic dial-a-ride problem (D ARP), the authors of [33] introduce the electric autonomous D ARP that aims to minimize the total trav el time of all the vehicles and riders. The authors of [34] propose an online char ge scheduling algorithm for EVs providing AMoD services. By adopting a static network flow model in [35], the benefits of smart charging have been in vestigated and approximate closed form expressions that highlight the trade-off between operational costs and charging costs have been deri ved. Furthermore, [36] studies interactions between AMoD systems and the power grid. In addition, [37] studies the implications of pricing schemes on an AMoD fleet of EVs. In [38], the authors propose a dynamic joint pricing and routing strate gy for non-electric shared mobility on demand services. [39] studies a quadratic programming problem in order to jointly optimize vehicle dispatching, charge scheduling, and charging infrastructure, while the demand is defined e xogenously . T o the best of our knowledge, there is no existing work on centralized real-time management for electric 4 AMoD systems addressing the joint optimization scheme of vehicle routing and charging as well as pricing for the rides. In this paper we aim to highlight the benefits of a real-time controller that jointly: (i) routes the vehicles throughout the network in order to serve the demand for rides as well as to relocate the empty vehicles for further use, (ii) executes smart charging strategies by exploiting the diversity in the electricity prices (both geographically and temporally) in order to minimize charging costs, and (iii) adjusts the demand for rides by setting prices in order to stabilize the system (i.e., the queues of customers waiting for rides) while maximizing profits. Paper Organization: The remainder of the paper is organized as follows. In Section 2, we present the system model and define the platform operator’ s optimization problem. In Section 3, we discuss the static planning problem associated with the system model and characterize the optimal static policy . In Section 4, we propose a method for dev eloping a near-optimal real-time policy using deep reinforcement learning. In Section 5, we present the numerical results of the case studies we have conducted in Manhattan and San Francisco to demonstrate the performance of our real-time control policy . Finally , we conclude the paper in Section 6. 2 System Model and Pr oblem Definition Network and Demand Models: W e consider a fleet of AMoD EVs operating within a transportation network characterized by a fully connected graph consisting of M = { 1 , . . . , m } nodes that can each serve as a trip origin or destination. W e study a discrete-time system with time periods normalized to integral units t ∈ { 0 , 1 , 2 , . . . } . In this discrete-time system, we model the arri val of the potential riders with OD pair ( i, j ) as a Poisson process with an arriv al rate of λ ij ( t ) in period t , where λ ii ( t ) = 0 . W e adopted a price-responsive rider model studied in [40]. W e assume that the riders are heterogeneous in terms of their willingness to pay . In particular, if the price for recei ving a ride from node i to node j in period t is set to ` ij ( t ) , the induced arriv al rate for rides from i to j is given by Λ ij ( t ) = λ ij ( t )(1 − F ( ` ij ( t ))) , where F ( · ) is the cumulativ e distribution of riders’ willingness to pay with a support of [0 , ` max ] 1 . Thus, the number of new ride requests in time period t is A ij ( t ) ∼ Pois (Λ ij ( t )) for OD pair ( i, j ) . V ehicle Model: T o capture the effect of trip demand and the associated charging and routing (routing also implies rebalancing of the empty vehicles) decisions on the costs associated with operating the fleet (mainte- nance, mileage, etc.), we assume that each autonomous vehicle in the fleet has a per period operational cost of β . Furthermore, as the v ehicles are electric, they ha ve to sustain char ge in order to operate. Without loss of generality , we assume there is a charging station placed at each node i ∈ M . T o charge at node i during time period t , the operator pays a price of electricity p i ( t ) per unit of energy . W e assume that all EVs in the fleet hav e a battery capacity denoted as v max ∈ Z + ; therefore, each EV has a discrete battery energy lev el v ∈ V , where V = { v ∈ N | 0 ≤ v ≤ v max } . In our discrete-time model, we assume each vehicle takes one period 1 For brevity of notation, we uniformly set ` max to be the maximum willingness to pay for all OD pairs without loss of generality . Our results can be derived in a similar fashion by replacing ` max with ` ij max , where ` ij max is the maximum willingness to pay for OD pair ( i, j ) . 5 to charge one unit of energy and τ ij periods to travel between OD pair ( i, j ) , while consuming v ij units of energy 2 . Ride Hailing Model: The platform operator dynamically routes the fleet of EVs in order to serv e the demand at each node. Customers that purchase a ride are not immediately matched with a ride, b ut enter the queue for OD pair ( i, j ) . After the platform operator ex ecutes routing decisions for the fleet, the customers in the queue for OD pair ( i, j ) are matched with rides and serv ed in a first-come, first-served discipline. A measure of the expected wait time is not av ailable to each arriving customer . Howe ver , the operator knows that longer wait times will negativ ely af fect their business and hence seeks to minimize the total wait time experienced by users. Denote the queue length for OD pair ( i, j ) by q ij ( t ) . If after serving the customers, the queue length q ij ( t ) > 0 , the platform operator is penalized by a fixed cost of w per person at the queue to account for the value of time of the customers. Platform Operator’ s Problem: W e consider a profit-maximizing AMoD operator that manages a fleet of EVs that make trips to provide transportation services to customers. The operator’ s goal is to maximize profits by 1) setting prices for rides and hence managing customer demand at each node; 2) optimally operating the AMoD fleet (i.e., char ging and routing) to minimize operational and charging costs. W e will study two types of control policies the platform operator utilizes: 1) a static policy , where the pricing, routing and charging decisions are time in variant and independent of the state of the system; 2) a real-time policy , where the pricing, routing and charging decisions are dependent on the system state. 3 Analysis of the Static Pr oblem In this section, we establish and discuss the static planning problem to provide a measure for comparison and demonstrate the efficac y of the real-time policy (which will be discussed in Section 4). T o do so, we consider the fluid scaling of the dynamic network and characterize the static problem via a network flow formulation. Under this setting, we use the e xpected v alues of the v ariables (arriv als and prices of electricity) and ignore their time dependent dynamics, while allowing the vehicle routing decisions to be flows (real numbers) rather than integers. The static problem is conv enient for determining the optimal static pricing, routing, and charging polic y , under which the queueing network of the dynamic system is stable [41] 3 . 3.1 Static Profit Maximization Pr oblem W e formulate the static optimization problem via a network flow model that aims to maximize the platform operator’ s profits. The platform operator maximizes its profits by setting prices and making routing and charging decisions such that the system remains stable. 2 In this paper, we consider the travel times to be constant and exogenously defined for the time period the policy is developed for. This is because we assume that the number of AMoD vehicles is much less compared to the rest of the traf fic. Also, to consider changing traffic conditions throughout the day , it is possible to train multiple static and real-time control policies for the different time interv als. 3 The stability condition that we are interested in is rate stability of all queues. A queue for OD pair ( i, j ) is rate stable if lim t →∞ q ij ( t ) /t = 0 . 6 Let ` ij be the prices for rides for OD pair ( i, j ) , x v ij be the number of v ehicles at node i with energy le vel v being routed to node j , and x v ic be the number of vehicles charging at node i starting with energy level v . W e state the platform operator’ s profit maximization problem as follows: max x v ic ,x v ij ,` ij X i ∈M X j ∈M λ ij ` ij (1 − F ( ` ij )) − X i ∈M v max − 1 X v =0 ( β + p i ) x v ic − β X i ∈M X j ∈M v max X v = v ij x v ij τ ij (1a) subject to λ ij (1 − F ( ` ij )) ≤ v max X v = v ij x v ij ∀ i, j ∈ M , (1b) x v ic + X j ∈M x v ij = x v − 1 ic + X j ∈M x v + v j i j i ∀ i ∈ M , ∀ v ∈ V , (1c) x v max ic = 0 ∀ i ∈ M , (1d) x v ij = 0 ∀ v < v ij , ∀ i, j ∈ M , (1e) x v ic ≥ 0 , x v ij ≥ 0 ∀ i, j ∈ M , ∀ v ∈ V , (1f) x v ic = x v ij = 0 ∀ v / ∈ V , ∀ i, j ∈ M . (1g) The first term in the objective function in (1) accounts for the aggregate revenue the platform generates by providing rides for λ ij (1 − F ( ` ij )) number of riders with a price of ` ij . The second term is the operational and charging costs incurred by the charging vehicles (assuming that p i ( t ) = p i ∀ t under the static setting), and the last term is the operational costs of the trip-making vehicles (including rebalancing trips). The constraint (1b) requires the platform to operate at least as many vehicles to serve all the induced de- mand between any two nodes i and j (The rest are the vehicles tra velling without passengers, i.e., rebalancing vehicles). W e will refer to this as the demand satisfaction constraint . The constraint (1c) is the flow balance constraint for each node and each battery energy level, which restricts the number of av ailable vehicles at node i and energy lev el v to be the sum of arriv als from all nodes (including idle vehicles) and vehicles that are charging with energy level v − 1 . The constraint (1d) ensures that the vehicles with full battery do not charge further , and the constraint (1e) ensures the vehicles sustain enough charge to travel between OD pair ( i, j ) . The solution to the optimization problem in (1) is the optimal static policy that consists of optimal prices as well as optimal vehicle routing and charging decisions. This policy can not directly be implemented in a real environment because it does not yield integer valued solutions. It is possible generate integer -valued solutions to be implemented in a real environment using the fractional flows (e.g., randomizing the vehicle decisions according to the flows, which we do in Section 5), yet the methodology is not the focus of our work. Instead, we highlight a sufficient condition for a realizable policy (generating integer v alued actions) to provide stability according to the feasible solutions of (1): Proposition 1. Let { ˜ ` ij , ˜ x v ij , ˜ x v ic } be a feasible solution of (1) . Let µ be a policy that generates inte ger actions and can be implemented in the real en vir onment. Then, µ guarantees stability of the system if for all OD pairs ( i, j ) : 7 1. The time averag e of the induced arrivals equals (1 − F ( ˜ ` ij )) , and 2. The time averag e of the r outed vehicles equals P v max v = v ij ˜ x v ij . The p roof of Proposition 1 is provided in Appendix A. According to Proposition 1, for a static pricing policy with the optimal prices ` ∗ ij , there exists an integer-v alued routing and charging policy that maintains stability of the system. Corollary 1.1. An example policy that generates inte ger-valued actions is randomizing accor ding to the flows. Precisely , given a feasible solution { ˜ ` ij , ˜ x v ij , ˜ x v ic } of (1) , inte ger-valued actions can be generated by r outing a vehicle at node i with ener gy level v to node j with pr obability ψ v ij = ˜ x v ij P m k =1 ˜ x v ik + x v ic , and char ging with pr obability ψ v ic = ˜ x v ic P m k =1 ˜ x v ik + x v ic , ∀ i, j ∈ M and ∀ v ∈ V . Combining this randomized policy with a static pricing policy of ` ij ( t ) = ˜ ` ij , ∀ t , r esults in a policy satisfying the criteria in Pr oposition 1. The optimization problem in (1) is non-con vex for a general F ( · ) . Nonetheless, when the platform’ s profits are con ve x in the induced demand λ ij (1 − F ( · )) , it can be re written as a con vex optimization problem and can be solved exactly . Hence, we assume that the rider’ s willingness to pay is uniformly distributed in [0 , ` max ] , i.e., F ( ` ij ) = ` ij ` max 4 . Marginal Pricing: The prices for rides are a crucial component of the profits generated. The next proposition highlights ho w the optimal prices ` ∗ ij for rides are related to the netw ork parameters, prices of electricity , and the operational costs. Proposition 2. Let ν ∗ ij be optimal the dual variable corresponding to the demand satisfaction constraint for OD pair ( i, j ) . The optimal prices ` ∗ ij ar e: ` ∗ ij = ` max + ν ∗ ij 2 . (2) These prices can be upper bounded by: ` ∗ ij ≤ ` max + β ( τ ij + τ j i + v ij + v j i ) + v ij p j + v j i p i 2 (3) 4 It is also possible to use other distributions that might reflect real willingness-to-pay distributions more accurately (such as pareto distribution, exponential distribution, triangular distribution, constant elasticity distribution, and normal distribution). Among these, pareto, exponential, and constant elasticity distributions preserve con ve xity and therefore the static planning problem can be solved efficiently . Triangular and normal distributions are not con vex in their support and therefore the static planning problem is not a conve x optimization problem. Nevertheless, it can still be solv ed numerically for the optimal static policy . Using these distrib utions howe ver we cannot deri ve the closed-form results that allow us to interpret the pricing policy of the platform operator . The real-time policy proposed in Section 4 uses model-free Reinforcement Learning and therefore can be applied using other distributions or an y other customer price response model. 8 Mor eover , with these optimal prices ` ∗ ij , the pr ofits generated per period is: P = m X i =1 m X j =1 λ ij ` max ( ` max − ` ∗ ij ) 2 . (4) The proof of Proposition 2 is provided in Appendix B. Observe that the profits in Equation (4) are de- creasing as the prices for rides increase. Thus expensi ve rides generate less profits compared to the cheaper rides and it is more beneficial if the optimal dual variables ν ∗ ij are small and prices are close to ` max / 2 . W e can interpret the dual variables ν ∗ ij as the cost of providing a single ride between i and j to the platform. In the worst case scenario, ev ery single requested ride from node i requires rebalancing and charging both at the origin and the destination. Hence the upper bound on (3) includes the operational costs of passenger- carrying, rebalancing and charging vehicles (both at the origin and the destination); and the energy costs of both passenger-carrying and rebalancing trips multiplied by the price of electricity at the trip destinations. Similar to the taxes applied on products, whose burden is shared among the supplier and the customer; the costs associated with rides are shared among the platform operator and the riders (which is why the price paid by the riders include half of the cost of the ride). Although the static policy guarantees stability (by appropriate implementation of integer-v alued actions as dictated by Proposition 1), it does not perform well in a real dynamic setting because it does not ackno wledge the stochastic dynamics of the system. On the other hand, a real-time polic y that e xecutes decisions based on the current state of the en vironment would likely perform better (e.g., if the queue length for OD pair ( i, j ) is very large, then it is probably better for the platform operator to set higher prices to prev ent the queue from growing further). Accordingly , we present a practical way of implementing a real-time policy in the next section. 4 The Real-T ime P olicy The static policy established in the pre vious section has three major issues: 1. Because it is based on a flow model, it generates static fractional flows that are not directly imple- mentable in the real setting. 2. It neglects the stochastic events that occur in the dynamic setting (e.g., the induced arriv als), and as- sumes ev erything is deterministic. Hence, it does consider the unexpected occurrences (e.g., queues might build in the dynamic setting, whereas the static model assumes no queues) when executing ac- tions. 3. It assumes perfect kno wledge of the network parameters (arri vals, trip durations, ener gy consumptions of the trips, and prices of electricity). Due to the abov e reasons, it is impractical to implement the static policy in the dynamic environment. A real-time policy that generates integer solutions and takes into account the current state of the network 9 which is essential for decision making is necessary , and can be determined by solving the dynamic program that describes the system (with full knowledge of the network parameters) for the optimal policy . Such solu- tions would address issues 1 and 2 outlined above. Inspired by our theoretical model, the state information that describes the network fully consists of the vehicle states (locations, energy levels), queue lengths for each OD pair , and electricity prices at each node. Upon obtaining the full state information, the actions hav e to be ex ecuted for pricing for rides and fleet management (vehicle routing and charging). Consequent to taking actions, the platform operator observes a reward (consisting of re venue gained by arriv als, queue costs, and operational and charging costs), and the network transitions into a new state (Although the tran- sition into the new state is stochastic, the random processes that govern this stochastic transition is known if the network parameters are known). The solution of this dynamic program is the optimal policy that de- termines which action to take for each state the system is in, and can nominally be derived using classical exact dynamic programming algorithms (e.g., value iteration). Howe ver , the complexity and the scale of our dynamic problem presents a difficulty here: Aside from having a large dimensional state space (for instance, m = 10 , v max = 5 , τ ij = 3 ∀ i, j : the state has dimension 1240) and action space, the cardinality of these spaces are not finite (queues can grow unbounded, prices are continuous). Considering that the computational complexity per iteration for value iteration is O ( |A||S | 2 ) and for policy iteration O ( |A||S | 2 + |S | 3 ) [42], where S and A are the state space and the action space, respectively , the problem is computationally in- tractable to solve using classical dynamic programming. Even if we did make them finite by putting a cap on the queue lengths and discretizing the prices, curse of dimensionality renders the problem intractable to solve with classical exact dynamic programming algorithms. As such, we resort to approximate dynamic programming methods. Specifically , we define the policy via a deep neural network that takes the full state information of the network as input and outputs the best action 5 . Subsequently , we apply a model-free rein- forcement learning algorithm to train the neural network in order to improve the performance of the policy . Since it is model-free, it does not require a modeling of the network (hence, it does not require kno wledge of the network parameters), which resolves the third issue associated with the static polic y . W e adopted a practical policy gradient method, called Proximal Policy Optimization (PPO), dev eloped in [2], which is ef fecti ve for optimizing large nonlinear policies such as neural networks. W e chose PPO mainly because it supports continuous state-action spaces and guarantees monotonic improv ement. 6 W e note that it is possible to apply reinforcement learning to learn a policy in any en vironment, real or artificial, as long as there is data av ailable. In this work we use our theoretical model described in Section 2 to create the en vironment and generate data, mainly because there is no electric AMoD microsimulation en vironment av ailable and also to verify our findings about the static policy . Dev eloping a microsimulator for electric AMoD (like SUMO [43]) and integrating it with a deep reinforcement learning library to create a framew ork for real traffic experiments remains a future work. T o ensure that our numerical experiments 5 In general, the policy is a stochastic policy and determines the probabilities of taking the actions rather than deterministically producing an action. 6 Although the policy outputs a continuous set of actions, integer actions can be generated by randomizing. This is done during both training and testing, therefore the RL agent observes the integer state transitions and learns as if the policy outputs integer actions. W e discuss how to generate inte ger actions in more detail in Section 4.1. 10 are reproducible, in the next subsection, we describe the Markov Decision Process (MDP) that governs this dynamic en vironment, which is a direct extension of our static model. It is also possible to enrich the en viron- ment and the MDP to reflect real life constraints more accurately such as road capacity and charging station constraints. Since the approach we adopt to dev elop the real-time policy is model-free, it can be applied identically . In Section 5 we present numerical results on real-time policies de veloped through reinforcement learning based on dynamic environments generated through our theoretical model. The goal of the experiments is to primarily answer the following questions: 1. Can we de velop a real-time control and pricing policy for AMoD using reinforcement learning and what are its potential benefits ov er the static policy? 2. How does the polic y trained for a specific network perform, if the network parameters change? 3. Can we dev elop a global policy that can be utilized in any network with moderate fine tuning? The reader may skip reading Section 4.1 if they are not interested in the details of the MDP model used in our numerical experiment. 4.1 The Real-Time Pr oblem as MDP W e define the MDP by the tuple ( S , A , T , r ) , where S is the state space, A is the action space, T is the state transition operator and r is the reward function. W e describe these elements as follows: 1. S : The state space consists of prices of electricity at each node, the queue lengths for each origin- destination pair, and the number of vehicles at each node and each energy lev el. Howe ver , since trav elling from node i to node j takes τ ij periods of time, we need to define intermediate nodes. As such, we define τ ij − 1 number of intermediate nodes between each origin and destination pair , for each battery energy le vel v . Hence, the state space consists of s d = m 2 + ( v max + 1)(( P m i =1 P m j =1 τ ij ) − m 2 + 2 m ) dimensional vectors in R s d ≥ 0 (W e include all the non-negati ve valued vectors, howe ver , only m 2 − m entries can grow to infinity because they are queue lengths, and the rest are al ways upper bounded by fleet size or maximum price of electricity). As such, we define the elements of the state vector at time t as s ( t ) = [ p ( t ) q ( t ) s v eh ( t )] , where p ( t ) = [ p i ( t )] i ∈M is the electricity prices state vector , q ( t ) = [ q ij ( t )] i,j ∈M ; i 6 = j is the queue lengths state vector , and s v eh ( t ) = [ s v ij k ( t )] ∀ i,j,k,v is the vehicle state v ector , where s v ij k ( t ) is the number of vehicles at vehicle state ( i, j, k , v ) . The vehicle state ( i, j, k , v ) specifies the location of a vehicle that is travelling between OD pair ( i, j ) as the k ’ th intermediate node between nodes i and j , and specifies the battery energy lev el of a vehicle as v (The states of the vehicles at the nodes i ∈ M with energy lev el v is denoted by ( i, i, 0 , v ) ). 2. A : The action space consists of prices for rides at each origin-destination pair and routing/char ging de- cisions for vehicles at nodes i ∈ M at each energy level v . The price actions are continuous in range [0 , ` max ] . 11 Each vehicle at state ( i, i, 0 , v ) ( ∀ i ∈ M , ∀ v ∈ V ) can either charge, stay idle or trav el to one of the remain- ing m − 1 nodes. T o allow for different transitions for vehicles at the same state (some might charge, some might trav el to another node), we define the action taken at time t for vehicles at state ( i, i, 0 , v ) as an m + 1 dimensional probability vector with entries in [0 , 1] that sum up to 1: α v i ( t ) = [ α v i 1 ( t ) . . . α v im ( t ) α v ic ( t )] , where α v max ic ( t ) = 0 and α v ij ( t ) = 0 if v < v ij . The action space is then all the vectors a of dimension a d = m 2 − m + ( v max + 1)( m 2 + m ) , whose first m 2 − m entries are the prices and the rest are the proba- bility vectors satisfying the aforementioned properties. As such, we define the elements of the action vector at time t as a ( t ) = [ ` ( t ) α ( t )] , where ` ( t ) = [ ` ij ] i,j ∈M ,i 6 = j is the vector of prices and α ( t ) = [ α v i ( t )] ∀ i,v is the vector of routing/char ging actions. 3. T : The transition operator is defined as T ij k = P r ( s ( t + 1) = j | s ( t ) = i, a ( t ) = k ) . W e can define the transition probabilities for electricity prices p ( t + 1) , queue lengths q ( t + 1) , and vehicle states s v eh ( t + 1) as follows: Electricity Price T ransitions: Since we assume that the dynamics of prices of electricity are exogenous to our AMoD system, P r ( p ( t + 1) = p 2 | p ( t ) = p 1 , a ( t )) = P r ( p ( t + 1) = p 2 | p ( t ) = p 1 ) , i.e., the dynamics of the price are independent of the action taken. Depending on the setting, ne w prices might either be deterministic or distributed according to some probability density function at time t : p ( t ) ∼ P ( t ) , which is determined by the electricity provider . V ehicle T ransitions: For each vehicle at node i and energy le vel v , the transition probability is defined by the action probability vector α v i ( t ) . Each vehicle transitions into state ( i, j, 1 , v − v ij ) with probability α v ij ( t ) , stays idle in state ( i, i, 0 , v ) with probability α v ii ( t ) or charges and transitions into state ( i, i, 0 , v + 1) with probability α v ic ( t ) . The vehicles at intermediate states ( i, j, k , v ) transition into state ( i, j, k + 1 , v ) if k < τ ij − 1 or ( j, j, 0 , v ) if k = τ ij − 1 with probability 1. The total transition probability to the vehicle states s v eh ( t + 1) gi ven s v eh ( t ) and α ( t ) is the sum of all the probabilities of the feasible transitions from s v eh ( t ) to s v eh ( t + 1) under α ( t ) , where the probability of a feasible transition is the multiplication of individual vehicle transition probabilities (since the vehicle transition probabilities are independent). Note that instead of gradually dissipating the ener gy of the vehicles on their route, we immediately discharge the required energy for the trip from their batteries and keep them constant during the trip. This ensures that the vehicles hav e enough battery to complete the ride and does not violate the model, because the vehicles arrive to their destinations with true value of ener gy and a new action will only be taken when the y reach the destination. Queue T ransitions: The queue lengths transition according to the prices and the vehicle routing decisions. For prices ` ij ( t ) and induced arri val rate Λ ij ( t ) , the probability that A ij ( t ) ne w customers arri ve in the queue ( i, j ) is: P r ( A ij ( t )) = e − Λ ij ( t ) Λ ij ( t ) A ij ( t ) ( A ij ( t ))! Let us denote the total number of vehicles routed from node i to j at time t as x ij ( t ) , which is giv en by: x ij ( t ) = v max X v = v ij x v ij ( t ) = v max X v = v ij s v − v ij ij 1 ( t + 1) . (5) 12 Figure 3: The schematic diagram representing the state transition of our MDP . Upon taking an action, a vehicle at state ( i, i, 0 , v ) charges for a price of p i ( t ) and transitions into state ( i, i, 0 , v + 1) with probability α v ic ( t ) , stays idle at state ( i, i, 0 , v ) with probability α v ii ( t ) , or starts traveling to another node j and transitions into state ( i, j, 1 , v − v ij ) with probability α v ij ( t ) . Furthermore, A ij ( t ) ne w customers arrive to the queue ( i, j ) depending on the price ` ij ( t ) . After the routing and charging decisions are e xecuted for all the EVs in the fleet, the queues are modified. Giv en s v eh ( t + 1) and x ij ( t ) , the probability that the queue length q ij ( t + 1) = q is: P r ( q ij ( t + 1) = q | s ( t ) , a ( t ) , s v eh ( t + 1)) = P r ( A ij ( t ) = q − q ij ( t ) + x ij ( t )) , if q > 0 , and P r ( A ij ( t ) ≤ − q ij ( t ) + x ij ( t )) if q = 0 . Since the arri vals are independent, the total probability that the queue vector q ( t + 1) = q is: P r ( q ( t + 1) = q | s ( t ) , a ( t ) , s v eh ( t + 1)) = Y i ∈M Y j ∈M j 6 = i P r ( q ij ( t + 1) | s ( t ) , a ( t ) , s v eh ( t + 1)) . Hence, the transition probability is defined as: P r ( s ( t + 1) | s ( t ) , a ( t )) = P r ( p ( t + 1) | p ( t )) × P r ( s v eh ( t + 1) | s ( t ) , α ( t )) × P r ( q ( t + 1) | s ( t ) , α ( t ) , s v eh ( t + 1)) (6) W e illustrate how the v ehicles and queues transition into new states consequent to an action in Figure 3. 4. r : The re ward function r ( t ) is a function of state-action pairs at time t : r ( t ) = r ( a ( t ) , s ( t )) . Let x v ic ( t ) denote the number of vehicles charging at node i starting with energy lev el v at time period t . The rew ard function r ( t ) is defined as: r ( t ) = X i ∈M X j ∈M j 6 = i ` ij ( t ) A ij ( t ) − w X ∈M X j ∈M j 6 = i q ij ( t ) − X i ∈M v max − 1 X v =0 ( β + p i ) x v ic ( t ) − β X i ∈M X j ∈M j 6 = i x ij ( t ) − β X i ∈M X j ∈M j 6 = i τ ij − 1 X k =1 v max − 1 X v =0 s v ij k ( t ) (7) The first term corresponds to the re venue generated by the passengers that request a ride for a price ` ij ( t ) , the second term is the queue cost of the passengers that have not yet been served, the third term is the charging 13 and operational costs of the charging vehicles and the last two terms are the operational costs of the vehicles making trips. Note that revenue generated is immediately added to the reward function when the passengers enter the network instead of after the passengers are served. Since the reinforcement learning approach is based on maximizing the cumulative re ward gained, all the passengers eventually hav e to be served in order to prev ent queues from blowing up and hence it does not violate the model to add the re venues immediately . Using the definitions of the tuple ( S , A , T , r ) , we model the dynamic problem as an MDP . Giv en large- dimensional state and action spaces with infinite cardinality , we can not solve the MDP using exact dynamic programming methods. As a solution, we characterize the real-time policy via a deep neural network and ex ecute reinforcement learning in order to develop a real-time polic y . 4.2 Reinf orcement Learning Method In this subsection, we go through the preliminaries of reinforcement learning and briefly explain the idea of the algorithm we adopted. 4.2.1 Preliminaries The real-time policy associated with the MDP is defined as a function parameterized by θ : π θ ( a | s ) = π : S × A → [0 , 1] , i.e., a probability distribution in the state-action space. Given a state s , the policy returns the probability for taking the action a (for all actions), and samples an action according to the probability distrib ution. The goal is to deriv e the optimal policy π ∗ , which maximizes the discounted cumulativ e expected re wards J π : J π ∗ = max π J π = max π E π " ∞ X t =0 γ t r ( t ) # , π ∗ = arg max π E π " ∞ X t =0 γ t r ( t ) # , where γ ∈ (0 , 1] is the discount factor . The value of taking an action a in state s , and following the polic y π afterwards is characterized by the v alue function Q π ( s , a ) : Q π ( s , a ) = E π " ∞ X t =0 γ t r ( t ) | s (0) = s , a (0) = a # . The value of being in state s is formalized by the v alue function V π ( s ) : V π ( s ) = E a (0) ,π " ∞ X t =0 γ t r ( t ) | s (0) = s # , and the advantage of taking the action a in state s and following the policy π thereafter is defined as the advantage function A π ( s , a ) : A π ( s , a ) = Q π ( s , a ) − V π ( s ) . 14 The methods used by reinforcement learning algorithms can be divided into three main groups: 1) critic- only methods, 2) actor-only methods, and 3) actor-critic methods, where the word critic refers to the value function and the word actor refers to the policy [44]. Critic-only (or value-function based) methods (such as Q-learning [45] and SARSA [46]) improv e a deterministic policy using the value function by iterating: a ∗ = arg max a Q π ( s , a ) , π ( a ∗ | s ) ← − 1 . Actor-only methods (or policy gradient methods), such as Williams’ REINFORCE algorithm [47], improv e the policy by updating the parameter θ by gradient ascent, without using an y form of a stored v alue function: θ ( t + 1) = θ ( t ) + α ∇ θ E π θ ( t ) " X τ γ τ r ( τ ) # . The advantage of policy gradient methods is their ability to generate actions from a continuous action space by utilizing a parameterized policy . Finally , actor-critic methods [48, 49] make use of both the v alue functions and policy gradients: θ ( t + 1) = θ ( t ) + α ∇ θ E π θ ( t ) Q π θ ( t ) ( s , a ) . Actor-critic methods are able to produce actions in a continuous action space, while reducing the high vari- ance of the policy gradients by adding a critic (v alue function). All of these methods aim to update the parameters θ (or directly update the policy π for critic-only methods) to improve the policy . In deep reinforcement learning, the policy π is defined by a deep neural network, whose weights constitute the parameter θ . T o develop a real-time policy for our MDP , we adopt a practical policy gradient method called Proximal Polic y Optimization (PPO). 4.2.2 Proximal P olicy Optimization PPO is a practical policy gradient method developed in [2], and is effecti ve for optimizing large non-linear policies such as deep neural networks. It preserves some of the benefits of trust region policy optimization (TRPO) [50] such as monotonic impro vement, b ut is much simpler to implement because it can be optimized by a first-order optimizer , and is empirically shown to ha ve better sample complexity . In TRPO, an objective function (the “surrogate” objecti ve) is maximized subject to a constraint on the size of the policy update so that the ne w policy is not too far from the old polic y: maximize θ ˆ E t π θ ( a t | s t ) π θ old ( a t | s t ) ˆ A t (8a) subject to ˆ E t [ KL [ π θ old ( ·| s t ) , π θ ( ·| s t )]] ≤ δ, (8b) where π θ is a stochastic polic y and ˆ A t is an estimator of the adv antage function at timestep t . The e xpectation ˆ E t [ . . . ] indicates the empirical average over a finite batch of samples and KL [ π θ old ( ·| s t ) , π θ ( ·| s t )] denotes 15 the Kullback–Leibler div ergence between π θ old and π . Although TRPO solves the above constrained max- imization problem using conjugate gradient, the theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem maximize θ ˆ E t π θ ( a t | s t ) π θ old ( a t | s t ) ˆ A t − β KL [ π θ old ( ·| s t ) , π θ ( ·| s t )] , (9) for some penalty coefficient β . TRPO uses a hard constraint rather than a penalty because it is hard to choose a single v alue of β that performs well. T o ov ercome this issue and dev elop a first-order algorithm that emulates the monotonic improv ement of TRPO (without solving the constrained optimization problem), two PPO algorithms are constructed by: 1) clipping the surrogate objecti ve and 2) using adapti ve KL penalty coefficient [2]. 1. Clipped Surr ogate Objective: Let r t ( θ ) denote the probability ratio r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) , so r ( θ old ) = 1 . TRPO maximizes L ( θ ) = ˆ E t π θ ( a t | s t ) π θ old ( a t | s t ) ˆ A t = ˆ E t h r t ( θ ) ˆ A t i . (10) subject to the KL di ver gence constraint. W ithout a constraint ho wev er this would lead to a lar ge policy update. T o prevent this, PPO modifies the surrogate objecti ve to penalize changes to the policy that mov e r t ( θ ) away from 1: L C LI P ( θ ) = ˆ E t h min( r t ( θ ) ˆ A t , clip ( r t ( θ ) , 1 − , 1 + ) ˆ A t ) i , (11) where is a hyperparameter , usually 0.1 or 0.2. The term clip ( r t ( θ ) , 1 − , 1 + ) ˆ A t ) modifies the surrogate objecti ve by clipping the probability ratio, which removes the incenti ve for moving r t outside of the interval [1 − , 1 + ] . By taking the minimum of the clipped and the unclipped objectiv e, the final objectiv e becomes a lower bound on the unclipped objecti ve. 2. Adaptive KL P enalty Coefficient: Another approach is to use a penalty on KL div ergence and to adapt the penalty coefficient so that some target value of the KL div ergence d targ is achiev ed at each policy update. In each policy update, the following steps are performed: • Using sev eral epochs of minibatch SGD, optimize the KL-penalized objective L K LP E N ( θ ) = ˆ E t π θ ( a t | s t ) π θ old ( a t | s t ) ˆ A t − β KL [ π θ old ( ·| s t ) , π θ ( ·| s t )] (12) • Compute d = ˆ E t [ KL [ π θ old ( ·| s t ) , π θ ( ·| s t )]] – If d < d targ / 1 . 5 , β ← β / 2 – If d > d targ × 1 . 5 , β ← β × 2 . The updated β is then used for the ne xt policy update. This scheme allo ws β to adjust if KL di vergence is significantly dif ferent than d targ so that the desired KL diver gence between the old and the updated policy is attained. 16 A PPO algorithm using fixed-length trajectory segments is summarized in Algorithm 1. Each iteration, each of N (parallel) actors collect T timesteps of data. Then the surrogate loss on these N T timesteps of data is constructed and optimized with minibatch SGD for K epochs. Algorithm 1: PPO, Actor-Critic Style for iteration = 0 , 1 , 2 , . . . do for actor = 1 , 2 , . . . , N do Run policy π θ old in en vironment for T timesteps. Compute advantage estimates ˆ A 1 , . . . , ˆ A T end Optimize surrogate L C LI P or L K LP E N w .r .t. θ , with K epochs and minibatch size M ≤ N T . θ old ← θ end In this work, we used the PPO algorithm with the clipped surrogate objective, because experimentally it it sho wn to hav e better performance than the PPO algorithm with adaptive KL penalty coefficient [2]. W e refer the reader to [2] for a comprehensiv e study on PPO algorithms. In the next section, we present our numerical studies demonstrating the performance of the RL polic y . 5 Numerical Study In this section, we discuss the numerical e xperiments and results for the performance of reinforcement learn- ing approach to the dynamic problem and compare with the performance of several static policies, including the optimal static polic y outlined in Section 3. W e solved for the optimal static polic y using CVX, a package for specifying and solving con vex programs [51]. T o implement the dynamic environment compatible with reinforcement learning algorithms, we used Gym toolkit [52] dev eloped by OpenAI to create an en vironment. For the implementation of the PPO algorithm, we used Stable Baselines toolkit [53]. W e chose an operational cost of β = $0 . 1 (by normalizing the average price of an electric car over 5 years [54]) and maximum willingness to pay ` max = $30 . For prices of electricity p i ( t ) , we generated random prices for different locations and different times using the statistics of locational marginal prices in [55]. W e chose a maximum battery capacity of 20 kWh. W e discretrized the battery energy into 5 units, where one unit of battery energy is 4 kWh. The time it takes to deli ver one unit of char ge is taken as one time epoch, which is equal to 5 minutes in our setup. The waiting time cost for one period is w = $2 (average hourly wage is around $24 in the United States [56]). Note that the dimension of the state space grows significantly with battery capacity v max , because it expands the states each vehicle can ha ve by v max . Therefore, for computational purposes, we conducted two case studies: 1) Non-electric AMoD case study with a larger network in Manhattan, 2) Electric AMoD case study with a smaller network in San Francisco. W e picked two dif ferent real world networks in order to demonstrate the uni versality of reinforcement learning method in establishing a real-time polic y . In particular , 17 our intention is to support the claim that the success of the reinforcement learning method is not restricted to a single network, but generalizes to multiple real world networks. Both experiments were performed on a laptop computer with Intel ® Core TM i7-8750H CPU (6 × 2.20 GHz) and 16 GB DDR4 2666MHz RAM. 5.1 Case Study in Manhattan In a non-electric AMoD network, the energy dimension v v anishes. Because there is no charging action 7 , we can perform coarser discretizations of time. Specifically , we can allow each discrete time epoch to cover 5 × min i,j | i 6 = j τ ij minutes, and normalize the travel times τ ij and w accordingly (For EV’ s, because charging tak es a non-negligible but shorter time than travelling, in general we hav e τ ij > 1 , and larger number of states). The static profit maximization problem in (1) for AMoD with non-electric vehicles can be re written as: max x ij ,` ij X i ∈M X j ∈M λ ij ` ij (1 − F ( ` ij )) − β g X i ∈M X j ∈M x ij τ ij subject to λ ij (1 − F ( ` ij )) ≤ x ij ∀ i, j ∈ M , X j ∈M x ij = X j ∈M x j i ∀ i ∈ M , x ij ≥ 0 ∀ i, j ∈ M . (13) The operational costs β g = $2 . 5 (per 10 minutes, [57]) are different than those of electric vehicles. Because there is no “charging” (or refueling action, since it tak es negligible time), β g also includes fuel cost. The optimal static policy is used to compare and highlight the performance of the real-time polic y 8 . Figure 4: Man- hattan divided into m = 10 regions. W e divided Manhattan into 10 regions as in Figure 4, and using the yellow taxi data from the Ne w Y ork City T axi and Limousine Commission dataset [58] for May 04, 2019, Saturday between 18.00-20.00, we extracted the a verage arriv al rates for rides and trip durations between the regions (we exclude the rides occurring in the same region). W e trained our model by creating new induced random arriv als with the same average ar- riv al rate using prices determined by our policy . For the fleet size, we used a fleet of 1200 autonomous vehicles (according to the optimal fleet size emerging from the static problem). For training, we used a neural network with 4 hidden layers and 128 neurons in each hidden layer . The rest of the parameters are left as default as specified by the Stable Baselines toolkit [53]. In order to get the best policy , we train 3 different models using DDPG [59], TRPO [50], and PPO. W e trained the models for 10 million iterations, and the performances of the trained models are summarized in T able 1 using a verage re wards 7 The vehicles still refuel, ho wever this takes ne gligible time compared to the trip durations. 8 The solution of the static problem yields vehicle flows. In order to mak e the polic y compatible with our environment and to generate integer actions that can be applied in a dynamic setting, we randomized the actions by dividing each flo w for OD pair ( i, j ) (and energy lev el v ) by the total number of vehicles in i (and energy level v ) and used that fraction as the probability of sending a vehicle from i to j (with energy lev el v ). 18 and queue lengths as metrics. Our experiments indicate that the model trained using PPO is performing the best among the three, hence we use that model as our real-time policy . Metrics Algorithms DDPG TRPO PPO A verage Re wards 9825.69 13142.47 15527.34 A verage Queue Length 431.76 87.96 68.11 T able 1: Performances of RL policies trained with different algorithms. W e compare different policies’ performance using the re wards and total queue length as metrics. The results are demonstrated in Figure 5. In Figure 5a we compare the rewards generated and the total queue length by applying the static and the real-time policies as defined in Sections 3 and 4. W e can observe that while the optimal static policy provides rate stability in a dynamic setting (since the queues do not blow up), it fails to generate profits as it is not able to clear the queues. On the other hand, the real-time policy is able to keep the total length of the queues 100 times shorter than the static policy while generating higher profits. The optimal static policy fails to generate profits and is not necessarily the best static policy to apply in a dynamic setting. As such, in Figure 5b we demonstrate the performance of a sub-optimal static policy , where the prices are 5% higher than the optimal static prices to reduce the arriv al rates and hence reduce the queue lengths. Observe that the profits generated are higher than the profits generated using optimal static policy for the static planning problem while the total queue length is less. This result indicates that under the stochasticity of the dynamic setting, a sub-optimal static policy can perform better than the optimal static policy . Furthermore, we summarize the performances of other static policies with higher static prices, namely with 5% , 10% , 20%30% , and 40% higher prices than the optimal static prices in T able 2. Among these, an increase of 10% performs the best in terms of re wards. Nevertheless, this policy does still do worse in terms of rewards and total queue length compared to the real-time policy , which generates around 10% more rewards and results in 70% less queues. Lastly we note that although a 40% increase in prices results in minimum average queue length, this is a result of significantly reduced induced demand and therefore it generates very lo w rewards. Metrics % of opt. static prices 105% 110% 120% 130% 140% A verage Re wards 12234.13 14112.77 13739.35 12046.91 9625.82 A verage Queue Length 584.05 231.93 74.64 30.88 14.20 T able 2: Performances of static pricing policies for Manhattan case study . Next, we showcase that even some heuristic modifications which resemble what is done in practice can do better than the optimal static policy . W e utilize the optimal static policy , but additionally utilize a surge- pricing policy . The surge-pricing policy aims to decrease the arriv al rates for longer queues so that the 19 − 6 − 4 − 2 0 2 Rew ards ($) × 10 4 Real-Time P olicy Real-Time P olicy Run. Avg. Static P olicy Static P olicy Run. Avg. 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Time (10 mins) × 10 5 0 . 0 0 . 4 0 . 8 T otal Queue Length × 10 4 Real − Time vs . Static P olicy (a) 0 . 8 1 . 2 1 . 6 Rew ards ($) × 10 4 0 1 2 3 4 5 Time (10 mins) × 10 4 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 T otal Queue Length × 10 3 Real − Time vs . Static P olicy (+ 5 % Prices ) (b) 0 . 8 1 . 2 1 . 6 Rew ards ($) × 10 4 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Time (10 mins) × 10 4 0 1 2 T otal Queue Length × 10 2 Real − Time vs . Surge Pricing P olicy (c) − 6 − 4 − 2 0 2 Rew ards ($) × 10 4 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Time (10 mins) × 10 3 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 T otal Queue Length × 10 4 Real − Time vs . Static P olicy ( Next W eek ) (d) Figure 5: Comparison of different policies for Manhattan case study . The legends for all figures are the same as the top left figure, where red lines correspond to the real-time policy and blue lines correspond to the static policies (W e e xcluded the running av erages for (d), because the static policy div erges). In all scenarios, we use the rewards generated and the total queue length as metrics. In (a), we demonstrate the results from applying the real-time policy and the optimal static policy . In (b), we compare the real-time polic y with the static policy that utilizes 5% higher prices than the optimal static policy . In (c), we utilize a sur ge pricing polic y along with the optimal static polic y and compare with the real-time policy . In (d), we employ the real-time policy and static policy developed for May 4, 2019, Saturday for the arrivals on May 11, 2019, Saturday . 20 queues will stay shorter and the rew ards will increase. At each time period, for all OD pairs, the policy is to increase the price by 50% if the queue is longer than 100% of the induced arriv al rate. The results are displayed in Figure 5c. Ne w arri vals bring higher re venue per person and the total queue length is decreased, which stabilizes the network while generating more profits than the optimal static policy . The surge pricing policy results in stable short queues and higher rew ards compared to the optimal static policy for the static setting, howe ver , both the real-time policy and the static pricing policy with 10% higher prices are superior . Performances of other sur ge pricing policies that multiply the prices by 1 . 25 / 1 . 5 / 2 if the queue is longer than 50% / 100% / 200% of the induced arriv al rates can be found in T able 3. Accordingly , the best surge pricing policy maximizing the rewards is to multiply the prices by 1 . 25 if the queue is longer than 50% of the induced arriv al rate. Y et, our real-time polic y still generates around 20% more re wards and results in 32% less queues. W e note that a sur ge pricing polic y that multiplies the prices by 2 when the queues are longer than 50% of the induced arri val rates minimizes the queues by decreasing the induced arri val rates significantly , which results in substantially low re wards. Surge Multiplier Queue Threshold 50% 100% 200% Queue Rewards Queue Rew ards Queue Re wards 1.25 101.25 13022.83 186.56 12897.30 380.34 12357.33 1.5 91.89 12602.90 178.22 12589.71 370.18 12233.95 2 83.15 5272.04 162.99 6224.69 337.01 7485.75 T able 3: Performances of surge pricing policies for Manhattan case study . Finally , we test how the static and the real-time policies are robust to v ariations in input statistics. W e compare the rewards generated and the total queue length applying the static and the real-time policies for the arriv al rates of May 11, 2019, Saturday between 18.00-20.00. The results are displayed in Figure 5d. Even though the arri val rates between May 11 and May 4 do not dif fer much, the static policy is not resilient and fails to stabilize when there is a slight change in the network. The real-time policy , on the other hand, is still able to stabilize the network and generate profits. The neural-network based policy is able to determine the correct pricing and routing decisions by considering the current state of the network, e ven under dif ferent arriv al rates. These experiments show us that we can indeed develop a real-time polic y using deep reinforcement learning and this policy is resilient to small changes in the network parameters. The next study inv estigates the idea of generality , i.e., whether we can dev elop a global real-time policy and fine-tune it to a specific en vironment with few-shots of training, rather than de veloping a ne w policy from scratch. Few-shot Learning: A common problem with reinforcement learning approaches is that because the agent is trained for a specific en vironment, it fails to respond to a slightly changed en vironment. Hence, one would need to train a different model for different environments (different network configurations, dif ferent arriv al rates). Howe ver , this is not a feasible solution considering that training one model tak es millions of iterations. As a more tractable solution, one could train a global model using different en vironments, and then calibrate 21 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Time (10 mins) × 10 4 0 . 0 0 . 5 1 . 0 Av erage Rew ards ($) × 10 4 Sp ecific mo del (2m iterations) Global mo del - no additional training Global mo del + 1k iterations Global mo del + 10k iterations Global mo del + 100k iterations 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Time (10 mins) × 10 4 0 . 0 0 . 5 1 . 0 Av erage Queue Length × 10 2 F ew − shot Learning Figure 6: Performances of the specific model that is trained from scratch and fine-tuned global model (for dif ferent amounts of fine-tuning as specified in the legend): rew ards (left) and queue lengths (right). Figure 7: San Francisco divided into m = 7 regions. Map obtained from the San Francisco County T rans- portation Authority [60]. it to the desired environment with fewer iterations rather than training a new model from scratch. W e tested this phenomenon by training a global model for Manhattan using v arious arri val rates and network configura- tions that we extracted from dif ferent 2-hour interv als (W e trained the global model for 10 million iterations). W e then trained this model for the network configuration and arri val rates on May 6, 2019, Monday between 15.00-17.00. The results are displayed in Figure 6. Even with no additional training, the global model per- forms better than the specific model trained from scratch for 2 million iterations. Furthermore, with only fe w iterations, it is possible to improve the performance of the global model significantly . This is an anticipated result, because although the network configurations and arriv al rates for different 2-hour intervals are differ - ent, the en vironments are not fundamentally dif ferent (the state transitions are governed by similar random processes) and hence it is possible to generalize a global policy and fine-tune it to the desired en vironment with fewer number of iterations. 5.2 Case Study in San Francisco W e conducted the case study in San Francisco by utilizing an EV fleet of 420 vehicles. W e divided San Francisco into 7 re gions as in Figure 7, and using the traceset of mobility of taxi cabs data from CRA WD AD [61], we obtained the av erage arriv al rates and travel times between regions (we exclude the rides occurring in the same region). In Figure 8, we compare the char ging costs paid under the real-time policy and the static policy . The static policy is generated by using the average value of the electricity prices, whereas the real-time policy takes into account the current electricity prices before executing an action. Therefore, the real-time policy provides cheaper charging options by utilizing smart charging strategies, decreasing the average charging costs by 25% . In Figure 9a, we compare the rew ards and the total queue length resulting from the real-time policy and the static policy . In Figure 9b, we compare the RL policy to the static policy with 5% higher prices than the optimal static policy , and summarize performances of sev eral other static pricing policies in T able 5. In 22 − 1 . 0 − 0 . 5 0 . 0 Rew ards ($) × 10 4 Real-Time P olicy Static P olicy Static P olicy Run. Avg. Real-Time P olicy Run. Avg. 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Time (5 mins) × 10 5 0 2 4 6 T otal Queue Length × 10 3 Real − Time vs . Static P olicy (a) − 0 . 8 − 0 . 4 0 . 0 0 . 4 0 . 8 1 . 2 1 . 6 Rew ards ($) × 10 3 0 1 2 3 4 5 Time (5 mins) × 10 4 0 2 4 6 T otal Queue Length × 10 2 Real − Time vs . Static P olicy (+ 5 % Prices ) (b) 0 . 4 0 . 8 1 . 2 1 . 6 Rew ards ($) × 10 3 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Time (5 mins) × 10 4 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 T otal Queue Length × 10 2 Real − Time vs . Surge Pricing P olicy (c) Figure 9: Comparison of dif ferent policies for San Francisco case study . The legends for all figures are the same as the top left figure, where red lines correspond to the real-time policy and blue lines correspond to the static policies. In all scenarios, we use the rewards generated and the total queue length as metrics. In (a), we demonstrate the results from applying the real-time policy and the optimal static policy . In (b), we compare the real-time policy with a sub-optimal static policy , where the prices are 5% higher than the optimal static policy . In (c), we utilize a surge pricing policy along with the optimal static policy and compare with the real-time polic y . Figure 9c, we use the static policy but also utilize a surge pricing policy that multiplies the prices by 1 . 5 if the queues are longer than 100% of the induced arriv al rates. The performances of other surge pricing policies are also displayed in T able 4. Similar to the case study in Manhattan, the results demonstrate that the performance of the trained real-time policy is superior to the other policies. In particular, the RL policy is able to generate around 24% more rew ards and result in around 75% less queues than the best heuristic policy , which utilizes 30% higher static prices than the optimal static policy. Surge Multiplier Queue Threshold 50% 100% 200% Queue Rew ards Queue Re wards Queue Rew ards 1.25 67.62 718.66 75.92 715.02 99.56 687.45 1.5 25.16 650.90 34.32 687.71 49.94 708.38 2 14.06 331.21 20.55 455.25 44.44 611.23 T able 4: Performances of surge pricing policies for San Francisco case study . 23 Metrics % of opt. static prices 105% 110% 120% 130% 140% A verage Re wards 4.98 485.65 696.38 721.89 682.76 A verage Queue Length 456.83 211.04 87.15 45.28 25.66 T able 5: Performances of static pricing policies for San Francisco case study . 6 Conclusion 0 1 2 3 4 5 Time (5 mins) × 10 3 0 20 40 60 80 Charging Costs (Cen ts/kWh) Real − Time vs . Static P olicy Charging Costs Real-time P olicy Static P olicy Static P olicy Run. Avg. Real-time P olicy Run. Avg. Figure 8: Charging costs for the optimal static policy and the real-time policy in San Francisco case study . In this paper , we dev eloped a real-time control policy based on deep reinforcement learning for operating an AMoD fleet of EVs as well as pricing for rides. Our real- time control polic y jointly makes decisions for: 1) v ehicle routing in order to serve passenger demand and to rebal- ance the empty vehicles, 2) vehicle charging in order to sustain energy for rides while exploiting geographical and temporal diversity in electricity prices for cheaper charg- ing options, and 3) pricing for rides in order to adjust the potential demand so that the network is stable and the profits are maximized. Furthermore, we formulated the static planning problem associated with the dynamic problem in order to define the optimal static polic y for the static planning problem. When implemented correctly , the static policy provides stability of the queues in the dynamic setting, yet it is not optimal regarding the profits and keeping the queues sufficiently low . Finally , we conducted case studies in Manhattan and San Francisco that demonstrate the performance of our dev eloped polic y . The two case studies on different networks indicate that reinforcement learning can be a univ ersal method for establishing well performing real-time policies that can be applied to many real world networks. Lastly , by doing the Manhattan study with non-electric vehicles and San Francisco study with electric vehicles, we have also demonstrated that a real-time policy using reinforcement learning can be established for both electric and non-electric AMoD systems. Refer ences [1] [Online]. A vailable: https://www .cbinsights.com/research/autonomous- dri verless-vehicles- corporations-list/. [2] J. Schulman, F . W olski, P . Dhariwal, A. Radford, and O. Klimov , “Proximal policy optimization algo- rithms, ” arXiv preprint arXiv:1707.06347 , 2017. 24 [3] R. Zhang and M. Pa vone, “Control of robotic Mobility-on-Demand systems: A queueing-theoretical perspectiv e, ” In Int. J ournal of Robotics Resear ch , vol. 35, no. 1–3, pp. 186–203, 2016. [4] M. Pav one, S. L. Smith, E. Frazzoli, and D. Rus, “Robotic load balancing for Mobility-on-Demand systems, ” Int. Journal of Robotics Resear ch , v ol. 31, no. 7, pp. 839–854, 2012. [5] F . Rossi, R. Zhang, Y . Hindy , and M. P avone, “Routing autonomous vehicles in congested transportation networks: Structural properties and coordination algorithms, ” Autonomous Robots , vol. 42, no. 7, pp. 1427–1442, 2018. [6] M. V olko v , J. Aslam, and D. Rus, “Markov-based redistribution polic y model for future urban mobility networks, ” Conference Recor d - IEEE Confer ence on Intelligent T ransportation Systems , pp. 1906– 1911, 09 2012. [7] Q. W ei, J. A. Rodriguez, R. Pedarsani, and S. Coogan, “Ride-sharing networks with mixed autonomy , ” arXiv pr eprint arXiv:1903.07707 , 2019. [8] R. Zhang, F . Rossi, and M. Pa vone, “Model predicti ve control of autonomous mobility-on-demand systems, ” in 2016 IEEE International Conference on Robotics and A utomation (ICRA) , May 2016. [9] F . Miao, S. Han, S. Lin, J. A. Stankovic, H. Huang, D. Zhang, S. Munir, T . He, and G. J. Pappas, “T axi dispatch with real-time sensing data in metropolitan areas: A receding horizon control approach, ” CoRR , vol. abs/1603.04418, 2016. [Online]. A vailable: http://arxiv .org/abs/1603.04418 [10] R. Iglesias, F . Rossi, K. W ang, D. Hallac, J. Leskovec, and M. Pa vone, “Data-driv en model predictive control of autonomous mobility-on-demand systems, ” CoRR , vol. abs/1709.07032, 2017. [Online]. A vailable: http://arxiv .org/abs/1709.07032 [11] F . Miao, S. Han, A. M. Hendawi, M. E. Khalefa, J. A. Stanko vic, and G. J. Pappas, “Data-driven distribu- tionally robust v ehicle balancing using dynamic region partitions, ” in 2017 A CM/IEEE 8th International Confer ence on Cyber-Physical Systems (ICCPS) , April 2017, pp. 261–272. [12] M. Tsao, R. Iglesias, and M. Pav one, “Stochastic model predictive control for autonomous mobility on demand, ” CoRR , vol. abs/1804.11074, 2018. [Online]. A vailable: http://arxiv .org/abs/1804.11074 [13] K. Spieser, S. Samaranayake, and E. Frazzoli, “V ehicle routing for shared-mobility systems with time- varying demand, ” in 2016 American Contr ol Conference (A CC) , July 2016, pp. 796–802. [14] R. M. A. Swaszek and C. Cassandras, “Load balancing in mobility-on-demand systems: Realloca- tion via parametric control using concurrent estimation, ” 2019 IEEE Intellig ent T ransportation Systems Confer ence (ITSC) , pp. 2148–2153, 2019. [15] M. Repoux, M. Kaspi, B. Boyacı, and N. Geroliminis, “Dynamic prediction-based relocation policies in one-way station-based carsharing systems with complete journe y reserv ations, ” 25 T ransportation Researc h P art B: Methodological , vol. 130, pp. 82 – 104, 2019. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S019126151930102X [16] B. Boyacı, K. G. Zografos, and N. Geroliminis, “ An integrated optimization-simulation framew ork for vehicle and personnel relocations of electric carsharing systems with reservations, ” T ransportation Researc h P art B: Methodological , vol. 95, pp. 214 – 237, 2017. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0191261515301119 [17] J. W arrington and D. Ruchti, “T wo-stage stochastic approximation for dynamic rebalancing of shared mobility systems, ” T ransportation Resear ch P art C: Emer ging T echnolo gies , vol. 104, pp. 110 – 134, 2019. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0968090X18314104 [18] C. Mao and Z. Shen, “ A reinforcement learning frame work for the adaptiv e routing problem in stochastic time-dependent network, ” T ransportation Researc h P art C: Emer ging T echnologies , vol. 93, pp. 179 – 197, 2018. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0968090X18307617 [19] F . Zhu and S. V . Ukkusuri, “ Accounting for dynamic speed limit control in a stochas- tic traf fic en vironment: A reinforcement learning approach, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 41, pp. 30 – 47, 2014. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0968090X1400028X [20] E. W alrav en, M. T . Spaan, and B. Bakker , “T raffic flow optimization: A reinforcement learning approach, ” Engineering Applications of Artificial Intelligence , vol. 52, pp. 203 – 212, 2016. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0952197616000038 [21] F . Zhu, H. A. Aziz, X. Qian, and S. V . Ukkusuri, “ A junction-tree based learning algorithm to optimize network wide traffic control: A coordinated multi-agent framew ork, ” T ransportation Researc h P art C: Emer ging T echnologies , vol. 58, pp. 487 – 501, 2015, special Issue: Advanced Road Traf fic Control. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0968090X14003593 [22] L. Li, Y . Lv, and F . W ang, “T raf fic signal timing via deep reinforcement learning, ” IEEE/CAA Journal of Automatica Sinica , v ol. 3, no. 3, pp. 247–254, 2016. [23] D. A. Lazar, E. Bıyık, D. Sadigh, and R. Pedarsani, “Learning how to dynamically route autonomous vehicles on shared roads, ” arXiv preprint , 2019. [24] M. Han, P . Senellart, S. Bressan, and H. W u, “Routing an autonomous taxi with reinforcement learning, ” in CIKM , 2016. [25] M. Gu ´ eriau and I. Dusparic, “Samod: Shared autonomous mobility-on-demand using decentralized reinforcement learning, ” in 2018 21st International Conference on Intelligent T ransportation Systems (ITSC) , Nov 2018, pp. 1558–1563. 26 [26] J. W en, J. Zhao, and P . Jaillet, “Rebalancing shared mobility-on-demand systems: A reinforcement learning approach, ” in 2017 IEEE 20th International Conference on Intelligent T ransportation Systems (ITSC) , Oct 2017, pp. 220–225. [27] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Ef ficient large-scale fleet management via multi-agent deep reinforcement learning, ” in Pr oceedings of the 24th A CM SIGKDD International Conference on Knowledge Discovery & Data Mining , ser . KDD ’18. New Y ork, NY , USA: Association for Computing Machinery , 2018, p. 1774–1783. [Online]. A vailable: https://doi.org/10.1145/3219819.3219993 [28] C. Mao, Y . Liu, and Z.-J. M. Shen, “Dispatch of autonomous vehicles for taxi services: A deep reinforcement learning approach, ” T ransportation Resear ch P art C: Emer ging T echnologies , vol. 115, p. 102626, 2020. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0968090X19312227 [29] E. V eldman and R. A. V erzijlbergh, “Distribution grid impacts of smart electric vehicle charging from different perspecti ves, ” IEEE T ransactions on Smart Grid , v ol. 6, no. 1, pp. 333–342, Jan 2015. [30] W . Su, H. Eichi, W . Zeng, and M. Cho w , “ A survey on the electrification of transportation in a smart grid en vironment, ” IEEE T ransactions on Industrial Informatics , vol. 8, no. 1, pp. 1–10, Feb 2012. [31] J. C. Mukherjee and A. Gupta, “ A revie w of charge scheduling of electric v ehicles in smart grid, ” IEEE Systems Journal , v ol. 9, no. 4, pp. 1541–1553, Dec 2015. [32] T . D. Chen, K. M. K ockelman, and J. P . Hanna, “Operations of a Shared, Autonomous, Electric V ehicle Fleet: Implications of V ehicle & Charging Infrastructure Decisions, ” T ransportation Researc h P art A: P olicy and and Practice , v ol. 94, pp. 243–254, 2016. [33] C. Bongiovanni, M. Kaspi, and N. Geroliminis, “The electric autonomous dial-a-ride problem, ” T ransportation Resear ch P art B: Methodological , vol. 122, pp. 436 – 456, 2019. [Online]. A vailable: http://www .sciencedirect.com/science/article/pii/S0191261517309669 [34] N. T ucker , B. T uran, and M. Alizadeh, “Online Char ge Scheduling for Electric V ehicles in Autonomous Mobility on Demand Fleets, ” In Pr oc. IEEE Int. Conf. on Intelligent T ransportation Systems , 2019. [35] B. T uran, N. T ucker , and M. Alizadeh, “Smart Charging Benefits in Autonomous Mobility on Demand Systems, ” In Pr oc. IEEE Int. Conf. on Intelligent T ransportation Systems , 2019. [Online]. A vailable: https://arxiv .org/abs/1907.00106 [36] F . Rossi, R. Iglesias, M. Alizadeh, and M. Pav one, “On the interaction between autonomous mobility- on-demand systems and the power network: models and coordination algorithms, ” Robotics: Science and Systems XIV , Jun 2018. [37] T . D. Chen and K. M. Kockelman, “Management of a shared autonomous electric vehicle fleet: Impli- cations of pricing schemes, ” T ransportation Resear ch Recor d , vol. 2572, no. 1, pp. 37–46, 2016. 27 [38] Y . Guan, A. M. Annaswamy , and H. E. Tseng, “Cumulati ve prospect theory based dynamic pricing for shared mobility on demand services, ” CoRR , vol. abs/1904.04824, 2019. [Online]. A vailable: http://arxiv .org/abs/1904.04824 [39] C. J. R. Sheppard, G. S. Bauer , B. F . Gerke, J. B. Greenblatt, A. T . Jenn, and A. R. Gopal, “Joint optimization scheme for the planning and operations of shared autonomous electric vehicle fleets serving mobility on demand, ” T ransportation Researc h Record , vol. 2673, no. 6, pp. 579–597, 2019. [Online]. A vailable: https://doi.org/10.1177/0361198119838270 [40] K. Bimpikis, O. Candogan, and D. Sab ´ an, “Spatial pricing in ride-sharing networks, ” Operations Re- sear ch , vol. 67, pp. 744–769, 2019. [41] R. Pedarsani, J. W alrand, and Y . Zhong, “Robust scheduling for flexible processing networks, ” Advances in Applied Pr obability , vol. 49, no. 2, pp. 603–628, 2017. [42] L. P . Kaelbling, M. L. Littman, and A. W . Moore, “Reinforcement learning: A surve y , ” J ournal of artificial intelligence r esear ch , vol. 4, pp. 237–285, 1996. [43] P . A. Lopez, M. Behrisch, L. Bieker -W alz, J. Erdmann, Y .-P . Fl ¨ otter ¨ od, R. Hilbrich, L. L ¨ ucken, J. Rummel, P . W agner, and E. W ießner, “Microscopic traf fic simulation using sumo, ” in The 21st IEEE International Confer ence on Intelligent T ransportation Systems . IEEE, 2018. [Online]. A vailable: h https://elib .dlr .de/124092/ [44] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska, “ A survey of actor -critic reinforcement learning: Standard and natural polic y gradients, ” IEEE T ransactions on Systems, Man, and Cybernetics, P art C (Applications and Revie ws) , vol. 42, no. 6, pp. 1291–1307, No v 2012. [45] C. J. C. H. W atkins and P . Dayan, “Q-learning, ” Machine Learning , vol. 8, no. 3, pp. 279–292, May 1992. [Online]. A vailable: https://doi.org/10.1007/BF00992698 [46] G. A. Rummery and M. Niranjan, “On-line q-learning using connectionist systems, ” T ech. Rep., 1994. [47] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning, ” Machine Learning , vol. 8, no. 3, pp. 229–256, May 1992. [Online]. A vailable: https://doi.org/10.1007/BF00992696 [48] A. G. Barto, R. S. Sutton, and C. W . Anderson, “Neuronlike adaptiv e elements that can solve difficult learning control problems, ” IEEE T ransactions on Systems, Man, and Cybernetics , v ol. SMC-13, no. 5, pp. 834–846, Sep. 1983. [49] I. H. Witten, “ An adaptive optimal controller for discrete-time markov en vironments, ” Information and Contr ol , vol. 34, pp. 286–295, 1977. [50] J. Schulman, S. Le vine, P . Moritz, M. I. Jordan, and P . Abbeel, “T rust region policy optimization, ” CoRR , vol. abs/1502.05477, 2015. [Online]. A vailable: http://arxiv .org/abs/1502.05477 28 [51] M. Grant and S. Boyd, “CVX: Matlab software for disciplined conv ex programming, version 2.1, ” http://cvxr .com/cvx, Mar . 2014. [52] G. Brockman, V . Cheung, L. Pettersson, J. Schneider , J. Schulman, J. T ang, and W . Zaremba, “Openai gym, ” 2016. [53] A. Hill, A. Raf fin, M. Ernestus, A. Glea ve, R. T raore, P . Dhariwal, C. Hesse, O. Klimov , A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor , and Y . W u, “Stable baselines, ” https://github .com/hill- a/stable- baselines, 2018. [54] The average electric car in the US is getting cheaper. [Online]. A vailable: https://qz.com/1695602/the- av erage-electric-vehicle-is-getting-cheaper-in-the-us/. [55] [Online]. A vailable: http://oasis.caiso.com [56] United States A verage Hourly W ages. [Online]. A vailable: https://tradingeconomics.com/united- states/wages. [57] How much does dri ving your car cost, per minute? [Online]. A vailable: https://www .bostonglobe.com/ideas/2014/08/08/how-much-dri ving-really-costs-per- minute/BqnNd2q7jET edLhxxzY2CI/story .html. [58] [Online]. A vailable: https://www1.nyc.gov/site/tlc/about/tlc- trip- record- data.page [59] T . P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T . Erez, Y . T assa, D. Silv er, and D. W ierstra, “Continuous control with deep reinforcement learning, ” arXiv pr eprint arXiv:1509.02971 , 2015. [60] [Online]. A vailable: http://tncstoday .sfcta.org/ [61] M. Piorkowski, N. Sarafijanovic-Djukic, and M. Grossglauser , “CRA WDAD dataset epfl/mobility (v . 2009-02-24), ” Downloaded from https://cra wdad.org/epfl/mobility/20090224, Feb . 2009. [62] J. G. Dai, “On positiv e harris recurrence of multiclass queueing networks: A unified approach via fluid limit models, ” Annals of Applied Pr obability , vol. 5, pp. 49–77, 1995. 29 A ppendices A Pr oof of Proposition 1 T o prov e Proposition 1, we first formulate the static optimization problem via a network flo w model that char - acterizes the capacity r e gion of the network for a giv en set of prices ` ij ( t ) = ` ij ∀ t (Hence, Λ ij ( t ) = Λ ij ∀ t ). The capacity re gion is defined as the set of all arri val rates [Λ ij ] i,j ∈M , where there exists a char ging and rout- ing policy under which the queueing network of the system is stable. Let x v i be the number of vehicles av ailable at node i , α v ij be the fraction of vehicles at node i with energy le vel v being routed to node j , and α v ic be the fraction of vehicles charging at node i starting with energy level v . W e say the static vehicle allocation for node i and energy lev el v is feasible if α v ic + P j ∈M j 6 = i α v ij ≤ 1 . The optimization problem that characterizes the capacity region of the network ensures that the total number of vehicles routed from i to j is at least as large as the nominal arriv al rate to the queue ( i, j ) . Namely , the vehicle allocation problem can be formulated as follows: min x v i ,α v ij ,α v ic ρ (14a) subject to Λ ij ≤ v max X v = v ij x v i α v ij ∀ i, j ∈ M , (14b) ρ ≥ α v ic + X j ∈M j 6 = i α v ij ∀ i ∈ M , ∀ v ∈ V , (14c) x v i = x v − 1 i α v − 1 ic + X j ∈M x v + v j i i α v + v j i j i ∀ i ∈ M , ∀ v ∈ V , (14d) α v max ic = 0 ∀ i ∈ M , (14e) α v ij = 0 ∀ v < v ij , ∀ i, j ∈ M (14f) x v i ≥ 0 , α v ij ≥ 0 α v ic ≥ 0 , ∀ i, j ∈ M , ∀ v ∈ V , (14g) x v i = α v ic = α v ij = 0 ∀ v / ∈ V , ∀ i, j ∈ M . (14h) The constraint (14c) upper bounds the allocation of v ehicles for each node i and ener gy lev el v . The con- straints (14d)-(14f) are similar to those of optimization problem (1) with x v i = x v ic + P j ∈M x v ij , α v ic = x v ic /x v i , and α v ij = x v ij /x v i . Lemma 1. Let the optimal value of (14) be ρ ∗ . Then, ρ ∗ ≤ 1 is a necessary and sufficient condition of rate stability of the system under some r outing and char ging policy . Pr oof. Consider the fluid scaling of the queueing network, Q rt ij = q ij ( b rt c ) r (see [62] for more discussion on the stability of fluid models), and let Q t ij be the corresponding fluid limit. The fluid model dynamics is as follows: Q t ij = Q 0 ij + A t ij − X t ij , 30 where A t ij is the total number of riders from node i to node j that ha ve arri ved to the netw ork until time t and X t ij is the total number of v ehicles routed from node i to j up to time t . Suppose that ρ ∗ > 1 and there exists a policy under which for all t ≥ 0 and for all origin-destination pairs ( i, j ) , Q t ij = 0 . Pick a point t 1 , where Q t 1 ij is differentiable for all ( i, j ) . Then, for all ( i, j ) , ˙ Q t 1 ij = 0 . Since ˙ A t 1 ij = Λ ij , this implies ˙ X t 1 ij = Λ ij . On the other hand, ˙ X t 1 ij is the total number of vehicles routed from i to j at t 1 . This implies Λ ij = P v max v = v ij x v i α v ij for all ( i, j ) and there e xists α v ij and α v ic at time t 1 such that the flow balance constraints hold and the allocation vector [ α v ij α v ic ] is feasible, i.e. α v ic + P m j =1 j 6 = i α v ij ≤ 1 . This contradicts ρ ∗ > 1 . Now suppose ρ ∗ ≤ 1 and α ∗ = [ α v ∗ ij α v ∗ ic ] is an allocation vector that solv es the static problem. The cumu- lativ e number of vehicles routed from node i to j up to time t is S t ij = P v max v = v ij x v i α v ij t = P v max v =0 x v i α v ij t ≥ Λ ij t . Suppose that for some origin-destination pair ( i, j ) , the queue Q t 1 ij ≥ > 0 for some positiv e t 1 and . By continuity of the fluid limit, there exists t 0 ∈ (0 , t 1 ) such that Q t 0 ij = / 2 and Q t ij > 0 for t ∈ [ t 0 , t 1 ] . Then, ˙ Q t ij > 0 implies Λ ij > P v max v = v ij x v i α v ij , which is a contradiction. By Lemma 1, the capacity r e gion C Λ of the network is the set of all Λ ij ∈ R + for which the corresponding optimal solution to the optimization problem (14) satisfies ρ ∗ ≤ 1 . As long as ρ ∗ ≤ 1 , there exists a routing and charging polic y such that the queues will be bounded away from infinity . The platform operator’ s goal is to maximize its profits by setting prices and making routing and char ging decisions such that the system remains stable. In its most general form, the problem can be formulated as follows: max ` ij ,x v i ,α v ij ,α v ic U (Λ ij ( ` ij ) , x v i , α v ij , α v ic ) subject to [Λ ij ( ` ij )] i,j ∈M ∈ C Λ , (15) where U ( · ) is the utility function that depends on the prices, demand for rides and the v ehicle decisions. Recall that x v ic = x v i α v ic and x v ij = x v i α v ij . Using these variables and noting that α v ic + P j ∈M α v ij = 1 when ρ ∗ ≤ 1 , the platform operator’ s profit maximization problem can be stated as (1). A feasible solution of (1) guarantees rate stability of the system, since the corresponding vehicle allocation problem (14) has solution ρ ∗ ≤ 1 . B Pr oof of Proposition 2 For brevity of notation, let β + p i = P i . Let ν ij be the dual v ariables corresponding to the demand satisfaction constraints and µ v i be the dual v ariables corresponding to the flo w balance constraints. Since the optimization problem (1) is a conv ex quadratic maximization problem (given a with uniform F ( · ) ) and Slater’ s condition is satisfied, strong duality holds. W e can write the dual problem as: min ν ij ,µ v i max ` ij m X i =1 m X j =1 λ ij (1 − ` ij ` max ) ( ` i − ν ij ) (16a) subject to ν ij ≥ 0 , (16b) ν ij + µ v i − µ v − v ij − β τ ij ≤ 0 , (16c) 31 µ v i − µ v +1 i − P i ≤ 0 ∀ i, j, v . (16d) For fix ed ν ij and µ v i , the inner maximization results in the optimal prices: ` ∗ ij = ` max + ν ij 2 . (17) By strong duality , the optimal primal solution satisfies the dual solution with optimal dual variables ν ∗ ij and µ v i ∗ , which completes the first part of the proposition. The dual problem with optimal prices in (17) can be written as: min ν ij ,µ v i m X i =1 m X j =1 λ ij ` max ` max − ν ij 2 2 (18a) subject to ν ij ≥ 0 , (18b) ν ij + µ v i − µ v − v ij j − β τ ij ≤ 0 , (18c) µ v i − µ v +1 i − P i ≤ 0 ∀ i, j, v . (18d) The objectiv e function in (18a) with optimal dual variables, along with (17) suggests: P = m X i =1 m X j =1 λ ij ` max ( ` max − ` ∗ ij ) 2 , where profits P is the value of the objectiv e function of both optimal and dual problems. T o get the upper bound on prices, we go through the following algebraic calculations using the constraints. The inequality (18d) giv es: µ v − v j i i ≤ v j i P i + µ v i , (19) and equiv alently: µ v − v ij j ≤ v ij P j + µ v j . (20) The inequalities (18c) and (18b) yield: µ v i − µ v − v ij j − β τ ij ≤ 0 , and equiv alently: µ v j − µ v − v j i i − β τ j i ≤ 0 , (21) Inequalities (19) and (21): µ v j ≤ µ v i + β τ j i + v j i P i . (22) And finally , the constraint (18c): ν ij ≤ β τ ij + µ v − v ij j − µ v i (20) ≤ β τ ij + v ij P j + µ v j − µ v i (22) ≤ β τ ij + v ij P j + β τ j i + v j i P i . 32 Replacing P i = p i + β and rearranging the terms: ν ij ≤ β ( τ ij + τ j i + v ij + v j i ) + v ij p j + v j i p i . (23) Using the upper bound on the dual variables ν ij and (17), we can upper bound the optimal prices. 33
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment