Self-driving scale car trained by Deep reinforcement learning

Self-driving scale car trained by Deep reinforcement Learning Qi Zhang Nor th Chin a Univ er sit y of Tech nol og y Tao Du Nor th Chin a Univ er sit y of Tech nol og y Chang z hen g Tia n Nor th Chin a Univ er sit y of Tech nol og y ABSTRACT Self-driving scale car trained by Deep reinforcement Learning This paper considers the problem o f self-drivin g algorithm b ased on deep learning. This is a hot t opic because self-driving is th e most important appl ication field of artificial intelligence. Existin g work focused on deep learn ing which has th e ability to learn “ end-to-end ” self-driving control directly from raw sensory data, but t his method i s just a mapping between images a nd driving. We prefer deep reinforcement learning to train a self- driving car in a virtu al simulation enviro nment created by Uni ty and th en migrate to reality. Deep rein forcement learning makes the machin e o wn t he driving d escision-making ability like human. The virtu al to realistic training method can efficiently h andle the problem that reinforcement learnin g requires reward from the environment which p robably cause cars d amge. We have derived a theoretical model and analysis on how to use Deep Q-learning to control a car to drive. We h ave carried out simulatio ns in the Unity virtual environment for evalu ating th e performance. Finally, we successfully migrate te mod el to the real world an d realize self-driving. Keywords Deep reinforcement learning,Unity,self-driving car,double deep Q networks 1. INTRODUCTION The automotive industry is a special industry. In order to keep the passengers’ safety, any accident i s unacceptable. Therefore, the reliability and security must satisfy the strin gent standard. The accuracy and rubustness o f the sensors and algori thms are required extremely precision in th e proccess o f self-driving vehicles. On the other hand, self-driving cars are products for th e average consumers, so the cost o f the cars need be controled. High-precision sensors[1] can improve the ac curacy of th e algorithms but very expensive. This is a difficult contradiction need to solve. Recently, the rapid deve lopment of artificial intelligence technology, e specially the deep learning, h as made a major breakthrough in the field s such like image recognition and intelligent control. Deep learning techni ques, typically such as convolutional neural networks, are widely used in various types of image processing, which makes them su itable for self-driving applications. The researchers use deep learning to b uild end -to- end deep learning self-driving car whose core is learning th rough the n eural network under supervised, then get the mapping relationship, finally achieve a pattern-replicating driving skills [2]. While en d-to-end driving is easy to scale and adaptable, it h as limited ab ility to h andle long-term planning which i nvolves the nature o f imitation learning[3,4]. We perfer to l et scale cares learn how t o drive on their own than under human’s supervision. Because th ere are many p roblems o f this repl ication pattern, especially on the sensor. The t raffic accident s of Tesla are caused by the f ailure of the perceived module in a bright light environment. Deep reinforcement learning can make appropriate decisions even some modules fail in working[5]. This paper focus on the issue of self-dri ving based on deep reiforencement learning, we modif y a 1:16 RC car and train it by double d eep Q network. We use a virtual-to-reality p rocess to achieve it, which means train ing the car in the virtual environment and testing in reality. In order to get a reliable simulation environment, we create a Unity simulation train ing enviro nment based on OpenAI gym. We set a reasonable reward mechanism and modify th e double deep Q-learning networks which makes th e algrothm suitable for training a self-driving car. The car w as trained in the Un ity simulation en vironment for many episodes. At last, the scale car is able to learn a pretty good policy to drive itself and we successfully transfer th e learned policy to the real world ! Figure 1: The rein forcement learning Donkey car based on DDQN. 2. RELATED WORK Our aim is making a self-d riving car trained by deep rein- forcement learning. Right now, the most com mon methods to train the car to perform self dri ving are behavioral cloning and line following. On a high level, behavioral clo ning works by u sing a convol utional neural network to learn a mapping between car i mages (t aken b y th e front camera) and steering angle and throttle values through supervised learning. The other method, line following, works by u sing computer vision techniques to track the mid dle line and u tilizes a PID con troller to get the car to follow t he line. Adit ya Kumar Jain used C NN technology to complete the self-driving car with a camera[ 6]. Kaspar Sakmannti proposed a behavioral learning method [7], collecting human driving data thr ough a camera, and then learning driving th rough CNN, which is a typi cal supervised learning.Kwabena Agyeman designed a car by linear regression versusblob tracking.However,these are the capabi lities that under under manual intervention. We hope th at cars can learn to drive by themselves,which is an intelligent way. In 1 989, Watkins proposed the noted Q-learning algorit hm. The algorithm is mainly b ased on the Q table to record the state - the value of t he action p air, each episode will update the state value. Mnih, Volodymyr, et al. In 2013, pioneered the concept of deep reinforcement l earning [9 ], successfully applied in Atari games. In 2015 they also improved th e mo del [1 0]. Two identically structured netw orks are used in DQN: Beh avior Network and Target Networ k. Al though this metho d improves the stab ility of the mod el, Q-Learnin g's problem of overestimating th e value cannot be solved. To solve this problem, Hasselt proposed the Double Q Learning method, which is applied to DQN, which is Double DQN, DDQN [11]. The so-called Dou ble Q Learning is to implement t he selection of actio ns and t he evaluation of actions with different value functions. Recently, the use of virtual simulation techn iques to train intensive learning mo dels and then migrated to reality has been largely verified. OpenAI has d eveloped a robotic arm called Dactyl [12] that trains AI robots in a virtu al enviro nment and finally applies them to physical robots. In the later research and exploration, the relevant personnel have been verified b y t he tasks of picking up and placing objects [13], visual servo [14], flexible movement [15], etc., all indicating their feasibility. In 2019, Luo, Wenhan, et al. proposed an end-to-end active target tracking method based o n reinforcement learning, which trained a robust active tracker in a virtual en vironment through a custom reward function and environment enhancement technology. From the abo ve work, we can see that many of t he visual aut opilot algorithms learn through the n eural network under th e con dition of supervised learning, get the mapping relationship, and then control. But this is not smart enough. Tesla's driverless accident is caused b y p erceived module failu re i n a bright light environment. Reinforced learning can do so , even in the event of failure of certain modules. Reinforcement learning makes it easier to learn a range of behaviors. Automated driving requ ires a series of correct actions to drive successfully. If only the data is ann otated, th e learned model is offset a bit at a time, and at the end it may be offset a lot, with devastating consequences. Reinforced learning can learn to auto matically correct the o ffset. The key t o a true autonomous vehicle is self-learni ng, and using more sensors d oes not solve the problem. It requires better coordination [16]. In this case, we use t he algorithm of deep reinforcement learning to make our self-driving car. 3. Proposed method 3.1 Self-driving scale car In auto nomous vehicles, cars are often composed of traditional car-mounted sensor sensin g systems, computer d ecision systems and driving control systems [17]. The function of the sensor sensing system is to ca pture surrounding environmental information and vehicle driving state, and provide information support for d ecision con trol. According to the scope of perception, it can b e divid ed into environmental in formation perceptio n and vehicle state perception. The environmental information includes roads, pedestrians, obstacles, traffic co ntrol signals and vehicle geographic location. V ehicle information includes driving speed, gear position, en gine speed, wheel speed, and remaining. The amount of oil, etc. Accord ing to the implementation technology, it can be divided in to ultrasonic radar, video acquisition sensor and positioning device [18]. In our desired experiment, we o nly n eed to use visual d ata as a sensing device. We use the RC car as a ben chmark for retrofitting. The hardware used is: Raspberry Pi (Raspberry Pi 3): This is a low-cost co mputer with a processing speed of 1 .2 GHz and a memory of 1 GB. It is equipped with a customized version of t he Linux system , supports Bluetooth, WIFI communication, and h as ri ch support for i2c, etc. The agreement amount is GPIO port, which is the calculation brain for o ur auto-driving car. PCA9685 (Servo Driver PCA 9685): Inclu des an i2 °C-con trolled PWM dri ver with a built -in clock to d rive the modified servo system. Wide Angle Raspberry Pi Camera: The resolution is 2592 x 1944 and t he viewing an gle is 1 60 degrees. It is our only environmental sensing device, which is our eyes. Other: For the sake of beauty, accord ing to the d esign provided by the Donkey Car community, 3 D printed a car b racket for carrying various hardware devices. Figure 2: One 1:16 scale car.Th ere is an opensource DIY self driving platform for small scale cars called donkeycar (visit donkeycar.com). 3.2 Environment require 3.2.1 Donkey Car Simulator The first step is to create a high fidelity simulator for Don key Car. Fortunately, someone from the Donkey Car community h as generously created a Donkey Car simulato r in Un ity. However, it is specifically designed to perform beh avioral learnin g (i.e. save the cam era images with the corresponding steering angles and throttle values in a file for supervised learnin g), but not cater for reinforcement learning at all. What w e expected is an OpenAI gym like in terface where we can manip ulate the simulated environment through callin g reset() t o reset t he enviro nment and step(action) to step t hrough the environment. W e made some modifications to make it compatible with reinforcement learning. Since we are going to write our reinforcement l earning cod e in python, we have t o first figure o ut a way to get python communicate with th e Un ity environment. It tu rns out that the Unity simulator created by Tawn Kramer also comes with py thon code for communicating with Unity. The communication is done through the Websocket p rotocol. Websocket p rotocol, unlike HTTP, allo ws t wo way bidirectional communication between server and client . In our case, o ur python “server” can push messages d irectly to Unity (e.g. steering and throttle actions), and our Unity “client” can also push information (e.g. states and rewards) back to the python server. 3.2.2 Create a customiz ed OpenAI gym environment for Donkey Car The next step is to c reate an OpenAI gym like interface f or training reinforcement learnin g algorithms. For those of you who are have t rained reinforcement learning algorithms before, you should be accustomed to the use o f a set of API for t he RL agent to i nteract with the enviro nment. The common ones are reset(), step(), is_game_over(), etc. We can cu stomize our own gym en vironment by extending the OpenAI gym class and implementing t he methods above. The resulting environment is compatible with OpenAI gym. We can interact with the Donkey environment using the familiar gym like interface. The en vironment also allows us to set frame_skipping and t rain the RL agent in headless mode(i.e. without Unity GUI). Therefore,we have a virtual environment that we can use. We t ake the pixel images taken by th e front camera of the Donkey car, and perform the following transformations: 1.Resize it from (120,160) to (80,80) 2.Turn i t into grayscale 3.Frame stacking: Stack 4 frames from previous time steps together 4.The final state is of dimension (1,80,80,4). 3.3 Algorithm 3.3.1 The model of Reinforce ment Learning Figure 3:The process of reinforcement learning Figure 3 sho ws the elements and p rocesses of rein forcement learning. The agent takes action and interacts with the environment. The enviro nment r eturns rewards and moves to t he next state. Through mul tiple i nteractions, th e agent gains experience and seeks the optimal strategy in experience. T his interactive learn ing process is similar to the h uman learning style. Its main features are trial and erro r and delayed return. The learning process can be represented b y the Markov d ecision process. The M arkov decision process consists of triples ,where: S = {  ,  ,  …} A = {   ,  ,  …}     =  䁓 t =      =  ,  =  䁋 r = r(s,a) S is a collection of all states ; A is a collection of all actions; P is the state transition probability;     means the transition probability when the agent takes actio n a and change state s to   .The r is reward function,which means the reward of taking action a under state s. The agent forms an interaction trajectory in each round of interaction with the en vironment(   ,  ,  ,  ,  ,  …  ,  ,  ), T he cumulative return at the state is   =  t t    t t    t t … = =h      tt  (1) The γ ∈ 䁓h,䁋 is the discount coefficien t of the return is u sed to weigh the relationship between current returns and long-term returns. The larger t he value, t he more att ention is p aid to long- term returns, an d vice versa. The goal of rein forcement learning is to learn strategies to maximize the expectations of cumulative returns: π a s = argmax  䁓䁋 (2) In order to solve the o ptimal strategy, the value function and the action state value f unction are introduced to evaluate the advantages and disadvantages of a certain state and action . The value function is defined as follows:    =   =h      tt   =  =   䁓 t t (  t )   =  䁋 (3) The action value function is defined as:   , =   =h      tt   =  ,  =  (4) Methods for so lving value functions and action state value functions are b ased o n table m ethods and approximation methods based on value functions [1 9]. Traditi onal dynamic programming, Monte Carlo and time d ifference (TD) algorith ms are all table methods. The essence is to create a tab le of Q(s, a), behavioral state, an d list as actions. The table is con tinuously updated by loop iter ation calcu lation . value . When the state i s relat ively small, it is completely feasible, but when th e state space is large, t he traditional metho d is not feasible. Can you fit the state action value function with the appr oximating abil ity of the d eep neural network to make Q(s, a) ≈ Q ( s, a, ø) has become th e current research hot spot. In 2013, deepmind highlighted the famou s DQN algorith m [10], which opened a new era of deep reinforcement learning. T he algorithm uses a con volutional n eural network to approximate the state action value functi on, and uses the original pixels of the screen as input to directly learn th e Atari game strategy. At the same time, using the experience replay mechani sm [20], the training samples are stored in the memory pool, an d each time a fixed amount of data is randomly sampled to train the neural network, the correlation between the train ing samples is eliminated, and the stability of the training is improved. 3.3.2 Self-drivin g algorithm based on DDQN In the presence of a friendly reinforcement learning model training environ ment, we plan to u se the strong learning algorithm as our control algorith m for automatic driving. F or this we chose to use the DDQN algorithm because it has a relatively simple coding feature. Below we will introduce th is metho d and how to apply it to t he autopilot model. In the DNQ algorithm, the au thor creatively proposed an approximate representation of the value functi on [9], successfully solving the probl em that the p roblem is to o large t o be u sed. Among them, the state value function is introduced: Figure 3: And use n eural networks to express state value functions. Bu t it does not necessarily guarantee the convergence of the Q network, that is, we may not be able to get th e Q n etwork parameters after convergence. This wi ll result in a p oorly trained mod el. In o rder to so lve this p roblem, DDQN [1 1] proposed by xx achieves the problem of eliminating o verestimation b y decoupling t he selection of th e target Q value action and th e calculation of the target Q value. DDQN (u sing the original name, abb reviated here) h as two Q network structures li ke the DQN algorithm. In DDQN, it is no longer to find the maximum Q value in eac h action directly in the target Q network, b ut first find t he action correspondi ng to the maximum Q value in the current Q network. 4. EXPERIMENT 4.1 Simulation Essentially, we want our RL agent to base its output decision (i.e. steering) only on the location and orientation of the lane lines and neglect everything else in the background . However, since we give it the full pixel camera images as inputs, it might overfit to the background patterns instead of recognizing the lane lines. This is especially problematic in th e real wo rld settings where there might be undesirable objects lying next to the track (e.g. tables and chairs) and people walking around the track. If we ever want to transfer the learned policy from the simulatio n to the real world, we should get the agent to neglect the background n oise and just focus on the track lines. To address this problem, I’ve created a pre-processing pipeline to segment out the lane lines from the raw pixel images before feeding them into the CNN. The segmentation process is inspired by this excellent blog post. The procedure is described as follows: 1.Detect and extract all edges using Canny Edge Detector 2.Identify the straight lines through Hough Line Transform 3.Separate the straight lines into positive sloped and negative sloped (candidates for left and right lines of the track) 4.Reject all the straight lines that do not belong to the track utilizing slope information The resulting transformed images consists of 0 to 2 straight lines representing the lane, illustrated as follows: Figure 5: We t hen took the segmented images, resize them to (80,80), stack 4 successive frames together and use it as the new input states. We t rained DDQN again with the new states. The resulting RL agent was again able to learn a good policy to drive the car! W ith the setup above, I trained DDQN for around 100 episodes on a single C PU and a GTX 1080 GPU. The entire training took around 2 to 3 hours. As we can see from the video below, the car was able t o learn a pretty good policy to drive itself! Figure 6:The donkey car in the Unity Simulation. Trained t o get the model, the car learns to drive and stay in the center o f the lane most of the time. 4.2 Simulation to Realty We have customized a 3. 5x4m simulation track from the merchants, and the track and Unity environment h as a hi gh degree of redu ction, similar to the real life road (accordi ng t o Chi na's right-hand drive stan dard). We modified the p rogram to change th e train ed mod el input from Unity's output to the camera's real-time input . Then the program was configured in th e Raspberry Pi and finally tested. The good news is t hat our car successfully followed the rules t hat h e needed to follow, and operated on the right, and automatically turned. Figure 6:This i s our road for Donkey car. Figure 7:The trained car self-driving We achieved the goal o f autonomous driving, b ut the t raining time became very long after th e image was p rocessed, and the learned strategy wo uld b e acciden tally unstable, especially in the case of a turn. The situation. After an alysis, we found th at this is to discard useful background information and line curvature information. However, I’ve noticed that not only training took longer, but t he learned policy was a lso less stable and the car wrigg led frequently especially when making tu rns. I think this happened because we threw away u seful background information an d line curvature information. In return, agents sho uld not be to o easy to over-fitti ng, or even summed u p as invisible and real-world orbits. Our pap ers demonstrate th e use of deep reinforcement learning, coupled with training in t he Unity simulator, an d th e resulting car is automatically driven within the tolerances. 5. CONCLUSION In this article, we use a reinforcement learning algorithm to set up a model that can be automated with just one camera, plus virtual simulation training in Unity, and the resulting autonomous driving car completes the established driving goals. It is a very feasible way to co nduct a strong learning and training thr ough a virtual environment and then move to real life. 6. REFERENCES [1] Janai J, Gü ney F, Behl A, et al. Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the- Art[J]. 2017. [2] Abraham, Hil lary, Chaiwoo Lee, Samantha Brady, Craig Fitzgerald, Bruce Mehler, Bryan Reimer, and Joseph F. Coughlin. "Autonomous vehicles, trust, and driving alternatives: A survey of consumer preferences." Massachusetts Inst. Technol, AgeLab, Cambridge (2016): 1-16. [3] Sutton, Ri chard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018. [4] Mnih, Volodymy r, et al. "Human-level con trol through deep reinforcement learning." Nature 518.7540 (2015): 529. [5] Lin, Long-Ji. Rein forcement learning for robots using neural networks. No. CMU-CS-93-103. Carn egie-Mellon Univ Pittsburgh PA School of Computer Science, 1993. [6] Van Hasselt , Hado, Arthur Guez, and David Silver. "Deep reinforcement learning with double q-learning." Thirtieth AAAI Conference on Artificial Intelligence. 2016. [7] Jain, Aditya Kumar. "Working model of Self-driving car using Convolutional Neural Network, Raspberry Pi and Arduino." 2018 S econd International Conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE, 2018. [8] Santana, Ed er, and George Hotz. "Learning a driving simulator." arXiv preprint arXiv:1608.01230 (2016). [9] https://openmv.io/blogs/news/linear-regression-line- following [10] https://medium.com/@ksakmann/behavioral-clon ing-make- a-car-drive-like-yourself-dc6021152713 [11] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2 013). [12] Watkins, Christopher John Cornish Hellaby. "Learning from delayed rewards." (1989). [13] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529. [14] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2 013). [15] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep reinforcement learning with double q-learning." Thirtieth AAAI Conference on Artificial Intelligence. 2016. [16] Luo, Wenhan, et al. "End-to-end Active Object Tracking and Its R eal-world Deployment via Reinforcement Learning." IEEE transactions on pattern analysis and machine intelligence(2019). [17] James, Stephen, Andrew J. Davison, and Edward Johns. "Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task." arXiv preprint arXiv:1707.02267 (2017). [18] Andrychowicz, Marcin, et al. "Learning dexterous in-hand manipulation." arXiv preprint arXiv:1808.00177 (2018). [19] Sadeghi, Fereshteh, et al. "Sim2real view invariant visual servoing b y recurrent control." arXiv preprint [20] Tan, Jie, et al. "Sim-to-real: Learning agile locomotion for quadruped robots." arXiv preprint arXiv:1804.10332 (2018). [21] Kendall, Alex, et al. "Learning to Drive in a Day." arXiv preprint arXiv:1807.00412 (2018). [22] Dörr, Do minik, David Grabengiesser, and Frank Gauterin. "Online driving style recognition using fuzzy logic." 17th International IEEE Conference on Intelligent Transportation Systems (ITSC). IEEE, 2014. [23] Bojarski, Mariusz, et al. "End to end learning for self-driving cars." arX iv preprint arXiv:1604.07316 (2016). [24] Codevilla, Felipe, et al. "End-to-end driving via conditional imitation learning." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018. [25] Shalev-Shwartz, Shai, Shaked Shammah, and Amnon Shashua. "Safe, multi-agent, reinforcement learning for autonomous driving." arXiv preprint arXiv:1610.03295 (2016).

Self-driving scale car trained by Deep reinforcement learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment