Autonomous Identification and Goal-Directed Invocation of Event-Predictive Behavioral Primitives
Voluntary behavior of humans appears to be composed of small, elementary building blocks or behavioral primitives. While this modular organization seems crucial for the learning of complex motor skills and the flexible adaption of behavior to new cir…
Authors: Christian Gumbsch, Martin V. Butz, Georg Martius
A U T O N O M O U S I D E N T I FI C A T I O N A N D G O A L - D I R E C T E D I N V O C A T I O N O F E V E N T - P R E D I C T I V E B E H A V I O R A L P R I M I T I V E S A P R E P R I N T Christian Gumbsch Max Planck Institute for Intelligent Systems & Univ ersity of Tübingen Tübingen, Germany christian.gumbsch@tuebingen.mpg.de Martin V . Butz Univ ersity of Tübingen Tübingen, Germany martin.butz@uni-tuebingen.de Georg Martius Max Planck Institute for Intelligent Systems Tübingen, Germany georg.martius@tuebingen.mpg.de October 30, 2024 A B S T R AC T V oluntary behavior of humans appears to be composed of small, elementary b uilding blocks or be- havioral primiti ves. While this modular or ganization seems crucial for the learning of comple x motor skills and the fle xible adaption of behavior to ne w circumstances, the problem of learning meaningful, compositional abstractions from sensorimotor experiences remains an open challenge. Here, we introduce a computational learning architecture, termed surprise-based behavioral modularization into e vent-predicti ve structures (SUBMODES), that explores beha vior and identifies the underlying behavioral units completely from scratch. The SUBMODES architecture bootstraps sensorimotor exploration using a self-org anizing neural controller . While e xploring the beha vioral capabilities of its own body , the system learns modular structures that predict the sensorimotor dynamics and generate the associated behavior . In line with recent theories of event perception, the system uses unexpected prediction error signals, i.e., surprise, to detect transitions between successiv e behavioral primiti ves. W e show that, when applied to tw o robotic systems with completely dif ferent body kinematics, the system manages to learn a v ariety of complex and realistic beha vioral primiti ves. Moreover , after initial self-exploration the system can use its learned predicti ve models progressi vely more ef fectiv ely for in voking model predictiv e planning and goal-directed control in dif ferent tasks and en vironments. K eywords sensorimotor learning, dev elopmental robotics, event cognition, skill acquisition and planning, self-organizing beha vior 1 Introduction Opening the fridge, grasping the milk and drinking from the bottle – behavioral sequences, composed of multiple, smaller units of behavior , are ubiquitous in our minds [ 1 , 2 , 3 ]. More generally speaking, we humans seem to organize our behavior and the accompan ying perception into small, compositional structures in a highly systematic manner [ 4 ]. These structures are often referred to as building blocks of behavior or behavioral primitives and can be vie wed as elementary units of behavior abo ve the lev el of single motor commands [5]. A large challenge for the brain as well as artificial cogniti ve systems lies in the ef fectiv e segmentation of our continuous perceptual stream of sensorimotor information into such beha vioral primiti ves. When does a particular beha vior commence? When does it end? Ho w are individual beha vioral primiti ves encoded compactly? In most cognitiv e systems approaches so far , behavioral primiti ves are segmented by hand, pre-programmed into the system, or learned G U M B S C H , B U T Z , A N D M A RT I U S by demonstration [ 6 , 7 , 8 , 9 ]. In all cases, though, the primiti ves are made explicit to the system, that is, the learning system does not need to identify the primiti ves autonomously . Our brain, ho wev er , seems to identify such primiti ves on its own, starting with bodily self-e xploration. Here, we introduce a computational architecture, termed SUrprise-based Beha vioral MODularization into Event- predictiv e Structures (SUBMODES), that learns beha vioral primiti ves as well as beha vioral transitions completely from scratch. The SUBMODES architecture learns such primitiv es by exploring the beha vioral repertoire of an embodied agent. Initial exploration is realized by a closed loop control scheme that adapts quickly to the sensorimotor feedback. In particular , we use dif ferential extrinsic plasticity (DEP) [ 10 ], which causes the agent to explore body-motor- en vironment interaction dynamics. DEP essentially fosters the exploration of coordinated, rh ythmical sensorimotor patterns, including a tendency to ‘zoom’ into particular dynamic attractors, stay and explore them for a while, and upon small perturbations leav e one attractor in fav or of another one. Starting with this self-exploration mechanism, the algorithm learns internal models that are trained to predict the motor commands and the resulting sensory consequences of the currently performed behavior . The SUBMODES system uses an unexpected increase in prediction error to detect the transition from one behavioral primitiv e to another . If such a ‘surprising’ error signal is percei ved, the internal predicti ve model either switches to a previously learned model or a new model is generated if the behavior was never experienced before. In this way , the agent systematically structures its perceiv ed continuous stream of sensorimotor information online into modular , compositional models of behavioral primitives as well as predictive ev ent-transition models. W e sho w that a large variety of behavioral primitives can be learned form scratch e ven in robotic systems that ha ve both many de grees of freedom and interact with complex, noisy en vironments. Moreover , we show that after initial self-exploration the agent can use its learned predictiv e models progressi vely more effecti vely for inv oking goal-directed planning and control. In effect, the system learns predictiv e behavioral primiti ves and ev ent transition models to in vok e hierarchical, model-predictiv e planning [ 11 , 12 ], anticipating the sensory consequences of the a v ailable behaviors and choosing those behavioral primiti ves that are belie ved to bring the system closer to a desired goal state. In sum, the main contributions of this work are as follo ws: (i) we show ho w a self-organizing beha vior control principle can be utilized to systematically explore the sensorimotor abilities of embodied agents; (ii) we introduce an online ev ent segmentation mechanism, which automatically structures the generated sensorimotor e xperiences into predictive behavioral and e vent-transition encodings; (iii) we show ho w such encodings can be used for hierarchical planning and goal-directed behavioral control. W e ev aluate the novel techniques in complex, simulated robots that are acting in noisy , physics-based en vironments. 2 System Motivation and Related W ork The problem of abstracting our sensorimotor e xperiences into conceptual, compositionally meaningfully re-combinable units of thought is a long-standing challenge in cogniti ve science, including cognitiv e linguistics, cogniti ve robotics, and neuroscience-inspired models [ 11 , 1 , 3 , 13 , 14 , 15 , 16 ]. One important type of such units concerns concrete beha vioral interactions with the environment, regardless if the y lead to transiti ve motions of the body or of other objects. Depending on the lev el of abstraction and the field of research, different synon yms can be found in the literature [ 2 ], such as ‘behavioral primiti ves’ [ 5 ], ‘mov ement primitiv es’ [ 6 ], ‘motor primitiv es’ [ 7 ], ‘motor schemas’ [ 17 ], or ‘movemes’[ 18 ]. It has been suggested that our ability to serially combine these compositional elements is crucial for our ability to quickly learn complex motor skills and to flexibly adjust our behavior to ne w tasks [ 6 ]. Furthermore, the assumption that there exists a limited repertoire of behavior , has been proposed as a way to deal with the curse of dimensionality and redundancy at dif ferent le vels of the motor hierarchy , moving from simple beha vioral primitiv es towards an ontology of more sophisticated interaction complex es [2, 19, 8, 9]. Although the acquisition and application of beha vioral primitives has been e xtensively studied in cognitiv e robotics and related fields, it is still not clear how we disco ver , encode, and ultimately use these behavioral primitives for the effecti ve in vocation of goal-directed beha vioral control. The remainder of this section is structured as follo ws: In Section 2.1 we introduce cogniti ve and computational theories on ho w goal-directed beha vioral control is learned by both humans and artificial systems. In Section 2.2 we pro vide an ov ervie w on ho w continuous sensorimotor information can be con verted into compositional, temporally predictiv e encodings of behavior . Finally , in Section 2.3 we outline how these compositional abstractions can guide higher -order , hierarchical planning. 2 G U M B S C H , B U T Z , A N D M A RT I U S 2.1 Goal-directed behavioral control According to the Ideo-Motor Principle [ 20 , 21 , 22 ], encodings of beha vior are closely linked to their sensory ef fects. The main idea is that initially purely reflex-like actions are paired with the sensory ef fects they cause. At a later point in time, when the previously learned effects become desirable, the behavior can be applied again [ 22 , 1 ]. While the Ideo-Motor Principle was hea vily criticized and ridiculed during the beginning of the 19th century and in the era of Behaviorism, it has seen a revi val over the last decades in various fields of cogniti ve science, as, for e xample, manifested in the propositions of the Anticipatory Beha vioral Control (ABC) theory [ 21 ] as well as the Theory of Ev ent Coding (TEC) [23]. TEC suggests that perceptual information and action plans are encoded in a common representation. According to TEC, actions and their consequent perceptual effects are encoded in a common predictiv e network, which allows the anticipation of perceptual action consequences and the in verse, goal-directed in vocation of the associated motor commands. TEC implies that behavior is primarily learned with respect to the effects that it produces. The ABC theory focuses ev en more on the learning of sensorimotor structures. According to ABC, the critical conditions for the application of an action-effect encoding are learned by focusing on (unexpected) perceptual changes, which lead to a further dif ferentiation of conditional structures [ 24 ]. For example, it can be learned that an object first needs to be in reach before we are able to grasp it [ 1 , 25 ]. In sum, both theories emphasize that our brain encodes beha vior with respect to the effect it entails and it does so, because the resulting structures enable the selectiv e and highly flexible activ ation of an action-effect comple x depending on the current context and desired goal states. Along similar lines, W olpert and Kawato hav e proposed that our brain may learn modular forward-in verse model pairs to acquire progressi vely more complex motor skills [ 26 ]. The proposition was implemented later on in the MOSAIC system [ 27 ]. The MOSAIC system learns sets of discrete, internal models, each consisting of a forward model, which predicts the sensory consequence of an action, and a paired in verse model, which generates the required motor commands. F or each internal model, the forward model is used to determine which beha vior is most likely responsible for the observed sensory dynamics, while the in verse model can generate the associated motor commands. The learning of behavioral control has also been e xamined within the reinforcement learning (RL) framew ork [ 28 ]. In RL one particular control policy is trained to maximize given re wards. Under appropriate conditions, such a polic y can correspond to a particular behavioral primitiv e trained on a specific task. The learning and task-dependent optimization of movement primiti ves has, for example, been in vestigated in an Actor-Critic frame work [ 6 ]. It has been shown that complex movement primitiv es, in realistic settings, such as ‘hitting a baseball with a bat’ can achieve nearly optimal performance when applying policy gradient based optimization [ 29 ]. V arious alternativ e approaches ha ve been in vestigated and contrasted [ 30 , 7 , 31 , 8 , 32 ]. In all cases, the be ginning and end of a movement primiti ve is predefined and not autonomously discov ered by the system itself. Furthermore, classical model-free RL methods typically require much more time to learn complex beha vioral dynamics than predictive, model-based approaches. 2.2 Learning sensorimotor abstractions While the outlined theories gi ve an account on how beha vior can be encoded, they do not e xplain ho w the continuous stream of sensorimotor information may be structured systematically to infer the underlying behavioral primiti ves. Event segmentation theory (EST) [ 33 ] gi ves a concrete formulation of how our brain might be able to se gment the perceptual stream into discrete representations. According to EST , humans perceiv e acti vity in terms of discrete conceptual e vents. An e vent is defined as “a se gment of time at a gi ven location that is concei ved by an observ er to have a beginning and an end” [ 34 , p. 3]. This definition of an e vent is rather general, containing both short sensorimotor ev ents, such as ‘grasping a mug’, b ut also potentially long se gments with multiple agents and ongoing acti vities, e.g., a concert. When considering the learning of beha vioral primiti ves, we can focus solely on the indi vidual sensorimotor lev el of events. According to EST , our perceptual process is guided by a set of internal models, which continuously predict what is perceiv ed next. A specific set of e vent models is acti ve o ver the course of one e vent, i.e., until a transient increase in prediction error occurs. Such a transient error signal may result in a change in the currently acti ve internal models. EST further suggests, that such a prediction error -based se gmentation mechanism might occur on diff erent le vels of abstraction, resulting in a hierarchical, taxonomic organization of e vents [ 33 , 34 ]. Hence, according to EST a cogniti vely plausible way to conceptualize the continuous sensorimotor stream into compositional beha vioral models is based on transient error signals of internal predicti ve models – essentially a more concrete formalism that dovetails with the ABC theory . Segmentation mechanisms based on transient prediction error signals hav e been studied in various computational models: Predicting movements in video sequences of actors performing e veryday motions, paired with the dedicated processing of transient prediction error signals, led to the disco very and encoding of simple mo vement primitiv es in a 3 G U M B S C H , B U T Z , A N D M A RT I U S recurrent neural network [ 35 ]. Similarly , learning predictiv e models and using an unexpected increase in prediction error has been used to learn forward models of dif ferent object interaction e vents in simple, physics-based simulation en vironments [ 36 , 37 ]. In both systems, the prediction error-based detection mechanism works online. The basic principle can be closely related to a surprise-based perceptual processing mechanism, which has been sho wn to se gment a hierarchically structured en vironment (four-rooms problem) into its sub-components (individual rooms) e ven in the case of very high noise [38]. Related mechanisms that use perceptual prediction errors or prediction confidence to gate the learning signal while learning different types of beha vior have been applied in v arious control sys tems [ 27 , 39 , 40 , 41 , 42 , 43 ]. Mechanisms that focus on learning progress or more graph-based algorithms to detect transitions ha ve been proposed as well [44, 45, 46, 47]. 2.3 Planning based on hierarchical structures Learning temporal abstractions of behavior enormously simplifies goal-directed planning in high-dimensional systems. If the right behavioral primiti ves are a v ailable rather complex tasks, such as ‘drinking from a mug’, can be decomposed into a sequence of primitiv es (‘reaching’, ‘grasping’, ‘lifting’, etc.). This drastically reduces the search space for planning and control [ 5 , 2 , 44 ]: Instead of choosing a motor command from the entire space of possible motor actions, once the next primiti ve is identified, a much smaller subspace of actions can be analyzed to determine the ne xt motor command. From a predictiv e coding-inspired, neuro-robotics perspectiv e, hierarchical behavioral planning was, for instance, imlemented in a recurrent neural network architecture [ 42 ]. A two lev el hierarchy is employed where the lev els interact in a bottom-up and top-down manner: The higher level produces top-down expectations of the ongoing behavior , essentially encoding sequences of beha vioral primiti ves. The lo wer level produces sensorimotor predictions based on the perceptual input and the top-do wn estimations. Prediction errors from the lo w le vel are, in turn, used to update activity of the high le vel in a bottom-up fashion. Related approaches integrate multiple time-scales for the adaption within the different le vels of the hierarchy [48, 49, 50]. Discov ering behavioral primiti ves and applying them for high-le vel goal-directed control is closely related to hierarchical RL and the options frame work [ 11 , 51 , 52 ]. An option is defined as a “generalization of primiti ve actions to include temporally extended courses of action” ([ 52 ], p. 186). In the right setting, i.e., an embodied, robotic agent with an elementary action corresponding to a single motor command, an option can resemble both a beha vioral primiti ve or a series of behavioral primitiv es, e.g. ‘grasping an object’. In the options frame work a particular option is typically defined with respect to a specific subgoal state. For example, the ‘grasping an object’-option might terminate when the object is held by the hand of the agent. An option can then be trained by comparing the outcome of performing the option with the desired subgoal to determine a pseudo-reward and updating the internal structures rew ard-dependently [ 51 ]. While recent implementations of hierarchical deep RL hav e sho wn remarkable performance in rather challenging video gaming tasks [ 53 ], self-moti vated beha vioral exploration and effecti ve subgoal identification remain as open challenges. 3 Overview of the SUBMODES ar chitecture W e propose a computational architecture, termed SUrprise-based Beha vioral MODularization into Event-predicti ve Structures (SUBMODES), to disco ver beha vioral primitiv es and learn e vent-predicti ve models of the corresponding behavior for an embodied agent completely from scratch. The SUBMODES architecture uses different modular components to explore and learn behavioral primiti ves and detect transitions in beha vior, illustrated in Fig. 1. In this section we gi ve an ov ervie w of the system. In the Appendices A – D further algorithmic details are pro vided. In the Appendix F the system is described in terms of pseudocode. The SUBMODES architecture is composed of different modular components, responsible for exploring behavior , learning models for different beha vioral modes, and detecting and encoding transitions in behavior . The different behavioral primitiv es learned by the system are encoded in behavioral models of our learning architecture. These models recei ve sensorimotor perceptions about the agent as an input and produce a predicted sensorimotor state , anticipating future sensorimotor perceptions and actions. W e assume that the system switches between its beha vioral modes in a predictable fashion, whereby the occurrence of such transitions is detected by error models . Upon detecting a transition, transition models are trained to encode the critical conditions that enable such a change in beha vior and the sensory consequences thereof. Initially , behavioral exploration is bootstrapped by an explorative controller and the behavioral models are trained on the percei ved sensorimotor e xperiences. At a later phase, the explorativ e controller is deactiv ated and the system can use its learned representations of behavior for anticipatory goal-directed control. 4 G U M B S C H , B U T Z , A N D M A RT I U S Figure 1: Illustration of the SUBMODES architecture during the learning of behavior . An explorative contr oller generates motor commands based on the current proprioceptive input to e xplore self-organizing beha vior . One of multiple, internal behavioral models attempts to predict the motor commands and sensory consequences of the ongoing behavior . The pr edicted sensorimotor state is compared to the actual state to compute the prediction error and update the activ e behavioral model. For each beha vioral model an err or model is trained, estimating the prediction confidence. If surprise is detected, i.e., a strong error signal outside the usual prediction confidence, the system is allowed to exchange the activ e behavioral model. For each transition between two dif ferent behavioral models a transition model is learned. During goal-directed control, the explorati ve controller is deactiv ated and the activ e behavioral model determines the next action (dashed line). The SUBMODES system learns behavioral primiti ves based on the experienced sensorimotor time series. W e bootstrap this learning process by in voking motor commands via a neural network controller that is updated using dif ferential extrinsic plasticity (DEP) [ 10 ]. At e very discrete time step t the controller transforms proprioceptive sensor values x ( t ) = ( x 1 , x 2 , ..., x n ) into motor commands y ( t ) = ( y 1 , y 2 , ..., y m ) . Here, we use a one-layered feed-forward neural network, as y i ( t ) = tanh n X j =1 W ij x j ( t ) + h i , (1) for a motor neuron i , with W ij the weight connecting input j with the output neuron i and a bias term h i . W ith fixed weights W the controller would continuously generate motor commands corresponding to one particular behavioral pattern. Ho we ver , the network weights W ij are constantly changed by applying the DEP-learning rule. This learning rule essentially updates the weights based on correlations of sensoric velocities o ver some time φ , i.e., ∆ W ( t ) ∝ M ˙ x ( t − φ ) ˙ x ( t ) , (2) with M an in verse model describing the relationship between motor actions and propriocepti ve sensor values (details in Appendix A). Besides the weight updates, changes in beha vior can also arise from a bias dynamics , which after some time of inactiv ation shifts the bias value h i for the most inactiv e motor neurons i . When applying the e xplorati ve controller using the DEP learning rule to an embodied agent, the controller typically discov ers dif ferent dynamic sensorimotor attractors, which correspond to behavioral dynamics that unfold relati vely uniformly ov er time. These behavioral dynamics can be seen as beha vioral primitives, since the y typically correspond to simple elementary actions like ‘crawling’, ‘shaking hands’ or ‘wiping a table’ [ 54 ]. Howe ver , upon perturbations the controller might lea ve one sensorimotor attractor and some time later disco vers a ne w one, resulting in a change in behavior . Such perturbations can be caused by a sudden change in interaction of the agent with its en vironment, e.g., by hitting an obstacle, or by changes within the sensorimotor loop, e.g., the activ ation of a bias neuron. This property makes the DEP-controller an ideal candidate for beha vioral exploration of a comple x, embodied agent. 5 G U M B S C H , B U T Z , A N D M A RT I U S The SUBMODES architecture encodes the explored behavioral primitiv es through a set of modular , predicti ve behavioral models B . One behavioral model B i ∈ B attempts to encode one particular behavioral primitiv e pre viously demonstrated by the explorati ve controller . Each model B i is a single-layered neural network (no hidden layer) recei ving the current sensory state x ( t ) as an input and predicting the next motor command y 0 ( t ) and the sensory consequence of this particular action ∆ x 0 ( t + 1) . At a certain point in time t only one model B ( t ) = B i is activ e. The sensorimotor predictions produced by the activ e model B ( t ) are compared to the perceiv ed change in sensory values ∆ x ( t + 1) and motor command y ( t ) and the prediction error is computed as the deviation between prediction and sensation. The error signal is then used to update the activ e model B i using delta-rule based gradient descent. T o maintain minimal statistics about the accurac y of the sensory predictions, the system contains a set of err or models E . For each beha vioral model B i an error distribution E i ∈ E is learned, which is estimated by means of a normal distribution. Each error model E i maintains a moving a verage ¯ e i ( t ) and variance ¯ σ i ( t ) of the sensory prediction error , thus, estimating the first two moments of the prediction for each beha vioral model. W e assume that changes in beha vior result in a strong, une xpected increase in the sensory prediction error e ( t ) for the currently predicting model. The system detects such a surprise for time step t if the error is outside of a certain confidence region of the error statistics, e ( t ) > ¯ e i ( t ) + θ ¯ σ i ( t ) , (3) with e ( t ) the current sensory prediction error 1 , ¯ e i ( t ) the moving a verage and ¯ σ i ( t ) the moving error deviation of the currently activ e behavioral model B i and θ the threshold [38]. If a surprise signal is detected, the system is allowed to switch its active behavioral model B ( t ) . T o determine the new model, the system enters a sear ching period . In this mode, the mean prediction error of all e xisting models are monitored and if there is one which shows a non-surprising error (determined by Equation 3), this model takes over . If after a maximum amount of time steps, the mean prediction error of e very model is considered surprising a new model B j is generated and added to B . In this w ay , the system is able to switch between previously learned behavioral models and to generate new models on the fly . While transitions in behavior are initially detected based on strong increases in prediction error, we assume that the system switches predictably between such beha vioral primiti ves. For example, some transitions in beha vior may only occur in a specific context, for instance, a transition from walking to swimming may only occur in shallo w water . T o model the critical conditions leading to a transition in beha vior and, thus, enable the system to accurately predict such a transition, we train a set of transition models T . For each transition from model B i to model B j a transition model T i → j ∈ T is trained. One transition model T i → j attempts to identify the sensory state that allows this particular transition in behavior to take place and learns to predict ho w such a transition typically unfolds. Transition models are updated once a transition in beha vior occurs (further described in Appendix B). Hence, by learning models of transition in behavior , the SUBMODES architecture does not only learn how one stable beha vioral primitive unfolds – encoded by its beha vioral models B – b ut also ho w different behavioral primiti ves are connected through transitions in beha vior – encoded through transition models T . After an initial e xploring and learning of behavioral abilities, the SUBMODES architecture can perform model- predictiv e planning to generate goal-directed behavior . The predictive design of the internal models allo ws the system to directly use its learned structures for goal-directed control by minimizing the diff erence between anticipated and desired perceptions. For goal-directed behavioral control, the motor command y ( t ) is determined directly by the active behavioral model B ( t ) . T o plan behavior the system recei ves a desired sensory goal state x G ( t ) at e very time step t . The system first considers which subset of behavioral models B ( t ) ⊆ B are applicable gi ven the current sensory state using its transition models T . Then, the system ‘imagines’ how the sensorimotor time series will unfold for each applicable behavior B j ∈ B ( t ) ov er a fixed time horizon (details in Appendix C). By comparing the predicted time series with the goal state, the system can activ ate the behavioral model whose predictions are closest to the goal state. 4 Simulations The e xperiments were conducted in the physically realistic rigid body simulator L P Z R O B O T S [ 56 ]. W e tested the SUBMODES system on two robots, the Spherical robot and the Hexapod . The system was updated with a frequency of 50 Hz, each time receiving ne w sensor readings and setting motor commands. The Spherical robot, illustrated in Fig. 2, has a ball shaped body , that contains three internal masses. The actuators mov e the masses along the axes where the tar get locations are specified by the motor commands: 0 corresponding to a centered position and +1 or − 1 to the outer positions. As sensory information the projection of the axes’ direction onto 1 In practice, we compute e ( t ) ov er a short time frame of 25 time steps. 6 G U M B S C H , B U T Z , A N D M A RT I U S Figure 2: Spherical robot and its axis orientation sensors. (a) shows a screenshot from simulation. (b) sho ws a schematic illustration of how the axis orientation sensor v alues x i are determined (taken from [55]) the z -component of the w orld-coordinate system is a vailable, illustrated in Fig. 2 (b). The robot is equipped with a spherical head atop of its body to visualize the current rolling direction of the Spherical robot. There is no ph ysically interaction between the head and the body . The head always ‘hov ers’ abov e the body and is rotating around its z -axis to face the current rolling direction. The Hexapod is a six-legged robot inspired by a stick insect. It has 18 actuated degrees of freedom, 3 in each leg. Like in real stick insects, each leg is partitioned into three parts: femur , tibia, and tarsus. The femur is connected to the body by a two-dimensional coxa joint, which is able to perform forw ard-backward and upward-do wnward rotations of the leg with respect to the body . Femur and tibia are connected by a one-dimensional knee joint, which is able to rotate the tibia upward or do wnward with respect to the femur . The motor v alues correspond to nominal angles of the joint, where − 1 is associated with the minimal joint angle and +1 with the maximal angle. T arsi and antennae are attached by spring joints and are not actuated. For both robots, the SUBMODES system recei ves the current propriocepti ve sensory information as an input. When using the Hexapod, the delayed sensor values of the 12 coxa joints, with a small temporal delay of δ = 8 time steps, are additionally provided. Besides the propriocepti ve sensory information, the velocity of the robot’ s body movement v and the current orientation α are av ailable sensory input. The orientation α is provided in the form of sin( α ) and cos( α ) . Gaussian distributed noise is added to the propriocepti ve sensor v alues ( σ = 0 . 05 ) and motor commands (Spherical: σ = 0 . 05 , Hexapod: σ = 0 . 1 ). 5 Results 5.1 Learned behavioral primitives In a first test, we examined which behavior is generated by the DEP-controller for the different robots, and how the SUBMODES architecture segments the e xplored stream of sensorimotor information into different behavioral primitiv es. For that purpose, we let the SUBMODES system e xplore dif ferent behaviors for 90 minutes simulation time. The Spherical robot was tested in a lar ge quadratic arena surrounded by walls. When applied to the Spherical robot, the DEP-controller typically generates dif ferent rolling motions, where one of its internal masses is kept fixed at the center of the respecti ve axis, while the other tw o masses periodically oscillate with a certain phase shift. Thereby , the robot’ s body rotates around one of its axis, while this axis is kept approximately parallel to the ground. If the robot hits a wall, the sensorimotor dynamics are strongly perturbed. These strong perturbations of the dynamics are amplified by the DEP-learning rule, which can result in the generation of a ne w rolling behavior . If the robot continues one rolling motion long enough the bias dynamics, that we added to the original DEP-controller , is acti vated and the pre viously centered weight is shifted to one side. This results in a turning motion where the robot turns either left or right while rotating around the axis with the shifted internal mass. In 90 minutes simulation time of exploring behavior for the Spherical robot, the SUBMODES system learned on av erage 15 behavioral models ( σ = 1 . 7 ) over 10 simulations. Surprise is typically detected by the system once the Spherical robot hits a wall or switches from rolling straight to dri ving a curv e. The upper part of Fig. 3 shows the detection of surprise for one ex emplary transition in behavior . In this example the robot first rolls in a straight line by rotating its body around its internal, green axis. Upon hitting a wall the previously demonstrated behavior stops and for 7 G U M B S C H , B U T Z , A N D M A RT I U S Figure 3: Exemplary surprise detection for the Spherical robot and the He xapod shown through the dev elopment of the internal error statistics o ver time. The plots sho w the current prediction error ( e ( t ) ), the mean prediction error of the activ e model ( ¯ e i ( t ) ) and the confidence of the acti ve model ( ¯ e i ( t ) + θ ¯ σ i ( t ) ) ov er time. Marks along the x-axis denote 10 second interv als. The pictures sho w the surprise detection in simulation. The fourth frame depicts the time step when surprise was detected. The inter frame interval is approximately 0.5 seconds. See the te xt for qualitative descriptions of the changes in behavior . a short period of time all internal masses start moving. This results in a strong increase in prediction error outside the confidence of the acti ve model B ( t ) . After some time the motion of one of the internal masses decreases (red mass) until this mass stops mo ving and is kept fixed at the center of the axis. Since this beha vior was demonstrated for the first time, no new model is found during the searching period and a new model is generated. While the system performs the ne w rolling beha vior , the predictions of this ne w beha vioral model impro ve and the prediction confidence of this model decreases. Further transitions in behavior are sho wn in V ideo 1 ( youtu.be/DKblfeM2Jys ). The behavior e xplored by the SUBMODES system for the Spherical robot can be described in terms of angular velocity ω i for each axis i . The angular velocity ω i states ho w fast the body of the robot rotates around the internal axis i , illustrated in Fig. 4 (a). Fig. 4 (b)-(d) depict rolling behaviors of the Spherical robot from one simulation in terms of angular velocities. Since the change of orientation ˙ α is not reflected in ω , we separate the behavior for dri ving straight (Fig. 4 (b)), dri ving a right curv e (Fig. 4 (c)), and driving a left curve (Fig. 4 (d)). Curv ed rolling corresponds to rotating around axis i , where the mass of axis i is shifted to the right or left side of axis i . The color of each point sho ws the clustering of beha vior through the beha vioral models by the SUBMODES system. In this simulation the system learned 17 models. Here a clear partition can be observed, where dif ferent behavioral models are acti ve depending on the angular velocities ω i and the turning velocity ˙ α (straight/left/right) of the point in behavioral space. Note, that both ω i and ˙ α were not directly av ailable to the system, but instead the system used its internal predictions on changes of the sensory values to systematically structure the e xperienced behavior . W e tested the Hexapod robot in an open field without any obstacles. When applied to the He xapod the DEP-controller , with a particular in verse model M , generates dif ferent gaits with circular or ov al forward mov ements of each leg. The performed gaits vary in the strength of leg mov ements and the relationships of the phases between leg movements. One of the emerging g aits for the Hexapod is the tripod gait , as pre viously observed in [ 10 ]. The tripod gait, shown in Fig. 5 (a), can be characterized as always ha ving three legs on the ground and the ipsilateral front and back le g and the contralateral middle leg mo ving together and in phase [ 57 ]. Moreover , a synchr onous tr ot gait could emer ge, where two legs at opposing sides of the body mo ve synchronously and hind and front leg mov ements are synchronized [ 10 ], as shown in Fig. 5 (b). Additionally , v arious hybrid forms of these gaits emerged, for e xample, front and middle le gs moving as during the tripod gait and hind le gs moving synchronized and in phase. When acti vating the bias dynamics of the DEP-controller , the legs on one side of the body are of fset either dorsally or ventrally alongside the rotational axes 8 G U M B S C H , B U T Z , A N D M A RT I U S Figure 4: Beha vioral space of the Spherical robot discov ered by the SUBMODES architecture in one simulation. (a) illustrates the angular velocity ω i around the internal axes. Each point in (b)-(d) shows the behavior of the robot in terms of angular velocities ω i at that time. (b) shows the beha vior for rolling in an approximate straight line, i.e., with changes in driving direction | ˙ α | < 0 . 3 ◦ . (c) shows the beha vior for turning left ( ˙ α > 0 . 3 ◦ ) and (d) shows the beha vior for turning right ( ˙ α < − 0 . 3 ◦ ). The color of each point depicts which beha vioral model B i was acti ve and predicting the behavior at this time. For clarity only e very 50th time step of the simulation is sho wn. Figure 5: Exemplary gaits discov ered by the SUBMODES system for the Hexapod. Each gait was encoded by a single behavioral model B i . (a)-(d) show gaits in an open field. (e) and (f) show gaits in dif ferent terrains (see section 5.3). In (e) snow slo ws down le g mov ements within it. In (f) a low ceiling limits the upw ard mov ement range of the legs. The inter-frame interv al for the shown images is approximately 0.2 seconds. 9 G U M B S C H , B U T Z , A N D M A RT I U S of the coxa joints. This causes the le gs on one side to rotate with a smaller amplitude, resulting in the robot crawling in a left or right curve, as sho wn in Fig. 5 (c)-(d). In 90 minutes exploring beha vior for the Hexapod, the SUBMODES system learned on av erage 18 behavioral models ( σ = 2 . 9 ) ov er 10 simulations. Surprise is typically detected when the amplitude or phase-relation between the circular joint mov ements change, i.e., when the robot changes its gait, changes from cra wling straight to cra wling in a curv e, or alters the ov erall velocity of the gait. An example of changing from tripod g ait to curved locomotion with the respecti ve surprise-detection is shown in the lo wer row of Fig. 3. V ideo 2 ( youtu.be/qeUpOqs9PCo ) shows more transitions in behavior for the He xapod. 5.2 Goal-directed locomotion In a second test we analyzed ho w the SUBMODES system can use its learned behavioral encodings for goal-directed planning and control. W e demonstrate this in a goal-reaching locomotion task. In all experiments goals were small, circular areas. Using an agent-centric frame of reference we define goals by means of a target orientation and velocity . After either reaching the goal state or failing to reach it in time, the robot w as reset and a new goal area was generated. One simulation of this e xperiment consisted of 100 training episodes. Each episode w as composed of three dif ferent phases: • Exploration phase: During the exploration phase the system w as allo wed to discov er and learn new types of beha vior for fi ve minutes of simulation time. In this phase all motor commands were generated by the DEP-controller . • T raining phase: During the training phase the DEP-controller was deactiv ated and the motor commands were produced by the activ e behavioral models with the aim of reaching the gi ven goal. During the training phase the internal models of the system were updated. In each training phase three goals were presented. • T esting phase: The testing phase is equi valent to the training phase, e xcept during testing no model updates occur . This phase is included to measure the learning progress of the system o ver time. Each testing phase consists of fiv e randomly generated goal areas. One training episode typically lasts between 12-17 minutes simulation time. The Spherical robot (diameter = 1 unit) was tested in a lar ge quadratic arena (size = 300 × 300 units) surrounded by walls. Circular goal areas (radius = 1 unit) were randomly generated with a fixed distance around the center of the arena (distance = 60 units). The Spherical robot was giv en a maximum of 140 seconds to reach a goal area before being reset. V ideo 3 ( youtu.be/i0oovLnqF9A ) shows some e xemplary runs. Fig. 6 sho ws the results for the goal-reaching task for the Spherical robot, with the SUBMODES system sho wn in black. Fig. 6 (a) shows the av erage time spent to reach the goal area. Over the first 50 training episode the time required for goal-directed locomotion continuously decreases. While in the first testing episodes the system required approximately 90 seconds per goal, during the last testing episodes it took less than 60 seconds. As a reference, we include the hypothetical optimal performance of approximately 27 seconds that assumes no acceleration or turning is required and the robot can simply driv e towards the goal with maximum speed. Most of the behavioral models for the Spherical robot were discov ered during the first 25 e xploration phases, i.e., 125 minutes of exploring behavior . The number of beha vioral models increased only slightly afterwards (see Fig. 6 (c)). Similarly , the percentage of goal areas reached within the maximal amount of time increased strongly o ver the first training episodes (see Fig. 6 (b)). Already after the second training episode the SUBMODES system managed to reach ov er 70% of the goal areas in time. After 25 training episodes the system was able to reach more than 90% of the goal areas. W e compare the performance of the SUBMODES system to dif ferent ablations of the system, also plotted in Fig. 6. T o determine the ef fectiveness of self-organized exploration combined with surprise-based segmentation, we compare the system to beha vioral control using random contr ollers . In this setting, the system does not explore its behavioral abilities but is instead equipped with 30 neural network controllers with fixed weights randomly generated follo wing a uniform distribution ( ∈ [ − 1 , 1] ). The system can use these controller models for planning and goal-directed control. Additionally , we compare the system to a random se gmentation baseline. For this baseline the system is gi ven 30 behavioral models B i and during exploration a randomly selected model is acti v ated after each 5 seconds simulation time. This baseline is used to determine the effect of surprise-based segmentation compared to random time-based segmentation. Moreov er, we tested the SUBMODES system without transition models . In this case, e xploration and segmentation are applied normally , but no transition models are learned for transitions between behavioral primiti ves. 10 G U M B S C H , B U T Z , A N D M A RT I U S Figure 6: Results for the goal-reaching task for the Spherical robot over the course of training episodes. (a) shows the av erage time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (140s). (c) sho ws the mean number of behavioral models discovered. The black line depicts the SUBMODES architecture with the shaded area sho wing the standard de viation. Other line styles and colors sho w different baselines (see te xt for further explanations). Thus, the system cannot know if a transition between tw o models is possible and cannot anticipate ho w a transition may affect its future sensory states. This setting is included to test the effect of learning transition models for goal-directed planning. As sho wn in Fig. 6, the SUBMODES system clearly outperforms all of its ablations with respect to number of goals reached and time required per goal. In the random contr oller models setting, the system learns that some of the controllers can be used for locomotion, howe ver , it finds no reliable way of changing direction. As a result, using random controllers the robot only managed to reach goal areas if by chance it ended up with the right orientation tow ards the goal. This is strongly reflected in the percentage of goal areas reached in time, which is on av erage belo w 20% for all testing episodes. Similar results can be observed for the random se gmentation setting. In this setting most of the learned models do not represent a consistent type of beha vior . Hence, the system managed to reach goal areas only by chance and, as a result, on a verage reaches less than 20% of the goals during all episodes. W ithout transition models the system did not only take more time to reach the goal areas, b ut also on av erage only reached approximately 60% of the goal areas in time. W e assume, that without transition models the system mak es errors in planning when predicting changes in behavior resulting in a w orse performance. The Hexapod robot (length = 1 unit) was tested in a large area without an y obstacles. Circular goal area (radius = 1 unit) were randomly generated around the reset point of the robot with a fixed distance (distance = 60 units). The Hexapod was reset if it did not reach a giv en goal area within 200 seconds simulation time. V ideo 4 ( youtu.be/1h083TjLDK8 ) shows some e xemplary runs. Fig. 7 depicts the results of the goal-reaching task for the Hexapod robot when using the SUBMODES architecture (black line). Already after the first training episode the system was able to reach 80% of the goal areas within the maximal amount of time. From the 10th training episode onward more than 90% of the presented goal areas were reached in time. The time required to reach the goal areas rapidly decreases o ver the first training episodes. From the 60th episode onward all goals were successfully reached. In the last testing episodes the system needed on a verage 70 seconds simulation time to reach the goal areas. The hypothetical optimal performance of approximately 22 seconds simulation again assumes constant maximal speed directly to the goal, which cannot be reached. The system continuously discov ers ne w behavioral models o ver the course of the exploration phases. As before, we compare the performance of the SUBMODES system to different ablations of the system, see Fig. 7. When using r andom contr oller models , the Hexapod nev er managed to reach a goal area. While in some simulation we observed, that some random controllers could be used for changing the orientation of the robot, not once was a controller generated that could be used for locomotion. Thus, using random controllers the He xapod never managed to actually mov e to the goal areas. When applying random se gmentation the robot reached approximately 10–15% of the goal areas in time during the first two episodes, but only v ery rarely reached a goal area afterwards. The cause for this 11 G U M B S C H , B U T Z , A N D M A RT I U S Figure 7: Results for the goal-reaching task for the He xapod ov er the course of the training episodes. (a) sho ws the av erage time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (200s). (c) sho ws the mean number of beha vioral models discov ered. The black line sho ws the performance of the SUBMODES architecture with the shaded area sho wing the standard deviation. Other line styles and colors show dif ferent baselines (see text for further explanations). could be that without the surprise-based segmentation one specific behavioral model does not correspond to a particular behavioral primiti ves, but instead each model is trained on v arious dif ferent types of behavior . Even if by chance one model encodes a consistent beha vioral primiti ve, it might get ov erwritten very quickly , resulting in a degeneration of performance. As for the Spherical robot, the system without transition models performs worse in the goal-reaching task in terms of time required to reach a goal area and number of goals reached in time. 5.3 T errain-dependent locomotion The previous tests sho wed that the SUBMODES system is able to identify self-e xplored beha vioral primitives and learn models of these beha vioral units and transitions thereof that can be applied for goal-directed locomotion. In a third test we want to further examine if the system is also able to distinguish between dif ferent external ev ents affecting the behavior of the robot. F or this purpose, we test it in an en vironment consisting of three dif ferent terrain types: a cav e, an open field and a snow field. The cave has a lo w ceiling 1 . 1 h abov e the ground with h being the combined length of the Hexapod’ s tibia and tarsus. Thus, the Hexapod is not able to fully lift its legs, when positioned in the cave. Howe ver , the ceiling and the floor of the ca ve ha ve a low friction, which allo ws the Hexapod to locomote forw ard using mostly forward-backward motions of its legs. The second en vironment is an open field without obstacles and a floor with normal friction (as in the pre vious experiments). The third en vironment is a snow en vironment. In this environment a 0 . 4 h tall snow layer is co vering the ground. All movements inside the snow layer are se verely slowed do wn, by the factor 0 . 8 , caused by the high friction of the sno w . The SUBMODES system was given 60 minutes simulation time of behavioral e xploration in each of the three en vironments. Afterwards, the robot w as placed in an obstacle course consisting of all three en vironment types (each with a size of 60 × 60 units), shown in Fig. 8 (a), and had to use its pre viously learned models for goal-directed control. The robot starts in the center of the cav e facing the north wall. The first goal is to crawl out of the ca ve through an opening at the right side of the ca ve. After reaching the opening, a goal area was randomly positioned in the sno w field and the task was to mov e ov er the open field and the sno w layer to the goal position. Fig. 8 (a) shows the possible positions of the goal areas in white. If the robot reached the goal area or did not reach it within an upper time limit (400 seconds simulation time), the robot w as reset inside the ca ve. Like in the pre vious tests, goal positions were defined with respect to the desired orientation α and velocity v of the robot. W e tested the system for 100 training episodes, where each episode was composed of a training phase, during which one goal area was presented and the internal models were updated, and a testing phase, with fiv e goal areas and without any model updates. The SUBMODES system discovered ne w beha vioral models for each of the three en vironments. In the cave the system found different cra wling motions, which allo wed the Hexapod to move using only little upward mo vements of the le gs. One behavior that w as discov ered in the cave in e very simulation is tripod crawling . During this behavior the le gs are mov ed forward and backward as during the tripod gait, but only by slightly lifting its legs, as shown in Fig. 5 (f). In 12 G U M B S C H , B U T Z , A N D M A RT I U S Figure 8: T rajectories of the Hexapod for goal-directed locomotion in dif ferent terrain. (a) illustrates the obstacle course consisting of three diff erent environments. T extures depict the type of environment and black lines represent walls. White areas show possible goal positions. The first goal is alw ays positioned at the exit of the ca ve, the second goal is positioned inside the sno w . (b)-(d) show e xemplary trajectories from the last testing phases of dif ferent simulations. The color of the line denotes in which en vironment the used behavioral model was first disco vered. the sno w en vironment, the system discovered interesting g aits for fast mo vement despite the high friction of the sno w layer . During most of the gaits disco vered in sno w , at least two legs are periodically lifted outside of the sno w while the other legs mov e only little and their feet constantly stay within the sno w , as for e xample sho wn in Fig. 5 (e). Some behavioral models were acti vated in more than one type of en vironment, but these beha viors mostly resemble standing still or performing little le g movement. The system disco vered on av erage 34 behavioral models ( σ = 3 . 8 , n = 10 ) during the 180 minutes simulation time of exploration. On average 9 models were discovered in the ca ve ( σ = 2 . 3 ), 15 models in the open field ( σ = 3 . 5 ), and 10 models were disco vered in the sno w en vironment ( σ = 2 . 0 ). The results for goal-directed locomotion in the obstacle course are shown in Fig. 9, with the black line depicting the SUBMODES architecture. The time spent to reach a goal area and the percentage of goal areas reached by the system rapidly improv es ov er the first couple of training episodes. Already after sev en training episodes the system was able to reach more than 80% of the goal areas in time. The percentage of goal areas reached in time further increased, such that the system reached more than 95% of the goal areas during the last couple of episodes. Furthermore, time spent to reach a goal area is approximately halved ov er the course of training. V ideo 5 ( youtu.be/xhEmmm6VMg8 ) sho ws one ex emplary run of the Hexapod through the obstacle course. In Fig. 8 (b)-(d) some trajectories generated by the SUBMODES system for this task are illustrated. The background pattern denotes the type of en vironment and the color of the lines sho w in which en vironment the acti ve behavioral model was first discov ered. One can see that the system mostly applies behavioral models in one specific en vironment that were first discov ered in this particular en vironment. Hence, the system seems to distinguish between different types of beha viors based on the three different environments and learns which beha viors are applicable per environment. Note, that the system does not receiv e direct information about its current en vironment. The applicability of one behavioral primiti ve is determined purely by the prediction errors of the internal models and by learning the transition probabilities between different beha vioral models. The necessity of transition models for this task is clearly reflected in the performance of the ablated system without transition models (see Fig. 9, blue line). W ithout learning transition models, the system takes longer to improv e its performance for goal-directed locomotion, and nev er reaches more than 30% of the goal areas in time. 13 G U M B S C H , B U T Z , A N D M A RT I U S Figure 9: Results for the terrain-dependent goal-reaching task for the Hexapod ov er the course of training epochs. (a) sho ws the a verage time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (400 s). The solid black line depicts the SUBMODES system with the shaded area showing standard de viation; the dashed blue line shows the performance of the system without transition models; the solid green line shows an estimate of h ypothetical optimal performance. 6 Discussion and Future W ork W e ha ve proposed a no vel computational architecture, the SUBMODES architecture for surprise-based learning of modular , event -predictiv e behavioral primiti ves. W e sho wed through different simulations that this system is able to discov er and detect a variety of beha vioral primitiv es in highly comple x, dynamic systems without the provision of an y signal indicating the existence of a beha vioral unit or the beginning or end of such a unit. Instead, the system uncov ered different beha vioral primitives from a continuous self-explored sensorimotor stream in a self-supervised fashion purely based on the detection of surprise and principles of e vent-predictiv e cognition [ 33 , 4 ]. This allowed our system to discretize the continuous stream of information e xperienced by an embodied agent online, while simultaneously learning models of the performed behavior and transitions in behavior . In this way , the SUBMODES system was able to learn a repertoire of various beha viors for two complex robotic agents from scratch. In this work, the behavioral capabilities were initially explored by means of self-organizing behavior , which was generated by the dif ferential extrinsic plasticity (DEP) controller [ 10 ]. This controller w as able to produce various complex, highly-coordinated behavioral patterns for the two robots with completely different body kinematics. Without specifying a goal, various rolling motions for the Spherical robot and crawling behaviors for the Hexapod emer ged, most notably the tripod gait also kno wn from real insects [10]. While it has been shown before that DEP can disco ver and produce interesting types of behavior [ 10 , 54 ], the controller w as, to the best of our kno wledge, never used to bootstrap behavioral learning. The SUBMODES system demonstrated that DEP is highly suitable for sensorimotor exploration in a self-supervised learning architecture. Ho wever , the SUBMODES architecture does not rely on this particular controller . Other forms of behavioral exploration or learning by demonstration could in principle be applied as well, including predictiv e information maximization [ 58 ], intrinsically motiv ated goal exploration processes [ 59 ], or human demonstration. T raditionally , learning behavioral primiti ves was in vestigated by learning only one primiti ve in isolation or by providing either explicit labels of the ongoing primitives or labels signaling transitions between primitiv es [ 6 , 7 , 8 , 9 , 29 , 30 , 31 , 32 ]. Our system segments beha vioral primiti ves without an y supervised information or explicit labels. Classical approaches for self-supervised, online segmentation of beha vior were applied in much simpler toy-scenarios or in sensory spaces with a lower complexity [ 27 , 38 , 46 , 47 ]. Related systems for learning predicti ve behavioral encodings for more complex robotic systems learn based on replays of manually-demonstrated primiti ve motions, e.g., [ 42 , 48 ], which simplifies the se gmentation problem because the trajectories during training hav e smaller variations such that transitions 14 G U M B S C H , B U T Z , A N D M A RT I U S are more apparent. W e ha ve sho wn that our surprise-based segmentation mechanism works well for high-dimensional, noisy , self-generated streams of sensorimotor information. Besides the segmentation and learning abilities, we showed that the SUBMODES system can use its learned behavioral representations progressi vely more effecti vely for solving v arious goal-reaching tasks. The improv ement in performance ov er time is accomplished by three main mechanisms: (1.) Over time, the system discov ers new types of behavior , which may be more ef fective for the tested tasks. (2.) The system continues to impro ve the accuracy of the av ailable behavioral models, enabling the more accurate anticipation of sensory consequences for each associated behavior . (3.) The system improv es its predictiv e models about beha vioral transitions, learning when transitions between different types of behavior can be applied and ho w a specific transition affects the sensory state. The learning of modular beha vioral models paired with sensorimotor e xploration, allows the SUBMODES system to rapidly acquire models suitable for goal-directed control. This cannot be achieved by model-free approaches of behavioral control, such as model-free RL methods. For example, when applying Soft Actor-Critic to the Mujoco Ant-v1—a four-le gged robot similar to the Hexapod but with fe wer degrees of freedom—more than 1M update steps were needed to achie ve reasonable forward locomotion alone [ 60 ]. In comparison, the SUBMODES system applied to the Hexapod managed to learn at least one good model for locomotion already during the first exploration phase, which takes less than 15k update steps. Hence, our system seems to be roughly tw o orders of magnitude faster than a state-of-the-art deep RL approach when applied to similar robots. On top of that, our system learns additional models for locomotion and turning and is able to use the learned models for inducing flexible goal-reaching of tar get areas within a restricted time interval. A success rate of more than 90% was reached in less than 400k update steps. While the system manages to improv e its capabilities to perform goal-directed control both in terms of the number of goal states reached and the time required to reach these goals, the system currently does not quite achie ve optimal performance for the e xamined tasks. Howe ver , note that the learned representations were not optimized for an y of the tested objecti ves. Instead, the system learned general, abstract representations of beha vior that can in principle be applied in various tasks. If one wishes to further optimize the performance of SUBMODES with respect to a specific task, there are v arious methods that could be applied in addition to the already in volved processes: Seeing that the learned models are differentiable, applying goal-directed active inference is possible [ 61 , 62 ], adjusting the motor command of each behavioral model depending on the desired sensory outcome on the fly . Furthermore, if a criterion for successful performance in a specific task is kno wn, for example achieving high velocity in a locomotion task, the models could be optimized to further achiev e this criterion by means of model-free RL [28] and policy gradient approaches [29]. SUBMODES modularizes the experienced behavior by encoding behavioral primiti ves through discrete, individual models. While this modularization protects the system from catastrophic forgetting [ 63 ], the time required to learn different beha viors could be further improved by sharing information among models. Hence, for future work we want to apply the principles employed here to a more general forward architecture, akin to the netw ork architectures in [ 64 ], and explore ho w behavioral representations can be modularized by selecti vely activ ating sub-components within the same network structure, as for example demonstrated by the REPRISE architecture [40, 41]. Besides further behavioral optimization and a less strict modularization, we intend to explore the applicability of SUBMODES to more complex tasks. One challenge in this respect is higher task complexity , where multiple intermediate goals need to be accomplished to reach a desired final goal state—such as when the hand first needs to move to a bottle before moving the hand to the mouth in order to drink out of the bottle. W e expect that such tasks require non-greedy , conceptual planning mechanisms that unfold on deeper , conceptualized lev els of abstraction. Finally , we intend to tackle the visual sensory challenge and apply SUBMODES to real robots, where precise location, state, and motion information are not av ailable but need to be inferred from the gi ven sensory information indirectly . References [1] Martin V . Butz and Esther F . Kutter . How the Mind Comes Into Being: Intr oducing Cognitive Science from a Functional and Computational P erspective . Oxford Uni versity Press, Oxford, UK, 2017. [2] T amar Flash and Binyamin Hochner . Motor primiti ves in vertebrates and in vertebrates. Current Opinion in Neur obiology , 15(6):660 – 666, 2005. [3] Peter Gärdenfors. The Geometry of Meaning: Semantics Based on Conceptual Spaces . MIT Press, London, England, 2014. [4] Martin V . Butz. T o wards a unified sub-symbolic computational theory of cognition. F r ontiers in Psychology , 7(925), 2016. 15 G U M B S C H , B U T Z , A N D M A RT I U S [5] Darrin C Bentiv egna, Christopher G Atkeson, and Gordon Cheng. Learning from observ ation and practice using primitiv es. In AAAI 2004 F all Symposium on Real-life Reinforcement Learning . Citeseer , 2004. [6] Stefan Schaal. Dynamic movement primiti ves-a frame work for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines , pages 261–280. Springer , 2006. [7] Auke Jan Ijspeert, Jun Nakanishi, Heik o Hoffmann, Peter Pastor , and Stefan Schaal. Dynamical movement primitiv es: Learning attractor models for motor behaviors. Neural Computation , 25(2):328–373, 2013. [8] Duy Nguyen-T uong and Jan Peters. Model learning for robot control: a surv ey . Cognitive Pr ocessing , 12:319–340, 2011. [9] F . Wörgötter , E. E. Akso y , N. Krüger , J. Piater, A. Ude, and M. T amosiunaite. A simple ontology of manipulation actions based on hand-object relations. A utonomous Mental Development, IEEE T ransactions on , 5(2):117–134, 2013. [10] Ralf Der and Geor g Martius. Novel plasticity rule can explain the de velopment of sensorimotor intelligence. Pr oceedings of the National Academy of Sciences , 112(45):E6224–E6232, 2015. [11] Andre w G. Barto and Sridhar Mahade van. Recent advances in hierarchical reinforcement learning. Discr ete Event Dynamic Systems , 13:341–379, 2003. [12] Matthe w Botvinick and Ari W einstein. Model-based hierarchical reinforcement learning and human action control. Philosophical T ransactions of the Royal Society of London B: Biological Sciences , 369(1655), 2014. [13] Martin A. Giese and T omaso Poggio. Neural mechanisms for the recogniton of biological mo vements. Natur e Revie ws Neur oscience , 4:179–192, 2003. [14] Oliv er Herbort, Martin V . Butz, and Joachim Hoffmann. T ow ards an adapti ve hierarchical anticipatory behavioral control system. In F r om Reactive to Anticipatory Cognitive Embodied Systems: P apers fr om the AAAI F all Symposium , pages 83–90, Menlo Park, CA, 2005. AAAI Press. [15] Y uuya Sugita, Jun T ani, and Martin V Butz. Simultaneously emerging brai tenberg codes and compositionality . Adaptive Behavior , 19:295–316, 2011. [16] Jun T ani. Learning to percei ve the w orld as articulated: An approach for hierarchical learning in sensory-motor systems. Neur al Networks , 12:1131–1141, 1999. [17] Ronald C Arkin. Motor schema—based mobile robot na vigation. The International journal of r obotics r esear ch , 8(4):92–112, 1989. [18] Christoph Bregler . Learning and recognizing human dynamics in video sequences. In Computer V ision and P attern Recognition, 1997. Pr oceedings., 1997 IEEE Computer Society Conference on , pages 568–574. IEEE, 1997. [19] D. Kraft, N. Pugeault, E. Baseski, M. Popo vic, D. Kragic, S. Kalkan, F . Wörgötter , and N. Krüger . Birth of the object: Detection of objectness and e xtraction of object shape through object action complex es. International Journal of Humanoid Robotics , 5(2):247–265, 2008. [20] W illiam James. The Principles of Psychology , volume I,II. Cambridge, MA: Harvard Uni versity Press, 1890. [21] Joachim Hof fmann. Anticipatory behavioral control. In Anticipatory behavior in adaptive learning systems , pages 44–65. Springer , 2003. [22] Armin Stock and Claudia Stock. A short history of ideo-motor action. Psychological r esearc h , 68(2-3):176–188, 2004. [23] W . Prinz. A common coding approach to perception and action. In O. Neumann and W . Prinz, editors, Relationships between per ception and action , pages 167–201. Springer V erlag, Berlin, 1990. [24] Martin V . Butz and Joachim Hoffmann. Anticipations control behavior: Animal behavior in an anticipatory learning classifier system. Adaptive Behavior , 10:75–96, 2002. [25] Martin V . Butz. Which structures are out there. In Thomas K. Metzinger and W anja W iese, editors, Philosophy and Pr edictive Processing , chapter 8. MIND Group, Frankfurt am Main, 2017. [26] Daniel M W olpert and Mitsuo Kaw ato. Multiple paired forward and in verse models for motor control. Neural networks , 11(7-8):1317–1329, 1998. [27] Masahiko Haruno, Daniel M W olpert, and Mitsuo Kawato. Mosaic model for sensorimotor learning and control. Neural computation , 13(10):2201–2220, 2001. [28] Richard S. Sutton and Andrew G. Barto. Reinfor cement Learning: An Intr oduction . MIT Press, 2nd edition, 2018. [29] Jan Peters and Stefan Schaal. Natural actor-critic. Neur ocomputing , 71(7-9):1180–1190, 2008. 16 G U M B S C H , B U T Z , A N D M A RT I U S [30] S. Calinon and A. Billard. Statistical learning by imitation of competing constraints in joint space and task space. Advanced Robotics , 23(15):2059–2076, 2009. [31] Jens K ober and Jan Peters. Policy search for motor primitiv es in robotics. Machine Learning , 84:171–203, 2011. [32] O. Sigaud, C. Salaun, and V . Padois. On-line regression algorithms for learning mechanical models of robots: a surve y . Robotics and Autonomous Systems , 59(12):1115–1129, December 2011. [33] Jef frey M Zacks, Nicole K Speer , Khena M Swallo w , T odd S Braver , and Jeremy R Reynolds. Event perception: a mind-brain perspectiv e. Psycholo gical bulletin , 133(2):273, 2007. [34] Jeffre y M Zacks and Barbara Tversky . Event structure in perception and conception. Psycholo gical bulletin , 127(1):3–21, 2001. [35] Jeremy R Re ynolds, Jef frey M Zacks, and T odd S Bra ver . A computational model of e vent segmentation from perceptual prediction. Co gnitive Science , 31(4):613–643, 2007. [36] Christian Gumbsch, Jan Kneissler , and Martin V Butz. Learning beha vior-grounded event segmentations. In Pr oceedings of the 38th Annual Meeting of the Cognitive Science Society , pages 1787–1792, 2016. [37] Christian Gumbsch, Sebastian Otte, and Martin V Butz. A computational model for the dynamical learning of ev ent taxonomies. In Pr oceedings of the 39th Annual Meeting of the Cognitive Science Society , pages 452–457, 2017. [38] Martin V Butz, Samarth Swarup, and David E Goldberg. Effecti ve online detection of task-independent landmarks. Urbana , 51:61801, 2004. [39] Robert A Jacobs, Michael I Jordan, Steven J No wlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation , 3(1):79–87, 1991. [40] Martin V . Butz, Da vid Bilkey , Alistair Knott, and Sebastian Otte. Reprise: A retrospecti ve and prospecti ve inference scheme. Pr oceedings of the 40th Annual Meeting of the Cognitive Science Society , 2018. [41] Martin V Butz, David Bilk ey , Dania Humaidan, Alistair Knott, and Sebastian Otte. Learning, planning, and control in a monolithic neural ev ent inference architecture. arXiv pr eprint arXiv:1809.07412 , 2018. [42] Jun T ani. Learning to generate articulated beha vior through the bottom-up and the top-do wn interaction processes. Neural Networks , 16(1):11–23, 2003. [43] Shingo Murata, Hiroaki Arie, T etsuya Ogata, Shigeki Sugano, and Jun T ani. Learning to generate proactive and reactiv e behavior using a dynamic neural network model with time-v arying variance prediction mechanism. Advanced Robotics , 28(17):1189–1203, 2014. [44] Nicolas Duminy and Dominique Duhaut Sao Mai Nguyen. Learning a set of interrelated tasks by using a succession of motor policies for a socially guided intrinsically moti vated learner . F r ontiers in neur or obotics , 12, 2018. [45] Anna C Schapiro, T imothy T Rogers, Natalia I Cordo va, Nicholas B T urk-Bro wne, and Matthe w M Botvinick. Neural representations of e vents arise from temporal community structure. Nat Neur osci , 16(4):486–492, April 2013. [46] Özgür ¸ Sim¸ sek and Andrew G. Barto. Using relativ e nov elty to identify useful temporal abstractions in reinforce- ment learning. Pr oceedings of the T wenty-F irst International Confer ence on Machine Learning (ICML-2004) , pages 751–758, 2004. [47] Özgür ¸ Sim¸ sek and Andrew G. Barto. Skill characterization based on betweenness. In D. K oller, D. Schuurmans, Y . Bengio, and L. Bottou, editors, Advances in Neural Information Pr ocessing Systems 21 , pages 1497–1504. Curran Associates, Inc., Red Hook, NY , 2009. [48] Y uichi Y amashita and Jun T ani. Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment. PLoS computational biology , 4(11):e1000220, 2008. [49] Shingo Murata, Y uichi Y amashita, Hiroaki Arie, T etsuya Ogata, Shigeki Sugano, and Jun T ani. Learning to percei ve the world as probabilistic or deterministic via interaction with others: A neuro-robotics experiment. IEEE transactions on neur al networks and learning systems , 28(4):830–848, 2017. [50] Junpei Zhong, Angelo Cangelosi, T etsuya Ogata, and Xinzheng Zhang. Encoding longer-term conte xtual information with predictiv e coding and ego-motion. Comple xity , 2018, 2018. [51] Matthe w Botvinick, Y ael Niv , and Andrew C. Barto. Hierarchically org anized behavior and its neural foundations: A reinforcement learning perspectiv e. Cognition , 113(3):262 – 280, 2009. [52] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A frame work for temporal abstraction in reinforcement learning. Artificial Intellig ence , 112:181–211, 1999. 17 G U M B S C H , B U T Z , A N D M A RT I U S [53] T ejas D K ulkarni, Karthik Narasimhan, Arda v an Saeedi, and Josh T enenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motiv ation. In Advances in neural information pr ocessing systems , pages 3675–3683, 2016. [54] Georg Martius, Raf ael Hostettler , Alois Knoll, and Ralf Der . Compliant control for soft robots: emergent beha vior of a tendon driv en anthropomorphic arm. In Intelligent Robots and Systems (IR OS), 2016 IEEE/RSJ International Confer ence on , pages 767–773. IEEE, 2016. [55] Georg Martius. Goal-oriented contr ol of self-organizing behavior in autonomous r obots . PhD thesis, Göttingen Univ ersity , 2010. [56] Georg Martius, Frank Hesse, F Güttler , and Ralf Der . L P Z R O B O T S : A free and po werful robot simulator , 2010. [57] JJ Collins and Ian Stewart. He xapodal gaits and coupled nonlinear oscillator models. Biological cybernetics , 68(4):287–298, 1993. [58] Georg Martius, Ralf Der , and Nihat A y . Information driv en self-org anization of complex robotic behaviors. PloS one , 8(5):e63400, 2013. [59] Sébastien Forestier , Y oan Mollard, and Pierre-Yves Oude yer . Intrinsically motiv ated goal exploration processes with automatic curriculum learning. arXiv pr eprint arXiv:1708.02190 , 2017. [60] T uomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Le vine. Soft actor-critic: Of f-policy maximum entropy deep reinforcement learning with a stochastic actor . arXiv pr eprint arXiv:1801.01290 , 2018. [61] Sebastian Otte, Theresa Schmitt, Karl Friston, and Martin V . Butz. Inferring adapti ve goal-directed beha vior within recurrent neural networks. 26th International Confer ence on Artificial Neural Networks (ICANN17) , pages 227–235, 2017. [62] Karl Friston, Francesco Rigoli, Dimitri Ognibene, Christoph Mathys, Thomas FitzGerald, and Giovanni Pezzulo. Activ e inference and epistemic value. Cognitive Neur oscience , 6:187–214, 2015. [63] Robert M French. Catastrophic forgetting in connectionist networks. T rends in cognitive sciences , 3(4):128–135, 1999. [64] Jun T ani. Exploring Robotic Minds . Oxford Uni versity Press, Oxford, UK, 2017. [65] Georg Martius and J Michael Herrmann. V ariants of guided self-organization for robot control. Theory in Biosciences , 131(3):129–137, 2012. [66] Jerome Friedman, T re v or Hastie, and Robert T ibshirani. The elements of statistical learning , v olume 1. Springer series in statistics New Y ork, NY , USA:, 2001. [67] Tsung-Y i Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár . Focal loss for dense object detection. IEEE transactions on pattern analysis and mac hine intelligence , 2018. 18 G U M B S C H , B U T Z , A N D M A RT I U S Figure 10: Network architecture of the DEP-controller (adapted from [ 54 ]). The left side illustrates the neural network controller generating motor commands y ( t ) based on the propriocepti ve sensory input x ( t ) . The right side shows the DEP learning rule, multiplying the deriv ative of a sensor v alue ˙ x ( t ) with the inferred motor changes ˜ ˙ y ( t ) , generated by the in verse model M from some future input’ s deri vati ve ˙ x ( t + φ ) . A Beha vioral exploration using DEP For the parametric setup of the DEP-controller we follo w [ 10 ]. The complete controller architecture is illustrated in Fig. 10. The DEP-controller receives an n -dimensional sensory input x ( t ) and generates an m -dimensional motor command y ( t ) at ev ery discrete time step t . W e assume that the system has a basic understanding of the causal relationship between motor actions and proprioceptiv e sensor values [ 10 ]. This ‘understanding’ is imprinted into an inv erse model M , which relates sensory values x ( t + φ ) back to motor commands y ( t ) with a certain time lag φ . When focusing on changes in sensory values and motor v alues, we get ˜ ˙ y ( t ) = M ˙ x ( t + φ ) , (4) where M is the in verse model, simplified as a linear model in the form of a m × n matrix, and the time lag φ = 1 . The controller weights are then updated using the differential e xtrinsic plasticity rule (DEP): ∆ W ij = W ( ˜ ˙ y i ( t ) ˙ x j ( t ) − W ij ) , (5) where W = 0 . 1 is a learning rate and − W ij is a damping term. Since ˜ ˙ y ( t ) is a linear transformation of ˙ x ( t + φ ) , the synaptic weights of the controller change based on correlations between changes in sensor v alues ˙ x with a time lag φ . Thereby , the in verse model M states how correlations between ˙ x i ( t + φ ) and ˙ x j ( t ) impact the weights W . As in [10] we use an appropriate normalization of the controller weights W . There are two options to perform weight normalization: global normalization and individual normalization . For global normalization the entire weight matrix is normalized: W ← κ W || W || + p (6) with κ an empirical gain factor and a regularization term p = 10 − 12 that becomes ef fectiv e near the singularity ( || W || = 0 ). In indi vidual normalization each motor neuron is normalized individually , with W ij ← κ W ij || W i || + p , (7) where || W i || is the norm of the i th row of W , consisting of all weights that connect to the motor neuron i . The type of normalization applied has a strong effect on the resulting beha vior: While individual normalization leads to beha viors that in volve all motors, global normalization restricts the o verall acti vity to a subset of motors. For the Spherical robot we apply global normalization, which results in the behavior in which two internal masses are constantly mo ved while the third mass is stationary . For the Hexapod robot we apply individual normalization, resulting in all joints being in volved for locomotion. The gain factor κ regulates the o verall feedback strength of the sensorimotor loop, which we set to κ = 1 . 5 for the Spherical robot and to κ = 2 . 2 for the Hexapod. 19 G U M B S C H , B U T Z , A N D M A RT I U S Figure 11: Prestructuring of the inv erse model M of the DEP-controller when using the Hexapod. ↔ depicts the up-down dimension of the coxa joint, l depicts the forward-backward dimension. An arro w from joint i to j describes the entry M ij of the in verse model matrix. + -arrows represent a positi ve connection ( M ij = 1 ), − -arrows represent a negati ve connection ( M ij = − 1 ). The controller additionally uses a bias dynamics that we added to the original DEP-controller . The bias dynamics changes the v alues of one bias neuron e very τ h time steps. The bias ought to be altered is chosen as the bias neuron connecting to the motor neuron i that has had the fe west changes in the controller weights W i connected to the neuron i . Based on this heuristics we introduce acti vity to motor neurons that did not change their activity much in the recent past. When the bias neuron is acti vated its acti vity is randomly set to either h i = 1 . 5 or h = − 1 . 5 . After τ h time steps all bias neurons are deactiv ated again and this process is repeated. For the Spherical robot we set τ h = 5000 (100 seconds). Since for the He xapod robot the beha vior demonstrated by the DEP-controller changes more often naturally , we chose a larger time horizon of τ h = 10000 (200 seconds). The DEP-controller applied to the Spherical robot uses three bias neurons, one for each motor neuron. For the Hexapod we use four bias neurons. T wo bias neurons are connected to the forward-backw ard coxa joints of either the right le gs or the left legs and tw o bias neurons are connected to the upw ard-do wnward coxa joints of either the right or left legs. W ith this wiring, the acti v ation of one bias neuron can of fset the coxa joint positions of the le gs on one body side to four different directions (upw ard, downward, forward, backw ard). The in verse model M of the DEP-controller , states how sensory changes relate back to changes in the motor commands of the system, as defined by Equation 4. If the DEP-controller uses only propriocepti ve sensory information as an input and motor commands of the same joints as an output, we can set M = I to the identity matrix I . This design corresponds to the idea that changes in the proprioception of joint i are caused by changes in the motor command of joint i . This setting can be considered the standard case of applying the DEP-controller , which we also use for the Spherical robot. Howe ver , the in verse model M can also be prestructured, by adding connections between joints within the in verse model M where correlations or anticorrelations of the joint velocities are desired. The underlying idea is, that we can add connections for joints i and j to increase either positiv e correlations ( M ij > 0 ) or negati ve correlations ( M ij < 0 ) between their v elocities ov er time. W e apply this form of guided self-or ganization of behavior [ 65 ] when using the Hexapod, as it was pre viously done by [ 10 ]. For the Hexapod, the in verse model M assumes a positiv e correlation between changes of joint angles and changes in motor commands for the same joint, i.e., M ii = 1 for a joint i . Furthermore, the time-delayed sensor for forward-backward angles is positiv ely linked to the downward-upward angle of the same coxa joint (see Fig. 11 (a)). This connection facilitates circular leg motions ov er time [ 10 ], e.g., once the leg mo ves forward it is desired that the le g moves do wnward some time later . T o further facilitate locomotion we additionally want to obtain antiphasic forward-backward motions of subsequent legs on the same side. For this purpose, negati ve links are included in M between the forward-backw ard sensors and motors of subsequent le gs of the same side (see Fig. 11 (b)). 20 G U M B S C H , B U T Z , A N D M A RT I U S B Learning transitions in beha vior T o enable the accurate prediction of a sensorimotor time series consisting of a variety of dif ferent behaviors, it is important to not only consider ho w the sensorimotor information unfolds during each stable behavioral mode, b ut to also model the transitions between tw o subsequent beha vioral primiti ves. For this purpose, the SUBMODES system incorporates a set of transition models T . If a transition from model B i to model B j occurs the transition model T i → j ∈ T is updated. Each transition model consists of three subcomponents: P i → j , ¯ t i → j , and F i → j . Some transitions require a specific context to occur , e.g., a transition from ‘walking’ to ‘swimming’ can only occur if the agent is standing in shallo w water . T o model the critical conditions for a transition in behavior , T i → j contains a transition probability network P i → j . This network aims to predict the probability of a successful transition from B i to B j giv en the current sensory state x ( t ) . P i → j is a single layered feed-forward neural network mapping a sensory state x ( t ) to a probability ∈ [0 , 1] . If a transition was initiated at time step t , then P i → j receiv es x ( t ) as an input to train the network. If after the transition the system acti v ated model B j , then P i → j is trained on the de viation of its prediction from the target probability 1 . If the system planned to reach B j when initiating the transition, b ut ended up using a different model, P i → j is updated using the tar get probability 0 . Thus, the network estimates the probability of being able to switch from B i to B j giv en the current sensory state. T ransitions in behavior may tak e dif ferent times to be completed, since every transition in behavior is preceded by a searching period. Hence, T i → j contains the component ¯ t i → j , an estimation of the time required to perform a transition from B i to B j . Currently ¯ t i → j is computed as the mean time steps which passed between the initiation of a transition from model B i and successiv ely activ ating model B j . T ransitions in behavior can also entail a strong sudden sensory change. For example a transition from ‘running’ to ‘standing still’ typically results in a strong decrease in v elocity . T o predict the sensory changes occurring during a transition between models an additional single-layered feed-forward neural netw ork F i → j is trained. F i → j learns a mapping from a sensory state x to a change in sensory states ∆ x . When a transition from model B i is initiated at time t and model B j is acti v ated at time step t + t i → j then F i → j is trained on the input x ( t ) and the nominal output x ( t + t i → j ) − x ( t ) . Hence, F i → j predicts how the sensory state will change from the onset of a transition until the transition is finished. Overall, one transition model T i → j can be used to estimate (1.) where in sensory space such a transition is applicable, (2.) how long the transition in behavior will tak e until the next model is active, and (3.) ho w the sensory state will change over the course of the transition, by means of P i → j , ¯ t i → j , and F i → j , respectively . This results in a directed graph representation of beha vioral primiti ves, as illustrated in Fig. 12 (a). Each node of the graph represents a stable behavioral mode with uniformly unfolding sensorimotor dynamics, encoded by a single behavioral model B i . The edges between two nodes are transitions in beha vior, represented by a transition model F i → j . The av ailability of an edge giv en the current sensory state, is encoded by the transition probability model P i → j . This graph representation of behavior and transitions in beha vior is crucial to allow hierarchical, goal-directed planning of beha vior . C Goal-directed planning When switching into goal-directed control, the explorati ve controller is deacti v ated and beha vioral models and model transitions are in voked purposefully for minimizing the dif ference between anticipated and desired perceptions. This process of greedy planning is schematically illustrated in Fig. 12 (b). During goal-directed control at time t the system considers which subset of beha viors B ( t ) ⊆ B is applicable giv en the current sensory state x ( t ) and the currently activ e model B ( t ) = B i . Whether a behavior B j is an element of B ( t ) is determined stochastically using the transition probability network P i → j . The system determines the probability P ( B j ∈ B ( t ) | B i , x ( t )) of B j being an element of B ( t ) as P ( B j ∈ B ( t ) | B i , x ( t )) = P i → j ( x ( t )) , (8) with B i the activ e model and x ( t ) the current sensory state. As a next step the system predicts how its sensory state will change when transitioning from the current behavior B i to a ne w behavior B j ∈ B ( t ) . The sensory state x 0 ( t + ¯ t i → j | B i → B j ) , describing the sensory state after a transition to B j from the activ e model B i , is determined as x 0 ( t + ¯ t i → j | B i → B j ) = x ( t ) + F i → j ( x ( t )) , (9) with ¯ t i → j the estimated time required for the transition and F i → j the transition network predicting the sensory change during a transition from B i to B j . 21 G U M B S C H , B U T Z , A N D M A RT I U S Figure 12: Illustration of the representations of beha vior learned by the SUBMODES architecture and their use for goal-directed planning. (a) The learned representations form a directed graph with the behavioral models B i as nodes. Each edge represents one step of sensory prediction, either by staying in the same model, or by transitioning to a new behavior . A transition to behavior B j from current behavior B i is considered in a stochastic fashion according to the probability P i → j giv en the current sensory state x ( t ) . In this example B 1 is acti ve and B 2 and B 4 can be reached. (b) sho ws how the prediction can be used for greedy planning. × marks a goal state and the dotted lines show the predicted trajectory when using an associated beha vioral model B i . In this example the system chooses B 4 since a part of the predicted trajectory , marked by a grey background, has the lowest mean distance to the goal state. (c) shows ho w replanning allows the system to concatenate dif ferent behavioral primiti ves for accurate goal-directed control. Then, the system predicts how the sensory information will ev olve o ver a planning horizon τ p when staying in B j . The succeeding sensory states x 0 ( t + u | B i → B j ) are computed iterativ ely via x 0 ( t + u + 1 | B i → B j ) ← x 0 ( t + u | B i → B j ) + B j ( x 0 ( t + u | B i → B j )) , (10) starting with u = ¯ t i → j until u = τ p . Gi ven a goal state x G ( t ) the distance of a predicted sensory state x 0 ( t + u ) with respect to the goal can be computed as D G ( x 0 ( t + u )) = | x G ( t ) − x 0 ( t + u ) | M , (11) for some metric M . In the current experiments M was chosen as the squared distance between task-rele vant sensory information of x G ( t ) and x 0 ( t + u ) . In our e xamples, the task-rele vant coordinates are the orientation α and the v elocity v . The next beha vioral model B ( t + 1) is chosen as B ( t + 1) ← arg min B j ∈B ( t ) min ¯ t i → j <τ <τ p 1 ( τ − ¯ t i → j ) τ X u = ¯ t i → j D G ( x 0 ( t + u | B ( t ) → B j )) . (12) Hence, the next model B ( t + 1) is determined, as the applicable behavior that predicts the sensory time series with the lowest mean distance to the goal. The predicted time series has a maximal length of τ p to ensure an upper limit on the computational complexity , which we set to τ p = 500 . After acti v ating the ne xt model B ( t + 1) , the system initiates a searching period to determine whether the transition to this model was successful, as described in section 3. The transition model T i → j is then updated, depending on the success of the initiated transition. As soon as the system is certain about which beha vioral model is currently activ e, i.e., after the searching period, it is allo wed to replan. In this w ay , the system can serially concatenate single beha vioral primitiv es to form a chain of more complex beha vior that allo ws the system to accurately reach a giv en goal state, as illustrated in Fig. 12 (c). 22 G U M B S C H , B U T Z , A N D M A RT I U S D Parametric setup All neural network models of our system, i.e., B i , F i → j , P i → j , are single-layered neural networks mapping directly from sensory input space X to their respective output spaces ( B i : X × Y , F i → j : X , P i → j : [0 , 1] ). For netw orks predicting sensory changes or motor commands, i.e., B i and F i → j , output neurons use a tanh -activ ation function and a squared error loss is used for back propagation. T o enforce sparsity in the network weights, a L 1 weight regularization term is added to the loss-functions [ 66 ], with the regularization constant λ = 0 . 005 . For networks predicting probabilities, i.e., P i → j , output neurons use a sigmoid acti vation function and perform back-propagation based on a balanced cross-entropy loss [ 67 ]. The different types of networks use dif ferent learning rates ( B i : B = 0 . 005 , F i → j : F = 0 . 01 , P i → j : P = 0 . 05 ). T o enable fast learning while a voiding local overfitting, each network is equipped with a r eplay buffer with a large capacity (capacity = 10000) that stores a new input-output pair in each training step. During each network update s additional samples are randomly drawn from the buf fer and the neural network models are additionally trained on the drawn samples. For the Spherical robot s = 2 samples are additionally drawn during each network update. Seeing that the behaviors change faster for the Hexapod, we used a larger sampling rate of s = 25 in that scenario. The error models E i ∈ E are estimated as a normal distribution. The normal distributions are initialized with ¯ e i (0) ← 0 . 05 and ¯ σ i (0) ← 0 . T o allow each error model to quickly keep track of the prediction accuracy of its respectiv e behavioral model, ¯ e i and ¯ σ i are updated as an exponential moving a verage and variance, with a timescale of 1000 steps ( E = 0 . 001 ). T o enable the detection of surprise we compute the prediction error e ( t ) as a simple moving a verage of the sensory prediction error over a short time interval (25 time steps or 0.5 seconds). Comparing the prediction error e ( t ) to the acti ve error model E i , allo ws the system to detect surprise (as defined in equation 3). The surprise threshold θ determines the confidence threshold abov e which an error is considered ‘surprising’. Seeing that we face highly noisy scenarios in our experiments, we chose a small threshold θ = 2 to achie ve a fine-grained se gmentation. Ho wever , depending on the general predictability of the scenario and the desired le vel of abstraction, a larger θ can be applied as well [37]. Upon detecting surprise, the system enters a searching period with a duration ∈ [ τ s,min , τ s,max ] to determine the next behavioral model. In our simulations the searching period takes at least τ s,min = 50 time steps (1 second), to ha ve a sufficient number of data points for comparing the predictions of all models. The searching period takes maximally τ s,max = 700 time steps (14 seconds) before a ne w model is created. Seeing that the DEP-controller maintains one type of beha vior for a relati vely long time (typically longer than a minute), we can use such a long searching period. In this way , small irregularities in behavior , such as the Hexapod stumbling, are ignored instead of resulting in the generation of a ne w behavioral model. Howe ver , for other exploration mechanisms with faster changes in beha vior a shorter searching period is recommended. E Processing sensory changes T o allow the surprise-based segmentation to take all sensory dimensions into account equally , it is necessary that ev ery sensory dimension x i changes at a similar rate. This can be achie ved in two ways: (1.) choosing an appropriate time frame for determining the change in sensory information ∆ x , (2.) scaling each dimension i of ∆ x by a constant factor c i , such that all ∆ x i are within the same interval. For the Spherical robot ∆ x is computed as the change of sensory information o ver one time step (i.e., ∆ x ( t + 1) = x ( t + 1) − x ( t ) ). For the Hexapod ∆ x is computed as the mean change ov er 10 time steps. By computing ∆ x in this way , changes in proprioception are typically within the same interval ( ∆ x i ∈ [ − 0 . 1 , 0 . 1] ). T o assure that other changes in sensory information are within this interval as well, ∆ sin( α ) , ∆ cos( α ) , and ∆ v are multiplied with a constant f actor c ( c = 10 for the Spherical robot, c = 15 for the Hexapod). F Pseudocode In this section we provide pseudocode for the SUBMODES algorithm. Algorithm 1 describes the main loop of the system. Algorithms 2 – 5 are separated from the main algorithm to impro ve readability . Algorithm 1 recei ves the surprise threshold θ , the minimal and maximal duration of the searching interval [ τ s,min , τ s,max ] , and the planning horizon τ p as input parameters. 23 G U M B S C H , B U T Z , A N D M A RT I U S Algorithm 1 SUBMODES: main algorithm 1: procedur e S U B M O D E S ( θ , τ s,min , τ s,max , τ p ) 2: t ← 0 3: B 0 ← C R E AT E _ N E W _ M O D E L () 4: B ← { B 0 } , E ← { E 0 } , T ← {} 5: B ( t ) ← B 0 , E ( t ) ← E 0 6: initialize D E P 7: x ( t ) ← sense current sensory state 8: y ( t ) ← action from D E P given x ( t ) 9: p ( t ) ← prediction from B ( t ) gi ven x ( t ) 10: e ( t ) ← 0 prediction error 11: explor ation ← true exploration vs. planning? 12: sear ching ← false in searching phase? 13: t s ← 0 time spent searching 14: ex ecute action y ( t ) 15: while simulation is running do 16: 1. surprise detection and model updates 17: t ← t + 1 18: x ( t ) ← sense current sensory state 19: update e ( t ) based on k x ( t ) − p ( t ) k 20: if S U R P R I S E ( e ( t ) , E ( t )) and not sear ching then 21: t s ← 0 22: sear ching ← true 23: if sear ching then 24: t s ← t s + 1 25: B ( t ) , searc hing ← S E A R C H _ S T E P ( t s ) 26: E ( t ) ← error model associated with B ( t ) 27: else 28: update B ( t ) based on x ( t − 1) , y ( t ) , x ( t ) 29: update E ( t ) based on e ( t ) 30: 2. action generation and planning 31: explor ation ← exploration or planning phase? 32: if explor ation then DEP-based exploration 33: update D E P based on x ( t ) (Eq. 5) 34: y ( t ) ← action from D E P given x ( t ) (Eq. 1) 35: else goal-directed planning 36: if not sear ching then 37: x G ← receiv e goal state 38: B ( t ) ← P L A N N I N G ( x G ) 39: if B ( t ) 6 = B ( t − 1) then 40: t s ← 0 41: sear ching ← true 42: y ( t ) ← action from B ( t ) gi ven x ( t ) 43: 3. next prediction 44: p ( t ) ← prediction from B ( t ) gi ven x ( t ) . 45: ex ecute action y ( t ) Algorithm 2 SUBMODES: model creation 1: procedur e C R E A T E _ N E W _ M O D E L 2: B i ← create new beha vioral model 3: E i ← create new error model 4: for B j ∈ B do 5: T i → j ← create new transition model 6: T j → i ← create new transition model 7: add created models to B , E , and T , respectiv ely 8: retur n B i 24 G U M B S C H , B U T Z , A N D M A RT I U S Algorithm 3 SUBMODES: surprise detection 1: procedur e S U R P R I S E ( e , E i ) 2: ¯ e ← mean of E i 3: ¯ σ ← standard deviation of E i 4: retur n e > ¯ e + θ ¯ σ (Eq. 3) Algorithm 4 SUBMODES: one step of searching 1: procedur e S E A R C H _ S T E P ( t s ) 2: for B i ∈ B do 3: ¯ e s ( i ) ← update mean prediction error of B i during the searching phase 4: if t s > τ s,min then try to determine next model 5: B 0 ← set of B i ∈ B with no S U R P R I S E ( ¯ e s ( i ) , E i ) 6: if B 0 6 = ∅ then found suitable models 7: B next ← arg min B i ∈B 0 ¯ e s ( i ) 8: if B next 6 = B ( t ) then model transition 9: update transition models 10: retur n B next , sear ching = false 11: else if t s > τ s,max then no model found in time 12: B next ← C R E AT E _ N E W _ M O D E L () 13: update transition models 14: retur n B next , sear ching = false 15: retur n B ( t ) , searc hing = true continue searching Algorithm 5 SUBMODES: planning the next beha vior 1: procedur e P L A N N I N G ( x G ) 2: B next ← determine best behavioral model for the goal x G ov er τ p steps (Eq. 12) 3: retur n B next 25
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment