Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization

Published as a conference paper at ICLR 2026 E FFI C I E N T M O R P H O L O G Y - C O N T R O L C O - D E S I G N V I A S T A C K E L B E R G P R O X I M A L P O L I C Y O P T I M I Z A T I O N Y anning Dai 1 ∗ Y uhui W ang 1 ∗ † Dylan R. Ashley 1 , 2 , 3 , 4 J ¨ urgen Schmidhuber 1 , 2 , 3 , 4 1 Center of Excellence for Generativ e AI, King Abdullah Univ ersity of Science and T echnology (KA UST), Thuwal, Saudi Arabia. 2 Dalle Molle Institute for Artiﬁcial Intelligence Research (IDSIA), Lugano, Switzerland. 3 Univ ersit ` a della Svizzera italiana (USI), Lugano, Switzerland. 4 Scuola univ ersitaria professionale della Svizzera italiana (SUPSI), Lugano, Switzerland. A B S T R AC T Morphology-control co-design concerns the coupled optimization of an agent’ s body structure and control policy . This problem exhibits a bi-le vel structure, where the control dynamically adapts to the morphology to maximize perfor- mance. Existing methods typically neglect the control’ s adaptation dynamics by adopting a single-level formulation that treats the control policy as ﬁxed when optimizing morphology . This can lead to inefﬁcient optimization, as morphology updates may be misaligned with control adaptation. In this paper , we re visit the co-design problem from a game-theoretic perspecti ve, modeling the intrinsic cou- pling between morphology and control as a novel v ariant of a Stackelberg game. W e propose Stac kelber g Pr oximal P olicy Optimization (Stac kelber g PPO) , which explicitly incorporates the control’ s adaptation dynamics into morphology opti- mization. By modeling this intrinsic coupling, our method aligns morphology up- dates with control adaptation, thereby stabilizing training and improving learning efﬁcienc y . Experiments across div erse co-design tasks demonstrate that Stack- elberg PPO outperforms standard PPO in both stability and ﬁnal performance, opening the way for dramatically more ef ﬁcient robotics designs. Stackelberg PPO (ours) 5% 20% 60% 0% 100% PPO T raining Process Figure 1: Sho wcasing how our Stackelber g PPO can autonomously design task-speciﬁc robots for the “Pusher” task, starting with a bare-bones structure and ultimately ev olving into a so- phisticated design with arm-like structures for pushing boxes and leg-like limbs for mov ement. This highlights the method’ s ability to create adapti ve and complex designs. In comparison, the traditional PPO method generates simpler structures that can’t support more complex behav- iors. For more examples of evolv ed designs and animations, as well as open-source code, visit: https://yanningdai.github .io/stackelber g-ppo-co-design . ∗ Equal contribution. † Corresponding author . 1 Published as a conference paper at ICLR 2026 1 I N T RO D U C T I O N Morphology-control co-design optimizes both an agent’ s body structure and its control policy . The morphology deﬁnes the agent’ s design, including topology , geometry , joint layout, and actuation limits, while the control policy dictates how this structure produces behavior to interact with the en vironment ( P aul , 2006 ; Ha & Schmidhuber , 2018 ). Both aspects are critical for task performance. For instance, a quadruped robot with rigid legs can’t walk without an appropriate gait policy , and a locomotion policy is ineffecti ve if the robot’ s morphology lacks the necessary joints to support mov ement. These examples sho w that morphology and control must be co-designed to ensure com- plementarity . Agents optimized under this paradigm are typically more versatile, robust, and efﬁ- cient than those optimized for only one aspect ( Sims , 1994 ; Lipson & Pollack , 2000 ; Bongard et al. , 2006 ; Kriegman et al. , 2020 ). A ke y challenge in morphology-control co-design is the dynamic interplay between morphology and control: the control policy must adapt to the e volving morphology to fully realize its potential; without this, the morphology’ s true performance may be underestimated ( Schaf f & W alter , 2022 ). Existing methods often treat morphology and control as separate optimization processes ( Cheney et al. , 2018 ; Lu et al. , 2025 ; Y uan et al. , 2022 ). Speciﬁcally , morphology updates are typically made assuming a ﬁxed control polic y , ne glecting that control must adapt to structural changes. As a result, morphology updates become misaligned with the optimal control response, leading to unstable and inefﬁcient optimization in the morphology space and, ultimately , degraded performance. In this paper , we revisit the co-design problem from a game-theoretic perspective, formulating it as phase-separated Stackelber g Markov Games (SMGs), where the leader ﬁrst updates the morphol- ogy , and the follower adapts its control policy accordingly . The key insight is that the leader must anticipate how morphology changes will inﬂuence the follower’ s control dynamics, enabling more effecti ve designs. This perspective leads to the dev elopment of Stackelberg Implicit Differentiation (SID), a technique that incorporates the follower’ s adaptiv e dynamics into morphology optimization. Howe ver , applying SID to morphology–control co-design is non-trivial due to the phase-separated nature of the interaction and the non-differentiable interf ace between the leader and follower . T o address these challenges, we deriv e Stackelberg policy gradients tailored to phase-separated SMGs with non-differentiable interfaces, building on SID to explicitly account for the follower’ s adaptation in the morphology optimization process. Since direct differentiation is obstructed by the non-differentiable leader–follower interface, we apply the log-deriv ativ e technique ( W illiams , 1992 ) to deriv e a new Stackelber g surrogate formulation that bypasses this issue and provides a tractable gradient estimator . W e further provide theoretical guarantees, showing that these surrogates are lo- cally equiv alent to the true Stackelberg gradients. T o stabilize training under large policy shifts, we adapt PPO’ s likelihood-ratio clipping to our Stackelberg framework, ensuring robust optimization of the surrogate gradients. Our method, Stackelber g PPO , outperforms state-of-the-art baselines by 20.66% on av erage and by 32.02% on complex 3D tasks. 2 R E L A T E D W O R K Morphology–Contr ol Co-design Co-optimizing morphology and control is attracting increasing attention in embodied intelligence ( Li et al. , 2024 ; Huang et al. , 2024b ; Liu et al. , 2025 ). Prior work optimizes only continuous attrib utes under a ﬁx ed topology , without generating ne w structural topologies ( Banarse et al. , 2019 ; Huang et al. , 2024a ). Early topology-editing work treated co-design as a discrete, non-differentiable search solved using ev olutionary strategies ( Sims , 1994 ; Cheney et al. , 2018 ), requiring each morphology to be paired with a separately trained controller and thus incurring high computational cost. Subsequent methods introduced structural priors and parameter sharing to reuse experience across designs ( Dong et al. , 2023 ; Zhao et al. , 2020 ; W ang et al. , 2019 ; Xiong et al. , 2023 ). More recent RL-based approaches cast structure generation as sequential edits in an MDP ( Gupta et al. , 2021 ; Y uan et al. , 2022 ), with graph and attention architectures improving representation quality ( Chen et al. , 2024 ; Lu et al. , 2025 ; Y uan et al. , 2022 ). Howe ver , the discrete nature of morphology-editing operations blocks gradient propagation across the morphology-control interface, pre venting ef ﬁcient learning by capturing their coupled dynamics. Our work establishes a gradient-driv en pathway that allo ws controller adaptation to directly affect morphology updates. 2 Published as a conference paper at ICLR 2026 Stackelberg Game Learning systems with asymmetric components often exhibit directional de- pendencies, where one module’ s decisions inﬂuence another’ s adaptation in a non-reciprocal man- ner ( Schmidhuber , 2015 ). Stackelber g games formalize this asymmetry , with a leader commit- ting to strategies that followers then best-respond to. Classical approaches study static normal- form games ( Bas ¸ ar & Olsder , 1998 ; Conitzer & Sandholm , 2006 ; V on Stengel & Zamir , 2010 ) while more recent extensions integrate this structure into sequential decision processes and RL framew orks( Gerstgrasser & Park es , 2023 ; Zhong et al. , 2023 ; Bai et al. , 2021 ), often incorporat- ing conﬁdence-aware or optimistic mechanisms to manage follower uncertainty ( Ling et al. , 2023 ; Kao et al. , 2022 ; Kar et al. , 2015 ; Mishra et al. , 2020 ). Another line of research applies implicit differentiation to enable direct gradient ﬂow from the follower to the leader in Stackelberg games. A complementary line leverages implicit differentiation to propagate follower gradients to the leader ( Zheng et al. , 2022 ; Y ang et al. , 2023 ; V u et al. , 2022 ), typically under DDPG-style settings with e x- plicit action-level coupling and alternating updates. Our problem differs in two key aspects: leader actions (morphology edits) cannot be directly transmitted to the follower , and both agents use non- alternating PPO-style updates. W e extend implicit Stackelber g gradient methods to this more general regime, enabling (to our knowledge) the ﬁrst application of implicit Stackelber g dif ferentiation to morphology–control co-design under PPO algorithm. 3 P R E L I M I NA R I E S Proximal P olicy Optimization (PPO) Our method builds upon PPO due to its empirical stability and simplicity . In reinforcement learning, an agent interacts with the environment by observing a state s t , selecting an action a t according to its policy π θ , and recei ving feedback in the form of rew ards. V anilla policy gradient methods ( Sutton , 1984 ; Williams , 1992 ; Sutton et al. , 1999 ) optimize π θ using a surrogate objective that locally approximates the true performance, which has been shown to cause instability when policy updates become too large ( Schulman et al. , 2015 ). PPO ( Schulman et al. , 2017 ) addresses this by constraining the likelihood ratio between ne w and old policies through a clipping technique: L PPO ( θ ) = E t h min  r t ( θ ) ˆ A t , clip( r t ( θ ) , 1 − ϵ, 1 + ϵ ) ˆ A t  i , where r t ( θ ) = π θ ( a t | s t ) π θ o ( a t | s t ) is the likelihood ratio and ˆ A t is an estimator of the advantage function. The clipping mechanism pre vents likelihood ratio r t ( θ ) from de viating e xcessiv ely , thereby limiting policy updates and balancing stability with performance impro vement. Stackelberg Game A Stack elberg g ame models a sequential and asymmetric interaction in which a leader ﬁrst commits to a strategy , and a follower subsequently optimizes its strategy in response. One ke y characteristic of a Stack elber g game is that the leader explicitly accounts for the follower’ s best-r esponse dynamics when making its decision. Let θ L and θ F denote the decision variables of the leader and the follower , and let J L ( θ L , θ F ) and J F ( θ L , θ F ) denote their respecti ve objectiv es. The leader solves the follo wing bi-lev el optimization problem, referred to as the Stackelber g objective : max θ L J L  θ L , θ F ∗ ( θ L )  s.t. θ F ∗ ( θ L ) = arg max θ F J F ( θ L , θ F ) (1) where θ F ∗ ( θ L ) denotes follower’ s best response. The leader’ s gradient can be written as ∇ θ L J L  θ L , θ F ∗ ( θ L )  = ∇ θ L J L ( θ L , θ F ) | {z } direct gradient +  ∇ θ L θ F ∗ ( θ L )  ⊤ ∇ θ F J L ( θ L , θ F ) | {z } indirect gradient via inﬂuencing follower (2) The ﬁrst term, the dir ect gradient , captures the steepest direction along which the leader can directly improv e its objective. The second term, the indirect gradient via inﬂuencing follower , captures the indirect strategic ef fect: it characterizes the direction along which the leader can further improv e its objectiv e by inﬂuencing the follo wer’ s response. In particular , the term  ∇ θ L θ F ∗ ( θ L )  captures ho w the follower’ s optimal decision changes in response to variations in the leader’ s decision. The Ja- cobian ∇ θ L θ F ∗ ( θ L ) follows from the ﬁrst-order optimality condition of the follower’ s maximization problem, ∇ θ F J F ( θ L , θ F ∗ ) = 0 , which indicates that θ F ∗ ( θ L ) is an implicit function of the leader’ s variable θ L . This yields  ∇ θ L θ F ∗ ( θ L )  ⊤ = − ∇ θ L θ F J F ( θ L , θ F )  ∇ 2 θ F J F ( θ L , θ F )  − 1 (3) 3 Published as a conference paper at ICLR 2026 Leader Policy 𝑎 ! " 𝑎 # " Leader Policy 𝑎 $%# " Leader Policy … Non - di ff er ent i abl e Follower Policy 𝑎 $ & 𝑎 $'# & Follower Policy … 𝑠 ! " 𝑠 # " 𝑠 $ %# " 𝑠 $ " 𝑠 $ & 𝑠 $ '# & Trans iti on Proce ss In pu t/O utp ut Leader Phase Follower Phase Figure 2: Illustration of the phase-separated Stackelberg Markov Game for morphology–control co- design. In the leader phase (blue part), the agent incrementally edits the morphology via discrete topology-altering actions, producing a terminal morphology s L T . In the follower phase (green part), the control policy is optimized based on this morphology . This implicit differentiation framework, often referred to as Stackelberg implicit differentiation (SID) , is widely used in Stackelberg reinforcement learning (e.g., Stackelberg DDPG ( Y ang et al. , 2023 )), bi-lev el optimization ( Zucchet & Sacramento , 2022 ), and meta-learning ( Pan et al. , 2023 ). Morphology–Contr ol Co-Design Morphology–control co-design aims to jointly optimize a robot’ s morphology (body structure) and its control policy . The morphology deﬁnes the robot’ s structural and physical properties, such as topology (limb and joint connectivity), geometry (body proportions, limb lengths), and material properties for soft robots. The controller determines the robot’ s actions based on its body , using motor commands, torque signals, or higher-lev el behaviors. The process unfolds in two stages. First, the morphology is constructed incrementally through a sequence of step-by-step editing actions (e.g., adding/removing limbs, adjusting lengths, attaching joints), rather than producing a complete morphology in a single step. This incremental approach is necessary due to the high-dimensional and combinatorial morphology space, which makes one- shot generation intractable. Formally , starting from an initial morphology s L 0 , morphology-editing actions a L t are applied sequentially until a terminal morphology s L T is obtained: a L t ∼ π L ( · | s L t ) , s L t +1 ∼ P L ( · | s L t , a L t ) , t = 0 , 1 , · · · , T − 1 . The transition function P L updates the morphology through discrete topology-altering operations, making the morphology dynamics non-differentiable. Next, the control policy is optimized based on the generated terminal morphology s L T . The mor- phology deﬁnes the controller’ s action space (e.g., av ailable joints and actuators), state space (e.g., joint positions, forces, velocities), and underlying dynamics. At each timestep, given a state s F t , the controller applies a control action a F t and transitions to a new state: a F t ∼ π F ( · | s F t ; s L T ) , s F t +1 ∼ P F ( · | s F t , a F t ; s L T ) , t = T , T + 1 , · · · . The ov erall process is illustrated in Figure 2. Existing works ( Lu et al. , 2025 ; Y uan et al. , 2022 ) commonly model the co-design problem as a bi-lev el optimization structure, similar to the formulation in eq. (1), where θ L and θ F represent the parameters of the morphology policy π L θ L and the controller policy π F θ F , respectively . Howe ver , to simplify implementation, these approaches typically adopt a single-level shar ed objective , given by max θ L ,θ F J shared ( θ L , θ F ) = E " ∞ X t = T γ t − T R F ( s F t , a F t ; s L T ) # , (4) where R F ( s F t , a F t ; s L T ) measures the controller’ s performance under the speciﬁed morphology , such as locomotion speed or task success. This shared-objective formulation treats the controller parameters θ F as jointly optimized with θ L , ignoring that θ F ∗ ( θ L ) is implicitly determined by the morphology . Consequently , the leader up- 4 Published as a conference paper at ICLR 2026 date includes only the direct gradient term and omits the implicit response term, which may steer morphology optimization in a direction misaligned with the controller’ s best response. 4 A S T A C K E L B E R G P E R S P E C T I V E O N C O - D E S I G N In this section, we formalize morphology–control co-design from a Stack elberg g ame-theoretic per- spectiv e. W e ﬁrst introduce asymmetric objectiv es for the morphology designer and the controller , and then cast their interaction as a leader–follower Stack elberg game. Asymmetric Objectiv es In contrast to classical approaches that adopt a single shared objective (eq. 4) ( Lu et al. , 2025 ), we introduce asymmetric objectiv es for morphology and control: J L ( θ L , θ F ) = E " T − 1 X t =0 γ t R L ( s L t , a L t ) + ∞ X t = T γ t − T R F ( s F t , a F t ; s L T ) # (5) The rew ard R L ( s L t , a L t ) pro vides immediate feedback for morphology-editing actions, typically re- ﬂecting structural costs such as material usage or design complexity . The morphology objectiv e therefore combines immediate morphology-editing rew ards with long-term control performance. The control objecti ve focuses solely on maximizing its long-term return, conditioned on the terminal morphology induced by π L θ L : J F ( θ L , θ F ) = E " ∞ X t = T γ t − T R F ( s F t , a F t ; s L T ) # (6) This formulation makes asymmetry explicit: control adapts to a ﬁxed morphology , while morphol- ogy optimizes both its structural objectiv es and downstream control performance. Stackelberg F ormulation Although existing work formulates co-design as a bi-level optimization problem, it typically treats the control policy as ﬁxed when optimizing the morphology (see eq. 4) ( Lu et al. , 2025 ; Y uan et al. , 2022 ), thereby failing to capture adaptiv e control dynamics. W e revisit this structure from a Stackelber g game-theoretic perspecti ve, where the morphology policy acts as the leader and the control policy acts as the follower . This perspectiv e emphasizes strategic anticipation: the leader commits to a decision while accounting for the follo wer’ s rational response. This perspectiv e naturally gives rise to Stackelber g Implicit Differentiation (SID), leading to differ - ent optimization behavior by incorporating the follower’ s adaptive dynamics into morphology op- timization (see eq. 2). Howe ver , applying SID to morphology–control co-design is non-trivial due to two intrinsic characteristics: a) a phase-separated interaction structure and b) a non-differentiable leader–follower interf ace, both of which obstruct gradient propagation across the interface. a) Phase-separated interaction. Unlike standard Stackelber g Markov Games (SMG) in which the leader and follower alternate actions ( Li et al. , 2020 ), the interaction in the co-design problem is phase-separated: the leader acts for T steps, after which the follo wer acts for the remaining horizon. Formally , we deﬁne this structure as a Phase-Separated Stack elberg Mark ov Game . Deﬁnition 1. A Phase-Separated SMG between a leader policy π L θ L and a follower policy π F θ F , parameterized by θ L and θ F r espectively , is deﬁned as G =  ( S L , A L , P L , R L , µ L , T ) , ( S F , A F , P F , R F , µ F ) , γ  . (i) The leader’ s components ar e given by its state space S L , action space A L , transition function P L , r ewar d function R L , initial state distribution µ L , and acting horizon T . (ii) The leader–follower interaction is phase-separated (i.e ., non-alternating): the leader ﬁrst acts for T steps, pr oducing a terminal state s L T , after which the follower acts until termination. (iii) The leader and follower interact thr ough the terminal state s L T ∈ S L , induced by the leader’ s action sequence under the transition dynamics P L . The follower acts conditioned on terminal state s L T , and all its components ( S F , A F , P F , R F , µ F ) ar e deﬁned conditionally on s L T . (iv) The leader aims to solve the Stac kelber g objective deﬁned in eq. (1) . 5 Published as a conference paper at ICLR 2026 b) Non-differ entiable interfaces. In morphology–control co-design, the resulting Phase-Separated SMG exhibits an inherently non-differentiable leader–follower interface. Speciﬁcally , the leader’ s transition function P L updates the morphology through discrete topology-altering actions, so the terminal state s L T linking the leader and follower does not permit direct gradient propagation from the follower to the leader . V alue Functions For subsequent analysis, we introduce v alue functions induced by the asymmet- ric objecti ves. Analogous to standard RL, the leader’ s Q-function, induced from its objective in eq. (5), is deﬁned as Q L π L ,π F  s L t ′ , a L t ′  = E " T − 1 X t = t ′ γ t − t ′ R L ( s L t , a L t ) + ∞ X t = T γ t − t ′ R F ( s F t , a F t ; s L T ); s L t = s L t ′ , a L t = a L t ′ , π L , π F # This Q-function captures the leader’ s long-term return from a giv en state–action pair , accounting for both its rewards before morphology ﬁnalization and the follo wer’ s rewards conditioned on the ﬁnal morphology . From this, the leader’ s advantage function is deﬁned as A L π L ,π F  s L t ′ , a L t ′  = Q L π L ,π F  s L t ′ , a L t ′  − E a L t ′ ∼ π L h Q L π L ,π F  s L t ′ , a L t ′  i . The advantage function measures how much better (or worse) a speciﬁc action a L t ′ is compared to the leader’ s a verage behavior at state s L t ′ . A follower’ s advantage function A F π F  s F t ′ , a F t ′ ; s L T  can be deﬁned analogously from its objectiv e in eq. (6). 5 M E T H O D In this section, we present how to realize Stackelberg Implicit Differentiation (SID; see Section 3) within the phase-separated Stackelberg Markov Game (SMG) framework introduced in Section 4. This enables us to account for the follower’ s adaptive response during morphology optimization, which is typically ignored in prior approaches ( Lu et al. , 2025 ; Y uan et al. , 2022 ). A direct way to implement SID is via backpropagation-through-interface, which propagates gra- dients from the follo wer to the leader through the leader–follower interface, as in Stackelber g MADDPG ( Y ang et al. , 2023 ). Howe ver , as highlighted in Section 4, the phase-separated SMG formulation of the co-design problem exhibits a non-differentiable leader–follower interface and a temporally separated interaction structure, both of which complicate gradient propagation across the interface. These challenges require ne w deri v ations of the Stack elberg gradients, as presented below . 5 . 1 S TAC K E L B E R G P O L I C Y G R A D I E N T W e now introduce Stackelberg implicit dif ferentiation into our phase-separated Stackelber g Marko v Game deﬁned in Deﬁnition 1. Since this formulation departs from the classical SMG, we develop new deri vations for all gradient components in eqs. (2) and (3). W e present each term in turn. Cross-Deri vative ∇ θ L θ F J F ( θ L , θ F ) (see eq. (3)) . This is the most challenging term. Unlike clas- sical SMGs where the follo wer directly tak es the leader’ s action as input, in our setting the interf ace is the terminal state s L T , generated through the transition P L . Backpropagation-through-interface methods (e.g., Stackelberg MADDPG) are infeasible here, since reaching θ L would require differ - entiating through the non-differentiable transition P L . Instead, we deriv e the cross-deriv ativ e using the log-deriv ati ve technique, analogous to the stochastic policy gradient ( Sutton , 1984 ; Williams , 1992 ; Sutton et al. , 1999 ), which bypasses the transition’ s non-differentiability while relying only on sampled trajectories. Let ( θ L o , θ F o ) denote the parameters of the behavior policies used for col- lecting data. Formally , we obtain the following theorem. Theorem 1. W e deﬁne the surro gate L F L,F  θ L , θ F ; θ L o , θ F o  = c E " π L θ L  a L | s L  π L θ L o ( a L | s L ) " γ T E " π F θ F  a F | s F ; s L T  π F θ F o  a F | s F ; s L T  A F π F θ F o  s F , a F ; s L T  ### (7) Then, we have ∇ θ L θ F J F ( θ L , θ F ) | θ L = θ L o ,θ F = θ F o = ∇ θ L θ F L F L,F  θ L , θ F ; θ L o , θ F o  | θ L = θ L o ,θ F = θ F o . 6 Published as a conference paper at ICLR 2026 In eq. (7), the outer expectation is taken over s L ∼ d L θ L o , a L ∼ π L θ L o , s L T ∼ d L,T θ L o , where d L,t θ L o ( s L ) = P ( s L t = s L ; π L θ L o ) is the visitation distribution probability of leader policy at step t , and d L θ L o ( s L ) ≜ 1 /T P t d L,t θ L o ( s L ) . The inner expectation is taken ov er s F ∼ d F θ F o ( · ; s L T ) , a F ∼ π F θ F o ( · ; s L T ) , where d F θ F o denotes the follower’ s visitation distribution. The constant c = T / (1 − γ ) nor- malizes the distribution, and its effect can be absorbed by the learning rate in practice. Proofs of this and subsequent theorems are provided in Appendix B. This theorem sho ws that the cross-deri vati ve can be expressed as an expectation in volving likelihood-ratio (importance-weighted) advantage esti- mators, thereby extending the classical policy gradient theorem to capture leader-follo wer coupling in our phase-separated SMG. First-Order Derivativ es ∇ θ L J L ( θ L , θ F ) and ∇ θ F J L ( θ L , θ F ) (see eq. 2) . These ﬁrst-order terms are relativ ely straightforward, as they follow the same structure as the policy gradient the- orem ( Sutton , 1984 ; W illiams , 1992 ; Sutton et al. , 1999 ). They quantify how the leader’ s objective changes with respect to its own parameters (leader’ s direct gradient) and w .r .t. the follo wer’ s param- eters. Both can be expressed using adv antage functions under importance weighting, as follows. Proposition 1. W e have ∇ θ L J L ( θ L , θ F ) | θ L = θ L o ,θ F = θ F o = ∇ θ L L L L  θ L , θ F ; θ L o , θ F o  | θ L = θ L o ,θ F = θ F o ∇ θ F J L ( θ L , θ F ) | θ L = θ L o ,θ F = θ F o = ∇ θ F L L F  θ L , θ F ; θ L o , θ F o  | θ L = θ L o ,θ F = θ F o wher e L L L  θ L , θ F ; θ L o , θ F o  = E " π L θ L  a L | s L  π L θ L o ( a L | s L ) A L π L θ L o ,π F θ F o  s L , a L  # L L F  θ L , θ F ; θ L o , θ F o  = E " γ T π F θ F  a F | s F ; s L T  π F θ F o  a F | s F ; s L T  π F θ F o  s F , a F ; s L T  # (8) In verse of Second-Order Deriv ative (Hessian)  ∇ 2 θ F J F  θ L , θ F  − 1 (see eq. 3) . This last com- ponent in volves the inv erse Hessian. Although the Hessian can be computed from the deriv ed loss function (see Appendix Proposition 2), the Hessian is typically indeﬁnite due to the advantage term, making its in version unstable. A standard remedy is to approximate it by the Fisher information matrix, F ( θ F ) = E  ∇ θ F log π F θ F ( a F | s F ; s L T ) ∇ θ F log π F θ F ( a F | s F ; s L T ) ⊤  , which is positiv e semi-deﬁnite and can be estimated via the KL div ergence between policies: F ( θ F ) = ∇ 2 θ F L F KL ( θ L , θ F ; θ L o , θ F o ) = ∇ 2 θ F E h KL  π F θ F ( · | s F ; s L T )   π F θ F o ( · | s F ; s L T ) i , (9) This natural-gradient approximation, used in methods such as natural policy gradient and trust re- gion polic y optimization ( Kakade , 2001 ; Schulman et al. , 2015 ), a voids indeﬁniteness and impro ves stability . Further stability is obtained by regularizing the Hessian with a small multiple of the iden- tity  ∇ 2 θ F L F KL + λI  − 1 with λ > 0 , which has been shown to interpolate between the standard policy gradient (when λ → ∞ ) and the Stackelberg gradient (when λ → 0 ) ( Y ang et al. , 2023 ). 5 . 2 A L G O R I T H M S Based on the surrogate functions in Eqs. (7) to (9), we compute the leader’ s Stackelberg gradient in Eq. (2). Since these surrogates are locally equi valent to the true Stackelber g gradients, we adopt the likelihood-ratio clipping technique from PPO ( Schulman et al. , 2017 ) to constrain policy diver gence and ensure stable optimization. Note that this application is not a simple reuse of PPO clipping. Rather , it is grounded in our local-approximation theory on the ne wly deri ved Stackelber g surrogate (see Theorem 1). Moreover , the expectation terms are estimated from sampled trajectories. This yields sample-based surrogates with PPO clipping, denoted by b L , and the corresponding estimation of the leader’ s Stackelberg gradient can be e xpressed as ∇ θ L b J L ( θ L , θ ∗ F ( θ L )) = ∇ θ L b L L L − ∇ θ L θ F b L F L,F step 1 z }| {  ∇ 2 θ F b L F KL + λI  − 1 ∇ θ F b L L F | {z } step 2 (10) 7 Published as a conference paper at ICLR 2026 Stackelberg PPO (ours) BodyGen Tr a n s f o r m 2 A c t NGE ESS 0M 10M 20M 30M 40M 50M Step 0 2500 5000 7500 10000 12500 Reward Crawler 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 15000 Reward Cheetah 0M 10M 20M 30M 40M 50M Step 0 300 600 900 1200 1500 Reward Swimmer 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 15000 Reward W alker -Hard 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 Reward Glider -Hard 0M 10M 20M 30M 40M 50M Step 0 1000 2000 3000 4000 5000 Reward T errainCrosser 0M 10M 20M 30M 40M 50M Step 0 1000 2000 3000 4000 Reward Pusher 0M 10M 20M 30M 40M 50M Step 0 1500 3000 4500 6000 7500 Reward Stepper -Regular 0M 10M 20M 30M 40M 50M Step 0 1500 3000 4500 6000 7500 Reward Stepper -Hard Figure 3: Performance curve with respect to the number of follower steps during training. Shaded regions denote standard error across se ven random seeds. W e refer to this overall procedure as Stack elberg PPO , which integrates PPO-style clipping into the Stackelber g gradient computation. W e ﬁrst compute step 1 in the abov e equation, which can be efﬁciently implemented using the conjugate gradient method. Conjugate gradient only requires Hessian-vector products, which can be obtained without explicitly constructing the Hessian via Pearlmutter’ s method ( Pearlmutter , 1994 ; M ø ller , 1993 ): ∇ 2 θ L ( θ ) v = ∇ θ  ∇ ⊤ θ L ( θ ) v  . W e then compute step 2 , which in turn only requires Jacobian-vector products. These can like wise be com- puted efﬁciently without explicitly forming the full Jacobian by using the Jacobian-vector product operation provided by automatic dif ferentiation framew orks. 6 E X P E R I M E N T S Our goal is to test whether lev eraging Stackelberg implicit differentiation to regularize the leader’ s gradient can improv e sample efﬁcienc y and ﬁnal performance. All experiments are conducted on MuJoCo-based morphology–control co-design tasks. W e adopt benchmarks from prior work, including three ﬂat-terrain tasks ( Crawler , Cheetah , Swimmer , Glider , Walker ) and one complex-terrain task ( TerrainCrosser ) ( Lu et al. , 2025 ). T o further ev aluate performance under more challenging conditions, we introduce two new tasks, Stepper-Regular and Stepper-Hard , where the agent must climb stair-like structures. These tasks require morphologies capable of effecti ve climbing in addition to robust control. T o test generality beyond locomotion, we also include a contact-rich 3D manipulation task, Pusher , designed to ev aluate whether co-design methods can evolv e structures aligned with manipulation objectiv es. Additional results on other tasks are provided in Appendix C due to space constraints. In all en vironments, morphologies are represented as tree structures with constraints on depth, branch- 8 Published as a conference paper at ICLR 2026 (a) V isualization 0M 10M 20M 30M 40M 50M Step 0 1500 3000 4500 6000 7500 Reward λ=0.0 λ=0.5 λ=1.0 λ=5.0 λ=10.0 λ=∞ (b) Regularization P arameter λ 0M 10M 20M 30M 40M 50M Step 0 1500 3000 4500 6000 7500 Reward Fisher Gradient Analytic Gradient (c) Hessian Figure 4: (a) Evolved morphologies. Ablation studies on (b) λ sweep from 0 to ∞ and (c) Fisher information matrix on/off, both e valuated on Stepper-Re gular task. ing factor , and joint degrees of freedom. While structural complexity and terrain difﬁculty v ary , the rew ard function consistently emphasizes forward velocity , ensuring fair comparisons of how differ - ent methods balance morphology and control. Each algorithm is e v aluated with se ven random seeds per task. All reported learning curves show mean values with shaded areas representing standard deviations. Further details and visualization of the en vironments are provided in Appendix C. 6 . 1 C O M PA R I S O N W I T H B A S E L I N E S W e implement our Stackelberg PPO on top of BodyGen ( Lu et al. , 2025 ), a PPO-based framework that employs transformer-based co-design with graph-aware positional encodings, optimizing mor- phology and control independently under shared rewards. BodyGen serves as our primary baseline, with the only modiﬁcation being the use of Stackelber g policy gradients. Implementation details are provided in Appendix C. In addition to BodyGen, we compare Stackelber g PPO against sev eral advanced co-design methods: • Evolutionary Structur e Sear ch (ESS) ( Sims , 1994 ): A canonical evolutionary-algorithm approach to robot design, where candidate morphologies are scored by handcrafted ﬁtness functions. Here we instead use a lightweight RL-based training loop for principled ev aluation. • Neur al Graph Evolution (NGE) ( W ang et al. , 2019 ): Evolutionary search over graph-structured morphologies with GNN controllers, each generation training the inherited parent controller . • T ransform2Act ( Y uan et al. , 2022 ): Concurrent RL co-design using separate GNNs for morphol- ogy and control within uniﬁed PP O training, with joint-speciﬁc MLP heads for uni versal control. Figure 3 presents the learning curves across all en vironments. Stackelberg PPO consistently achiev es the best performance, yielding an av erage +20.66% improv ement over the strongest base- line. Compared to ev olutionary approaches (ESS, NGE), it attains substantially higher sample ef- ﬁciency by av oiding the costly rollouts required to ev aluate each morphological candidate. Rel- ativ e to the v anilla gradient method without Stackelberg differentiation (BodyGen), Stackelber g PPO achiev es superior results in both sample efﬁcienc y and ﬁnal performance. The advantage is most evident on challenging 3D tasks with large design spaces ( Crawler , Stepper-Regular , Stepper-Hard , Pusher ), where our method deliv ers an av erage +32.02% improv ement. Fig- ure 4(a) showcases examples of the ev olved creatures generated by our method. Additional mor- phology examples and e volution processes are pro vided in the appendix E.1 and E.5. 6 . 2 A B L A T I O N S T U D I E S W e conduct ablation studies to validate key components of Stackelberg PPO, including (1) the reg- ularization parameter λ that controls gradient interpolation (eq. 10); and (2) the Fisher gradient approximation of the Hessian for stability (eq. 9). Regularization Parameter λ (eq. 10). The parameter λ interpolates between pure Stackelberg gradients and standard policy gradients. W e ev aluate Stackelber g PPO on the Stepper-Regular 9 Published as a conference paper at ICLR 2026 0 M 10 M 20 M 30 M 40 M 50 M Steps 0 2000 4000 6000 8000 10000 Reward =0.1 =0.2 =0.4 =0.6 =0.8 no clip (a) Cumulativ e Reward 0 M 10 M 20 M 30 M 40 M 50 M Steps 0.0 0.4 0.8 1.2 1.6 2.0 KL Diver gence =0.1 =0.2 =0.4 =0.6 =0.8 no clip (b) KL-div ergence T Stackelberg PPO BodyGen 3 6188 . 99 ± 681 . 06 3663 . 06 ± 571 . 30 5 7215 . 20 ± 449 . 02 4685 . 94 ± 645 . 23 7 8260 . 74 ± 148 . 58 6879 . 60 ± 175 . 41 9 6739 . 51 ± 631 . 35 3375 . 11 ± 486 . 54 11 6874 . 34 ± 604 . 42 3216 . 77 ± 657 . 61 (c) Leader Horizon T Figure 5: (a) Reward curves and (b) KL-di vergence traces for dif ferent clipping thresholds ϵ . (c) Performance comparison under varying leader horizons T . All ev aluated on Stepper-Re gular . en vironment with λ ∈ { 0 . 0 , 0 . 5 , 1 . 0 , 5 . 0 , 10 . 0 , ∞} , where λ = 0 corresponds to no re gularization and λ = ∞ reduces to the vanilla gradient without Stackelberg dif ferentiation. Figure 4(b) shows robust performance for λ ∈ [0 . 5 , 10] , with degradation only at the extremes ( λ = 0 or ∞ ). This highlights both the robustness of the method to λ values and the necessity of re gularization. Hessian Computation (eq. 9). W e compare our Fisher approximation with direct analytic second- order gradients (eq. 11). As sho wn in Figure 4(c), the Fisher approximation achieves stable learning with nearly twice the performance of the analytic gradient (6000 vs 2500). This improv ement arises from the positiv e semi-deﬁniteness of the Fisher matrix, which avoids the numerical instabilities caused by the indeﬁnite raw Hessian. Sensitivity to PPO Clipping Threshold ϵ . W e ev aluate the sensitivity of Stackelber g PPO to the clipping parameter ϵ by sweeping over multiple thresholds and measuring its effect on task perfor- mance and KL-di vergence stability . Figure 5(a) and (b) sho ws that moderate clipping (e.g., ϵ ≤ 0 . 4 ) yields stable learning with low KL di vergence, while remo ving clipping causes rapid KL gro wth and clear performance degradation. Full quantitative results are reported in Appendix E.3. Leader Horizon T (eq. 7). W e examine how the leader horizon T affects structural optimization. As sho wn in Figure 5(c), larger horizons generally improve performance by enabling richer mor- phology edits, while ov erly large values (e.g., T = 11 ) become harder to optimize and cause mild degradation—yet still outperform very small horizons such as T = 3 . Importantly , increasing T does not introduce higher variance in the leader-gradient update of eq. 7 relative to BodyGen base- line, conﬁrming that the Stackelber g update remains stable across a wide range of horizon lengths. 7 C O N C L U S I O N S W e introduced Stackelber g Pr oximal P olicy Optimization (Stackelber g PPO) , a reinforcement learning framework grounded in the Stackelberg game paradigm, which explicitly captures the leader–follower coupling between high-lev el design decisions and adaptiv e control responses. While this formulation is general, we instantiate it in the context of morphology–control co-design, where the leader speciﬁes the body structure and the follower adapts the control policy . Instead of treating design and control as independent, Stackelberg PPO exploits the leader–follower coupling to antic- ipate how the follo wer will adapt, enabling the leader to update its policy to ward morphologies that are more compatible with downstream control. Experiments demonstrate that this coupling yields superior performance and stability over standard PPO, particularly on complex locomotion tasks where tight coordination between morphology and control is essential. Despite these promising results, sev eral avenues remain for future work. A key direction is sim-to- real transfer , which remains challenging due to unmodeled hardware constraints and material dy- namics. Bridging this gap could enable the real-world deplo yment of self-e volving robotic systems. W e further envision advances in this area leading to truly adaptive artiﬁcial life forms capable of self-directed e volution, reshaping our understanding of intelligence, embodiment, and the boundary between designed and ev olved systems. 10 Published as a conference paper at ICLR 2026 R E P RO D U C I B I L I T Y S TA T E M E N T The demonstrations and open-source code are a vailable at: https://yanningdai.github.io/stack elberg- ppo-co-design . The computational requirements, hyperparameters, and key implementation details are provided in Appendix D. A C K N O W L E D G E M E N T S W e gratefully acknowledge the insightful comments from the ICLR 2026 revie wers. The research reported in this publication was supported by funding from King Abdullah Univ ersity of Science and T echnology (KA UST) - Center of Excellence for Generative AI, under award number 5940. This work was additionally supported by the European Research Council (ERC, Advanced Grant Number 742870). For computer time, this research used Ibex managed by the Supercomputing Core Laboratory at King Abdullah Univ ersity of Science and T echnology (KA UST) in Saudi Arabia. E T H I C S S TA T E M E N T As fundamental AI research and to the best of the authors’ knowledge, there are no clear ethical risks associated with this work beyond the risks already posed by prior w ork. R E F E R E N C E S Y u Bai, Chi Jin, Huan W ang, and Caiming Xiong. Sample-efﬁcient learning of stackelberg equilib- ria in general-sum games. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 34: 25799–25811, 2021. Dylan Banarse, Y oram Bachrach, Siqi Liu, Guy Lever , Nicolas Heess, Chrisantha Fernando, Push- meet Kohli, and Thore Graepel. The body is not a giv en: Joint agent policy learning and mor- phology ev olution. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS’19) , pp. 1134–1142, 2019. T amer Bas ¸ar and Geert Jan Olsder . Dynamic Noncooperative Game Theory . Society for Industrial and Applied Mathematics, 1998. Josh Bongard, V ictor Zyko v , and Hod Lipson. Resilient machines through continuous self-modeling. Science , 314(5802):1118–1121, 2006. Runfa Chen, Ling W ang, Y u Du, T ianrui Xue, Fuchun Sun, Jianwei Zhang, and W enbing Huang. Subequiv ariant reinforcement learning in 3D multi-entity physical environments. arXiv pr eprint arXiv:2407.12505 , 2024. Nick Cheney , Josh Bongard, Vytas SunSpiral, and Hod Lipson. Scalable co-optimization of mor- phology and control in embodied machines. Journal of the Royal Society Interface , 15(143): 20170937, 2018. V incent Conitzer and T uomas Sandholm. Computing the optimal strategy to commit to. In Pr oceed- ings of the 7th A CM Conference on Electr onic Commer ce , pp. 82–90, 2006. Heng Dong, Junyu Zhang, T onghan W ang, and Chongjie Zhang. Symmetry-aware robot design with structured subgroups. In International Confer ence on Machine Learning (ICML) , pp. 8334–8355. PMLR, 2023. Matthias Gerstgrasser and David C Parkes. Oracles & followers: Stackelberg equilibria in deep multi-agent reinforcement learning. In International Conference on Machine Learning (ICML) , pp. 11213–11236. PMLR, 2023. Agrim Gupta, Silvio Sav arese, Surya Ganguli, and Li Fei-Fei. Embodied intelligence via learning and ev olution. Nature Communications , 12(1):5721, 2021. 11 Published as a conference paper at ICLR 2026 David Ha and J ¨ urgen Schmidhuber . Recurrent w orld models facilitate policy ev olution. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2018. Kangyao Huang, Di Guo, Xinyu Zhang, Xiangyang Ji, and Huaping Liu. Compete vo: T owards morphological evolution from competition. In Pr oceedings of the Thirty-Thir d International Joint Confer ence on Artiﬁcial Intelligence (IJCAI) , 2024a. Suning Huang, Boyuan Chen, Huazhe Xu, and V incent Sitzmann. Dittogym: Learning to control soft shape-shifting robots. In Pr oceedings of the T welfth International Confer ence on Learning Repr esentations (ICLR) , 2024b. Sham M Kakade. A natural polic y gradient. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 14, 2001. Hsu Kao, Chen-Y u W ei, and V ijay Subramanian. Decentralized cooperati ve reinforcement learning with hierarchical information structure. In International Conference on Algorithmic Learning Theory (ALT) , pp. 573–605. PMLR, 2022. Debarun Kar , Fei Fang, Francesco Delle Fav e, Nicole Sintov , and Milind T ambe. “A game of thrones” when human behavior models compete in repeated Stackelberg security games. In Pro- ceedings of the 2015 International Confer ence on Autonomous Agents and Multiagent Systems (AAMAS) , pp. 1381–1390, 2015. Sam Kriegman, Douglas Blackiston, Michael Levin, and Josh Bongard. Scalable sim-to-real transfer of soft robot designs. In IEEE International Conference on Soft Robotics (RoboSoft) , pp. 2187– 2200, 2020. Henger Li, W en Shen, and Zizhan Zheng. Spatial-temporal mo ving target defense: A marko v stack- elberg game model. In International Conference on Autonomous Agents and Multiag ent Systems (AAMAS) . AAMAS Conference proceedings, 2020. Muhan Li, Da vid Matthe ws, and Sam Kriegman. Reinforcement learning for freeform robot design. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pp. 8799–8806. IEEE, 2024. Chun Kai Ling, Zico K olter , and Fei Fang. Function approximation for solving Stackelber g equi- librium in large perfect information games. In Proceedings of the AAAI Confer ence on Artiﬁcial Intelligence , pp. 5764–5772, 2023. Hod Lipson and Jordan B Pollack. Automatic design and manufacture of robotic lifeforms. Nature , 406(6799):974–978, 2000. Huaping Liu, Di Guo, and Angelo Cangelosi. Embodied intelligence: A synergy of morphology , action, perception and learning. A CM Computing Surveys , 57(7):1–36, 2025. Haofei Lu, Zhe W u, Junliang Xing, Jianshu Li, Ruoyu Li, Zhe Li, and Y uanchun Shi. Bodygen: Advancing to wards efﬁcient embodiment co-design. In International Confer ence on Learning Repr esentations (ICLR) , 2025. Rajesh K Mishra, Deepanshu V asal, and Sriram V ishwanath. Model-free reinforcement learning for stochastic Stackelberg security games. In 2020 59th IEEE Conference on Decision and Contr ol (CDC) , pp. 348–353. IEEE, 2020. M. F . M ø ller . Exact calculation of the product of the Hessian matrix of feed-forward network error functions and a vector in O(N) time. T echnical Report PB-432, Computer Science Department, Aarhus Univ ersity , Denmark, 1993. Y unian Pan, T ao Li, Henger Li, T ianyi Xu, Zizhan Zheng, and Quanyan Zhu. A ﬁrst order meta Stackelber g method for rob ust federated learning. In ICML W orkshop on New F rontier s in Adver - sarial Machine Learning (AdvML-F r ontiers’ 23) , 2023. Chandana Paul. Morphological computation: A basis for the analysis of morphology and control requirements. Robotics and Autonomous Systems , 54(8):619–630, 2006. 12 Published as a conference paper at ICLR 2026 Barak A Pearlmutter . Fast exact multiplication by the hessian. Neural Computation , 6(1):147–160, 1994. Charles Schaff and Matthew R W alter . N-limb: Neural limb optimization for ef ﬁcient morphological design. arXiv pr eprint arXiv:2207.11773 , 2022. J ¨ urgen Schmidhuber . On learning to think: Algorithmic information theory for nov el combina- tions of reinforcement learning controllers and recurrent neural world models. arXiv pr eprint arXiv:1511.09249 , 2015. John Schulman, Serge y Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Confer ence on Machine Learning (ICML) , pp. 1889–1897. PMLR, 2015. John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. Karl Sims. Evolving virtual creatures. In Pr oceedings of the 21st Annual Conference on Computer Graphics and Inter active T echniques (SIGGRAPH ’94) , pp. 15–22, 1994. Richard S Sutton. T emporal cr edit assignment in r einforcement learning . PhD thesis, Univ ersity of Massachusetts Amherst, 1984. Richard S Sutton, Da vid A McAllester , Satinder Singh, and Y ishay Mansour . Policy gradient meth- ods for reinforcement learning with function approximation. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , volume 12, 1999. Emanuel T odorov , T om Erez, and Y uv al T assa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IR OS) , pp. 5026–5033, 2012. Bernhard V on Stengel and Shmuel Zamir . Leadership games with con ve x strategy sets. Games and Economic Behavior , 69(2):446–457, 2010. Quoc-Liem V u, Zane Alumbaugh, Ryan Ching, Quanchen Ding, Arnav Mahajan, Benjamin Chas- nov , Sam Burden, and Lillian J Ratliff. Stackelber g policy gradient: Evaluating the performance of leaders and followers. In ICLR 2022 W orkshop on Gamiﬁcation and Multiagent Solutions , 2022. T ingwu W ang, Y uhao Zhou, Sanja Fidler , and Jimmy Ba. Neural graph ev olution: T owards ef ﬁcient automatic robot design. In International Confer ence on Learning Repr esentations (ICLR) , 2019. Ronald J W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine Learning , 8(3):229–256, 1992. Zheng Xiong, Jacob Beck, and Shimon Whiteson. Uni versal morphology control via contex- tual modulation. In International Conference on Machine Learning (ICML) , pp. 38286–38300. PMLR, 2023. Boling Y ang, Liyuan Zheng, Lillian J Ratliff, Byron Boots, and Joshua R Smith. Stackelber g games for learning emergent behaviors during competitiv e autocurricula. In 2023 IEEE International Confer ence on Robotics and Automation (ICRA) , pp. 5501–5507, 2023. Y e Y uan, Y uda Song, Zhengyi Luo, W en Sun, and Kris M Kitani. Transform2Act: Learning a transform-and-control policy for ef ﬁcient agent design. In International Conference on Learning Repr esentations (ICLR) , 2022. Allan Zhao, Jie Xu, Mina K onakovi ´ c-Lukovi ´ c, Josephine Hughes, Andrew Spielberg, Daniela Rus, and W ojciech Matusik. Robogrammar: Graph grammar for terrain-optimized robot design. A CM T ransactions on Graphics (T OG) , 39(6):1–16, 2020. Liyuan Zheng, T anner Fiez, Zane Alumbaugh, Benjamin Chasnov , and Lillian J Ratliff. Stackel- berg actor -critic: Game-theoretic reinforcement learning algorithms. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 36, pp. 9217–9224, 2022. 13 Published as a conference paper at ICLR 2026 Han Zhong, Zhuoran Y ang, Zhaoran W ang, and Michael I Jordan. Can reinforcement learning ﬁnd Stackelber g-Nash equilibria in general-sum markov games with myopically rational followers? Journal of Mac hine Learning Resear ch , 24(35):1–52, 2023. Nicolas Zucchet and Joao Sacramento. Beyond backpropagation: Bile vel optimization through implicit differentiation and equilibrium propagation. Neur al Computation , 34(12):2309–2346, 2022. 14 Published as a conference paper at ICLR 2026 A L L M U S AG E S TA T E M E N T W e used LLMs for drafting and reﬁning text extensi vely throughout the paper . LLMs were not used to dev elop algorithms, provide theoretical results, run experiments, or contribute in any other way to the work beyond the aforementioned writing help. B T H E O R E T I C A L A N A LY S I S In this section, we provide the theoretical foundations of our approach. W e ﬁrst present the trajectory factorization in the proposed phase-separated Stackelberg Markov Game, which serves as the basis for proving all the theorems. Giv en the trajectory τ ≜   s L t , a L t  T − 1 t =0 , s L T ,  s F t , a F t  ∞ t = T  , the trajectory distribution naturally factorizes into two phases: P  τ ; π L θ L , π F θ F  = µ L  s L 0  T − 1 Y t =0 π L θ L  a L t | s L t  P L  s L t +1 | s L t , a L t  µ F  s F 0 ; s L T  ∞ Y t = T π F θ F  a F t | s F t ; s L T  P  a F s t +1 | s F t , a F t ; s L T  Pr oof of Theorem 1. Based on polic y gradient theorem ( Sutton , 1984 ; W illiams , 1992 ; Sutton et al. , 1999 ), we hav e ∇ θ F J F  θ L , θ F  = Z P  τ ; θ L , θ F  γ T ∞ X t = T ∇ θ F log π F θ F  a F t | s F t ; s L T  A F t dτ Differentiating both sides with respect to θ L yields the cross deriv ative: ∇ 2 θ L ,θ F J F  θ L , θ F  = ∇ θ L  ∇ θ F J F  θ L , θ F  = ∇ θ L Z P  τ ; θ L , θ F  " γ T ∞ X t = T ∇ θ F log π F θ F  a F t | s F t ; s L T  A F t # dτ = Z ∇ θ L P  τ ; θ L , θ F  " γ T ∞ X t = T ∇ θ F log π F θ F  a F t | s F t ; s L T  A F t # ⊤ dτ = Z P  τ ; θ L , θ F  ∇ θ L log P  τ ; θ L , θ F  " γ T ∞ X t = T ∇ θ F log π F θ F  a F t | s F t ; s L T  A F t # ⊤ dτ = Z P  τ ; θ L , θ F  " T − 1 X t =0 ∇ θ L log π L θ L  a L t | s L t  # " γ T ∞ X t = T ∇ θ F log π F θ F  a F t | s F t ; s L T  A F t # ⊤ dτ = ∇ θ L ,θ F Z P  τ ; θ L , θ F  " T − 1 X t =0 log π L θ L  a L t | s L t  # " γ T ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t # ⊤ dτ Evaluating this identity at the reference parameters ( θ L , θ F ) = ( θ L o , θ F o ) giv es 15 Published as a conference paper at ICLR 2026 ∇ 2 θ L ,θ F J F  θ L , θ F  | θ L = θ L o ,θ F = θ F o = ∇ 2 θ L ,θ F Z P  τ ; θ L o , θ F o  " T − 1 X t =0 π L θ L  a L t | s L t  π L θ L o  a L t | s L t  # " γ T ∞ X t = T π F θ F  a F t | s F t ; s L T  π F θ F o  a F t | s F t ; s L T  A F π F θ F  s F t , a F t ; s L T  # dτ = ∇ 2 θ L ,θ F Z P  τ L 0: T ; θ L o , θ F o  " T − 1 X t =0 π L θ L  a L t | s L t  π L θ L o  a L t | s L t  # Z P  τ F T : ∞ ; θ L o , θ F o  " γ T ∞ X t = T π F θ F  a F t ′ | s F t ′ ; s L T  π F θ F o  a F t ′ | s F t ′ ; s L T  A F π F θ F  s F t , a F t ; s L T  # dτ = ∇ 2 θ L ,θ F T − 1 X t =0 Z P  τ L 0: T ; θ L o , θ F o  " π L θ L  a L t | s L t  π L θ L o  a L t | s L t  # ∞ X t ′ = T Z P  τ F T : ∞ ; θ L o , θ F o , s L T  " γ T π F θ F  a F t ′ | s F t ′ ; s L T  π F θ F o  a F t ′ | s F t ′ ; s L T  A F π F θ F  s F t ′ , a F t ′ ; s L T  # dτ = ∇ 2 θ L ,θ F Z d L θ L o ( s L ) π L θ L o  a L | s L  d L,T θ L o ( s L T ) " π L θ L  a L | s L  π L θ L o ( a L | s L ) # Z d F θ F o ( s F ; s L T ) π F θ F o  a F | s F ; s L T  " γ T π F θ F  a F | s F ; s L T  π F θ F o  a F | s F ; s L T  A F π F θ F  s F , a F ; s L T  # dτ = ∇ θ L ,θ F L F L,F  θ L , θ F ; θ L o , θ F o  | θ L = θ L o ,θ F = θ F o Pr oof of Proposition 1. The result follows directly by applying the likelihood-ratio trick in the same way as the standard proof of the policy gradient theorem ( Sutton , 1984 ; W illiams , 1992 ; Sutton et al. , 1999 ). Proposition 2. W e have ∇ 2 θ F J F  θ L , θ F  = ∇ 2 θ F E π L θ L ,π F θ F " ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t # + E π L θ L o ,π F θ F o   ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  ! ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t ! ⊤   (11) Pr oof of Proposition 2. Based on policy gradient theorem ( Sutton , 1984 ; W illiams , 1992 ; Sutton et al. , 1999 ), we hav e ∇ θ F J F  θ L , θ F  = Z P  τ ; θ L , θ F  ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t dτ 16 Published as a conference paper at ICLR 2026 Differentiating this e xpression again with respect to θ F , we obtain ∇ 2 θ F J F ( θ L , θ F ) = ∇ θ F Z P  τ ; θ L , θ F  ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t dτ = Z P  τ ; θ L , θ F  ∇ 2 θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t dτ + Z P  τ ; θ L , θ F  ∇ θ F log P  τ ; θ L , θ F  ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  ! ⊤ dτ = Z P  τ ; θ L , θ F  ∇ 2 θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t dτ + Z P  τ ; θ L , θ F  ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  ! ∇ θ F ∞ X t = T log π F θ F  a F t | s F t ; s L T  A F t ! ⊤ dτ 17 Published as a conference paper at ICLR 2026 C E N V I RO N M E N T D E TA I L S In this section, we provide comprehensive details about the nine en vironments employed in our experimental e valuation. T o ensure fair comparison with existing methods, we adopt all 6 en- vironments from the BodyGen framework: Crawler , Cheetah , Glider , W alker , Swimmer , and T errainCrosser . Additionally , we introduce three novel environments designed to ev aluate differ - ent aspects of our algorithm: Stepper-Regular and Stepper-Hard feature complex topographical structures to test robustness in challenging terrain, while Pusher ev aluates manipulation capabili- ties. These additional en vironments are speciﬁcally crafted to assess the rob ustness and adaptability of our algorithm under more challenging conditions, thereby providing a more rigorous assessment of the proposed method’ s capabilities. Figure 6 provides visualizations of all nine en vironments. Each agent undergoes dynamic morphological evolution through topological and attribute modiﬁ- cations during training. The observation includes the root body’ s spatial position and velocity , all joints’ angular positions and velocities, and motor gear parameters. All joints use hinge connections enabling single-axis rotation. Joint attributes encompass bone vectors, sizes, and motor gear v alues. The action space consists of one-dimensional control signals applied to each joint’ s motor . Crawler operates in a 3D en vironment, where agents exhibit quadrupedal cra wling locomotion. The initial morphology consists of a central root node with four limb branches extending outward. The body tree is constrained to a maximum depth of 4 lev els, with each non-root node supporting at most 2 child limbs. Episodes are terminated when the agent’ s body height exceeds 2.0 units to prev ent unrealistic vertical extensions. The re ward function encourages forward movement while penalizing excessi ve control ef fort: r t = x t +1 − x t τ − w · 1 N X j ∈J t || u t j || 2 (12) where x t denotes the agent’ s forward position at timestep t , u t j represents the ef fectiv e control input applied to joint (i.e., the raw action scaled by the joint’ s gear ratio) applied to joint j at time t , w = 0 . 0001 is the control re gularization coefﬁcient, N is the total number of joints, and τ = 0 . 04 . Cheetah features 2D locomotion focused on fast running gaits. The agent begins with an initial design comprising a root body connected to one primary limb segment. The morphological search allows a maximum tree depth of 4 with up to 3 child limbs per node. The root body’ s angular orientation is constrained within 20 degrees to maintain stable running posture. Episode termination occurs when body height falls belo w 0.7 or exceeds 2.0 units. The re ward follows the velocity-based formulation: r t = x t +1 − x t τ (13) where τ = 0 . 008 . Glider and W alker enable 2D aerial and terrestrial locomotion. Both en vironments share the same base morphology: agents start from an initial conﬁguration with three limb segments attached to a central root, each limb node can support up to three children, and joints can oscillate within a 60 ◦ range to accommodate wide-range motion. The rew ard structure follows eq. 13, emphasizing for- ward displacement. The two tasks differ only in their allowable morphology depth: Glider restricts the body tree to a maximum depth of 3, while W alker permits up to 4 lev els. Swimmer enables undulatory , snake-like locomotion in a 2D aquatic en vironment. The agent ev olves in water with a viscosity coefﬁcient of v is = 0 . 1 . The initial morphology consists of a root body connected to a single limb segment, and each limb node may support up to three child segments, enabling ﬂe xible articulated structures suited for wa ve-based propulsion. This task serv es as a lightweight validation en vironment, and thus imposes no early-termination conditions such as height limits or joint-rotation thresholds. The reward structure follo ws eq. 13, emphasizing forward displacement under hydrodynamic resistance. 18 Published as a conference paper at ICLR 2026 T errainCrosser presents a challenging 2D terrain na vigation task using the Cheetah agent con- ﬁguration. The environment features ﬁxed terrain heights with maximum elev ation differences of z max = 0 . 5 . Agents must traverse gaps generated from single-channel height maps. Height con- straints maintain agent stability between 0.7 and 2.0 units, with violations leading to episode termi- nation. The rew ard function prioritizes forward progress as deﬁned in eq. 13. Stepper -Regular and Stepper -Hard introduce challenging staircase navigation tasks that test agents’ morphological adaptation capabilities for vertical terrain traversal. Both environments uti- lize the Crawler agent conﬁguration in a 3D setting. Stepper-Regular features stairs with step width of 1.0 units and height of 0.4 units; Stepper-Hard increases the difﬁculty by ele vating step height to 0.8 units while maintaining the same width. Unlike the standard Crawler en vironment, height termination constraints are removed to allow full exploration of vertical climbing capabilities. The rew ard function follo ws eq. 12, focusing solely on forward progression, thereby maintaining rew ard consistency across en vironments. Pusher is a challenging 3D manipulation task designed to e valuate whether the co-design system can generate morphologies and control strategies that effecti vely interact with e xternal objects. This en vironment reuses the Crawler agent conﬁguration in a 3D setting. A rigid cube of side length 1 . 0 m is placed in front of the agent and constrained to mov e in the horizontal ( x, y ) plane. The observation space augments the agent state with the 3D relativ e position between the agent’ s root body and the cube. The reward encourages forward displacement of the cube, penalizes lateral motion, and provides an auxiliary shaping term based on the proximity between the agent and the cube. A control-effort penalty identical to eq. 12 is applied. Formally , the reward is r t = x cube t +1 − x cube t τ − κ ·   y cube t +1 − y cube t   τ + 1 1 +   p cube t − p root t   − w · 1 N X j ∈J t || u t j || 2 (14) where x cube t and y cube t denote the cube’ s forward and lateral positions at timestep t , p cube t and p root t are the 3D positions of the cube and the agent’ s root body , and κ = 0 . 1 controls the lateral-motion penalty . 19 Published as a conference paper at ICLR 2026 Crawler Cheetah Glider Stepper - Regular Stepper - Hard Te r r a i n C r o s s e r 3D 2D - xz Plane 3D 2D - xz Plane 2D - xz Plane 3D Pusher Wa l k e r 3D 2D - xz Plane Swimmer 2D - xy Pl ane Figure 6: V isualization of the nine benchmark en vironments used in our experiments. Crawler , Stepper-Re gular , Stepper-Hard, and Pusher are 3D tasks; others are 2D (x-z or x-y plane). The en vironments differ substantially in required morphology depth, symmetry , and limb arrangement, enabling ev aluation on the generality of morphology–control co-design across locomotion and ma- nipulation tasks. 20 Published as a conference paper at ICLR 2026 D I M P L E M E N T A T I O N D E T A I L S D . 1 C O M P U TA T I O N C O S T Follo wing standard reinforcement learning practices, we utilize distributed trajectory sampling across multiple CPU threads to enhance training efﬁcienc y . All models are trained with sev en ran- dom seeds on a high-performance computing cluster equipped with dual Intel® Xeon® processors (totaling 64 cores) and 24 NVIDIA A100 GPUs. Our implementation uses PyT orch 2.0.1 for all neu- ral network models and MuJoCo 2.1.0 ( T odorov et al. , 2012 ) physics engine for the morphology- control simulation en vironments. The training process is computationally efﬁcient, requiring ap- proximately 30 hours per model when utilizing 10 CPU cores alongside a single NVIDIA A100 GPU across all experimental en vironments. D . 2 H Y P E R PA R A M E T E R C O N FI G U R A T I O N Stackelberg PPO (Ours): Our method introduces se veral Stackelberg-speciﬁc hyperparameters that require careful tuning. W e conduct grid search over key parameters: Fisher information ma- trix regularization coefﬁcient λ ∈ { 0 . 5 , 1 . 0 , 5 . 0 , 10 . 0 } , maximum conjugate gradient (CG) steps ∈ { 10 , 20 , 30 } , and follo wer sampling steps per episode ∈ { 6 , 15 , 30 , 60 , 100 } during leader up- date. For the underlying netw ork architecture, we maintain the same conﬁguration as BodyGen ( Lu et al. , 2025 ) without modiﬁcation to ensure fair comparison, including their MoSA T transformer blocks and all network-related parameters. The ﬁnal hyperparameter conﬁguration, along with the underlying BodyGen network architecture we adopt, is detailed in T able 1. BodyGen: W e follow their original implementation and released code, adopting the same hyper- parameter conﬁguration as reported in their work ( Lu et al. , 2025 ). The settings include MoSA T Pre-LN normalization, SiLu acti vation, hidden dimension 64, policy learning rate 5e-5, v alue learn- ing rate 3e-4, and other parameters as detailed in T able 1. T ransform2Act: F ollowing the original implementation( Y uan et al. , 2022 ), this baseline uses GraphCon v layers, policy GNN size (64, 64, 64), policy learning rate 5e-5, value GNN size (64, 64, 64), value learning rate 3e-4, JSMLP activ ation T anh, JSMLP size (128, 128, 128) for policy networks, and MLP size (512, 256) for v alue functions. NGE: Based on the original implementations ( W ang et al. , 2019 ), this ev olutionary baseline uses 125 generations, population size 20, elimination rate 0.15, with GraphConv layers, T anh activ ation, policy GNN size (64, 64, 64), polic y MLP size (128, 128), value GNN size (64, 64, 64), v alue MLP size (512, 256), policy learning rate 5e-5, and v alue learning rate 3e-4. D . 3 G R A D I E N T N O R M A L I Z A T I O N Recall Eq. 2, where the Stackelberg gradient for the leader decomposes into a direct term and a r esponse-induced term. T o avoid scale imbalance between these components, we scale the response- induced term by a data-dependent factor α computed from the relativ e norms of the two terms (no extra hyper-parameters). Let g dir := ∇ θ L J L ( θ L , θ F ) and g resp := ( ∇ θ L θ F ∗ ( θ L )) ⊤ ∇ θ F J L ( θ L , θ F ) . W e update the leader using \ ∇ θ L J L = g dir − α g resp , α = min  1 , ∥ g dir ∥ 2 ∥ g resp ∥ 2 + ε  , (15) where ε > 0 is a small numerical constant for stability . This rule guarantees α ∥ g resp ∥ 2 ≤ ∥ g dir ∥ 2 , ensuring the follower -implicit component nev er dominates while preserving its direction. W e use α = 1 across all e xperiments for simplicity and consistency . 21 Published as a conference paper at ICLR 2026 T able 1: Hyperparameters of Stackelberg PPO adopted in all experiments Hyperparameter V alue Fisher Regularization Coef ﬁcient λ 5.0 Maximum Conjugate Gradient Steps 20 CG Relativ e Error T olerance 10 − 3 Follo wer Sampling Steps per Episode 6 Gradient Normalization Ratio α 1.0 Structure Design Steps T stru 5 Attribute Design Steps T attr 1 T ransformer Layer Normalization Pre-LN T ransformer Activ ation Function SiLu FNN Scaling Ratio r 4 T ransformer Blocks number (Policy Network) 3 T ransformer Blocks number (V alue Network) 3 T ransformer Hidden Dimension (Policy Network) 64 T ransformer Hidden Dimension (V alue Network) 64 Optimizer Adam Policy Learning Rate 5e-5 V alue Learning Rate 3e-4 Clip Gradient Norm 40.0 PPO Clip ϵ 0.2 PPO Batch Size 50000 PPO Minibatch Size 2048 PPO Iterations Per Batch 10 T raining Epochs 1000 Discount factor γ 0.995 GAE Parameter λ GAE 0.95 22 Published as a conference paper at ICLR 2026 E A D D I T I O NA L R E S U LT S E . 1 V I S UA L I Z A T I O N A N D Q UA L I TA T I V E R E S U LT S Figure 7 presents the diverse morphologies discov ered by our Stackelber g PPO framew ork across different en vironments. The ev olved body designs rev eal the sophisticated structural complexity achiev ed by our approach, conﬁrming that the Stackelber g g ame formulation enables continuous co- adaptation between morphology and control without premature con ver gence to suboptimal simple structures. Remarkably , these designs demonstrate emergent functional differentiation, developing specialized appendages for complementary tasks such as maintaining equilibrium versus providing propulsiv e forces. As illustrated in the training curves presented in Fig. 3, we provide quantitative performance compar- isons across all ev aluated environments. T able 2 summarizes the ﬁnal episode rew ards achiev ed by each method, presenting mean values and standard deviations computed over seven random seeds. All baseline methods are conﬁgured using their optimal hyperparameter settings as reported in prior literature, with detailed speciﬁcations provided in Appendix D.2. T able 2: Performance comparison of Stackelber g PPO against baseline methods across morphology- control co-design en vironments. Results show mean episode rewards and standard deviations over sev en random seeds. Methods Crawler Cheetah Swimmer Stackelberg PPO (Ours) 11047 . 90 ± 126 . 20 13514 . 94 ± 653 . 62 1334 . 98 ± 16 . 06 BodyGen ( Lu et al. , 2025 ) 9098 . 72 ± 558 . 26 11575 . 87 ± 640 . 65 1302 . 64 ± 3 . 71 T ransform2Act ( Y uan et al. , 2022 ) 3950 . 80 ± 268 . 43 8297 . 90 ± 825 . 02 737 . 90 ± 21 . 04 NGE ( W ang et al. , 2019 ) 1482 . 45 ± 524 . 97 2534 . 76 ± 428 . 68 384 . 45 ± 112 . 03 ESS ( Sims , 1994 ) 631 . 67 ± 122 . 41 671 . 67 ± 134 . 65 190 . 62 ± 37 . 84 Methods W alker -Hard Glider-Hard T errainCrosser Stackelberg PPO (Ours) 13612 . 32 ± 501 . 26 12414 . 50 ± 498 . 53 4488 . 07 ± 467 . 98 BodyGen ( Lu et al. , 2025 ) 11645 . 89 ± 797 . 77 11049 . 95 ± 468 . 44 4103 . 25 ± 871 . 90 T ransform2Act ( Y uan et al. , 2022 ) 4420 . 63 ± 267 . 48 6120 . 62 ± 1086 . 62 2364 . 63 ± 473 . 80 NGE ( W ang et al. , 2019 ) 1504 . 55 ± 553 . 15 2081 . 25 ± 348 . 17 827 . 15 ± 427 . 21 ESS ( Sims , 1994 ) 636 . 03 ± 125 . 74 541 . 55 ± 107 . 56 426 . 81 ± 168 . 30 Methods Pusher Stepper -Regular Stepper -Hard Stackelberg PPO (Ours) 3462 . 77 ± 368 . 09 7215 . 20 ± 449 . 02 6003 . 59 ± 1027 . 77 BodyGen ( Lu et al. , 2025 ) 2779 . 95 ± 509 . 18 4685 . 94 ± 845 . 23 4685 . 41 ± 800 . 09 T ransform2Act ( Y uan et al. , 2022 ) 1015 . 28 ± 247 . 09 2325 . 69 ± 664 . 00 1192 . 39 ± 544 . 20 NGE ( W ang et al. , 2019 ) 551 . 57 ± 120 . 65 870 . 56 ± 215 . 45 509 . 12 ± 207 . 01 ESS ( Sims , 1994 ) 243 . 14 ± 95 . 93 351 . 02 ± 136 . 28 392 . 54 ± 151 . 83 23 Published as a conference paper at ICLR 2026 Cheetah Crawler Glider T errainCrosser Stepper -Regular Stepper -Hard Pusher W alker Swimmer Figure 7: V isualization of co-ev olved body designs generated through our Stackelber g PPO. 24 Published as a conference paper at ICLR 2026 E . 2 E X T E N D E D E N V I R O N M E N T E V A L U ATI O N Beyond the en vironments reported in the main paper , we also include results for the full sets of Glider and W alker tasks, each provided in three difﬁculty le vels: regular , medium, and hard. In the main text we present only the hard variants, as they offer the largest design spaces and naturally encompass the easier tiers, providing a clearer view of morphology–control co-design under less restrictiv e structural budgets. Here, we report the complete results mainly for completeness, and to illustrate how our method beha ves under dif ferent morphology complexity limits. The six en vironments dif fer only in their structural allowances: Glider uses a maximum tree depth of 3 and W alker a maximum depth of 4, while the regular/medium/hard variants correspond to maximum child counts of { 1 , 2 , 3 } . All other en vironment settings are identical. Figure 8 presents the training curves and the ﬁnal generated morphologies across all six tasks. Our method outperforms the baseline across all difﬁculty lev els, with the performance gap increasing as the design space becomes larger and more challenging. Interestingly , the three difﬁculty tiers within each en vironment achiev e similar ﬁnal performance, suggesting that overall task success is not strictly tied to structural complexity: ev en simpler conﬁgurations can discover div erse, correct, and high-quality locomotion patterns. E . 3 A D D I T I O NA L A B L A T I O N S T U D I E S A N D M E C H A N I S M A NA L Y S I S Quantitative Results for PPO Clipping Sensitivity . T o complement the qualitativ e trends shown in Figure 5(a), we provide the full quantitativ e statistics for the clipping sweep experiment. The purpose of this analysis is to examine how the surrogate objective behav es under different clipping thresholds and to identify when the underlying assumptions of policy-gradient theory remain valid. From a theoretical perspectiv e, large policy updates can cause the surrogate objectiv e to div erge from the true return, leading to instability . PPO addresses this by bounding the likelihood ratio π θ ( a | s ) /π θ 0 ( a | s ) within [1 − ϵ, 1 + ϵ ] , which pre vents o verly aggressi ve updates and ensures that the surrogate remains a reliable approximation. In this experiment, we vary the clipping parameter ϵ and measure three quantities that together characterize the stability of the update rule: (1) av erage performance, (2) likelihood-ratio constraint violations, and (3) KL div ergence. T able 3 reports the full numerical results corresponding to the curves sho wn in the main text. T able 3: Sensitivity of Stackelber g PPO to the clipping threshold ϵ . Clipping Parameter Perf ormance Likelihood Ratio V iolations (%) A verage KL Div ergence ϵ = 0 . 1 4934 . 52 ± 646 . 40 14 . 82 ± 0 . 55 0 . 0030 ± 0 . 0006 ϵ = 0 . 2 7215 . 20 ± 449 . 02 13 . 39 ± 0 . 83 0 . 0196 ± 0 . 0025 ϵ = 0 . 4 7907 . 01 ± 208 . 02 9 . 92 ± 0 . 94 0 . 0343 ± 0 . 0153 ϵ = 0 . 6 4778 . 18 ± 407 . 84 9 . 10 ± 1 . 30 0 . 0665 ± 0 . 0188 ϵ = 0 . 8 2656 . 92 ± 503 . 93 7 . 02 ± 0 . 81 0 . 1340 ± 0 . 0388 No Clipping 1233 . 26 ± 443 . 98 0 1 . 7726 ± 0 . 1539 Ablation on SID Components and PPO Clipping. T o further disentangle the contributions of our Stackelber g Implicit Differentiation (SID) estimator and PPO clipping, we conduct an additional controlled ablation. Speciﬁcally , we ev aluate three variants under the same phase-separated, non- differentiable Stack elberg setup: • SID+PPO (full) — our complete method using both SID and PPO clipping, • PPO-only — standard PPO updates without SID, • SID-only — applying our SID estimator without PPO clipping. This ablation assesses whether (i) our SID estimator meaningfully improves leader optimization and (ii) PPO clipping is required to stabilize the induced surrogate objecti ves. As shown in T able 4, both components provide clear performance gains, and the full algorithm consistently achiev es the highest returns across four en vironments. 25 Published as a conference paper at ICLR 2026 Stackelberg PPO (ours) BodyGen Tr a n s f o r m 2 A c t NGE ESS 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 15000 Reward W alker-Regular 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 15000 Reward W alker-Medium 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 15000 Reward W alker-Hard 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 Reward Glider -Regular 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 Reward Glider -Medium 0M 10M 20M 30M 40M 50M Step 0 3000 6000 9000 12000 Reward Glider -Hard (a) Max Depth: 4 Max Childr en Per Nod e: 1 Max Depth: 4 Max Childr en Per Nod e: 2 Max Depth: 4 Max Childr en Per Nod e: 3 W alker -Regular W alker - Medium W alker -Hard Glider -Regular Glider - Medium Glider -Hard Max Depth: 3 Max Childr en Per Nod e: 1 Max Depth: 3 Max Childr en Per Nod e: 2 Max Depth: 3 Max Childr en Per Nod e: 3 (b) Figure 8: Extended evaluation on Glider and W alker environments under different morphology com- plexity b udgets.(a) Training curv es for the regular , medium, and hard variants of each en vironment. (b) Final generated morphologies under each complexity tier . 26 Published as a conference paper at ICLR 2026 W alker 5% 10% 20% 40% 60% 80% 100% Ours BodyGen Stepper -Regular 5% 10% 20% 40% 60% 80% 100% BodyGen Ours Crawler 5% 10% 20% 40% 60% 80% 100% Ours BodyGen Pusher 5% 10% 20% 40% 60% 80% 100% BodyGen Ours Figure 9: Comparison of morphology e volution between Stackelber g PPO (ours) and Standard PPO (BodyGen) across four en vironments. BodyGen tends to collapse early into lo w-complexity designs, while Stack elberg PPO continues exploring structurally richer morphologies, yielding more capable ﬁnal structures. 27 Published as a conference paper at ICLR 2026 T able 4: Ablation studies on the components of our SID estimator and PPO clipping, ev aluated under the same phase-separated, non-differentiable Stack elberg setting. En vironment SID+PPO (full) PPO-only (no SID) SID-only (no clipping) Stepper-Re gular 7215 . 20 ± 449 . 02 4685 . 94 ± 845 . 23 1257 . 33 ± 530 . 25 Crawler 11047 . 90 ± 126 . 20 9098 . 72 ± 558 . 26 35 . 77 ± 12 . 25 Cheetah 13514 . 94 ± 653 . 62 11575 . 87 ± 640 . 65 472 . 89 ± 77 . 40 Glider 12414 . 50 ± 498 . 53 11049 . 95 ± 468 . 44 566 . 81 ± 89 . 96 Effect of Leader Gradients on Controller Adaptation. T o better understand the mechanism be- hind Stackelber g PPO’ s performance gains, we analyze ho w morphology updates interact with con- troller adaptation. Speciﬁcally , we inv estigate whether the improved performance originates from faster controller adaptation under changing morphologies, or from more informative leader gradients that guide the structure search more effecti vely . T o isolate these effects, we e xtract ten intermediate chec kpoints from a BodyGen training run (spanning 10%–100% of training progress). From each checkpoint, we initialize both methods with identical morphology , controller parameters, and opti- mizer state, and then train each method for a single epoch . This setup ensures that any difference in performance improvement reﬂects dif ferences in the leader update rule, rather than controller initialization or long-term training. As shown in T able 5, Stackelberg PPO consistently achie ves a larger one-epoch performance im- prov ement compared to standard PPO (BodyGen). This indicates that Stackelberg PPO does not rely on faster controller adaptation; instead, it provides more informativ e leader gradients that enable the morphology to improve even when the controller is only partially adapted. These results high- light the role of the Stackelber g update in stabilizing and accelerating the joint morphology–control optimization process. T able 5: A verage performance change after one epoch of training from the same checkpoint model, av eraged over 10 checkpoints and 7 seeds, e valuated on Stepper-Re gular . Stackelberg PPO (Ours) BodyGen (PPO) Performance Change After 1 Epoch + 0 . 392 ± 0 . 075 % +0 . 224 ± 0 . 043 % W e further provide a visual comparison of morphology e volution to illustrate this effect (Figure 9). Across multiple en vironments, BodyGen tends to con verge early to low-complexity designs, which restricts later improvements ev en as the controller becomes stronger . In contrast, Stackelberg PPO continues meaningful structural exploration throughout training, enabling richer and more adaptiv e morphologies. These qualitati ve trajectories align with the adaptation results above, reinforcing that the Stackelber g update produces more informativ e and better-aligned structural gradients. 28 Published as a conference paper at ICLR 2026 E . 4 S A M P L E A N D T R A I N I N G E FFI C I E N C Y Sample Efﬁciency . T o assess the efﬁciency of different co-design algorithms, we measure how many environment interaction samples are required to reach a predeﬁned performance threshold. As reported in T able 6, Stackelberg PPO consistently con ver ges with substantially fe wer samples across all environments. On av erage, it reaches the threshold with approximately -39% fewer samples than BodyGen. In contrast, Transform2Act, NGE, and ESS fail to reach any threshold within the av ailable training budget. These results highlight the adv antage of explicitly modeling morphology– control coupling via a Stackelberg formulation, enabling faster con ver gence and more stable co- design dynamics. T able 6: Sample efﬁciency comparison: number of samples (in millions) required to reach the performance threshold. En vironment Threshold Stackelberg PPO BodyGen T ransform2Act NGE ESS Crawler 9000 25 . 8 47 . 2 ∞ ∞ ∞ Cheetah 11000 19 . 2 42 . 1 ∞ ∞ ∞ Swimmer 1200 14 . 8 17 . 0 ∞ ∞ ∞ W alker-Hard 10000 18 . 1 30 . 3 ∞ ∞ ∞ Glider-Hard 11000 23 . 6 49 . 7 ∞ ∞ ∞ T errainCrosser 3500 23 . 9 33 . 8 ∞ ∞ ∞ Pusher 2500 29 . 3 39 . 1 ∞ ∞ ∞ Stepper-Re gular 4500 18 . 5 40 . 4 ∞ ∞ ∞ Stepper-Hard 4500 27 . 2 43 . 1 ∞ ∞ ∞ T raining Efﬁciency . Despite incorporating a bilev el update, Stackelber g PPO introduces only modest computational overhead. The method av oids explicit Hessian construction or inv ersion; in- stead, the conjugate-gradient step relies solely on efﬁcient Hessian–vector products (approximately one backward pass). As a result, its cost scales linearly with morphology and controller dimension- ality , rather than quadratically . Moreover , rollout collection dominates overall computation in all co-design settings, so the additional optimization cost has limited inﬂuence on total training time. T able 7 summarizes the training time under different morphology/control design spaces. Increasing the structural search space does not incur superlinear overhead, conﬁrming the scalability of Stack- elberg PPO. The comparison with ES-based approaches in T able 8 further shows that ES reduces wall-clock time only when substantial CPU parallelization is av ailable, while its resulting designs remain far less ef fectiv e than those produced by PPO-based methods. Overall, our method achie ves strong ef ﬁciency–performance trade-of fs: • Compared to BodyGen, Stackelberg PPO achiev es substantially better sample efﬁcienc y by requiring -39% fewer samples to reach the performance threshold while also obtaining +20.66% higher ﬁnal scores. In terms of wall-clock time, the difference between the two methods is modest (+13%) , keeping the ov erall training cost comparable. • Compared to ES-based baselines, although ESS attains shorter wall-clock time using 6 × more CPU cores (64 cores), its performance is extremely poor , achieving only a 0.16 frac- tion of our method’ s performance. 29 Published as a conference paper at ICLR 2026 T able 7: W all-clock training time comparison across en vironments with dif ferent design space sizes (10 CPU cores + A100 GPU). En vironment Space Size (mean) Space Size (max) Stackelberg PPO BodyGen (PPO) T errainCrosser 4 . 50 ± 0 . 76 14 33 . 88 ± 0 . 42 27 . 87 ± 1 . 27 Swimmer 5 . 50 ± 0 . 76 14 32 . 64 ± 0 . 74 28 . 13 ± 0 . 45 Cheetah 6 . 57 ± 0 . 90 14 32 . 96 ± 0 . 67 29 . 52 ± 1 . 03 Glider-Hard 7 . 33 ± 1 . 49 9 32 . 93 ± 0 . 71 28 . 93 ± 1 . 50 W alker-Hard 8 . 43 ± 1 . 50 27 32 . 54 ± 0 . 62 30 . 21 ± 1 . 22 Stepper-Hard 9 . 57 ± 0 . 90 29 32 . 70 ± 1 . 01 30 . 25 ± 2 . 06 Pusher 14 . 33 ± 4 . 07 29 33 . 41 ± 0 . 82 29 . 24 ± 1 . 12 Stepper-Re gular 16 . 40 ± 4 . 69 29 32 . 83 ± 0 . 87 30 . 17 ± 1 . 41 Crawler 18 . 25 ± 1 . 29 29 33 . 73 ± 0 . 64 30 . 54 ± 1 . 33 T able 8: W all-clock training time across methods. NGE results are shown under both 10 CPU cores and 64 cores to illustrate parallelization effects. Stackelberg PPO (10 cor es) BodyGen (10 cores) NGE (10 cores) NGE (64 cores) W all-clock T ime 33 . 07 ± 0 . 49 h 29 . 43 ± 0 . 97 h 45 . 16 ± 3 . 72 h 13 . 52 ± 1 . 52 h E . 5 M O R P H O L O G Y E VO L U T I O N P RO C E S S V I S U A L I Z A T I O N Figure 10 showcases the morphological ev olution trajectories discov ered by our Stackelberg PPO framew ork across diverse locomotion tasks and en vironments. Each row represents a distinct em- bodiment (Crawler , Cheetah, Swimmer, Glider , Stepper-Regular , Stepper-Hard, T errain Crosser , W alker , and Pusher), and the columns depict the progressive morphological changes from early e vo- lution (5%) through con vergence (100%). The ev olution demonstrates emergent specialization of appendages for task-speciﬁc locomotion requirements. 30 Published as a conference paper at ICLR 2026 Crawler 5% 10% 20% 40% 60% 80% 100% Cheetah 5% 10% 20% 40% 60% 80% 100% Glider 5% 10% 20% 40% 60% 80% 100% Stepper -Regular 5% 10% 20% 40% 60% 80% 100% Stepper -Hard 5% 10% 20% 40% 60% 80% 100% T errain Crosser 5% 10% 20% 40% 60% 80% 100% W alker 5% 10% 20% 40% 60% 80% 100% Swimmer 5% 10% 20% 40% 60% 80% 100% Pusher 5% 10% 20% 40% 60% 80% 100% Figure 10: Morphological ev olution trajectories across eight environments. Each row represents a distinct robot embodiment, with columns sho wing progressive stages of morphological adaptation from 5% to 100% training progress. 31 Published as a conference paper at ICLR 2026 E . 6 R E S U LT S U N D E R R E A L I S T I C C O - D E S I G N C O N S T R A I N T S In the main paper , we adopt a uniﬁed forward-progress re ward to ensure fair comparison across algo- rithms and to a void introducing task-speciﬁc reward biases. While this setup is standard and suitable for benchmarking algorithmic contributions, real-world robot design is often shaped by additional engineering constraints. T o better understand the practical co-design behavior of Stackelberg PPO, we further e valuate ﬁve common realistic constraints under an identical crawler task and training budget. These constraints span both morphology- and control-lev el considerations, including power usage, manufacturability , torque limits, payload handling, and robustness. Sev eral of these factors are already captured by our experimental setup: • Po wer usage: Energy e xpenditure is discouraged through a small ef fort penalty included in the rew ard (Equation eq. (12)). • T orque limits: Joint torque capacity is implicitly limited by bounding the “allo wable torque” attribute during morphology design. • Manufacturability: Physical realizability is enforced by constraining morphology-editing attributes such as limb length, joint count, and topology depth (see Appendix C). • Rob ustness: Robustness naturally emerges from the ev aluation protocol: each morphology–controller pair is scored using multiple rollouts, causing non-robust designs to yield lower a veraged returns. T o complement these b uilt-in constraints, we further provide more detailed quantitati ve e xperiments that isolate and measure their individual ef fects. Po wer Constraint. W e e valuate performance under various po wer penalty coefﬁcients ( 0 . 001 , 0 . 01 , 0 . 1 ), extending beyond the mild penalty ( 0 . 0001 ) used in the main experiments. T able 9 reports the detailed performance and control-effort statistics under each penalty coefﬁcient. The generated morphologies are visualized in Figure 11. Increasing the penalty produces three consistent effects: • Impact on Performance (velocity rew ard). As the penalty coefﬁcient increases, both meth- ods experience reduced forward-progress rew ard. Ho wev er , Stackelber g PPO exhibits a substantially smaller degradation, maintaining stronger performance across all tested set- tings. • Impact on Control Effort. Larger penalties encourage more conservati ve actuation strate- gies for both approaches, reﬂected by the lower penalty terms in the table. • Impact on Morphology–Control Co-Design. W ith stronger penalties, the optimized mor- phologies tend to adopt shorter , thick er , and more symmetric limbs, paired with low-torque gaits characteristic of energy-ef ﬁcient locomotion. Penalty Coef. = 0.0001 Penalty Coef. = 0.1 Figure 11: Power constraint under different penalty coefﬁcients. As the penalty increases, the co- designed morphologies transition tow ard shorter , thicker , and more symmetric limbs. 32 Published as a conference paper at ICLR 2026 T able 9: Power -constraint setting: performance and power penalties under different penalty coefﬁ- cients. Penalty Coef. Perf ormance Po wer Penalty Stackelberg PPO (ours) BodyGen Stackelberg PPO (ours) BodyGen 0 . 0001 11047 . 90 ± 126 . 20 9098 . 72 ± 558 . 26 5631 . 42 ± 674 . 03 2745 . 84 ± 284 . 55 0 . 001 10191 . 15 ± 371 . 81 7501 . 61 ± 671 . 33 5342 . 16 ± 584 . 25 5948 . 16 ± 254 . 84 0 . 01 9853 . 19 ± 229 . 37 8304 . 00 ± 497 . 56 1582 . 48 ± 697 . 61 468 . 12 ± 516 . 33 0 . 1 10585 . 25 ± 146 . 80 8974 . 48 ± 574 . 29 25 . 50 ± 22 . 27 26 . 34 ± 23 . 64 Manufacturability Constraint. A manufacturability cost penalty is applied by incorporating two components into the leader objecti ve: structural complexity is measured by the number of body ele- ments, and material cost is deﬁned as the total mass. T able 10 summarizes the resulting performance and morphology characteristics under different penalty coefﬁcients. The trends are consistent with those observed in the power constraint experiments in (i): our method consistently achiev es better rew ard–cost tradeof fs across all penalty le vels. As shown in Figure 12, the generated morphologies are compact than the original structure, with fewer distal branches, shorter limbs, and mass con- centrated near the root. These structures exhibit lower inertia and more ef ﬁcient force transmission, supporting stable forward locomotion under cost constraints. Penalty Coef. = 0 (18 Bodies / 1.73 kg) Penalty Coef. = 1 (15 Bodies / 1.56 kg) Penalty Coef. = 10 (8 Bodies / 0.99 kg) Figure 12: Manufacturability constraint under different penalty coefﬁcients. Higher penalties on structure complexity and material mass encourage designs with fe wer body elements, reduced branching, and mass concentrated near the root, producing compact morphologies that are easier to fabricate. T able 10: Manufacturability constraint setting: performance, morphology complexity , and material cost under different penalty le vels. Penalty Coef. Perf ormance Morphology Complexity Material Cost Stackelberg PPO (ours) BodyGen Stackelberg PPO (ours) BodyGen Stackelberg PPO (ours) BodyGen 0 11047 . 90 ± 126 . 20 9098 . 72 ± 558 . 26 16 . 40 ± 2 . 45 13 . 67 ± 2 . 08 1 . 71 ± 0 . 23 1 . 57 ± 0 . 32 1 7892 . 93 ± 349 . 84 6531 . 37 ± 437 . 26 13 . 67 ± 2 . 03 9 . 67 ± 1 . 61 1 . 59 ± 0 . 18 1 . 32 ± 0 . 16 10 6825 . 47 ± 303 . 09 5372 . 10 ± 364 . 79 8 . 25 ± 0 . 91 7 . 50 ± 1 . 24 0 . 94 ± 0 . 08 0 . 93 ± 0 . 10 33 Published as a conference paper at ICLR 2026 T orque Limits Constraint. A torque-limit penalty is incorporated by enforcing a 50 N·m cap on all joints and adding a proportional violation cost to the leader objective. T able 11 summarizes the quantitative results, and the morphological effects are shown in Figure 13. As in the manufac- turability and control-effort settings, our method achiev es stronger reward–cost tradeoffs when the controller retains sufﬁcient expressi veness (penalty = 0.01). Under the stronger penalty (0.1), the tightened actuation constraints reduce the feasible morphology space for all methods, narro wing the performance gap. Penalty Coef. = 0 Penalty Coef. = 0.01 Penalty Coef. = 0.1 Figure 13: T orque limits constraint under dif ferent penalty coefﬁcients. Tighter actuation limits lead to noticeably simpler and more compact structures, with shorter limbs and reduced distal branching. T able 11: T orque limits constraint: performance and torque-violation penalties under dif ferent torque-penalty coefﬁcients. Penalty Coef. Perf ormance Limit V iolation Penalty Stackelberg PPO (ours) BodyGen Stackelberg PPO (ours) BodyGen 0 . 01 7893 . 42 ± 84 . 62 6311 . 75 ± 98 . 31 20210 . 50 ± 6503 . 22 11350 . 45 ± 5338 . 31 0 . 1 3133 . 01 ± 70 . 44 3121 . 80 ± 54 . 03 1106 . 45 ± 64 . 69 899 . 35 ± 49 . 40 Payload Constraint. T o ev aluate the agent’ s ability to maintain locomotion under additional load, we attach an extra mass to the root link to serve as a payload. During training, the payload value is randomized within a ﬁxed range (0-0.6 kg) to promote generalization. After training, we ev alu- ate each method under three ﬁxed payload lev els (0.2 kg, 0.4 kg, 0.6 kg). As sho wn in T able 12, Stackelber g PPO consistently maintains higher forward progress across all payload settings. Figure 14 further compares morphologies trained with and without payload. Under load, the e volved struc- tures become more symmetric and better support the additional mass, indicating that Stackelberg PPO adapts the topology itself rather than relying solely on controller compensation. Wit ho u t Payload Wit h Payload – Morphology 1 Wit h Payload – Mo rphology 2 Figure 14: Morphology comparison trained with and without payload. Payload induces more sym- metric and load-supporting structures. 34 Published as a conference paper at ICLR 2026 T able 12: Payload constraint: performance comparison under different payload weights. Payload W eight Stackelberg PPO (ours) BodyGen 0 . 2 kg 8675 . 08 ± 286 . 42 5186 . 31 ± 659 . 87 0 . 4 kg 6966 . 34 ± 473 . 43 5116 . 54 ± 546 . 55 0 . 6 kg 7347 . 46 ± 478 . 01 4523 . 89 ± 536 . 99 Robustness Evaluation. W e ev aluate robustness under two settings: random external forces ap- plied to the root body at every control step, and terrain friction noise created by randomly varying the ground’ s friction in each episode. For each disturbance lev el, all policies are tested across mul- tiple stochastic rollouts, and we report the resulting forward-progress rew ard. T ables 13 and 14 summarize the results. Across all disturbance magnitudes, Stackelberg PPO consistently demon- strates substantially higher rob ustness. For e xample, when external forces increase from 2 N to 6 N, performance decreases by only 5.91% for Stackelber g PPO, compared to a much larger 59.57% de- cline for BodyGen. A similar pattern holds under terrain friction noise. These improvements arise primarily from more symmetric, mechanically balanced morphologies that better tolerate external forces and friction variability . T able 13: Robustness ev aluation: performance under different le vels of external disturbance forces. Level Stackelberg PPO (ours) BodyGen 2 . 0 N 11557 . 31 ± 124 . 68 6963 . 05 ± 450 . 48 4 . 0 N 11290 . 13 ± 164 . 54 4621 . 82 ± 597 . 71 6 . 0 N 10875 . 23 ± 250 . 97 2816 . 16 ± 857 . 83 T able 14: Robustness ev aluation: performance under different le vels of terrain friction noise. Level Stackelberg PPO (ours) BodyGen 30% 11424 . 66 ± 112 . 08 7326 . 04 ± 421 . 73 50% 11333 . 43 ± 141 . 42 6795 . 55 ± 493 . 31 70% 10892 . 09 ± 149 . 72 5062 . 85 ± 579 . 66 35 Published as a conference paper at ICLR 2026 E . 7 D I S C U S S I O N A N D E X T E N D E D E V A L U A T I O N O N R E A L I S T I C C O - D E S I G N C H A L L E N G E S In this section, we present broader analyses of morphology–control co-design and extend our re- sults along four representati ve challenge dimensions: (i) di verse co-design en vironments, (ii) multi- objectiv e and role-speciﬁc rew ards, (iii) robustness and generalization under unseen disturbances, and (iv) the use of morphology priors. These studies highlight both the empirical advantages of Stackelber g PPO and the conceptual beneﬁts of explicitly decoupling structure design from control learning. T ogether , the y demonstrate that our frame work scales naturally to more complex co-design settings that better reﬂect real-world robotic demands, and they point toward promising directions for building more adapti ve and physically grounded morphology–control systems. Diverse co-design en vironments. Standard co-design benchmarks focus almost exclusi vely on ﬂat- terrain locomotion, which poses limited structural or behavioral challenge. T o expose a broader range of morphology–control interactions, we introduce more demanding en vironments—most no- tably difﬁcult terrain and manipulation—that require non-periodic motions, contact management, and functional differentiation across limbs. In the Stepper en vironments, agents must coordinate structure and control to handle large discontinuities without exteroceptiv e sensing. On low stairs, they de velop stable stepping and small hops; on high stairs, the difﬁculty induces long-range, high- amplitude jumping behaviors. These emergent solutions reﬂect the stronger morphological and dy- namical adaptation required by complex terrain. In the pusher task, co-design must jointly support locomotion and precise force application. Learned morphologies exhibit clear role specialization: some limbs provide acceleration and stability , while others regulate contact orientation and apply controlled pushing forces. Baseline methods typically recov er only the locomotion component, re- lying on collision-based propulsion. These en vironments re veal aspects of the co-design problem that ﬂat locomotion cannot capture, and they demonstrate that Stackelber g PPO scales to richer settings requiring terrain adaptation, contact reasoning, and multi-role morphology design. Multi-objective and role-speciﬁc reward design. As shown earlier in Appendix E.6, our frame- work naturally accommodates additional objectiv es such as power consumption or payload capac- ity . The resulting morphologies and controllers smoothly adapt to the trade-of fs introduced by these objectiv es, validating the method’ s multi-objectiv e co-design capability . Furthermore, the leader–follower decomposition allo ws reward terms to be assigned selecti vely to the structure-design or control-learning stages. For example, complexity or material-cost penalties can be applied only to the leader (structure) updates, enabling constraints on morphology without interfering with con- troller learning. This role-speciﬁc re ward routing pro vides a high de gree of ﬂexibility for real-w orld design requirements. Robustness and generalization under unseen disturbances Although our current setting does not include exterocepti ve sensing and is not intended for zero-shot transfer to arbitrary unseen worlds, we ev aluate generalization and robustness under an obstacle-navig ation task not seen during training. Policies are trained only on ﬂat terrain (Crawler task) and then tested in environments containing either sparse or dense grids of square obstacles. As reported in T able 15, Stackelberg PPO obtains higher forward progress than BodyGen across both difﬁculty lev els. The visualization in Figure 15 further shows that the morphologies produced by our method maintain more consistent forward motion, whereas baseline agents more frequently stall or deviate under unexpected contacts. These results illustrate that the co-designed morphology–policy pair exhibits meaningful robustness to previously unseen disturbances and obstacle interactions. Dense Obstacle Navigati o n Sparse Obstacle Nav i gation Figure 15: V isualization the unseen obstacle-navigation task en vironment. 36 Published as a conference paper at ICLR 2026 T able 15: Performance in the unseen obstacle-navigation task under two obstacle densities. Obstacle T ype Spacing Perf ormance Stackelberg PPO (ours) BodyGen Sparse Obstacle 16 m ( ∼ 4× robot width) 1790 . 45 ± 161 . 77 1061 . 55 ± 228 . 40 Dense Obstacle 8 m ( ∼ 2× robot width) 1698 . 52 ± 733 . 02 1007 . 21 ± 157 . 42 Incorporating and beneﬁting from morphology priors. Our framew ork also supports reusing morphology priors obtained from related tasks. T o examine this, we transfer morphologies ev olved in the Crawler en vironment to initialize training in the Pusher task. T able 16 sho ws that both Stackel- berg PPO and BodyGen beneﬁt from priors in terms of ﬁnal performance and the number of environ- ment steps required to reach a threshold reward. Stackelber g PPO consistently obtains higher ﬁnal rew ard and requires fewer steps under both “with prior” and “without prior” conditions. Figure 16 visualizes representative morphologies produced under this setup. While priors accelerate training, it is generally advisable to choose priors that encode broadly useful structural patterns—such as stable support geometries or balanced limb arrangements—rather than narrowly specialized solu- tions. Such general-purpose priors provide a more ﬂexible foundation for do wnstream adaptation and reduce the risk of ov er-constraining the design space. Crawler Ta s k Prior Pusher Ta s k with Prior Figure 16: Cross-task reuse of morphology priors: Crawler prior (left) and the resulting Pusher morphology (right). T able 16: Performance and sample efﬁcienc y in the Pusher task with and without morphology priors. Condition Perf ormance Steps to Threshold (2500 Reward) Stackelberg PPO (ours) BodyGen Stackelberg PPO (ours) BodyGen W ith Prior 4822 . 59 ± 114 . 32 4575 . 52 ± 112 . 78 ∼ 8 M ∼ 9 M W ithout Prior 3462 . 77 ± 368 . 09 2779 . 95 ± 509 . 18 ∼ 32 M ∼ 44 M 37

Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment