Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Heracl es: Bridging Precise Tracking and Generati v e S ynthesis f or Gen era l Humano id Control X -Humano id Heracles Project T eam Abstract Achieving general-purpo se humano id control requires a delicate balance bet w een the precise executio n o f commanded moti ons and the ﬂexib le, anthropomorphi c adaptabilit y needed to reco ver from unpre- dictab le environmental perturbations. Current genera l controllers predominantly form ulate moti on control as a rigid reference-tracking prob lem. While eﬀectiv e in nominal conditi ons, these trackers o ften exhibit brittle, non-anthropo m orphic f ailure modes under severe disturbances, lacking the generativ e adaptabilit y inherent to human motor control. T o ov ercome this limit ati on, we propose Heracles, a no v el st ate-conditi on ed diﬀ usi on midd leware that bridges precise moti on tracking and generativ e syn- thesis. Rather than relying on rigid tracking paradigms or complex explicit mode-switching, Heracles operates as an intermediary layer bet ween high-level reference moti ons and lo w-lev el physi cs trackers. By conditioning on the robot’s real-time state, the diﬀ usi on model implicitly adapts its behavi or: it approximates an identit y map when the st ate closely aligns with the reference, preserving zero-s hot tracking ﬁdelit y . Conv ersely , when encountering signiﬁcant st ate deviations, it seamless ly transitio ns into a generativ e synthesiz er to produce natural, anthropomorphi c recov ery trajectories. O ur framework dem onstrates that integrating generative priors into the control loop not only signiﬁcantly enhances rob ustness against extreme perturbations b ut also elevates humanoid control from a rigid tracking paradigm to an open-ended, generativ e general-purpo se architecture. 1. Introd ucti o n Humano id robots are rapid ly transitioning from structured l a boratory settings to complex, unstructured real- w orld environments. This shift demands general-purpo se control architectures capab le of executing precise, goal-directed m otio ns while maint aining the ﬂexib le, anthropom orphic resilien ce characteristic o f human m otor control. In biol ogica l systems, motor behavi or is rarely a rigid executio n of a predeﬁned plan; rather , humans seamlessly blend exact t a sk execution with intuitive, generative recov ery when faced with unexpected physica l disturbances. Emulating this dual capa bilit y remains a formida b le chall enge in m odern robotics. The prevailing paradigm for general-purpo se humanoid control relies heavily on reference-driv en tracking. R ecent advancements form ulate moti on control primarily as a prob lem of minimizing the kinematic deviatio n bet w een the robot’s current state and a provided reference trajectory . Po w ered by deep reinforcement learning, these tracking-based controllers He et al. ( 2025 ); Luo et al. ( 2025 ); W ang et al. ( 2026 ); Yin et al. ( 2025 ), such as those mimicking reference moti ons or utili zing universa l trackers, excel in nomina l conditions. They en a bl e humano ids to f aithfully replicate highly diverse sets of m otion capture (MoCap) dat a, achieving impressive zero-shot execution for an array of highly agil e and dyn ami c skills. Ho w ever , form ulating general control strictly as a rigid tracking objective introduces a critical vulnerabilit y: catastrophic and non-anthropom orphic f ailure modes under severe enviro nmental perturbations. When a robot is pushed far from its referen ce trajectory , a pure tracker blind ly attempts to minimi ze the immediate state error , often resulting in rigid, physi cally infea sibl e joint torques that lead to unn atura l and unreco vera b le falls. Conv ersely , entirely dropping the tracking objective in fav or of pure generativ e models Li et al. ( 2026 ); Zeng et al. ( 2025 ), such as end-to-end behavior cloning, yields more n atura l, human-like m ov ements but © 2026 X -Humanoid All rights reserved. Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control Figure 1: Heracles synthesizes diverse, anthropom orphic recov ery motio ns via state-condition ed diﬀ usio n. In contra st to recov ery policies trained with termin ati on tricks that conv erge to a limited set of stereot ypica l maneuvers, Heracles lev era ges its generativ e midd leware to produce a rich repertoire of a gile, human-like recov ery behaviors, enab ling more genera l and robust responses across a wide range of extreme perturbations. f undamentally sacriﬁces the spatial and temporal precision required for strict t a sk executio n. A signiﬁcant gap remains: bridging the exactitude of precise trackers with the generativ e adaptabilit y of rob ust human reco very . T o o verco me this f undamental limit atio n, we propose Heracles, a no v el st ate-conditi on ed diﬀ usi on midd leware designed to bridge precise moti on tracking and generativ e synthesis ( Fig. 1 ). Rather than engineering a m onolithi c end-to-end controller or relying on complex, explicit state machines to switch bet ween tracking and reco very modes, Heracles operates elegantly as an intermediary layer . Situated bet w een the high-lev el origin a l m otion commands and the low-lev el physical execution policy , it injects pow erf ul generativ e priors directly into the control loop without disrupting the high-frequency physics executio n. The core inno vation of Heracles lies in its implicit, st ate-driv en adaptabilit y . By tightly conditio ning the diﬀ usi on process on the robot’s real-time physica l state, the model dynamically shapes its output without explicit heuristics. When the robot’s st ate closely aligns with the reference command, the diﬀ usi on process approximates an identit y map, allo wing the commands to pass through with near-zero modiﬁcati on to preserve strict tracking 2 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control ﬁdelit y . Conv ersely , when encountering signiﬁcant st ate deviations—su ch as a severe push or an imminent fall—the model seamlessly transitio ns into a generativ e synthesizer . It synthesi zes natural, anthropom orphic reco very trajectories that guide the underlying physics tracker back to st ability . This mechanism not only enhances physica l robustn ess but elevates humanoid control from a rigid tracking paradigm to an open-ended, generativ e architecture. In summary , the primary contributi ons of this work are three-fold: • A Generativ e Control Midd leware P aradigm: W e introdu ce Heracles, a n ov el st ate-conditi on ed diﬀ usi on midd leware that uniqu ely bridges the precisio n o f moti on tracking with the ﬂexibilit y of generativ e synthesis, eﬀectiv ely decoupling high-lev el intent generati on from lo w-lev el physica l execution. • Enhanced Architecture for Genera l-Purpose Control: W e improv e the underlying physics tracker and the ov erall control framework to better serve general-purpose t as ks. This optimi zed architecture ensures the retention of high-ﬁdelit y tracking characteristics while seamlessly integrating with the generativ e priors from the midd lew are. • R ob ust Anthropomorphi c R eco very and Motion Genera liz ati on: W e successfully deploy the proposed framew ork o n physi cal humano id robots. Extensiv e hardware experiments demonstrate emergent, human- like recov ery behavi ors and robust generativ e adapt a bilit y under severe, out-o f-distrib ution physica l disturbances. 2. R el ated W ork 2.1. Genera l Humanoid Motio n Controller R ecent advances in deep reinforcement learning cat alyze the development of general-purpose controllers for high-degree-of -freedom humanoid robots. These architectures aim to provide a uniﬁed execution policy for diverse motor behaviors, predominantly dividing into mimic-ba sed reference trackers and unsupervised reinforcement learning (URL) methods. Mimic-Ba sed Motio n Trackers. A dominant paradigm formulates motor behavi or executi on as a high- ﬁdelit y motio n imit atio n task. Pioneered by DeepMimic Peng et al. ( 2018 ), which demonstrated that deep reinforcement learning can train physics-ba sed characters to cl osely imit ate reference moti on clips, this paradigm has since been scaled to genera l-purpose humano id control. Fo undationa l large-scale framew orks, including the Generalized Motion T racker ( GMT) Chen et al. ( 2025 ), demonstrate that adaptiv e sampling and mixture-o f-experts (MoE) architectures en a bl e humanoids to track div erse moti ons via a single uniﬁed policy . Extending this paradigm, recent eﬀorts structura lly scale these m odels. SO NIC Luo et al. ( 2025 ) expands net w ork capacit y and dat asets to ov er 100 million frames, est ab lishing moti on tracking as a robu st scalab le fo undatio n t a sk. T o break the representation bottleneck in m ulti-moti on RL optimiz ati on, OmniXtreme W ang et al. ( 2026 ) introdu ces a ﬂow-matching policy to decouple genera l m otor skill learning from sim-to-real physica l reﬁnement. Beyo nd pure kinematic tracking, researchers contin uou sly enhance the versatilit y and stabilit y of these framew orks. HO VER He et al. ( 2025 ) employs multi-m ode policy distill ati on to consolidate speciﬁc control modes—speciﬁ cally navigatio n and loco-manipulatio n—into a uniﬁed controller . C o ncurrently , AMS P an et al. ( 2026 ) proposes a hybrid reward scheme combining agil e human MoCap dat a with physically constrained synthetic balance moti ons. More recently , BeyondMimi c Li ao et al. ( 2025 ) integrates a guided diﬀ usi on mechanism directly into the tracking form ulation, allo wing the system to solv e diverse downstream tasks via classiﬁer guidance. Bey ond pure kinematics, Sun et al. ( 2024 ) pion eer the integration of large language models with adversaria l imitation learning for z ero-s hot t a sk execution through quanti zed skill representations. R oboGhost Li et al. ( 2026 ) integrates langu a ge-grounded moti on l atents from a moti on generator with reinforcement learning for retarget-free whole-body controller . O mniR et arget Y ang et al. ( 2025 ) incorporates human-scene interaction (HSI) and human-object interaction (HOI) constraints into the moti on retargeting pipeline, improving the physica l pl ausibility o f transferred m otio ns. Al ong simil ar lines, recent 3 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control w orks hav e f urther advanced humanoid parkour-st yle locomoti on Zhu et al. ( 2026 ); Zhuang et al. ( 2026 ) and contact-rich interaction t as ks Lin et al. ( 2026 ). MeshMimic Z hang et al. ( 2026 ) f urther advances the paradigm by incorporating 3D scene reconstructi on from mon ocular video, enab ling humanoid robots to learn coupled m otion-terrain interactions on complex, no n-ﬂat terrains. Despite these structura l advancements, tracking-ﬁrst controllers remain f undamentally anchored to their referen ce trajectories and t ypically require comput ati onally expensiv e test-time guidance for generati on. Under sev ere out-of -distrib utio n physica l perturbations, relying purely on the tracking formulati on predominantly results in rigid, physica lly infeasib le joint torques rather than seamless, real-time anthropom orphic reco very . U nsupervised RL and Skill Discov ery . Conv ersely , unsupervised reinforcement learning (URL) and the emerging paradigm of Behaviora l F oundatio n Models (BFMs) seek to equip robots with a diverse repertoire o f motor skills devoid of explicit reference trajectories. R ecent extensive surveys on BFMs Y uan et al. ( 2025 ) delineate a clear trajectory toward levera ging large-scale pre-training to capture broad behavi oral priors. Zhang et al. ( 2024 ) extended this paradigm to f ull-size humanoid robots by dev eloping an adversarial motio n prior framework that achiev es human-comparab le whole-body locom otion performance. The foundati onal BFM framework Zeng et al. ( 2025 ) utiliz es masked online distill ati on alo ngside Conditi ona l V ariational A utoencoders ( C V AEs) to model behaviora l distributi ons ﬂexibly from unstructured dat a. Pushing the boundaries o f autono mo us exploratio n, cutting-edge URL approaches est a blis h entirely reference-free policies. BFM- Zero Li et al. ( 2026 ) lev era ges unsupervised RL and F orward-Backward (FB) represent ati ons to create an objectiv e-centric, prompt a bl e latent space, enab ling a singl e generalist policy to perform zero-shot t a sks and reward inf erence seamlessly in the rea l world. Sev eral recent works Chen et al. ( 2026 ); Lu o et al. ( 2024 ); W ang et al. ( 2025 ); Yu et al. ( 2025 ); Zhang et al. ( 2026 ) lev era ge moti on tracking as a foundati onal mechanism to acquire human athletic skills, en a b ling humanoid robots to perform highly dynamic ball sports. Because these URL-driven policies and foundati on models explore the st ate-actio n space unconstrained by rigid referen ces, they inherently exhibit remarkab le physical compliance and naturalistic robu stness when perturbed. Ho w ev er , mapping these auton om ously disco vered, unconstrained latent spaces to high-precisio n, strict-ﬁdelit y spatial tracking tasks remains a f ormidab le chall enge. They f undamentally struggle to achieve the exactitude characteristic o f dedicated mimic controllers in complex, dynamic execution scen ari os. 2.2. Motio n Generatio n Synthesizing diverse, naturalisti c human mo vements represents a f oundationa l pursuit within computer animatio n. Motio n- X Lin et al. ( 2023 ) provides a l arge-scal e multi-m odal human moti on dataset comprising o ver 81K moti on sequences with uniﬁed whole-body annot ati ons spanning face, hands, and body , while its successor Motio n- X++ Z hang et al. ( 2025 ) f urther extends the scale and diversity by incorporating additio nal m otion sources and richer semantic l a bels to support more comprehensiv e whole-body moti on generatio n and underst anding. Early paradigm shifts lev eraged score-based generativ e models, with the Human Motion Diﬀ usi on Model (MDM) T ev et et al. ( 2023 ) est a blis hing a robust transformer-ba sed baseline for text-driv en kinemati c synthesis. Subsequently , researchers integrated moti on generati on into the L arge Language Model (LLM) ecosystem. MotionGPT Jiang et al. ( 2023 ) ﬁrst demonstrated that treating human motio n as a foreign language and unif ying moti on-text tasks via discrete tokenization yields strong multi-tas k performance. Building upon this insight, MotionGPT -2 W ang et al. ( 2024 ) f urther qu antizes multim odal inputs—including text and single-frame poses—into LLM-interpretabl e tokens for uniﬁed generation and underst anding. Meanwhil e, the MotionMilli on framework F an et al. ( 2025 ) demonstrates that millio n-scal e high-qualit y dat a sets coupled with autoregressiv e architectures unlock unprecedented z ero-s hot capa bilities. Expanding moda lit y f usio n, OmniMotion Li et al. ( 2025 ) utili zes a continu ous mas ked autoregressiv e transformer to seamlessly integrate text, speech, and musi c into a cohesive whole-body generatio n mechanism. P us hing m odel capacit y limits, HY -Motion 1.0 W en et al. ( 2025 ) successf ully scal es diﬀ usi on transformer-ba sed ﬂo w matching m odels to the billio n-parameter regime, yielding instructi on-f ollo wing digital animations with unparallel ed ﬁdelit y . Despite achieving remarkabl e anthropom orphism and diversit y , these generativ e fo undatio n models f undamentally 4 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control operate in an open-loop, pure kinemati c domain. They synthesiz e trajectories comprising joint angles and globa l translatio ns completely devo id of physi cal constraints. Implementing these raw kinematic outputs directly on a physica l humano id invaria b ly triggers dynamics mismatches, leading to inst a bilit y and falls, because the generativ e process ignores crucial embodied parameters including joint torque limits, cont act frictio n, and center-o f-ma ss dynamics. 2.3. Hybrid Architectures for Motion Tracking and S ynthesis T o o verco me the inherent limit ati ons of isol ated physica l trackers and open-loop generativ e models, recent re- search activ ely constructs hybrid architectures integrating generativ e synthesis with tracking objectives. Within the kinematic domain, frameworks attempt to constrain generation via tracking form ulations; COMET L ee et al. ( 2025 ) employs a conditio nal V AE framew ork with a reference-guided feed back loop to prevent long-term m otion degradation, while MotionStreamer Xi ao et al. ( 2025 ) and DAR T Zhao et al. ( 2024 ) enforce sequential synthesis driven by rigorous spatial constraints. T ransitio ning to physically simulated characters, building upon the foundati onal referen ce-tracking paradigm of DeepMimic Peng et al. ( 2018 ), Adversaria l Skill Em- beddings ( ASE) P eng et al. ( 2022 ) constru cts continu ous l atent spaces from large-scale unstructured moti on data, en a b ling characters to maintain highly anthropom orphic behavi ors during diverse downstream tasks. Advancing this trajectory , the V ersatile Motion Priors (VMP) framework Seriﬁ et al. ( 2024 ) optimizes the robust control o f physica l characters by distilling multipurpo se moti on pri ors through a t wo-stage variati onal a pproach, signiﬁcantly enhancing both reference trajectory tracking and resilience against external perturbations. R ecent inv estigations f urther deepen this paradigm: AMOR Alegre et al. ( 2025 ) proposes m ulti-objectiv e reinforcement learning to train weight-conditi on ed policies spanning the Pareto front of reward trade-oﬀs, while adversaria l diﬀerentia l discriminators Zhang et al. ( 2025 ) eliminate the need for manua lly-designed reward f un ctio ns in physics-ba sed moti on imit atio n. In the multi-a gent competitiv e domain, R oboStriker Yin et al. ( 2026 ) con- structs a hierarchica l framew ork that decouples high-lev el strategic reasoning from lo w-lev el physical execution via topologi cally constrained latent manifolds, dem onstrating emergent boxing behavi ors with sim-to-real transfer . How ever , these hybrid control paradigms ret ain structura l deﬁcienci es when confronting extreme, out-o f-distrib ution physical disturbances. They t ypically couple high-level generativ e priors with lo w-lev el executio n policies loosely , preventing high-frequency physica l state deviations from reshaping the generative target in real time. Consequently , when encountering severe imbalance, the system fails to transition implicitly from nomina l tracking to generativ e recov ery , often reverting to rigid, explicit st ate-switching mechanisms. Our proposed Heracles framework resolves this bottleneck through a st ate-co nditio ned diﬀ usi on middl eware that dynamically mod ulates the generativ e output based on real-time state deviations, achieving a seamless uniﬁcati on of precise z ero-s hot tracking and anthropom orphic generativ e recov ery within a closed control loop. 3. Method 3.1. S ystem Overvi ew and Prob lem F orm ulation Genera l humanoid control requires executing desired motio n commands while maint aining physical balance a gainst unpredict a b le environmental disturbances. W e form ulate this dual objectiv e—high-ﬁdelit y moti on tracking and robust physica l recov ery—within a hierarchica l control architecture. The proposed framework, Heracles, intrinsically decouples high-level intent generati on from high-frequency physica l execution ( Fig. 2 ). It comprises t w o primary components: a st ate-co ndition ed generativ e midd leware and a low-lev el, genera l- purpose physics tracking policy . In st andard tracking paradigms Chen et al. ( 2025 ); Luo et al. ( 2025 ); Peng et al. ( 2018 ), a policy minimiz es the kinemati c deviation bet ween the robot’s real-time proprioceptiv e st ate p 𝑡 and a referen ce m otion command m 𝑡 . When the st ate remains close to the reference manifold, directly tracking m 𝑡 yields optimal zero-shot performan ce. Ho wev er , when severe physica l perturbations push the state into out-o f-distrib ution ( O OD) 5 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control Figure 2: Overview of the Heracles framework. ( a) A ﬂow matching model ^ 𝐷 𝜃 learns to synthesiz e fea sib le keyframe trajectories conditi oned on the current st ate. ( b) Ref erence moti ons are quanti zed into discrete tokens via F S Q , shared by reconstru ction and actio n prediction heads. ( c) At inference, the middl eware generates trajectories through closed-loop replanning for the m otion tracker to execute. regio ns, forcing strict ad herence to m 𝑡 prod uces rigid, my opic correctiv e actions that lack the coordin ated m ulti-step reasoning required for physical recov ery , invariab ly precipitating catastrophic failure. T o resolv e this limitation, we f orm ulate the core prob lem as learning an intermediary mapping that dynamically m odulates the reference command based on real-time physical feasibilit y . The generativ e midd leware f uncti ons as this state-conditio ned trajectory synthesizer . Operating at a lo wer planning frequency , it observes the current state p 𝑡 and the origin a l reference m 𝑡 , predicting a short-horiz o n, dynamically fea sib le keyframe trajectory: 𝜏 𝑡 = 𝑓 𝜃 ( p 𝑡 , m 𝑡 ) ∈ R 𝐾 × 𝐷 , (1) where 𝐾 denotes the number of keyframes and 𝐷 represents the state dimension. This mapping design uniﬁes no minal tracking and OO D recov ery: when the robot operates near the reference manifold, 𝑓 𝜃 approximates an identit y transformati on to preserve high-ﬁdelit y tracking; conv ersely , under l arge disturbances, 𝑓 𝜃 synthesizes entirely new , physica lly feasib le transitio n trajectories that guide the robot back tow ard the reference manifold. The synthesiz ed keyframes 𝜏 𝑡 are subsequently densiﬁed and written into a reference buﬀer consumed by the lo w-lev el physics tracker . W e model this contin uous tracking process as a discounted Markov decisio n process (MD P) deﬁned by the tuple ℳ = ( 𝒮 , 𝒜 , 𝒫 , 𝑟 , 𝛾 ) . At each high-frequency control step, the tracking policy 𝜋 receiv es an observation: o 𝑡 = { p 𝑡 , m ′ 𝑡 , z 𝑑 } , (2) where m ′ 𝑡 denotes the mod ulated reference commands sampled from the densiﬁed 𝜏 𝑡 , and z 𝑑 constitutes a high-lev el moti on embedding (det ailed in Sec. 3.3 ). The policy outputs joint-lev el actions 𝑎 𝑡 ∈ 𝒜 to maximiz e the expected discounted return: 𝐽 ( 𝜋 ) = E 𝜏 ∼ 𝜋  𝑇 − 1  𝑡 =0 𝛾 𝑡 𝑟 𝑡  , (3) where 𝑟 𝑡 represents a t a sk reward composed of tracking precisi on and physi cal regul arization terms. Deployment foll ow s a receding-hori zon repl anning loop. Ev ery 𝑁 exec control steps, the middl eware updates 𝜏 𝑡 from the 6 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control latest proprioceptiv e observation, while the tracking policy executes the dense trajectory at the f undamental control frequen cy , forming a seamless, closed-loop tracking-generati on architecture. 3.2. State-C o nditio ned Generativ e Midd lew are via Flo w Matching W e form ulate the trajectory generati on process as a conditio nal ﬂo w matching problem ov er a geometrically constrain ed residual space, bridging exact moti on tracking and generativ e synthesis without relying on explicit m ode-switching heuristics. Geometric R esid ual P arameteriz ati on. Directly predicting absolute st ate coordinates wastes model capacit y on approximating the identit y mapping d uring near-no minal executio n. Instead, we predict residual trajectori es relative to the current st ate. L et 𝛽 𝑡 denote the static baseline anchored at the current proprioceptiv e state: 𝛽 𝑡,𝑘 = p 𝑡 , ∀ 𝑘 ∈ { 0 , . . . , 𝐾 − 1 } . (4) The midd leware predicts only the resid ual deviation r 𝑡 , recov ering the ﬁn a l trajectory as: 𝜏 𝑡 = 𝛽 𝑡 + r 𝑡 . (5) U nder this parameteriz ati on, the residua l directly encodes the m otio n increment from the current st ate o ver the planning horizon. When the robot closely follo ws the reference manifold ( p 𝑡 ≈ m 𝑡 ), the synthesized trajectory closely reprodu ces the original reference moti on, and the midd leware eﬀectiv ely acts as an identit y map on the command sign a l. U nder severe deviations, the model lev erages the conditioning gap bet ween p 𝑡 and m 𝑡 to synthesize reco very trajectories that diverge from the reference in f a v or of physical pl au sibilit y . Crucially , the target command m 𝑡 enters exclusiv ely through the conditioning vector , ensuring that the residual predictio n target remains independent of the st ate-command dist ance. Continu ous C o nditiona l Flow Matching. T o synthesi ze the complex, m ultimoda l distributi ons of these reco very residuals, we employ continu ous ﬂow matching Lipman et al. ( 2023 ). Let x 0 denote the norma lized ground-truth residua l dat a and x 1 ∼ 𝒩 (0 , I ) represent the prior Gaussian noise. W e deﬁne a probabilit y path via linear interpol ati on: x 𝑡 = (1 − 𝑡 ) x 0 + 𝑡 x 1 , 𝑡 ∈ [0 , 1] . (6) The model is trained to regress the underlying contin uou s vector ﬁeld by minimizing the vel ocit y matching objectiv e: ℒ vel = E 𝑡, x 0 , x 1  ‖ ˆ v ( x 𝑡 , 𝑡, c 𝑡 ) − ( x 1 − x 0 ) ‖ 2 2  , (7) where c 𝑡 = [ p 𝑡 , m 𝑡 ] serv es as the strict st ate-conditi oning vector . During inferen ce, physica lly viabl e recov ery trajectories are sampled by integrating the learned v elocit y ﬁeld ˆ v from 𝑡 = 1 to 𝑡 = 0 utilizing a minimal number of Euler steps. Architecture and Kinematic Continuity . The velocity ﬁeld is parameterized by an AdaLN-mod ulated Trans- former Peeb les and Xie ( 2023 ), where the conditio ning vector c 𝑡 and the ﬂo w timestep embedding are injected into each block vi a adaptiv e shift-scale-gate mod ulation. T o guarantee kinematic continuity bet w een the real- time physica l st ate and the synthesi zed trajectory , we pin the ﬁrst residual token vi a an inpainting constraint d uring integration: x 𝑡 [0] = (1 − 𝑡 next ) r 0 + 𝑡 next 𝜖 , 𝜖 ∼ 𝒩 (0 , I ) , (8) where r 0 strictly deﬁnes the zero-residual anchor corresponding to the initi a l st ate. R eceding-Horizon Planning with Directio nal W arm St art. A key design principle is that the midd leware alw ays predicts a ﬁxed temporal window of moti on, regardl ess of the dist an ce to the t arget command. The target is set to the current reference frame m 𝑡 , and both the temporal window ∆ 𝑡 and execution interval 𝑁 exec are held constant. Whether the robot is closely tracking the referen ce or operating f ar from the referen ce 7 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control manifold, the generated keyframes consistently encode the next ∆ 𝑡 seconds of moti on. Long-horiz o n recov ery emerges autoregressiv ely through successiv e repl an cycles, rather than requiring the model to reason about the f ull trajectory in a single pass. During inference, w e initi a lize the OD E solver with a directional m otio n prior rather than pure Gaussian noise. W e construct an initial resid ual trajectory linearly interpol ating toward the target: r init 𝑘 = 𝑘 𝐾 − 1 ( m 𝑡 − p 𝑡 ) , 𝑘 ∈ { 0 , . . . , 𝐾 − 1 } , (9) and begin OD E integration from a partially no ised versi on of this prior at 𝑡 start < 1 : x 𝑡 start = (1 − 𝑡 start ) normalize( r init ) + 𝑡 start 𝜖 , 𝜖 ∼ 𝒩 (0 , I ) , (10) where 𝑡 start controls the noise-to-pri or ratio. Inspired by SD Edit Meng et al. ( 2022 ), this warm st art pro vides the linear prior as an immediate directional estimate, while the learned vel ocit y ﬁeld reﬁnes it into a natural trajectory within a redu ced number of OD E steps. The resulting 𝐾 keyframes are then densiﬁed into the tracker’s operating frequency vi a cubic spline interpol ati on for joint positi ons and spheri cal linear interpolation for root orientatio ns, yielding the complete dense referen ce sign a l consumed by the tracking policy . 3.3. Genera l-Purpose Physi cs Tracker 3.3.1. Motio n Tracking F ormulation Observati on Space. The observatio n space is forma lly deﬁned as a composite vect or o 𝑡 = { p 𝑡 , m 𝑡 , z 𝑑 } , comprising the robot’s proprioceptiv e state p 𝑡 , the kinemati c moti on reference trajectory m 𝑡 , and the discrete high-lev el moti on embedding z 𝑑 . Speciﬁ cally , the proprioceptiv e st ate p 𝑡 enca psulates the immediate physica l conditi on o f the robot: p 𝑡 = [ g pro j 𝑡 , 𝜔 𝑡 , q 𝑡 − q 0 , ˙ q 𝑡 , a 𝑡 − 1 ] , (11) where g pro j 𝑡 ∈ R 3 denotes the gravity v ector projected into the loca l root frame, 𝜔 𝑡 ∈ R 3 represents the root angular velocity , q 𝑡 ∈ R 29 and ˙ q 𝑡 ∈ R 29 represent the current joint positions and velociti es respectiv ely , q 0 deﬁnes the def ault no minal joint conﬁguratio n, and a 𝑡 − 1 ∈ R 29 records the previou s action to ensure temporal sm oothness. The reference observation m 𝑡 pro vides per-step target kinemati cs: m 𝑡 = [ v ref 𝑡 , 𝜔 ref 𝑡 , e root 𝑡 , q ref 𝑡 ] . (12) During training, m 𝑡 is extracted directly from the motio n dat a set. At deployment, it is seamlessly repl aced by the densiﬁed midd leware output m ′ 𝑡 without any modiﬁ catio n to the tracker architecture. Here, v ref 𝑡 ∈ R 3 and 𝜔 ref 𝑡 ∈ R 3 denote the reference root linear and angul ar v elocities expressed in the body frame, q ref 𝑡 ∈ R 29 speciﬁes the t arget joint positi ons. The root orientation error e root 𝑡 ∈ R 6 is strictly parameterized using a 6D contin uou s rot ati on feature (Rot6D), computed from the ﬁrst t wo columns of the relative rot atio n matrix 𝑅 des root 𝑅 ⊤ root . The discrete moti on token z 𝑑 captures temporally coherent, high-lev el moti on semantics and is detailed in the subsequent policy architecture description. Actio n Space. The policy 𝜋 outputs target joint positio ns a 𝑡 ∈ R 29 . Each physical joint tracks these respective targets utilizing a lo w-lev el Proportional-Deriv ativ e (PD) controller operating at high frequency , ensuring st ab le torque generation. R ewards and Domain Randomiz ati on. The reward combines positiv e tracking terms—co vering root v elocities, body-link orientations, and joint-po sition matching—with regul arization pen a lties on action jerks, joint-limit violatio ns, and undesired contacts ( T ab. 1a ). T o ensure sim-to-real transferabilit y , we inject comprehensiv e 8 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control domain randomization d uring training, perturbing both the simulator’s physica l properties (fri ctio n, center-of - mass oﬀsets) and the command-lev el kinematic t argets ( T a b. 1b ). T a bl e 1: Training conﬁguratio n for the genera l-purpose physics tracker . L eft: reward function with expon ential tracking terms and regul arizatio n pen a lties. Right: domain randomiz ati on ranges for command-lev el, physica l, and extern a l disturbance parameters. ( a) Rew ard f uncti on. Tracking: 𝑟 𝑖 = 𝑤 𝑖 exp  −‖ 𝑒 𝑖 ‖ 2 /𝜎 2 𝑖  . Tracked Quantit y 𝑤 𝜎 R oot Positio n 0 . 1 0 . 30 R oot Orient ati on 0 . 5 0 . 40 R oot Linear V el. 1 . 0 0 . 50 R oot Angul ar V el. 1 . 0 1 . 00 R el. Body Positi on 1 . 0 0 . 30 R el. Body Orientation 1 . 0 0 . 40 Body Linear V el. 1 . 0 1 . 00 Body Angul ar V el. 1 . 0 √ 𝜋 R egulariz ati on 𝑤 Pena lt y Action Rate − 0 . 1 ‖ 𝑎 𝑡 − 𝑎 𝑡 - 1 ‖ 2 Jo int Limit − 10  max(0 , 𝑞 − 𝑞 lim ) Undesired Contacts − 0 . 1  ⊮ ( 𝐹 𝑐 > 1) (b) Domain randomization ranges. P arameter Range Command-lev el perturbations T arget J oint Pos. (rad) ± 0 . 01 T arget Ang. V el. (rad/s) ± 0 . 2 T arget R oot R ot. (rad) ± 0 . 05 Default J oint Pos. (rad) ± 0 . 01 Phy sical properties CoM Oﬀset (m) 𝑥 : ± 0 . 5; 𝑦 , 𝑧 : ± 0 . 1 Static Fricti on [0 . 3 , 1 . 6] Dynamic Friction [0 . 3 , 1 . 2] External disturbances Push Frequency (s) 1 . 0 – 3 . 0 Push Lin. V el. (m/s) 𝑥𝑦 : ± 0 . 5; 𝑧 : ± 0 . 2 Push Ang. V el. (rad/s) 𝑅𝑃 : ± 0 . 52; 𝑌 : ± 0 . 78 3.3.2. Genera l Motio n Tracking Policy Our tracking policy is built on a shared motio n-latent representation with an encoder–quantizer–decoder structure. A moti on encoder maps the kinematic observations m 𝑡 to a continu ous l atent z 𝑐 , which is then discretized into tokeniz ed codes z 𝑑 . Two parall el heads consume z 𝑑 : a reconstructi on decoder f or represent atio n learning, and an action decoder that fuses z 𝑑 with the proprioceptiv e st ate p 𝑡 to produ ce control actions. Impro ved Discrete Quanti zatio n. W e adopt an improv ed Finite S ca lar Quanti zatio n (iF SQ) Lin et al. ( 2026 ) to distill high-frequency kinematic sign a ls into compact semantic tokens. Giv en continu ous latent features z 𝑐 ∈ R 𝑁 × 𝑑 , with 𝑁 the batch size and 𝑑 the embedding dimensio n, each channel is bounded to [ − 1 , 1] and quantiz ed into 𝐿 = 2 𝐾 + 1 uniformly spaced levels, where the extra center level guarantees an exact zero-st ate. Element-wise quantiz ati on maps each dimensio n to an integer index 𝑧 𝑑 ∈ { 0 , . . . , 𝐿 − 1 } vi a: 𝑧 𝑑 = round  𝐿 − 1 2 ( 𝑓 ( 𝑧 𝑐 ) + 1)  . (13) Rather than the standard tanh bounding, w e employ a sigmoid-ba sed mapping that improv es bin utiliz ati on while preserving uniform quantiz ati on interva ls: 𝑓 ( 𝑥 ) = 2 . 0 𝜎 (1 . 6 𝑥 ) − 1 . (14) W e apply the straight-through estimator (stop-gradient) for the rounding operation, yielding the discrete tokens z 𝑑 injected into the policy observati on. Encoder-Decoder Architecture. The motio n encoder ingests a 10-frame f uture window M 𝑡 : 𝑡 +9 and produ ces a continu ous embedding z 𝑐 ∈ R 𝑑 . After iFS Q discreti zatio n, the reconstructi on decoder maps z 𝑑 back to the f ull 10-frame moti on sequence ˆ M 𝑡 : 𝑡 +9 , optimiz ed vi a: ℒ rec = 1 10 9  𝑘 =0 ‖ ˆ m 𝑡 + 𝑘 − m 𝑡 + 𝑘 ‖ 2 2 . (15) The action decoder concatenates z 𝑑 with a 10-step proprioceptiv e history P 𝑡 − 9: 𝑡 to produ ce the control actio n a 𝑡 . 9 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control Adaptiv e Motion Sampling. T raining a uniﬁed policy ov er a l arge, heterogeneou s moti on corpus introdu ces severe optimi zatio n imbalance: unif orm sampling ov erﬁts to trivial locom otion while underperforming on dynamic, agile skills. Inspired by the bin-level adapti ve curriculum in ZEST Sl eiman et al. ( 2026 ), we design a contin uou s temporal-bin variant with sm oothed diﬃcult y propagati on. W e partition the concatenated moti on corpus into 𝐵 uniformly spaced temporal bins. At each episode termina- tio n, the per-step tracking score 𝑠 𝑡 ∈ [0 , 1] is av era ged ov er the episode to yield a diﬃcult y estimate 𝑑 = 1 − ¯ 𝑠 , which is accumulated into the corresponding bin vi a exponential mo ving av era ge: 𝐹 𝑏 ← 𝛼 𝑑 𝑏 + (1 − 𝛼 ) 𝐹 𝑏 , (16) where 𝐹 𝑏 denotes the diﬃcult y o f bin 𝑏 and 𝛼 controls the update rate. T o prevent isolated hard bins from dominating the sampling distributi on and to propagate diﬃcult y to temporally adjacent regions, w e apply a 1D kernel sm oothing operation follo w ed by an outlier cap: ˜ 𝐹 𝑏 =  𝒦 * ˆ 𝐹  𝑏 , ˆ 𝐹 𝑏 = min  𝐹 𝑏 , 𝑐 · ¯ 𝐹  , (17) where 𝒦 is an exponentially decaying kernel and 𝑐 bounds the maximum bin weight rel ativ e to the global mean ¯ 𝐹 . The ﬁnal sampling distributio n mixes the smoothed diﬃcult y with a uniform baselin e to ensure explorati on: 𝑃 𝑏 = 𝜂 ˜ 𝐹 𝑏 + 1 − 𝜂 𝐵 , (18) where 𝜂 controls the balance bet w een diﬃcult y-driv en and uniform sampling. This mechanism continu ous ly steers the training distributi on toward challenging temporal regions of the m otion manifold, while the kernel sm oothing ensures that diﬃcult y informatio n propagates to neighboring segments, preventing abrupt sampling discontin uities. 3.4. Training P aradigm for U niﬁed Tracking and Generati on Dataset Constructi on. T raining tuples for the st ate-conditi on ed midd leware are generated from a diverse m otion corpus using a receding-hori zon sampling strategy . For each m otion sequence, segment st arting points are selected at regular interva ls, and the temporal segment length ℓ is drawn from a log-uniform distrib utio n: ℓ ∼ exp( 𝒰 (log ℓ min , log ℓ max )) , (19) where ℓ min = 𝐻 is set equ a l to the planning horizon. From each segment, 𝐾 uniformly spaced keyframes are extracted cov ering only the ﬁrst 𝐻 frames ( correspo nding to ∆ 𝑡 = 𝐻 / f ps seconds), regard less of the tot a l segment length. The conditi oning vect or c 𝑡 = [ ˜ p 𝑡 , m 𝑡 ] pairs the start st ate with the reference command at the segment end point. The residual supervision is computed against the st ati c baseline 𝛽 𝑡,𝑘 = ˜ p 𝑡 ( Eq. (5) ), so each training t arget encodes the m otion increment ov er the next ∆ 𝑡 seconds from the current st ate. This design ensures that (i) the model nev er needs to predict trajectories exceeding a ﬁxed temporal horizon, bounding the resid ual magnitude regard less of the st ate-co mmand gap; (ii) training naturally cov ers every phase of long reco very sequences, as successiv e st arting points within the same moti on yield ov erlapping loca l windows; and (iii) elimin ating near-zero-length segments ( ℓ < 𝐻 ) remo ves the trivial zero-residual bi a s that otherwise dominates under mean-squ ared-error training. No isy-St ate A ugmentation. Deployment introdu ces a systematic discrepancy bet ween noisy physical st ate estimatio n and the clean reference commands av ailab le during training. T o close this gap, we apply asymmetric start-st ate perturbations: only the initial proprioceptiv e st ate is corrupted with channel-wise Gaussian no ise, ˜ p 𝑡 = p 𝑡 + 𝜖 , (20) 10 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control while the target reference command remains clean. The noise magnitude is scaled per channel to reﬂect the varying sensitivit y of diﬀerent st ate dimensions. This asymmetri c augmentation mirrors the deployment scenario where the robot’s proprioceptiv e state reﬂects both sensor noise and accumulated tracking errors, while the reference command is alwa ys clean, improving robustn ess to real-world state-command discrepancies. Kinemati cs-A ware L oss W eighting. Bey ond the primary velocity matching objectiv e ( Eq. (7) ), w e introduce a kinemati cs-aware weighting scheme motiv ated by the observatio n that identical joint-space errors can indu ce va stly diﬀerent body-space deviatio ns depending on the current kinematic conﬁgurati on. C o ncretely , we weight each st ate dimensio n 𝑞 𝑑 by a pose-dependent Jacobian magnitude: 𝑤 𝑑 ( q ) =  𝑏     𝜕 p 𝑏 𝜕 𝑞 𝑑     2 2 ≈  𝑏     p 𝑏 ( q + 𝛿 𝑒 𝑑 ) − p 𝑏 ( q − 𝛿 𝑒 𝑑 ) 2 𝛿     2 2 , (21) where p 𝑏 is the Cartesian positi on of tracked body link 𝑏 . This approximates the diagona l of J ⊤ J eva luated independently at each training pose and keyframe, produ cing a weight tensor o f shape ( 𝑁 , 𝐾, 𝐷 ) . This captures pose-dependent lev er arm geometry: for example, a shoulder joint in a T -pose conﬁgurati on commands a longer m oment arm than when the arm hangs at rest, and receiv es a correspondingly larger gradient signal. The w eights are norma lized to unit mean per dimension and clamped to a minim um valu e to prev ent z ero-gradi ent dimensio ns. In practice, all weights are precomputed via diﬀerentiab le forw ard kinematics during dataset constru ctio n and cached alo ngside training tuples, eliminating all online o verhead. State Representatio n and Model V ariants. W e eva luate t w o st ate parameteri zatio ns: a 38D conﬁguration com- prising joint positio ns (29D), root positi on (3D), and root orientation in 6D contin uous rot ati on representation, and a 35D variant omitting the global root positio n. The 38D form ulation enabl es the midd leware to synthesize root translational commands, all o wing the robot to auton om ously correct globa l positio nal drift relative to the referen ce trajectory . The 35D variant delegates root translatio n entirely to the reference motio n, decoupling the midd leware from global localizatio n. While this sacriﬁces autono mo us positi on correctio n, it eliminates dependence on extern a l positio ning systems, making it directly deployab le with onboard proprioceptio n and IMU alo ne. Both conﬁgurations ret ain equivalent f a ll recov ery and general moti on tracking performan ce. All quantit ativ e results reported in this work use the 38D conﬁgurati on unless otherwise noted. 4. Experiments 4.1. Implementatio n Det ails Sim ulation Environment. All experiments are cond ucted on the Unitree G1 humano id pl atf orm, a f ull-size bipedal robot standing approximately 1.32 m tall with a total mass of roughly 35 kg. The robot features 29 actuated degrees of freedom spanning the torso, t wo 7-DoF arms, and t wo 6-DoF legs, all driv en by propriet ary electric actu ators. Training is carried out in Isa acLab Mittal et al. ( 2025 ), a GPU-accelerated sim ulator built on NVI D IA Isa ac Sim, where we inst antiate 16,384 parallel enviro nments on a single NVI D IA A100 (80 GB) GPU . The physics simulation runs at a 5 ms timestep (200 Hz) on ﬂ at ground with randomized frictio n coeﬃcients ( T a b. 1b ), while the control policy queri es observati ons and emits actions at 50 Hz ( every 4 simulatio n substeps). Each action is conv erted to joint torques by a per-joint PD controller executing at the f ull 200 Hz rate. For eva luation, all poli cies are tested in the MuJoCo physi cs engine on a held-out set comprising 101 unseen m otion sequen ces that span locomoti on, dance, martial arts, daily activities, f a ll-and-recov ery , acrobatic jumps, and discretized moti on clips in which continuo us reference trajectories are repl aced with piecewise-constant signals consisting of static poses separated by abrupt transitio ns, thereby remo ving all smooth interpolation and testing the policy’s ability to track discontin uous commands. E ach sequence is rolled out for its f ull d uratio n ( up to 20 s). Motio n Dat a set. The training corpus is curated from diverse, complementary sources, comprising selected clips from LAF AN1 H arv ey et al. ( 2020 ), 100ST YLE Maso n et al. ( 2022 ), SnapMoGen Guo et al. ( 2025 ), 11 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control AMASS Mahm ood et al. ( 2019 ), and proprietary in-house m otion capture recordings. The assemb led dat a set spans locom otio n, martial arts, dance, daily activities, f a ll-and-reco very sequences, and jumping m otions, thereby providing broad st ylistic and temporal cov erage for evaluating tracking ﬁdelit y , rob ustness, and executio n st a bilit y under heterogeneo us moti on commands. Tracker Training. The general-purpose physics tracker is trained vi a Proximal Poli cy Optimi zatio n (PPO) within an asymmetric actor–critic framework, foll owing est ab lished practices in prior whole-body tracking systems Li et al. ( 2026 ); Liao et al. ( 2025 ); Luo et al. ( 2025 ). The iF SQ encoder , reconstructi on decoder , and action decoder are trained jointly end-to-end with the PPO objectiv e, where the reconstructio n loss ( Eq. (15) ) and the RL policy gradient share the same encoder–quanti zer pathwa y . The adapti ve motio n sampling curriculum ( Eq. (18) ) is activated after an initial warm-up phase to allo w early-st a ge uniform co vera ge. The complete reward f un ction and domain randomization ranges are speciﬁed in T ab. 1 ; all remaining hyperparameters are listed in T ab. 2 . Trajectory Generator Training. The state-condition ed ﬂo w matching trajectory generator is trained oﬄine on paired trajectory tuples extracted from the moti on corpus follo wing the dat a set constru ctio n procedure described in Eq. (19) . The vel ocit y ﬁeld ˆ v is parameteri zed by an AdaLN-mod ul ated T ransformer Peeb les and Xie ( 2023 ) that predicts 𝐾 =8 uniformly spaced keyframes ov er a ﬁxed planning horizon o f ∆ 𝑡 =0 . 2 s. W e train t wo st ate represent ati on variants: a 38D conﬁguratio n encoding 29 joint positions, 3D root positi on, and 6D root orient ati on, and a 35D variant that omits root position entirely . The 38D model retains f ull aw areness o f global positi oning, enab ling recov ery to ward the reference command manifold in both pose and locati on. The 35D model relies solely on joint encoders and an IMU for root orientation, making it directly deploya bl e on hardware without external localizatio n; global positi on tracking is delegated to the referen ce m otion source. W e optimiz e the vel ocit y matching objective ( Eq. (7) ) jointly with the kinemati cs-aware loss w eighting ( Eq. (21) ). Noisy-state augment ati on ( Eq. (20) ) is applied throughout training with channel-wise Gaussian noise with redu ced ma gnitudes for root pose and orientation channels. During deployment, trajectory samples are generated via 5 Euler integration steps from 𝑡 =0 . 9 to 𝑡 =0 , initialized with the directional warm start ( Eq. (10) ) at 𝑡 start =0 . 9 . The generator repl ans ev ery 𝑁 exec =2 control steps (0.04 s), yielding a closed-loop replanning frequency of 25 Hz. Full hyperparameters for both the tracker and the trajectory generator are listed in T ab. 2 . T a bl e 2: Training hyperparameters for the physics tracker and the trajectory generator . Left: PPO -based tracker training conﬁguration. Right: ﬂo w matching trajectory generator architecture, optimiz ati on, and inferen ce settings. Tracker Hyperparameter V alue P arallel environments 16,384 R ollout horizon 24 steps Discount f actor 𝛾 0.99 GAE 𝜆 0.95 PPO epochs / mini-batches 5 / 4 Clipping ratio 𝜖 0.2 Actor learning rate 2 × 10 − 3 Critic learning rate 1 × 10 − 3 KL t arget 0.01 Entropy coeﬃcient 0.005 Gradient clip norm 1.0 T otal training iterations ∼ 100,000 Generator Hyperparameter V a lue Attention blocks / heads / dim 6 / 4 / 512 Conditio ning injecti on AdaLN ( c 𝑡 =[ p 𝑡 , m 𝑡 ] + timestep) Keyframes 𝐾 / horizon Δ 𝑡 8 / 0.2 s State dimensio n 𝐷 38 / 35 Optimizer AdamW ( 𝛽 1 =0 . 9 , 𝛽 2 =0 . 999 ) Learning rate 1 × 10 − 4 ( cosin e decay) W eight decay 10 − 4 Batch si ze 256 T raining epochs 4,000 P arameters 22.9 M Inference OD E steps 5 (Euler , 𝑡 : 0 . 9 → 0 ) W arm-start 𝑡 start 0.9 R eplan interva l 𝑁 exec 2 steps (0.04 s, 25 Hz) 4.2. Comparisons W e evaluate eight model conﬁguratio ns that systematica lly vary four design axes—policy architecture, m otion tokenizer , observati on design, and generative trajectory pl anning—to isol ate the contrib utio n o f each compon ent. All variants share the same simulator , robot morphology , reward f uncti on, and training budget, and are evaluated 12 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control on the identical held-out moti on set. Compared Methods. All polici es receive a 10-frame proprioceptiv e history and a 10-frame f uture moti on referen ce window (frames 𝑡 through 𝑡 +9 ) unless otherwise noted. W e organi ze the evaluated methods into t w o groups; det ailed net w ork architecture speciﬁcati ons are pro vided in T a b. 3 . Architecture and external baselines. MLP employs a st andard MLP actor–critic with BeyondMimi c-st yle (BM) observati ons Liao et al. ( 2025 ). Transformer foll o ws the T okenHSI architecture Pan et al. ( 2025 ), encoding proprioceptiv e and m otio n inputs into per-moda lit y tokens that are processed by a m ulti-head T ransformer with a learnab le a ggregation token before an MLP action head. VQ - V AE replaces the iF S Q quanti zer with a VQ- V AE tokeniz er Sun et al. ( 2024 ) while retaining the same encoder–decoder policy structure. SO NIC Luo et al. ( 2025 ) is reprod uced from its oﬃ cial open-source relea se with the original observatio n design. Heracles variants. iF S Q BM pairs the iF S Q tokenizer and encoder–decoder policy with Beyo ndMimic observati ons augmented by height features (BM+H), isol ating the tokeni zer contributi on under a standard observation design. iF S Q +H combines the iFS Q -based policy with the proposed observati on design ( Sec. 3.3 ) including explicit height features. iFS Q uses the proposed observatio n design without height, representing the best standalon e tracker conﬁguratio n. Heracles augments the iF S Q tracker with the st ate-co nditio ned trajectory generator ( Eq. (1) ), constituting the f ull proposed system. Ev aluatio n Protocol. All methods are evaluated on the same held-out set of 101 moti on sequences unseen d uring training, spanning locom otion, dance, martial arts, daily activities, and f a ll-and-reco very . Each sequence is rolled out for its f ull durati on (up to 20 s). W e report ﬁv e metrics: (i) C o mpletio n Rate ( CR) —the fraction o f reference frames for which the policy maint ains a root height error below 0.3 m and a root orientation error below 1.2 rad; ( ii) Joint P ositio n E rror —the 𝐿 2 norm o f joint-positio n deviations; ( iii) Root Height Error — absolute height deviatio n in the w orld frame; (iv) Root Orientation Error —orientation error excluding yaw; and (v) Root Linear V elocit y E rror —v elocit y error in the body frame. T a bl e 3: Net work architecture speciﬁ catio ns for all evaluated methods. [ · ] denotes MLP hidden-l a yer widths. iF S Q * co vers all iF SQ variants (iF SQ BM , iF SQ +H , iF SQ), which share the same net w ork but diﬀer in observati on design. Method Component Conﬁguratio n MLP Act or [4096 , 2048 , 1024 , 512 , 256] MLP Critic [4096 , 2048 , 1024 , 512 , 256] MLP T ransformer T okenizer [512 , 512] MLP × 2 m odalities → 3 tokens (512-dim) Backbone 3-layer , 4-head, 512-dim Transf ormer Action head [2048 , 1024 , 256] MLP Critic [3072 , 1536 , 768 , 512] MLP VQ- V AE Quantiz er Codebook |𝒞 | =10 , 240 , dim =512 Poli cy Enc-Dec (identical to iFS Q) SO NIC – Oﬃcial release Luo et al. ( 2025 ); stride-5 reference sampling iF SQ * Poli cy iF SQ Enc-Dec ( Sec. 3.3 ) Heracles Tracker iF SQ Enc-Dec ( Sec. 3.3 ) T raj. gen. 6-l ayer , 4-head, 512-dim AdaLN Transf ormer Peebl es and Xie ( 2023 ) R esults. T ab. 4 ( a) summarizes the component conﬁgurati on of each variant, and quantit ativ e tracking performan ce is reported in T ab. 4 ( b). W e highlight f our principal ﬁndings. R ob ustness. The three variants equipped with the proposed observati on design and iF SQ tokenizer ( iF S Q +H , iF S Q , Heracles) consistently outperf orm all baselines in completi on rate, achieving 87.3%, 87.2%, and 90.6% respectiv ely . Heracles attains the highest completio n rate, exceeding the best extern a l baseline VQ- V AE (86.0%) by 4.6 percent a ge points and MLP (84.8%) by 5.8 points. Amo ng external baselines, completio n rates range from 79.3% (S ONIC) to 86.0% (V Q - V AE). Switching from BM to the proposed observati on design while keeping 13 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control T a bl e 4: C o mponent conﬁgurati on and quantit ativ e comparison of all eva luated methods. ( a) Obs. column: BM = Bey ondMimi c-st yle Liao et al. ( 2025 ); † = proposed ( Sec. 3.3 ); +H = with height features. ✓ / ✗ indicates presence or absen ce of the trajectory generator . ( b) Completio n rate measures the fraction of referen ce frames for which the root height error remains belo w 0.3 m and the root orientatio n error stays belo w 1.2 rad; tracking errors are reported f or joint positio ns, root height, root orient ati on ( excl. yaw), and root linear vel ocit y . Blue cells mark the best result per metric. Colored percent a ges sho w rel ativ e change from MLP : teal = better , red = worse. ( a) Component conﬁguration Method Obs. T okenizer Policy Traj. Gen. MLP BM – MLP ✗ T ransformer BM – Trans. ✗ VQ- V AE BM VQ Enc-Dec ✗ SO NIC SO NIC – SO NIC ✗ iF SQ BM BM+H iF SQ Enc-Dec ✗ iF SQ +H † +H iF SQ Enc-Dec ✗ iF SQ † iF SQ Enc-Dec ✗ Heracles † iF S Q Enc-Dec ✓ ( b) Tracking performance Method CR (%) ↑ Jo int Err (rad) ↓ Height Err (m) ↓ Ori Err (rad) ↓ LinV el Err (m/s) ↓ MLP 84 . 8 1 . 1572 0 . 1194 0 . 3590 0 . 2230 T ransformer 80 . 6 ( − 5 . 0% ) 1 . 5436 ( − 33 . 4% ) 0 . 1426 ( − 19 . 4% ) 0 . 3410 ( +5 . 0% ) 0 . 2046 ( +8 . 3% ) VQ- V AE 86 . 0 ( +1 . 4% ) 2 . 3013 ( − 98 . 9% ) 0 . 1077 ( +9 . 8% ) 0 . 3675 ( − 2 . 4% ) 0 . 2376 ( − 6 . 5% ) SO NIC 79 . 3 ( − 6 . 5% ) 1 . 9828 ( − 71 . 3% ) 0 . 1402 ( − 17 . 4% ) 0 . 3771 ( − 5 . 0% ) 0 . 2334 ( − 4 . 7% ) iF SQ BM 85 . 1 ( +0 . 4% ) 1 . 4760 ( − 27 . 5% ) 0 . 1096 ( +8 . 2% ) 0 . 3474 ( +3 . 2% ) 0 . 2362 ( − 5 . 9% ) iF SQ +H 87 . 3 ( +2 . 9% ) 1 . 2924 ( − 11 . 7% ) 0 . 0271 ( +77 . 3% ) 0 . 1539 ( +57 . 1% ) 0 . 1709 ( +23 . 4% ) iF SQ 87 . 2 ( +2 . 8% ) 1 . 1863 ( − 2 . 5% ) 0 . 0955 ( +20 . 0% ) 0 . 3614 ( − 0 . 7% ) 0 . 1561 ( +30 . 0% ) Heracles 90 . 6 ( +6 . 8% ) 1 . 3272 ( − 14 . 7% ) 0 . 0764 ( +36 . 0% ) 0 . 2728 ( +24 . 0% ) 0 . 2325 ( − 4 . 3% ) the iFS Q tokenizer ﬁxed ( iF S Q BM → iF S Q) raises completi on from 85.1% to 87.2%, conﬁrming the role of observati on design in rob ust tracking. T okenizer eﬀectiv eness. Comparing VQ- V AE and iF S Q BM —which share the same encoder–decoder policy and BM-st yle observati on design but diﬀer in the qu antizer—revea ls that iFS Q red uces the joint-positi on error from 2.3013 to 1.4760 rad ( − 35.8%) while maint aining a comparab le completio n rate (85.1% vs. 86.0%). The dramatic redu ctio n in tracking precision highlights the superi or codebook utili zatio n o f ﬁnite scalar quantiz ati on o ver conv entio nal VQ- V AE in this high-frequency control domain. Observati on design and height features. Among the iF S Q variants, iF SQ +H achiev es the lo w est height error (0.0271 m) and orientatio n error (0.1539 rad) across all methods, while iF S Q att ains the best linear-v elocit y tracking (0.1561 m/s) and a competitiv e joint-positi on error (1.1863 rad). R em oving explicit height features (iF SQ +H → iF S Q) yields lo wer joint error (1.1863 vs. 1.2924 rad) at the expense of higher height error (0.0955 vs. 0.0271 m) and marked ly degraded orientation control (0.3614 vs. 0.1539 rad), conﬁrming that the height channel is critical for verti cal and orientation precisio n. T rajectory generator . Heracles achieves the highest completi on rate (90.6%) am ong all methods—a 6.8% relative impro vement ov er MLP and a 3.9% improv ement ov er the st anda lon e iFS Q tracker—while maintaining competitiv e tracking qualit y . Compared to iFS Q , the trajectory generator red uces root orient ati on error from 0.3614 to 0.2728 rad (24.5% relative red ucti on) and height error from 0.0955 to 0.0764 m (20.0% red ucti on), at a modest cost in joint-positi on error (1.3272 vs. 1.1863 rad). This indicates that the st ate- conditi oned generativ e pl ann er synthesiz es spatially-aw are recov ery trajectories that reﬁne both verti cal and heading control, a capabilit y absent in the reactive tracker al one. W e present qualitative sim-to-sim evaluati on results on an out-o f-distrib utio n martial arts sequence in MuJoCo, as shown in Fig. 3 . Amo ng the baseline methods, MLP , Transf ormer , and S ONIC fail to maint ain balance and collapse early in the sequen ce, while VQ - V AE barely tracks the moti on throughout. F or our ab lation variants, 14 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control MLP VQ - V AE T ransformer SONIC iFSQ BM iFSQ +H iFSQ Heracles Figure 3: Q ua lit ativ e sim-to-sim comparison on an out-o f-distrib utio n marti a l arts sequence. E ach row sho ws a diﬀerent method tracking the same referen ce motio n on a Unitree G1 humanoid alongside a reference ghost in MuJoCo. MLP , Transf ormer , and SO N I C coll a pse early; VQ - V AE barely tracks the moti on. Among our ab l ati ons, iF S Q BM and iF SQ survive without f alling, while iFS Q +H falls but l ater recov ers. Heracles ( O urs) tracks the f ull sequence m ost accurately , demonstrating the strongest robu stness to OO D moti ons. iF S Q BM and iF SQ successfully survive the entire sequence without f alling; how ever , iF S Q + H experien ces a fall at an intermediate stage b ut manages to reco ver afterwards. In contra st, Heracles not only survives the f ull sequence but also accurately tracks the root position and body pose across all frames, demo nstrating the strongest rob ustness to out-o f-distrib ution moti ons among all evaluated approaches. W e f urther validate our method through real-world deployment on a Unitree G1 humanoid robot, as illustrated in Fig. 4 . The experiments span a broad spectrum o f behavi ors, ranging from ev eryday locom otio n such as wa lking and running, to highly dynamic skills including kicking and f ull 360 ° kicks, as well as human-object interactio n (HOI) scenarios. 15 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control W alk Run HOI Kick 360 ° Kic k Figure 4: Rea l-w orld m otion tracking across div erse and dynamic behaviors. R eal-w orld experiments dem onstrate that our model genera lizes to a broad spectrum o f moti ons, from ev eryday locom otion (walk, run) to highly dyn ami c skills ( kick, 360 ° kick) and human-object interaction. 4.3. F all-and-R ecov ery Eva luatio n T o speciﬁca lly assess the f all-and-reco very capa bilities that distinguish Heracles from pure tracking approaches, w e condu ct a dedicated evaluatio n on a curated subset of fall-and-reco very moti on sequences extracted from the test corpus. These sequen ces encompa ss diverse recov ery scen ari os including lie-to-stand, prone-t o-st and, and st and-to-li e transitions, with varied initial f a llen conﬁguratio ns and reco very directio ns. All sequences are eva luated in both their origin a l contin uous form and discreti zed variants, in which the trajectories are replaced with piecewise-constant poses, to assess robustn ess under discontinu ous referen ce sign a ls. Q uantitative results are summariz ed in T a b. 5 ; qualitative sim-to-sim compariso ns are sho wn in Fig. 5 . T a bl e 5: F all-and-reco v ery eva luation on challenging lie-to-stand, prone-to-stand, and st and-to-lie sequen ces. CR denotes completio n rate. Blue marks the best result per metric. Colored percent a ges show relative change from MLP : teal = better , red = worse. Method CR (%) ↑ Jo int Err (rad) ↓ Height Err (m) ↓ Ori Err (rad) ↓ LinV el Err (m/s) ↓ MLP 44 . 0 2 . 1720 0 . 3586 1 . 0157 0 . 2710 T ransformer 40 . 6 ( − 7 . 7% ) 2 . 8309 ( − 30 . 3% ) 0 . 3706 ( − 3 . 3% ) 0 . 8355 ( +17 . 7% ) 0 . 2708 ( +0 . 1% ) VQ- V AE 69 . 8 ( +58 . 6% ) 2 . 5700 ( − 18 . 3% ) 0 . 1898 ( +47 . 1% ) 0 . 4307 ( +57 . 6% ) 0 . 2719 ( − 0 . 3% ) SO NIC 42 . 8 ( − 2 . 7% ) 2 . 9897 ( − 37 . 6% ) 0 . 3342 ( +6 . 8% ) 0 . 8629 ( +15 . 0% ) 0 . 2895 ( − 6 . 8% ) iF SQ BM 52 . 4 ( +19 . 1% ) 2 . 4482 ( − 12 . 7% ) 0 . 2880 ( +19 . 7% ) 0 . 9109 ( +10 . 3% ) 0 . 3134 ( − 15 . 6% ) iF SQ +H 52 . 7 ( +19 . 8% ) 1 . 7660 ( +18 . 7% ) 0 . 0419 ( +88 . 3% ) 0 . 2744 ( +73 . 0% ) 0 . 2793 ( − 3 . 1% ) iF SQ 48 . 2 ( +9 . 5% ) 2 . 0236 ( +6 . 8% ) 0 . 3024 ( +15 . 7% ) 1 . 0488 ( − 3 . 3% ) 0 . 2405 ( +11 . 3% ) Heracles 90 . 0 ( +104 . 5% ) 1 . 4114 ( +35 . 0% ) 0 . 0762 ( +78 . 7% ) 0 . 2427 ( +76 . 1% ) 0 . 2830 ( − 4 . 4% ) The f a ll-and-recov ery evaluati on revea ls a st ark performan ce divide that underscores the f undamental limit atio ns 16 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control o f pure tracking paradigms ( T a b. 5 ). All reactiv e tracking baselines—MLP , T ransf ormer , and S ONIC—achieve completi on rates below 45%, indicating complete inabilit y to execute f a ll-reco very moti ons. While these methods f uncti on adequ ately under nomina l tracking conditions ( T ab. 4 ), they fail catastrophica lly when the reference motio n demands transitions through extreme pose conﬁguratio ns inherent to fall-and-recov ery sequen ces. Am ong the compared methods, VQ - V AE dem onstrates unexpected resilience ( CR = 69.8%), suggesting that its less precise but more ﬂexibl e latent represent ati on provides some implicit generalizatio n to extreme poses. The iFS Q tracker variants without the generative middl eware sho w mixed results: iF S Q +H achiev es the highest completi on rate amo ng st anda lon e trackers (52.7%) and the lo west height error (0.0419 m) due to its explicit height features, while iF SQ BM and iF S Q achiev e completi on rates of 52.4% and 48.2% respectiv ely with substanti a lly higher tracking errors. Heracles achiev es the highest completi on rate by a decisive margin (90.0%), exceeding the second-best method VQ- V AE by 20.2 percent a ge points—a 104.5% relative improv ement ov er MLP . Critica lly , Heracles also attains the low est joint-positi on error (1.4114 rad) and orientation error (0.2427 rad) amo ng all methods, conﬁrming that the st ate-co ndition ed generativ e midd leware is essenti a l for maint aining coherent tracking through the extreme state transitio ns characteristic of fall-and-recov ery m otions. By dyn ami cally synthesizing fea sib le recov ery trajectories conditi oned on the robot’s real-time physica l st ate, Heracles bridges the gap bet w een the referen ce moti on and the robot’s actual conﬁguratio n, en a b ling gracef ul execution of moti ons that driv e purely reactive trackers to cat astrophi c f ailure. 4.4. Ab lation Studies and Architectural Analysis T o isolate the contrib utio n of each design choice within the generativ e midd lew are, we condu ct ab lation experiments on the trajectory generator while keeping the iFS Q tracker ﬁxed. All ab lations are evaluated on the f ull 101-sequen ce test set spanning the complete div ersit y of the evaluatio n corpus. Results are summariz ed in T a b. 6 . T a b le 6: Ab lation study on the trajectory generator’s key design components. All variants are eva luated on the f ull 101-sequence test set using the same iF S Q tracker . Blue marks the best result per metric. Colored percentages sho w rel ati ve change from Heracles (full): teal = better , red = worse. V ariant CR (%) ↑ Jo int Err (rad) ↓ Height Err (m) ↓ Ori Err (rad) ↓ LinV el Err (m/s) ↓ Heracles ( f ull) 90 . 6 1 . 3272 0 . 0764 0 . 2728 0 . 2325 w/o directional warm start 87 . 2 (-3.8%) 1 . 6236 (-22.3%) 0 . 0962 (-25.9%) 0 . 3393 (-24.4%) 0 . 2423 (-4.2%) w/o noisy-state augmentation 78 . 6 (-13.2%) 1 . 8896 (-42.4%) 0 . 1463 (-91.5%) 0 . 4318 (-58.3%) 0 . 2182 (+6.1%) w/o kinematics-aware weighting 82 . 1 (-9.4%) 1 . 6931 (-27.6%) 0 . 1200 (-57.1%) 0 . 4055 (-48.6%) 0 . 2394 (-3.0%) Directio nal W arm St art. R eplacing the directional motio n prior ( Eq. (10) ) with pure Gaussian initialization degrades all metrics: completi on drops from 90.6% to 87.2% ( − 3.8%) and joint error increases by 22.3%. The warm st art seeds the OD E solver with a coarse linear interpol ati on to ward the target, en a b ling the learned v elocit y ﬁeld to focus its reﬁnement b udget on n atura lness rather than gross direction estimatio n. Without this prior , the generator must expend additiona l integration steps to discov er the correct recov ery heading, yielding failures particularly on f a ll-and-reco very sequences. No isy-St ate A ugment ati on. Rem oving the asymmetric no ise injection ( Eq. (20) ) during training produ ces the m ost sev ere degradatio n in completi on rate among all ab lation variants, with CR f alling to 78.6% ( − 13.2%) and joint error increasing by 42.4%. Height error nearly doubles ( + 91.5%), and ori ent ati on error increases by 58.3%. Without noise augmentation, the generator ov erﬁts to clean state inputs; at deployment, accumulated tracking drift and sensor noise push the conditioning st ate awa y from the training distributio n, causing catastrophic failure on out-of -distrib ution moti ons. The augmented variant bridges this train–deploy distributi on gap, 17 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control MLP VQ - V AE T ransformer SONIC iFSQ BM iFSQ +H iFSQ Heracles Figure 5: Qualitative sim-to-sim comparison on an OOD lie-to-stand sequence. Same setup as Fig. 3 . MLP , T ransf ormer , SONIC, and iF S Q BM fail to st and up; VQ- V AE, iFS Q +H , and iFS Q partially track the motio n. Heracles ( Ours) completes the f ull lie-to-stand transitio n and most accurately tracks the root positio n. enab ling robust trajectory generati on ev en from noisy propriocepti ve readings. Kinemati cs-A ware L oss W eighting. R em oving the J acobian-based weighting ( Eq. (21) ) produ ces the l argest degradatio n in completi on rate after no isy-state augmentation, with CR falling to 82.1% ( − 9.4%). Height error increa ses by 57.1% and orientation error by 48.6%. The severit y of this ab lation indicates that pose-dependent lev er-arm geometry is criti cal f or rob ust tracking: a unit-radian shoulder error in an extended-arm conﬁgurati on ind uces far l arger Cartesian displacement than the same error with arms at rest, and the w eighting scheme enab les the generator to prioritize these geometrica lly sensitive conﬁguratio ns. Learned Discrete Representatio n. Bey ond compo nent-lev el ab l ati ons, we examine whether the iF S Q tokenizer acquires a semantica lly structured codebook after training on the f ull moti on corpus. Fig. 6 visu a lizes the discrete code activatio ns projected into a three-dimensional embedding space, with each point representing a quantiz ed moti on token colored by its source motio n category . The visualization reveals clearly separab le clusters corresponding to distinct motor skills—walking, running, jumping, marti a l arts, dance, parkour , crawling, balance, and f a ll recov ery—despite the quanti zer receiving no explicit category l a bels d uring training. Notab ly , semantically rel ated skills occupy neighboring regions (walking and running clusters lie adjacent, while crawling and balance form a separate group), indicating that the iF S Q codebook captures meaningf ul 18 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control kinemati c simil arit y structure. This emergent organi zatio n conﬁrms that the ﬁnite scalar quanti zatio n not only compresses high-frequency motio n signals into compact tokens but also distills a structured m otio n t axon omy that en a b les the downstream action decoder and trajectory generator to reason ov er semantically coherent m otion abstractio ns. Figure 6: Emergent semantic clustering in the learned iF SQ codebook. Each point represents a quanti zed m otion token projected into 3D via PCA; colors denote moti on categories. Despite receiving no category l a bels d uring training, the codebook self-organizes into semantically coherent clusters corresponding to distinct m otor skills. Architectural Motivati on. The ab l ati on results collectively reveal why a hierarchica l architecture—with a dedicated generativ e planner layered abo v e a physics tracker—is preferab le to mo nolithic alternatives for genera l-purpose humanoid control. On the 101-sequence evaluati on, remo ving any single component red uces completi on rate by 3.8–13.2%, dem onstrating that all three design choices are essential for robust performan ce. No isy-st ate augmentation has the l argest impact on tracking qu a lit y ( CR: − 13.2%, height error: + 91.5%), underscoring that bridging the train–deploy distributi on gap is the most critica l challenge for sustained tracking ﬁdelit y . Kinematics-a ware weighting produces the second-largest CR degradati on ( − 9.4%) alo ngside subst antial increa ses in height ( + 57.1%) and orientation error ( + 48.6%), rev ealing that accurate lever-arm modeling is essential for n a vigating complex multi-step transitio ns. These ﬁndings reinforce the design principle of frequen cy separation: the generativ e midd leware reasons o ver a 0.2 s pl anning window at 25 Hz, synthesizing 19 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control temporally coherent recov ery strategies that the tracker then f aithfully executes at 50 Hz—mirroring the hierarchi cal structure of biologi cal motor systems. 4.5. R ecov ery Behavi ors and Analysis Bey ond quantit ativ e metrics, we examine the qualitativ e character of reco very behaviors to underst and how the generativ e middl eware transforms the robot’s response to severe disturbances. This an a lysis rev eals a f undamental distinction bet ween tracking-based and generati on-ba sed control paradigms. Tracking-Only F ailure Modes. When subjected to l arge external perturbati ons, standalon e trackers—including all baselines and the iF S Q variant—exhibit a characteristic failure pattern. Upon being displaced from the referen ce trajectory , the tracker computes a single-step corrective action that minimizes the instant aneo us state-reference error . This myopic strategy produ ces rigid, jerky corrective torques that lack the temporal coordination required for dyn amic balance recov ery . In the mo st severe cases, the tracker’s insistence on returning to the exact reference pose forces physica lly infeasib le joint conﬁgurations, accelerating rather than prev enting the fall. Even when the tracker av oids cat astrophi c failure, its recov ery moti ons appear distinctly no n-anthropom orphic: abrupt whole-body stiﬀening, unn atura l arm postures, and a conspicu ous absen ce of the compensatory stepping strategies that characterize human balance reco very . Generativ e R ecov ery Behavi ors. Heracl es produces qualitatively diﬀerent recov ery dynamics. When a large perturbation displaces the robot from the reference manifold, the generativ e midd leware detects the state-reference discrepancy and synthesiz es a short-horizon trajectory that prioritizes physica l feasibility ov er immediate reference ﬁdelit y . This manif ests as emergent human-like reco v ery strategies: compensatory stepping to widen the base o f support, coordin ated arm counter-m otions to redistrib ute angular momentum, and grad ual torso realignment before resuming the origin a l moti on. Crucially , these behavi ors are not hand-designed or reward-engineered—they emerge naturally from the ﬂow matching model’s learned distributi on ov er physica lly plausib le moti on transitions, conditi oned on the robot’s real-time state. From Tracking to Planning: R ethinking General Humanoid Control. The observed behavi oral diﬀerence revea ls a deeper conceptua l insight into what constitutes a truly genera l-purpose humanoid controller . The dominant tracking paradigm implicitly assumes that control reduces to minimizing the deviation bet w een the robot’s st ate and a predeﬁned kinematic reference. While eﬀectiv e for nominal execution, this form ul ati on conﬂates t wo f undamentally distinct objectives: executing a desired mot or intent and maintaining physi cal viabilit y . Human motor control does not operate as a rigid reference tracker . When a person stumbles, they do not attempt to sn a p back to a pre-planned gait trajectory . Instead, the motor system rapid ly revises the intended trajectory itself , generating a new plan that accounts for the current physica l state, gravitatio nal constraints, and availa bl e mo mentum. The original intent is temporarily deprioritized in f a v or of a dyn ami cally feasib le reco very path, and only on ce st a bilit y is restored does the system smoothly re-engage with the origin a l t a sk objectiv e. Heracles embodies precisely this principle through its state-condition ed midd leware. The generativ e pl ann er contin uou sly m odulates the reference signal based on real-time physica l fea sibilit y: passing commands through unm odiﬁed when tracking is viab le, but seamlessly rewriting them when the physi cal state demands a diﬀerent m otor strategy . This transforms the controller from a passiv e trajectory foll ow er into an active trajectory synthesizer that reasons about what the robot s hould do given its current physica l realit y , rather than what it wa s told to do by a reference sign a l computed without kno wledge o f real-time dynamics. Implicati ons for General-Purpose Deployment. This paradigm shift has concrete implicati ons for deploying humano id robots in unstructured environments. Rea l-w orld scen ari os invaria bly introd uce perturbations absent from any training distrib utio n: unexpected collisions, terrain irregul ariti es, payload changes, or degraded actuation. A tracking-only controller can only succeed if its training-time domain randomization happens to 20 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control Figure 7: O mnidirecti onal fall reco very moti on tracking. R eal-w orld experiments demo nstrate that our m odel generalizes to arbitrary lie-to-stand recov ery moti ons, successf ully hand ling varied initi a l f a llen conﬁgurati ons and recov ery directio ns without t a sk-speciﬁ c engineering. co ver the encountered disturbance; bey ond this env elope, it fails brittly . The generativ e midd leware, by contra st, pro vides a principled mechanism for open-ended adaptation: as long as the learned moti on prior contains transitio ns bet ween suﬃciently diverse physical st ates, the m odel can compose no vel reco very strategies for previo usly unseen perturbations. This compo sitiona l generalization—the abilit y to recombine learned m otio n primitiv es into new sequences conditi oned on nov el states—is what distinguishes a truly general controller from on e that merely cov ers a l arge but ﬁnite set o f pre-trained behaviors. W e f urther evaluate this generalization capa bilit y on omnidirecti onal fall reco very tasks in both simulation and the real world. As shown in Fig. 5 , Heracles completes the f ull lie-to-stand transition in MuJoCo while all baselin e methods f ail or only partially succeed. Fig. 7 presents the corresponding real-world results: across three trials, the robot is initialized in distinct f all en conﬁguratio ns—supine, l atera l, and prone postures. In all cases, the robot successf ully executes a complete lie-to-st and recov ery by follo wing the reference moti on, progressiv ely transitioning through intermediate support phases. Critically , the reco very directions vary across trials, with the robot rising toward diﬀerent orientations rel ativ e to its initial fallen heading, demo nstrating true omnidirecti onal recov ery rather than a memorized ﬁxed-direction strategy . 21 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control 5. Conclu sio n In conclusi on, this work presents Heracles, a state-condition ed generativ e middl eware that f undament a lly resolv es the longstanding di chotomy bet w een strict kinematic tracking and rob ust physical recov ery in humanoid robotics. By embedding a contin uous ﬂo w matching process within a closed-loop control architecture, the framew ork dyn ami cally bridges high-ﬁdelit y intent executio n with anthropom orphic resilien ce. Without relying on explicit m ode-switching heuristics, Heracles intrinsically preserves exact zero-shot tracking under nominal conditi ons while seamlessly synthesizing dyn ami cally fea sibl e reco very maneuv ers during severe environmental perturbatio ns. Ultimately , this uniﬁed paradigm liberates embodied control from rigid, referen ce-bound executio n, est a b lishing a highly scalab le fo undatio n for deploying a gile and resilient general-purpose humanoids in complex physica l enviro nments. 6. X -Humanoid Heracles Project T eam This report reﬂects a collaborativ e eﬀort by the X -Humanoid Heracles project team. The roles and contrib utors are listed below . Project L eader . Qiang Zhang Equal Contributi on. Zelin T ao, Zeran Su Project T eam Members. Peiran Liu, Jingkai Sun, W enqiang Que, Jiahao Ma, Jialin Yu, Jiahang Cao, Pihai Sun, Hao Li ang T echnical Support. Gang H an, W en Zhao, Z hiyuan Xu, Yijie Guo, Jian T ang 22 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control R eferences [1] Luca s N Alegre, Agon Seriﬁ, R uben Grandia, Da vid Müller , Espen Knoop, and Moritz Bächer . Amor: Adaptiv e character control through multi-objectiv e reinforcement learning. In Proceedings of the Special Interest Group on Computer Graphics and Interactiv e T echniques Conferen ce ( SIGGRAPH) , 2025. 5 [2] Y eke Chen, Shihao Dong, Xiaoyu Ji, Jingkai Sun, Zeren Luo, Liu Zhao, Jiahui Z hang, W anyue Li, Ji Ma, Bo wen Xu, et al. L earning human-like badminton skills for humano id robots. arXiv preprint arXiv:2602.08370 , 2026. 4 [3] Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong W ang. Gmt: General m otion tracking for humanoid whole-body control. arXiv preprint , 2025. 3 , 5 [4] Ke F an, Shunlin Lu, Minyue Dai, R unyi Yu, Lixing Xiao, Z hiyang Dou, Junting Dong, Li zhuang Ma, and Jingbo W ang. Go to z ero: T o wards zero-shot moti on generati on with million-sca le data. 2025. 4 [5] Chuan Guo, Inwoo Hwang, Jian W ang, and Bing Zhou. Snapm ogen: Human moti on generation from expressiv e texts. In In Advances in Neural Informatio n Processing S ystems (NeurI PS) , 2025. 11 [6] Félix G Harvey , Mike Yuri ck, Derek No wrouzez a hrai, and Christopher Pa l. Rob ust motio n in-bet weening. A CM T ransactio ns on Graphics (TOG) , 39(4):60–1, 2020. 11 [7] T airan He, W enli Xiao, T oru Lin, Z hengyi Lu o, Z henjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolo ng W ang, Linxi F an, and Y uke Zhu. Hov er: V ersatile neural whole-body controller for humano id robots. In Proceedings of the I EEE International C o nferen ce on Roboti cs and A utomati on (ICRA) , 2025. 1 , 3 [8] Biao Jiang, Xin Chen, W en Liu, Jingyi Y u, Gang Yu, and T ao Chen. Motiongpt: Human moti on as a foreign language. In Advances in Neural Informatio n Processing S ystems (NeurI PS) , 2023. 4 [9] Eunjong L ee, Eunhee Kim, Sanghoon Hong, Eunho Jung, and Jihoon Kim. C o ntrollabl e long-term moti on generati on with extended joint targets. In Proceedings o f the I EEE/CVF Winter C o nference on Applications o f C o mputer Visio n ( W A CV) , 2025. 5 [10] Y itang Li, Zhengyi Luo, T onghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang W eng, Kris Kit ani, Mateusz Guzek, Ahmed T ouati, Alessandro L azaric, Matteo Pirott a, and Guanya Shi. Bfm- zero: A promptabl e behavi oral foundati on model for humano id control using unsupervised reinforcement learning. In In Intern ati onal C o nference on L earning R epresentations (ICLR) , 2026. 1 , 4 , 12 [11] Zhe Li, W eihao Y uan, W eichao Shen, Siyu Z hu, Zilong Dong, and Chang Xu. O mnim otion: Multimodal m otion generatio n with contin uo us masked autoregression. arXiv preprint , 2025. 4 [12] Zhe Li, Cheng Chi, Y angyang W ei, Boan Z hu, Y ibo Peng, T ao Huang, P engwei W ang, Z ho ngyuan W ang, Shanghang Z hang, and Chang Xu. From l angua ge to locom otion: Retargeting-free humanoid control vi a m otion l atent guidance. In In International Conferen ce on L earning Representatio ns (ICLR) , 2026. 3 [13] Qiayuan Li ao, T a kara E Tru ong, Xiaoyu Huang, Guy T evet, Kou shil Sreenath, and C Karen Liu. Be- y ondmimic: From m otion tracking to versatile humanoid control via guided diﬀ usi on. arXiv preprint arXiv:2508.08241 , 2025. 3 , 12 , 13 , 14 [14] Bin Lin, Yujia Ge, Xinyang Cheng, Y e Zhu, Shubin Y e, Yuan Li, Jiaxi Z hu, Jiahao Y an, Haoqian Zeng, Zhenyu W ang, Liuhan Z hang, F ang W an, Qingdong Liu, Xianyi Z hao, Y onghong Li, and Limin Y ang. ifsq: Impro ving fsq for image generatio n with 1 line o f code. arXiv preprint , 2026. 9 [15] Jing Lin, Ailing Zeng, Shunlin Lu, Y uanhao Cai, R uimao Zhang, Haoqi an W ang, and L ei Z hang. Motio n-x: A l arge-sca le 3d expressive whole-body human m otio n dat a set. Advances in Neural Informatio n Processing Sy stems ( ( NeurIPS) , 36:25268–25280, 2023. 4 23 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control [16] Y utang Lin, Jieming Cui, Yixuan Li, Baoxio ng Jia, Yixin Z hu, and Siyuan Huang. L essmimi c: L ong-horizon humano id interaction with uniﬁed dist ance ﬁeld representations. arXiv preprint , 2026. 4 [17] Y aron Lipman, Ricky T . Q . Chen, Heli Ben-H am u, Maximilian Nickel, and Matthew L e. Flow matching for generativ e modeling. In Intern ati onal Conferen ce on L earning Representatio ns (ICLR) , 2023. 7 [18] Zhengyi Luo, Jias hun W ang, Kangni Liu, H aotian Zhang, Chen T essler , Jingbo W ang, Y e Y u an, Jinkun Cao, Zihui Lin, F engyi W ang, et al. Smplolympics: Sports enviro nments for physi cally sim ulated humano ids. arXiv preprint arXiv:2407.00187 , 2024. 4 [19] Zhengyi Luo, Y e Yuan, Tingwu W ang, Chenran Li, Sirui Chen, Fernando Castañeda, Z i-Ang Cao, Jiefeng Li, David Minor , Q ingw ei Ben, et al. Sonic: Supersizing m otio n tracking for natural humano id whole-body control. arXiv preprint , 2025. 1 , 3 , 5 , 12 , 13 [20] Naureen Mahm ood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archiv e of moti on capture as surf ace sha pes. In Proceedings of the I EEE/C VF Internationa l Conference on Computer Visi on (ICCV) , pa ges 5442–5451, 2019. 12 [21] Ian Mason, Sebastian St arke, and T aku Kom ura. R eal-time st yle modelling o f human locomoti on vi a feature-wise transformatio ns and local motio n phases. Proc. A CM Comput. Graph. Interact. T ech. , 5(1), May 2022. doi: 10.1145/3522618. URL https://doi.org/10.1145/3522618 . 11 [22] Chenlin Meng, Y utong He, Y ang Song, Ji aming Song, Jiajun Wu, Jun- Y an Z hu, and Stef ano Ermo n. SD Edit: Guided image synthesis and editing with stochastic diﬀerential equations. In International Conferen ce on L earning R epresentations (ICLR) , 2022. 8 [23] Mayank Mitt a l, P asca l R oth, J ames Tigu e, Antoin e Richard, Octi Zhang, Peter Du, Antoni o S erran o-Muñoz, Xinjie Y ao, R ené Zurbrügg, Nikita Rudin, et al. Isa ac l a b: A gpu-accelerated simulati on framew ork for m ulti-moda l robot learning. arXiv preprint , 2025. 11 [24] Liang P an, Zeshi Y ang, Zhiyang Dou, W enjia W ang, Buzhen Hu ang, Bo Dai, T aku Kom ura, and Jingbo W ang. T okenhsi: Uniﬁed synthesis of physica l human-scene interactions through task tokeniz ati on. In In Proceedings of I EEE Conferen ce on Computer Visio n and P attern R ecognitio n , 2025. 13 [25] Y ixuan Pan, Ru oyi Qiao, Li Chen, Kashya p Chitt a, Liang P an, H aoguang Mai, Qingwen Bu, Hao Z hao, Cunyuan Z heng, Ping Luo, and Ho ngyang Li. Agilit y meets stabilit y: V ersatile humanoid control with heterogeneou s dat a. In Proceedings of the I EEE Intern ati onal C o nference on R obotics and Aut omatio n (ICRA) , 2026. 3 [26] William Peeb les and Saining Xie. Scalab le diﬀ usi on models with transformers. In Proceedings of the I EEE/CVF Intern ati onal C o nference on Computer Vision (ICCV) , 2023. 7 , 12 , 13 [27] Xue Bin Peng, Pieter Abbeel, S ergey Levine, and Michiel V an de Pann e. Deepmimic: Example-guided deep reinforcement learning o f physics-ba sed character skills. A CM Transacti ons On Graphi cs (TOG) , 37 (4):1–14, 2018. 3 , 5 [28] Xue Bin Peng, Y unrong Guo, Lin a Halper , Sergey L evin e, and Sanja Fid ler . Ase: Large-scale reusab le adv ersarial skill embeddings for physica lly simulated characters. A CM T ransactions On Graphi cs (TOG) , 41(4):1–17, 2022. 5 [29] Agon S eriﬁ, R uben Grandia, Espen Knoop, Markus Gross, and Moritz Bächer . Vmp: V ersatile m otion priors for robustly tracking m otio n on physica l characters. In Proceedings of the A CM SIGGRAPH/Eurographi cs Sympo sium on Computer Animation ( SCA) , pages 1–11, 2024. 5 24 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control [30] Jean-Pierre Sleiman, He Li, Alphonsus Ad u-Bredu, R obin Deits, Arun Kumar , Kevin Bergamin, Mohak Bhardwaj, Scott Bidd lestone, Nicola Burger , Matthew A. Estrada, Francesco Iacobelli, Twan Koolen, Alexander Lambert, Erica Lin, M. Ev a Mungai, Zach Nob les, Shan e Ro zen-L evy , Y uyao Shi, Ji a shun W ang, J akob W elner , F angzhou Y u, Mike Z hang, Alfred Rizzi, Jessica Hodgins, S ylvain Bertrand, Y euhi Abe, S cott Kuindersma, and F arbod F arshidian. Zest: Zero-shot embodied skill transfer for athletic robot control. arXiv preprint arXiv:2602.00401 , 2026. 10 [31] Jingkai Sun, Qi ang Z hang, Yiqun Duan, Xiaoyang Jiang, Chong Cheng, and R enjing Xu. Prompt, pl an, perform: Llm-based humanoid control via quanti zed imitation learning. In 2024 I EEE International Conferen ce on R obotics and A utomati on (ICRA) . I EEE, 2024. 3 , 13 [32] Guy T evet, Sigal Ra a b, Brian Gordon, Y o nat an Shaﬁr , Daniel Cohen-Or , and Amit H Bermano. Human m otion diﬀ usi on model. In Internationa l Conference on L earning R epresentations (ICLR) , 2023. 4 [33] Y inhuai W ang, Qihan Z hao, Runyi Yu, Hok W ai T sui, Ailing Zeng, Jing Lin, Zhengyi Luo, Jiwen Y u, Xiu Li, Qifeng Chen, et al. Skillmimi c: L earning bas ketball interactio n skills from demonstrati ons. In Proceedings o f the I EEE/CVF C o nference on Computer Visio n and P attern R ecognitio n , pages 17540–17549, 2025. 4 [34] Y uan W ang, Di Hu ang, Y aqi Z hang, W anli Ouyang, Jile Jiao, Xu et ao F eng, Y an Z hou, Pengfei W an, Shixiang T ang, and Dan Xu. Motiongpt-2: A general-purpose motio n-language model for moti on generation and understanding. arXiv preprint , 2024. 4 [35] Y unshen W ang, Shaohang Zhu, Peiyuan Z hi, Y uhan Li, Jiaxin Li, Y o ng-Lu Li, Y uchen Xiao, Xingxing W ang, Baoxio ng Jia, and Siyuan Huang. O mnixtreme: Breaking the generalit y barrier in high-dynamic humano id control. arXiv preprint , 2026. 1 , 3 [36] Y uxin W en, Qing Shuai, Di Kang, Jing Li, Cheng W en, Yu e Qian, Ningxin Jiao, Changhai Chen, W eijie Chen, Yiran W ang, et al. Hy-moti on 1.0: Scaling ﬂow matching models for text-to-moti on generatio n. arXiv preprint arXiv:2512.23464 , 2025. 4 [37] Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke F an, Liang Pan, Y ueer Zhou, Ziyong Feng, Xiao w ei Zhou, Sida Peng, and Jingbo W ang. Motio nstreamer: Streaming moti on generation vi a diﬀusion-ba sed autoregressiv e m odel in causa l l atent space. arXiv preprint , 2025. 5 [38] Lujie Y ang, Xi aoyu Hu ang, Zhen Wu, Ang joo Kanaz a wa, Pieter Abbeel, Carmelo Sferrazza, C Karen Liu, R ocky Du an, and Guanya Shi. Omniret arget: Interactio n-preserving dat a generati on for humanoid whole-body loco-manipulatio n and scene interactio n. arXiv preprint , 2025. 3 [39] Kangning Yin, W eishuai Zeng, Ke F an, Minyue Dai, Zirui W ang, Qiang Z hang, Zheng Tian, Jingbo W ang, Jiangmiao Pang, and W einan Zhang. Unitracker: L earning universal whole-body moti on tracker for humano id robots. arXiv preprint , 2025. 1 [40] Kangning Yin, Zhe Cao, W ent ao Dong, W eishuai Zeng, T ianyi Z hang, Qiang Zhang, Jingbo W ang, Jiang- miao Pang, Ming Z hou, and W einan Z hang. Robo striker: Hierarchi cal decisio n-making for auton om ous humano id boxing. arXiv preprint , 2026. 5 [41] R unyi Y u, Yinhuai W ang, Qihan Zhao, Hok W ai T sui, Jingbo W ang, P ing T an, and Qifeng Chen. Skillmimi c- v2: L earning robust and generalizab le interactio n skills from sparse and noisy dem onstratio ns. In Proceedings of the Special Interest Group on Computer Graphi cs and Interactive T echniques Conference Conferen ce P apers , pages 1–11, 2025. 4 [42] Mingqi Y uan, T ao Y u, W enqi Ge, Xiuy ong Y ao, Huiji ang W ang, Jiayu Chen, Bo Li, W ei Z hang, W enjun Zeng, Hua Chen, and Xin Jin. A survey of behavior foundati on model: Next-generati on whole-body control system of humanoid robots. I EEE T ransactio ns on P attern An a lysis and Machine Intelligence , 2025. 4 25 Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control [43] W eishuai Zeng, Shunlin Lu, Kangning Y in, Xiaojie Niu, Minyue Dai, Jingbo W ang, and Jiangmi ao Pang. Behavi or fo undation model for humanoid robots. arXiv preprint , 2025. 1 , 4 [44] Qiang Zhang, Peter Cui, David Y an, Jingkai Sun, Yi qun Du an, Gang Han, W en Zhao, W eining Z hang, Y ijie Guo, Arthur Z hang, and Renjing Xu. Whole-body humano id robot locom otion with human reference. arXiv preprint arXiv:2402.18294 , 2024. 4 [45] Qiang Z hang, Ji a hao Ma, Peiran Liu, Shuai Shi, Zeran Su, Zif an W ang, Jingkai Sun, W ei Cui, Ji a lin Y u, Gang Han, W en Zhao, Pihai Sun, Kangning Yin, Jiaxu W ang, Jiahang Cao, Lingfeng Z hang, H ao Cheng, Xiaoshuai Hao, Yiding Ji, Junwei Liang, Ji an T ang, Renjing Xu, and Yijie Guo. Meshmimic: Geometry- aw are humano id moti on learning through 3d scene reconstructi on. arXiv preprint , 2026. 4 [46] Y uhong Z hang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yuro ng Fu, Y u anhao Cai, Ruimao Z hang, Haoqian W ang, and L ei Zhang. Motio n-x++: A large-scale multim odal 3d whole-body human moti on dataset. arXiv preprint , 2025. 4 [47] Zhikai Zhang, Haofei Lu, Y unrui Lian, Ziqing Chen, Yun Liu, Chenghuai Lin, Han Xue, Zicheng Zeng, Zekun Qi, Shaolin Z heng, et al. L earning athletic humanoid tennis skills from imperfect human moti on data. arXiv preprint , 2026. 4 [48] Ziyu Z hang, Sergey Bashkiro v , Dun Y ang, Y i Shi, Michael T ayl or , and Xue Bin Peng. Physics-ba sed moti on imitation with adv ersarial diﬀerential discrimin ators. In Proceedings of the SIGGRAPH Asia 2025 Conferen ce P apers , SA Conferen ce Pa pers ’25, New Y ork, NY , US A, 2025. Association for Computing Machinery . ISBN 9798400721373. doi: 10.1145/3757377.3763819. URL https://doi.org/10.1145/3757377.3763819 . 5 [49] Kaifeng Z hao, Gen Li, and Siyu T ang. Dartcontrol: A diﬀ usi on-ba sed autoregressiv e m otion model for real-time text-driven m otion control. arXiv preprint , 2024. 5 [50] Shaoting Z hu, Baijun Y e, Jiaxu an W ang, Jiakang Chen, Ziwen Z huang, Linzhan Mou, R unhan Huang, and Hang Z hao. Ttt-parkour: Rapid test-time training for perceptive robot parkour . arXiv preprint arXiv:2602.02331 , 2026. 4 [51] Ziwen Z huang, Shaoting Z hu, Meng jie Z hao, and Hang Zhao. Deep whole-body parkour . arXiv preprint arXiv:2601.07701 , 2026. 4 26

Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment