SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

SARL: L abel-F ree R einforcement L earning by R ewarding R easoning T opology Y ifan W ang 1 Bolian Li 1 David Cho 1 R uqi Zhang 1 F anping Sui 2 Ananth Grama 1 1 Purdue University 2 T exas Instruments R einforcement learning has become central to improving large reasoning models, but its success still relies heavily on veriﬁable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be veriﬁed. Moreover , reasoning trajectories remain largely unconstrained, and optimization towards ﬁnal answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RL VR to open ended settings. W e introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response R easoning Map from intermediate thinking steps and rewards its small world topology , inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efﬁcient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy , indicating a more stable and exploratory training and generalized reasoning ability . K eywords: Structure aware reinforcement learning, label free RL VR , LLM post-training { Date: 2026-03-29 § Code: https://github.com/cacayaya/SARL # Contact: wang5617@purdue.edu 1 Introduction R ecently , Large R easoning Models (LRMs) have demonstrated transformative capabilities in tackling complex, reasoning-intensive tasks, especially in mathematics and code generation [ 5 , 29 ]. Central to this progress is the Chain-of- Thought (CoT) paradigm, where models generate explicit, step-by-step reasoning steps before producing a ﬁnal answer [ 27 , 25 ]. These reasoning capabilities are typically catalyzed by supervised ﬁne-tuning (SFT) and, more crucially , by reinforcement learning (RL) [ 15 , 18 ]. By rewarding correct ﬁnal outcomes, RL allows models to autonomously explore and reﬁne their internal logic, leading to the emergence of sophisticated “thinking” behaviors [ 10 , 23 ]. However , the current success of reasoning models is largely tethered to the R einforcement Learning from V eriﬁable Rewards (RL VR) framework [ 5 , 18 ]. RL VR relies heavily on tasks with objective ground truths and automated veriﬁers to provide reward signals. This dependency creates a “bottleneck of 1 veriﬁability”: in domains where ground truth is ambiguous, expensive to label, or non-existent, such as open-ended strategic planning or complex philosophical inquiry , the RL signal largely vanishes. Furthermore, current reward modeling predominantly focuses on outcome-level rewards, which treat the reasoning process as a “black box” and offer limited control over intermediate steps. While Process R eward Models (PRMs) [ 10 , 13 , 7 ] have been introduced to provide more granular feedback, they are still constrained by the cost of step-level supervision and by limited scalability to tasks without clear , rule-based veriﬁcation. However , to some extent, PRMs still optimize reasoning primarily through its contribution to ﬁnal-answer correctness, offering limited control over how intermediate thoughts are structured or organized. This raises a central question: Can complex reasoning be optimized not through supervision derived from ground-truth answers, but by directly shaping how the model thinks? A compelling answer to this question lies in the macro-scale functional architecture of the human brain [ 20 ]. Neuroscientiﬁc evidence suggests that human high-level cognition is facilitated by the small-world organization of the brain’s connectome [ 3 ]. This topology is characterized by dense local clusters responsible for specialized information processing and a few strategic long-range shortcuts that minimize the path length between disparate functional modules [ 26 ]. Such a structure allows the brain to achieve an optimal balance between functional segregation and global integration, which enables efﬁcient relational reasoning and creative insight even in the absence of external feedback [ 24 ]. In this biological framework, a thinking process is essentially a traversal across this small-world network. Here, the quality of reasoning is determined by its navigational efﬁciency and topological coherence [ 17 ]. Inspired by this architecture, we hypothesize that reasoning models can beneﬁt from structural priors that encourage more efﬁcient exploration in latent conceptual space. In this work, we introduce Structure A ware R einforcement Learning (SARL), a label free training framework that improves reasoning by optimizing the topology of the reasoning process itself rather than relying on ground truth labels. SARL constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world structure, encouraging reasoning trajectories that are both locally coherent and globally efﬁcient. Our main contributions are as follows: • W e propose SARL, a label-free RL framework that optimizes the structure of reasoning rather than its outcome, by rewarding the small-world topology of per-response R easoning Maps. • W e show that SARL is effective in both mathematical and open ended reasoning, outperforming prior label free baselines and matching or surpassing ground truth based RL in several settings. • W e ﬁnd that SARL induces a more stable training regime, with lower policy drift and higher entropy , indicating more exploration and generalization ability . 2 R elated W ork R easoning Structure in Human Cognition and T opology of LRMs Reasoning. The human brain’s connectome exhibits a small-world topology [ 26 , 3 ]: dense local clustering within specialized func- tional modules coexists with short global path lengths across the entire network, enabling simultaneous functional segregation and rapid cross-module integration [ 24 ]. This structural property is not merely ar- chitectural; it underlies the brain’s capacity to navigate abstract cognitive spaces efﬁciently , coordinating between specialized processing units at minimal energetic cost [ 17 , 20 ]. R ecent empirical work has begun uncovering analogous structural signatures in Large Reasoning Models (LRMs). Minegishi et al. [ 14 ] extract reasoning graphs from the hidden-state trajectories of LRMs and ﬁnd that distilled reasoning models exhibit substantially higher cyclicity , diameter , and small-world index than base models, with these structural advantages scaling with task difﬁculty and correlating positively 2 Figure 1: Overview of Structure A ware R einforcement Learning (SARL). Left: SARL replaces outcome based supervision with a label free structure reward computed from the model’s reasoning process. Middle: the reward is motivated by the small world prior , which balances local specialization and global integration. Right: construction of the SARL R easoning Map and Structure R eward. with accuracy . From a geometric perspective, Zhou et al. [ 33 ] model chain-of-thought reasoning as smooth ﬂows in the model’s representation space, showing that logical transitions induce structured geometric trajectories governed by the semantics of intermediate propositions. At the level of trace analysis, F eng et al. [ 4 ] introduce graph-based structural metrics for CoT traces and demonstrate that the fraction of abandoned reasoning branches, rather than trace length or self-review frequency , is the strongest predictor of correctness, providing causal evidence that structural organization shapes reasoning outcomes. T an et al. [ 21 ] further show that topological data analysis features of reasoning traces carry substantially higher predictive power for reasoning quality than standard graph-connectivity metrics. Collectively , these studies establish a robust empirical link between the structural organization of reasoning trajectories and reasoning quality . Our work takes the natural next step: rather than analyzing these structural properties post hoc, we optimize the small-world topology of reasoning graph which built from thinking steps as a direct instrinstic reasoning quality signal. Label F ree RL VR R einforcement Learning from V eriﬁable Rewards (RL VR) has emerged as the dominant post-training paradigm for LRMs [ 5 ], using binary outcome rewards in domains, primarily mathematics and code, where correctness can be automatically veriﬁed. Algorithms such as GRPO [ 18 ] enable stable, scalable RL training; yet RL VR is fundamentally constrained to veriﬁable tasks. In open-ended domains such as creative writing, strategic planning, or advisory reasoning, no automated veriﬁer exists and the RL signal vanishes entirely . Alternative label-free approaches have emerged to enable RL VR without ground-truth annotations. TTRL [ 34 ] derives reward signals from majority-vote consistency across multiple rollouts, treating the consensus answer as a pseudo-label. While effective on closed-form reasoning tasks, this voting-based binary reward is ill-suited for open-ended generation where no single canonical answer exists. Entropy- 3 based methods offer a more general alternative: EMPO [ 30 ] and concurrent work [ 1 ] minimize the predictive entropy of model outputs in semantic space, using the model’s own uncertainty as the sole training signal without requiring any labeled data. More directly targeting the open-ended setting, Intuitor [ 31 ] introduces R einforcement Learning from Internal F eedback (RLIF), where the model’s self-certainty scores serve as reward signals within a GRPO framework, demonstrating generalization to diverse out-of-domain tasks such as code generation where veriﬁable external rewards are unavailable. Other work uses output format compliance and response length as surrogate rewards [ 28 ], limiting applicability to math tasks with prescribed answer formats. Beyond those methods, our approach derive the label free reward signal from the structure of the model’s reasoning processes, and applicable to not only highly veriﬁable tasks like math but also quite open-end and complex tasks in broader and real-world reasoning settings. T able 1 summarizes the key differences between these label-free approaches. T able 1: Comparison of label-free RL methods. “ Algorithm-agnostic” indicates whether the reward can be used beyond some speciﬁc RL methods (e.g., not limited to GxPO). Method R eward Signal Algorithm-agnostic Open-ended V eriﬁable EMPO [ 30 ] Entropy minimization ✗ ✓ ✓ TTRL [ 34 ] Majority voting ✗ ✗ ✓ F ormat & Length [ 28 ] F ormat + length ✓ ✗ ✓ SARL (Ours) R easoning Structure ✓ ✓ ✓ 3 Method SARL replaces outcome based supervision with structural supervision on the reasoning process itself . Given a generated trajectory , SARL constructs a per-response R easoning Map from the intermediate thinking steps and assigns a reward according to its small-world topology . Intuitively , SARL favors reasoning traces that are locally coherent within functional subproblems while remaining globally well connected across different reasoning modes. This section describes the training objective, the construction of the R easoning Map, and the resulting reward. 3.1 Problem F ormulation Let x denote an input question and let π θ be a language model policy parameterized by θ . F ollowing the chain-of-thought paradigm, the model produces a trajectory τ = ( s 1 , s 2 , . . . , s T , a ) , where s 1 , . . . , s T are intermediate reasoning steps inside the block and a is the ﬁnal answer . Standard reinforcement learning from veriﬁable rewards optimizes an outcome level objective of the form max θ E x ∼D , τ ∼ π θ ( ·| x )  1 [ a = y ∗ ]  , (1) where y ∗ is a ground truth answer . While effective in veriﬁable domains, Eq. ( 1 ) depends on access to labels and provides no direct signal about the quality of the reasoning trace itself . SARL replaces the outcome reward with a structural reward computed from the generated reasoning trajectory . F or each rollout, we build a R easoning Map G ( τ ) = ( V , E ) whose nodes represent latent reasoning types and whose edges represent transitions between them. The SARL objective is max θ E x ∼D , τ ∼ π θ ( ·| x )  SR  G ( τ )   , (2) 4 where SR denotes the Structure R eward and measures the small-world organization of the resulting graph. Crucially , Eq. ( 2 ) does not rely on y ∗ , allowing the policy to be trained even when correctness labels are unavailable or expensive to obtain. 3.2 Structure A ware R einforcement Learning W e next present the overall SARL pipeline, from R easoning Map Construction (§ 3.2.1 ) to the deﬁnition of the Structure R eward (SR) (§ 3.2.2 ).And the full pipeline is summarized in Algorithm 1 . Algorithm 1 Structure A ware R einforcement Learning (label-free) R equire: P olicy π θ , embedding model M , training set D , RL algorithm A , rollouts per question G 1: for each training iteration do 2: Sample batch { x i } from D 3: for each question x i do 4: Generate G trajectories { τ ( g ) i } G g =1 ∼ π θ ( · | x i ) 5: for each trajectory τ ( g ) i do 6: Extract reasoning steps { s t } from 7: Compute step embeddings { e t } and cluster them into nodes V with assignments { z t } 8: Build G ( τ ( g ) i ) from transitions { ( v z t , v z t +1 ) : z t  = z t +1 } 9: Compute r ( g ) i ← SR  G ( τ ( g ) i )  10: end for 11: end for 12: Update θ ← A  θ , { ( τ ( g ) i , r ( g ) i ) }  13: end for 3.2.1 Reasoning Map Construction Unlike corpus level graph analyses that aggregate traces across many examples, SARL constructs a separate R easoning Map for each generated response. This design makes the reward fully online and compatible with standard policy optimization. Step Extraction. W e ﬁrst extract the content of the block and split it into reasoning steps at newline boundaries. This choice provides a practical middle ground: token level segmentation is too ﬁne and tends to produce degenerate dense graphs, while paragraph level segmentation is too coarse and can merge distinct operations into a single unit. Step Embedding. Each step s t is mapped to a unit normalized embedding e t ∈ R d , e t = M ( s t ) ∥M ( s t ) ∥ 2 , where M is a text embedding model. In practice, we compute these embeddings with a lightweight exter- nal encoder so that reward construction does not interfere with the main rollout engine. Implementation details are provided in Appendix A.1 . Latent reasoning types. Individual steps may differ in wording while still serving the same functional role. F or example, two algebraic manipulations may appear lexically different but correspond to the same reasoning type. SARL captures this structure by clustering the step embeddings { e t } T t =1 into K latent reasoning types, V = { v 1 , v 2 , . . . , v K } . (3) 5 Each node v k corresponds to one cluster of semantically similar steps, and we use z t ∈ { 1 , . . . , K } to denote the cluster assignment of step s t . Clustering is performed independently for each response. W e consider both KMeans and HDBSCAN for this step; the speciﬁc settings and the regimes in which each choice is more suitable are discussed in Appendix A.2 . R easoning transitions. The edge set E captures transitions between reasoning types. W e add an undirected edge ( v i , v j ) ∈ E whenever two consecutive steps s t and s t +1 belong to different clusters, namely when z t = i  = j = z t +1 . R epeated transitions between the same pair of reasoning types contribute only one edge. W e use an undirected graph because the reward is intended to measure structural organization and connectivity rather than directional control ﬂow . 3.2.2 Structure Reward (SR) Small-world prior . SARL is motivated by the observation that effective reasoning should balance local specialization with global integration. In graph terms, this suggests two desirable properties: high local clustering, which reﬂects coherent functional substructures, and short global path lengths, which reﬂect efﬁcient transitions across different reasoning modes. Let N ( v k ) denote the neighbor set of node v k , and let V ≥ 2 = { v k ∈ V : |N ( v k ) | ≥ 2 } be the set of nontrivial nodes. W e deﬁne the average clustering coefﬁcient as C ( G ) = 1 |V ≥ 2 | X v k ∈V ≥ 2   { ( v i , v j ) ∈ E : v i , v j ∈ N ( v k ) }   |N ( v k ) |  |N ( v k ) | − 1  / 2 , (4) which measures how strongly neighboring reasoning types form locally coherent modules. W e also deﬁne the average shortest path length over all reachable node pairs P = { ( v i , v j ) : v i  = v j , v j reachable from v i } as L ( G ) = 1 |P | X ( v i ,v j ) ∈P δ ( v i , v j ) , (5) where δ ( v i , v j ) denotes hop count distance. Lower values of L ( G ) indicate more efﬁcient communication across distinct reasoning types. Classical small-world index compare a graph against a random graph, but they are not ideal as direct reinforcement learning rewards. In particular , the resulting score is unbounded and depends on a stochastic reference graph, which can increase reward variance and make training less stable. W e therefore deﬁne the Structure R eward (SR) here directly as a combination of the two desired structural properties: SR( G ) = 1 2 C ( G ) | {z } Local Depth + 1 1 + L ( G ) | {z } Global Flow . (6) As shown in Eq. ( 4 ) and Eq. ( 5 ) , C ( G ) captures local specialization, while L ( G ) captures global efﬁciency . T ogether the two terms discourage traces that collapse into a single reasoning mode or drift through long unstructured chains of transitions. 4 Experiments 4.1 Experimental Setup Our experiments are designed to answer a key question: Can optimizing the structure of reasoning processes rather than solely focusing on ﬁnal outcomes lead to good reasoning performance? 6 Speciﬁcally , we investigate small-world topological as a structural prior for reasoning processes and evaluate its effectiveness on two different types of reasoning tasks: V eriﬁable Math Reasoning and Open-Ended R easoning. Models. W e conduct all experiments using Qwen3-4B [ 29 ], a strong reasoning model that generates explicit chain-of-thought reasoning within blocks. This reasoning format enables us to directly construct Reasoning Maps from intermediate reasoning traces, which are then used to compute the proposed Structure R eward. T raining Data. W e evaluate SARL under two distinct training regimes that cover both veriﬁable and non-veriﬁable reasoning domains. • V eriﬁable reasoning (Math). W e train on mathematical reasoning problems drawn from histori- cal AIME competitions (1983–2024). These problems require multi-step symbolic reasoning and provide deterministic correctness signals, making them a standard benchmark for evaluating reasoning ability . • Non-veriﬁable reasoning (Open-ended). W e train on OpenR ubrics-v2 [ 12 ], a large-scale preference dataset spanning diverse domains including creative writing, planning, coding, advice, and analytical reasoning. Unlike mathematical tasks, these problems lack explicit correctness labels, making them suitable for evaluating label-free training methods. Dataset ﬁltering and preparation details are provided in Appendix B.1 . Baselines. F or the veriﬁable setting (Math Reasoning), we compare against two prominent label-free RL baselines: EMPO [ 30 ] and TTRL [ 34 ]. F or the non-veriﬁable setting (Open-Ended R easoning), we additionally compare against DPO [ 16 ] training on preference data. T TRL is not applicable for open-ended reasoning as it requires to guess binary labels. T raining and Evaluation. All reinforcement learning experiments are implemented using the veRL framework [ 19 ]. Preference-based training with DPO is implemented using LlamaF actory [ 32 ]. Full training details, including hyperparameters and optimization settings, are provided in Appendix B.2 . F or evaluation, we assess mathematical reasoning performance on four benchmarks: MA TH500 [ 6 ], AIME25 [ 2 ], AMC23 [ 8 ] and Minerva Math [ 9 ]. W e adopt avg@ 8 as our primary evaluation metric, which computes the expected accuracy across multiple samples and provides a more stable assessment of model capability compared to single-run pass@1 scores. F or open-ended reasoning evaluation, we employ the WildBench Elo rating system and task-category macro scores from the WildBench leaderboard [ 11 ]. 4.2 V eriﬁable R easoning (Mathematical T asks) Results W e ﬁrst validate the effectiveness of Structure R eward in the well-established veriﬁable setting of mathematical reasoning, where RL VR methods are most commonly applied. Structure reward closes the gap to ground-truth supervision. As shown in T able 2 , across both optimization algorithms, Structure R eward achieves performance that matches or surpasses ground- truth RL despite requiring no correctness labels. Under GRPO , Structure R eward achieves the best performance among all label-free methods, reaching an average improvement of (+7.65), outperforming EMPO (+6.94) and TTRL (+6.61), and even exceeding ground-truth RL (+7.15). Under PPO , Structure R eward achieves an improvement of (+5.87), again surpassing ground-truth PPO training (+5.67). These results indicate that structural signals alone can provide sufﬁciently strong learning guidance to rival, and in some cases exceed, correctness based supervision. Notably , unlike EMPO and T TRL, which rely on group-level optimization and are restricted to GRPO -style training, SARL generalizes across both 7 Method AIME25 AMC23 MA TH500 Minerva A vg ( ∆ ) Base 31.67 82.81 90.10 53.45 64.51 Label-free RL PPO w/ Structure R eward (Ours) 42.92 85.00 92.53 61.08 70.38 (+5.87) Ground- T ruth RL (Oracle) PPO w/ Ground- T ruth 41.67 86.56 92.75 59.74 70.18 (+5.67) Label-free RL GRPO w/ Entropy Minimization (EMPO) 44.58 86.56 93.23 61.44 71.45 (+6.94) GRPO w/ Majority V oting (TTRL) 42.91 86.56 93.15 61.86 71.12 (+6.61) GRPO w/ Structure R eward (Ours) 45.83 87.50 93.30 61.99 72.16 (+7.65) Ground- T ruth RL (Oracle) GRPO w/ Ground- T ruth 46.67 84.38 93.15 62.45 71.66 (+7.15) T able 2: R esults on mathematical reasoning benchmarks. EMPO and T TRL rely on group-level opti- mization signals and are therefore only applicable to GRPO -style training, while Structure R eward is compatible with both PPO and GRPO . PPO and GRPO frameworks. This algorithm-agnostic compatibility highlights its practical applicability across diverse reinforcement learning pipelines. The largest gains from Structure R eward training occur on AIME25 , the most challenging benchmark requiring long multi-step mathematical derivations. Under GRPO , performance improves from 31.67 to 45.83 , corresponding to a gain of +14.16 points (+45% relative improvement). Such improvements are substantially larger than those observed on easier benchmarks, like MA TH500 and AMC23, suggesting that structural constraints become increasingly beneﬁcial as reasoning becomes more complex. This pattern is consistent with the small-world structural prior: complex problems require coordinated multi-phase reasoning, where local coherence and global connectivity must be balanced across long reasoning chains. 4.3 Non- V eriﬁable R easoning (Open-Ended T asks) R esults A core motivation of our work is to enable reinforcement learning beyond veriﬁable domains using label-free structural rewards. T o test this, we apply SARL on OpenR ubrics-v2 [ 12 ], a diverse open-ended QA dataset. Method Creative Planning Math Info Code WB Score ( ∆ ) Base 51.01 36.23 16.35 48.71 14.72 29.91 DPO 51.16 37.82 17.06 48.37 14.34 30.34 (+0.43) PPO w/ Structure R eward (Ours) 57.05 45.87 27.70 53.47 30.05 40.26 (+10.35) GRPO w/ Entropy Minimization (EMPO) 51.01 36.11 17.62 45.69 12.92 29.20 (-0.71) GRPO w/ Structure R eward (Ours) 55.34 43.56 26.83 51.98 29.91 39.01 (+9.10) T able 3: WildBench evaluation results for open-ended reasoning training. Numbers in parentheses indicate absolute improvement over the base model. 8 Structure reward enables reinforcement learning beyond veriﬁable domains. The most striking observation is that Structure Reward achieves substantial performance gains despite receiving no labels or preference signals. Under PPO , the WB Score improves from 29.91 to 40.26 (+10.35), while under GRPO it improves to 39.01 (+9.10). These improvements are signiﬁcantly larger than those obtained by alternative training strategies, indicating that structural optimization alone can provide a sufﬁciently strong learning signal for open-ended reasoning. P erformance improvements are broad across diverse reasoning tasks. Structure R eward improves performance across all ﬁve task categories under both PPO and GRPO . Beyond notable improvements on Math and Coding tasks, which we have established beneﬁt from structured reasoning, the most signiﬁcant ﬁnding is the substantial gains in truly open-ended domains: Creative writing (+6.04 for PPO , +4.33 for GRPO), Planning (+9.64 and +7.33), and Info/Advice (+4.76 and +3.27). These domains are less formally structured and require ﬂexible reasoning and knowledge synthesis. The consistent gains across such diverse categories indicate that the structural prior does not merely beneﬁt mathematical reasoning, but generalizes to a wide range of open-ended cognitive tasks. Comparison with applicable baselines in the open-ended setting. Our primary comparison focuses on methods that are genuinely label-free. However , T TRL [ 34 ] requires voting on binary labels and format-based label free rewards [ 28 ] only deﬁne math-speciﬁc output formats, making them unable to generalize to open-ended tasks. W e therefore compare primarily against EMPO [ 30 ], which could be extended to open-ended tasks by clustering multiple sampled responses in semantic space and using the resulting semantic entropy as the reward signal. And we additionally report DPO [ 16 ] as a stronger reference that uses preference labels. Even with this extra supervision, DPO yields only a marginal improvement (+0.43 WB Score), while EMPO slightly decreases performance (-0.71). In contrast, Structure R eward consistently delivers large gains, suggesting that structural prior provides a more general and effective signal for open-ended tasks. 4.4 Analysis Figure 2: T raining dynamics of different methods (reward signals) under GRPO . T raining dynamics analysis. Figure 2 reveals a clear pattern: SARL achieves the strongest perfor- mance while inducing the smallest policy drift. Its KL divergence remains near zero throughout training, whereas T TRL and EMPO exhibit steadily increasing KL, indicating much larger deviation from the reference policy . This behavior reﬂects a common failure mode of label-free RL methods: excessive policy drift can destabilize optimization and eventually lead to model collapse. 9 At the same time, SARL maintains high policy entropy over most of training, indicating sustained exploration rather than early collapse to narrow reasoning patterns. In contrast, training directly on ground-truth rewards often drives entropy drop quickly , suggesting premature exploitation that may hurt generalization. The central insight is that SARL improves reasoning through a stable optimization regime that combines minimal policy shift with relatively high exploration. Method A vg T okens ∆ vs. Base V eriﬁable reasoning (Math). Base 5736.6 0.0 PPO w/ Structure R eward 5075.6 -661.0 GRPO w/ Structure R eward 4587.8 -1148.8 Non-veriﬁable reasoning (Open-ended). Base 2677.91 0.00 PPO w/ Structure R eward 2303.17 -374.74 GRPO w/ Structure R eward 2285.87 -392.04 T able 4: A verage response length comparison. Length analysis. One possible concern is that the gains from Structure Reward may come simply from generating longer responses. T able 4 shows that this is not the case. Structure R eward substantially re- duces average response length in both open- ended and mathematical reasoning settings while achieving much higher performance. These results suggest that the beneﬁt comes from producing more efﬁcient and better or- ganized reasoning traces rather than simply extending length. Full length comparisons against other baselines are provided in Ap- pendix B.3 . 5 Limitations SARL is most effective for tasks that involve sufﬁciently long and complex reasoning processes. F or very straightforward or simple tasks with only a few steps, the beneﬁt of enforcing structural organization may be limited, as such tasks do not require complex reasoning. In addition, although the small world structure proves effective across the benchmarks considered in this work, it may not always be the optimal structural prior for every domain. Different tasks may favor distinct reasoning structures, and more ﬁne-grained or domain speciﬁc structural priors could further improve performance. W e emphasize that structure-aware training should be viewed as a general framework rather than a ﬁxed design. Our results suggest that structural priors offer a promising direction for reinforcement learning in reasoning tasks, while leaving substantial room for future work to explore richer and more specialized structural priors tailored to different problem domains. 6 Conclusion In this work, we introduced structure aware reinforcement learning (SARL), a label free framework that improves reasoning by rewarding the organization of the reasoning process rather than the correctness of ﬁnal answers. By constructing Reasoning Maps and optimizing their small world topology , SARL provides a scalable training signal that applies to both veriﬁable and non veriﬁable tasks. Empirically , SARL matches or surpasses ground truth based RL on mathematical reasoning benchmarks, outperforms existing label free baselines such as EMPO and TTRL, and delivers large gains on open ended W ildBench evaluation without using correctness labels or preference supervision. Our analysis further shows that these improvements are not driven by longer outputs, but by a more stable optimization regime with minimal policy drift and sustained exploration. T aken together , these results suggest that reasoning structure itself is a meaningful and general learning signal, opening a path toward reinforcement learning beyond the bottleneck of veriﬁability . 10 R eferences [1] Shivam Agarwal, Zimin Zhang, Lifan Y uan, Jiawei Han, and Hao P eng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint , 2025. [2] Mislav Balunovi ´ c, Jasper Dekoninck, Ivo P etrov , Nikola Jovanovi ´ c, and Martin V echev . Matharena: Evaluating llms on uncontaminated math competitions, F ebruary 2025. URL https://matharena. ai/ . [3] Danielle S Bassett and Edward T Bullmore. Small-world brain networks revisited. The Neuroscien- tist , 23(5):499–516, 2017. [4] Y unzhen F eng, Julia Kempe, Cheng Zhang, P arag Jain, and Anthony Hartshorn. What characterizes effective reasoning? revisiting length, review , and structure of cot. arXiv preprint , 2025. [5] Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, P eiyi W ang, Qihao Zhu, R unxin Xu, R uoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint , 2025. [6] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric T ang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 , 2021. [7] Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao P eng, Moon- tae Lee, Honglak Lee, and Lu W ang. Process reward models that think. arXiv preprint arXiv:2504.16828 , 2025. [8] knoveleng. Amc-23. https://huggingface.co/datasets/knoveleng/AMC- 23 , 2025. [9] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer , Henryk Michalewski, V inay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in neural information processing systems , 35: 3843–3857, 2022. [10] Hunter Lightman, V ineet Kosaraju, Y uri Burda, Harrison Edwards, Bowen Baker , T eddy Lee, Jan Leike, John Schulman, Ilya Sutskever , and Karl Cobbe. Let’s verify step by step. In The T welfth International Conference on Learning Representations (ICLR) , 2024. [11] Bill Y uchen Lin, Y untian Deng, Khyathi Chandu, F aeze Brahman, Abhilasha Ravichander , V alentina Pyatkin, Nouha Dziri, R onan Le Bras, and Y ejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv preprint , 2024. [12] Tianci Liu, Ran Xu, T ony Y u, Ilgee Hong, Carl Y ang, T uo Zhao, and Haoyu W ang. Openrubrics: T owards scalable synthetic rubric generation for reward modeling and llm alignment. arXiv preprint arXiv:2510.07743 , 2025. [13] Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Y uan, P engfei Liu, Y ang Y ou, and Hongxia Y ang. Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080 , 2023. [14] Gouki Minegishi, Hiroki Furuta, T akeshi Kojima, Y usuke Iwasawa, and Y utaka Matsuo. T opology of reasoning: Understanding large reasoning models through reasoning graph properties. arXiv preprint arXiv:2506.05744 , 2025. [15] Long Ouyang, Jeffrey W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, P amela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , et al. T raining language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730– 27744, 2022. 11 [16] Rafael Rafailov , Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a reward model. Advances in neural information processing systems , 36:53728–53741, 2023. [17] Caio Seguin, Martijn P V an Den Heuvel, and Andrew Zalesky . Navigation of brain networks. Proceedings of the National Academy of Sciences , 115(24):6297–6302, 2018. [18] Zhihong Shao, P eiyi W ang, Qihao Zhu, R unxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv preprint , 2024. [19] Guangming Sheng, Chi Zhang, Zilingfeng Y e, Xibin W u, W ang Zhang, R u Zhang, Y anghua P eng, Haibin Lin, and Chuan W u. Hybridﬂow: A ﬂexible and efﬁcient rlhf framework. arXiv preprint arXiv: 2409.19256 , 2024. [20] Olaf Sporns. Networks of the Brain . MIT press, 2016. [21] Xue W en T an, Nathaniel T an, Galen Lee, and Stanley Kok. The shape of reasoning: T opological analysis of reasoning traces in large language models. arXiv preprint , 2025. [22] Leandro von W erra, Y ounes Belkada, Lewis T unstall, Edward Beeching, T ristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: T ransformers R einforcement Learning, 2020. URL https://github.com/huggingface/trl . [23] P eiyi W ang, Lei Li, Zhihong Shao, R unxin Xu, Damai Dai, Yifei Li, Deli Chen, Y u W u, and Zhifang Sui. Math-shepherd: V erify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 9426–9439, 2024. [24] R ong W ang, Mianxin Liu, Xinhong Cheng, Y ing W u, Andrea Hildebrandt, and Changsong Zhou. Segregation, integration, and balance of large-scale resting brain networks conﬁgure different cognitive abilities. Proceedings of the National Academy of Sciences , 118(23):e2022288118, 2021. [25] Xuezhi W ang, Jason W ei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery , and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 , 2022. [26] Duncan J W atts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature , 393 (6684):440–442, 1998. [27] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, F ei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022. [28] Rihui Xin, Han Liu, Zecheng W ang, Y upeng Zhang, Dianbo Sui, Xiaolin Hu, and Bingning W ang. Surrogate signals from format and length: R einforcement learning for solving mathematical problems without ground truth answers. arXiv preprint , 2025. [29] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv preprint , 2025. [30] Qingyang Zhang, Haitao W u, Changqing Zhang, P eilin Zhao, and Y atao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization. arXiv preprint arXiv:2504.05812 , 2025. [31] Xuandong Zhao, Zhewei Kang, Aosong F eng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. arXiv preprint , 2025. 12 [32] Y aowei Zheng, Richong Zhang, Junhao Zhang, Y anhan Y e, Zheyan Luo, Zhangchi F eng, and Y ongqiang Ma. Llamafactory: Uniﬁed efﬁcient ﬁne-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Linguistics. URL http://arxiv.org/abs/2403.13372 . [33] Y ufa Zhou, Y ixiao W ang, Xunjian Y in, Shuyan Zhou, and Anru R Zhang. The geometry of reasoning: Flowing logics in representation space. arXiv preprint , 2025. [34] Y uxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Y uchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: T est-time reinforcement learning. arXiv preprint arXiv:2504.16084 , 2025. A Implementation Details A.1 Step embedding extraction W e provide two practical implementations for computing step embeddings e t used in § 3 . Implementation A (directly from the policy model hidden states during training). Given a generated trajectory τ = ( s 0 , . . . , s T ) , we run the policy model itself on the full sequence (or on per-step chunks) and extract the token hidden states from layer ℓ . Thus, the step embedding can be obtained directly from the policy model, without requiring a separate embedding model. Let L t denote the number of tokens in step s t . If h ℓ t,j is the hidden state of the j -th token in step s t at layer ℓ , we deﬁne e t = 1 L t L t X j =1 h ℓ t,j . This implementation is straightforward during training, when the policy model’s hidden states are accessible, for example using HuggingF ace TRL library [ 22 ] for RL training. Implementation B (serving-style rollout in RL using vLLM / sglang). In many RL pipelines, rollouts are generated by high-throughput inference engines (e.g., vLLM or sglang). In practice, these serving stacks may not expose the current policy model’s intermediate hidden states. In that case, we can compute step embeddings by running a separate text embedding model on each step string (after grouping tokens into steps using the newline delimiter) and using the resulting sentence embedding as e t . In our experiments we use Qwen/Qwen3-Embedding-0.6B as this separate embedding model. R emarks. Both implementations produce similar performances; Implementation B is preferred for efﬁciency and normally it will be much faster than Implementation A since it could use high-throughput inference engines (e.g., vLLM or sglang) to generate rollouts. A.2 Clustering details for functional nodes W e consider two practical choices for clustering step embeddings into latent reasoning types: KMeans and HDBSCAN. KMeans. KMeans is well suited to settings where the number of reasoning steps is not very large and the step structure is relatively regular . In our experiments, we use KMeans for the mathematical reasoning setting, where trajectories often follow a more regular progression such as setup, derivation, veriﬁcation, and answer extraction, and we ﬁnd KMeans empirically stable and reproducible across runs. Since KMeans requires specifying the number of clusters, we use k ≈ √ M , where M is the number of reasoning steps in the current response. 13 HDBSCAN. HDBSCAN is attractive when the step structure is more heterogeneous because it can infer the number of clusters automatically and can better capture non spherical cluster geometry . In our experiments, we use HDBSCAN for the open ended setting, where the reasoning patterns are more di- verse across prompts. F or HDB SCAN, we use Euclidean distance ( metric=’euclidean’ ), i.e., d ( e , e ′ ) = ∥ e − e ′ ∥ 2 . W e set min_cluster_size = max(2 , min(5 , M / 4)) and min_samples = min_cluster_size − 1 , where M is the total number of step embeddings. Noise points (assigned label − 1 ) are either grouped into a separate cluster or , if all points are noise, assigned to a default cluster . B Experimental Details B.1 Data Preparation AIME Historical Problems (Math Setting). W e use competition problems from the American Invitational Mathematics Examination (AIME) spanning 1983 to 2024, sourced from the publicly available AIME Problems 1983–2024 dataset on Kaggle. 1 This dataset contains around 1,000 problems in total. Each problem is formatted as a single-turn prompt with the instruction “Let’s think step by step and output the ﬁnal answer within \boxed{}.” appended to guide the model to solve the problem and box the ﬁnal answer . OpenR ubrics-v2 (Open-Ended Setting). W e use the OpenRubrics/OpenRubric-v2 dataset from HuggingF ace, 2 which contains preference pairs across diverse task categories including creative writing, planning, coding, and advice. W e use only the instruction ﬁeld as the training prompt and apply the following ﬁltering pipeline to select high-quality prompts. First, we perform response length ﬁltering: since this dataset contains reference responses, we use response length as a proxy to identify questions that elicit sufﬁciently complex reasoning. W e require both reference responses to fall within the token range [512 , 4096] . R esponses shorter than 512 tokens likely correspond to trivial questions that do not require deep reasoning, while responses longer than 4096 tokens are excluded for efﬁcient training. This ﬁltering ensures we select prompts that generate appropriately complex reasoning traces. Second, we apply deduplication by removing duplicate instructions (identiﬁed by hash) to prevent duplicate training samples. After ﬁltering, we retain around 3,000 samples. B.2 T raining Conﬁguration All training are mainly implemented via the verl library [ 19 ], though we also have TRL [ 22 ] version. All experiments are conducted on a single node equipped with 8 NVIDIA A100 GP Us using Fully Sharded Data P arallel (FSDP2) for model parallelism. The Structure R eward requires step embeddings computed from each rollout’s block. T o avoid interfering with the main policy rollout (which runs under vLLM), we launch a standalone embedding server using vllm.entrypoints.openai.api_server with the Qwen/Qwen3-Embedding-0.6B model on a dedicated port, consuming only 5% of one GPU’s memory . The reward function queries this server via HTTP for each generated trajectory . T ables 5 and 6 summarize the training hyperparameters used in our PPO and GRPO experiments, respectively . Across both algorithms, the Math and Open-Ended settings mainly differ in the training data, maximum response length, total epochs, and clustering method. 1 https://www.kaggle.com/datasets/tourist800/aime- problems- 1983- to- 2024 2 https://huggingface.co/datasets/OpenRubrics/OpenRubric- v2 14 Hyperparameter Math Setting Open-Ended Setting RL algorithm PPO PPO Advantage estimator GAE GAE T raining data AIME 1983–2024 OpenR ubrics-v2 Base model Qwen/Qwen3-4B Qwen/Qwen3-4B R ollout engine vLLM vLLM R ollouts per prompt ( G ) 8 8 T rain batch size 256 256 P olicy mini batch size 256 256 P olicy micro batch size / GPU 32 32 Critic micro batch size / GPU 32 32 Actor learning rate 1 × 10 − 6 1 × 10 − 6 Critic learning rate 1 × 10 − 5 1 × 10 − 5 Max prompt length 1,024 tokens 1,024 tokens Max response length 8,192 tokens 4,096 tokens Filter overlong prompts T rue T rue T runcation mode error error Data shufﬂe F alse F alse KL loss coefﬁcient 0.001 0.001 KL loss type low_var_kl low_var_kl Entropy coefﬁcient 0 0 Use KL in reward F alse F alse Critic warmup 0 0 T otal epochs 15 5 Save frequency 20 20 T est frequency 5 5 P arallelism strategy FSDP2 FSDP2 Dynamic batch sizing T rue T rue Gradient checkpointing T rue T rue Activation ofﬂoad T rue T rue R ollout GPU memory utilization 0.7 0.7 Embedding model Qwen/Qwen3-Embedding-0.6B Qwen/Qwen3-Embedding-0.6B Embedding server standalone vLLM API server standalone vLLM API server Clustering method KMeans HDBSCAN T able 5: PPO training hyperparameters for the Math and Open-Ended settings. 15 Hyperparameter Math Setting Open-Ended Setting RL algorithm GRPO GRPO T raining data AIME 1983–2024 OpenR ubrics-v2 Base model Qwen/Qwen3-4B Qwen/Qwen3-4B R ollout engine vLLM vLLM R ollouts per prompt ( G ) 8 8 T rain batch size 256 256 P olicy mini batch size 256 256 P olicy micro batch size / GPU 32 32 Actor learning rate 1 × 10 − 6 1 × 10 − 6 Max prompt length 1,024 tokens 1,024 tokens Max response length 8,192 tokens 4,096 tokens Filter overlong prompts T rue T rue T runcation mode error error Data shufﬂe F alse F alse KL loss coefﬁcient 0.001 0.001 KL loss type low_var_kl low_var_kl Entropy coefﬁcient 0 0 Use KL in reward F alse F alse Critic warmup 0 0 T otal epochs 15 5 Save frequency 20 20 T est frequency 5 5 P arallelism strategy FSDP2 FSDP2 Dynamic batch sizing T rue T rue Gradient checkpointing T rue T rue Activation ofﬂoad T rue T rue R ollout GPU memory utilization 0.7 0.7 Embedding model Qwen/Qwen3-Embedding-0.6B Qwen/Qwen3-Embedding-0.6B Embedding server standalone vLLM API server standalone vLLM API server Clustering method KMeans HDBSCAN T able 6: GRPO training hyperparameters for the Math and Open-Ended settings. 16 B.3 Additional Length Analysis F or completeness, we report the full response length comparisons against all available baselines for both the open ended and math settings. Method A vg T okens ∆ vs. Base Base 2677.91 0.00 DPO 2661.46 -16.45 EMPO 2731.92 +54.01 PPO w/ Structure R eward 2303.17 -374.74 GRPO w/ Structure R eward 2285.87 -392.04 T able 7: Full average response length comparison on WildBench. Method AIME25 AMC23 MA TH500 Minerva A vg ∆ vs. Base Base 7728.0 5884.1 4182.6 5151.7 5736.6 0.0 PPO w/ Ground- T ruth 7047.7 4634.7 3124.9 4021.7 4707.3 -1029.3 PPO w/ Structure R eward 7296.6 5069.4 3451.1 4485.4 5075.6 -661.0 GRPO w/ Ground- T ruth 6477.7 3956.6 2559.0 3143.8 4034.3 -1702.3 GRPO w/ Entropy Minimization (EMPO) 6651.5 4058.6 2702.4 3327.9 4185.1 -1551.5 GRPO w/ Majority V oting (TTRL) 6635.0 4057.8 2663.5 3258.5 4153.7 -1582.9 GRPO w/ Structure R eward 6984.9 4506.9 3029.4 3829.9 4587.8 -1148.8 T able 8: Full average response length comparison on mathematical reasoning benchmarks. 17

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment