구조화된 잠재 궤적을 통한 에너지 최소화 기반 추론
EBRM은 문제 인코딩 hₓ를 시작점으로 삼아, 학습된 에너지 E(hₓ, z) 를 최소화하는 다단계 잠재 궤적 z₁:ₜ 를 gradient‑based 로 탐색하고, 최종 상태 z_T 를 디코더에 입력해 답을 얻는 프레임워크이다. 에너지는 단계별 호환성, 인접 단계 전이 일관성, 궤적 부드러움으로 분해된다. 논문은 논리식 만족도(CNF)에서 플래너가 디코더 입력 분포를 벗어나 정확도가 95 %→56 % 로 급락하는 실패 원인을 분석하고, dual…
저자: David K. Johansson
R easoning as Energy Minimization over Structured Latent T rajectories David K. Johansson *1 Abstract Single-shot neural decoders commit to answers without iterative refinement; chain-of-thought methods refine over discrete token sequences but lack a scalar measure of reasoning progress. Energy-Based R easoning via Structured Latent Planning (EBRM) models reasoning as gradient- based optimization of a multi-step latent tra- jectory z 1: T under a learned energy function E ( h x , z ) . The energy decomposes into per-step compatibility , pairwise transition consistency , and trajectory smoothness terms. T raining splits into supervised encoder-decoder learning and con- trastive energy shaping with hard negatives. At inference, gradient descent or Langevin dynamics minimize energy over z ; the decoder maps z T to the answer . W e identify a critical failure mode: on CNF logic satisfaction, planning degrades accu- racy from ≈ 95% to ≈ 56% because the decoder is trained only on encoder outputs h x but evaluated on planner outputs z T , which drift into unseen latent regions. W e diagnose this via per-step de- coding, latent-drift tracking, and gradient decom- position, then propose two fixes, dual-path de- coder training and latent anchoring, that address the distribution mismatch. W e design a six-set ab- lation protocol (component contribution, trajec- tory length, planner dynamics, initialization, de- coder training distribution, anchor weight) and present diagnostic experiments across three tasks. On graph shortest-path, energy descends mono- tonically and trajectories show structured PC A ge- ometry . On arithmetic, the energy surface is flat ( r = 0 . 073 ), constituting a documented nega- tive result. Code: https://github.com/dkjo8/ ebr- via- structured- latent-planning . 1 Introduction Single-shot decoders map problem encodings to answers in one pass. Errors in the encoding propagate without correction. Chain-of -thought prompting [ 1 ] adds inter- mediate token-level steps, improving accuracy on multi- 1 P olished Snow Inc.. Correspondence to: David K. Johansson . Preprint. March 2026. Problem x Encoder h x encode min z E ( h x , z ) GD / Langevin z 1: T Decoder ˆ y decode Figure 1: EBRM overview . Encode problem x to con- text h x ; minimize E ( h x , z ) over latent trajectory z 1: T via gradient descent or Langevin dynamics; decode z T to an- swer ˆ y . step tasks, but the resulting traces are discrete, high- dimensional, and lack a scalar signal indicating whether reasoning is improving [ 2 , 3 ]. EBRM replaces token-level iteration with gradient-based optimization in continuous latent space. An encoder maps problem x to context h x ; a structured trajectory z 1: T ∈ R d × T is optimized to minimize a learned energy E ( h x , z ) [ 4 , 5 ]; the decoder reads z T and produces the answer . Energy decreases during optimization, provid- ing a built-in progress measure. The energy function decomposes into per-step, transition, and smoothness terms, each computed by a separate network (Section 3 ). Figure 1 shows the pipeline. Three tasks instantiate this setup. In graph shortest-path, the input is a weighted graph with source and sink; the target is binary node membership on a shortest path. In arithmetic expres- sion evaluation, the input is an expression tree such as (3 + 7) × 2 ; the target is the scalar result. In CNF logic satisfaction, the input is a Boolean formula; the target is a satisfying variable assignment. Each task uses a task-specific encoder and decoder; the energy model and planner are shared. Contributions. (C1) A latent trajectory representation z 1: T scored by a decomposable energy function (per- step, transition, smoothness). (C2) A gradient-based planner that minimizes E ( h x , z ) with encoder-seeded initialization, optional Langevin noise, and latent an- choring. (C3) A split training procedure: supervised 1 encoder-decoder loss (with optional dual-path training on planner outputs) plus contrastive energy loss with hard negatives. (C4) R oot cause analysis of the planning degradation failure mode, identifying encoder-decoder distribution mismatch as the primary cause. (C5) A six- set ablation protocol and diagnostic analysis (per-step decoding, latent drift, gradient decomposition, energy- accuracy correlation). (C6) Empirical results on three tasks with diagnostic figures and baselines. 2 R elated W ork Energy-based models and latent-variable models. EBMs assign a scalar energy to variable configurations and perform inference by energy minimization [ 4 ]. They avoid normalization requirements, allowing flexi- ble architecture design [ 6 ]. Latent EBMs learn a data- dependent prior over a latent vector , with posterior sam- pling via Langevin Monte Carlo [ 5 ]. Recent extensions include diffusion-assisted training [ 7 ] and structured univariate priors [ 8 ]. All of these operate on unstruc- tured latent vectors. EBRM structures the latent space as a multi-step trajectory and decomposes energy into per-step, transition, and smoothness terms. Iterative and multi-step reasoning. Chain-of-thought prompting [ 1 ] elicits intermediate steps as token se- quences but produces traces that are discrete and hard to optimize over . K ong et al. [ 2 ] separate latent thought vectors from token generation and refine them via Gibbs- style inference. W ang et al. [ 3 ] optimize token logits using gradient signals from a reward model. Kong et al. [ 9 ] scale inference-time computation through varia- tional Bayes over latent thoughts. EBRM differs in two ways: reasoning is a trajectory z 1: T rather than a single vector , and a decomposable energy function scores each step of the trajectory . Planning and latent optimization. Janner et al. [ 10 ] cast planning as diffusion-based trajectory sampling with gradient conditioning on rewards. Chen et al. [ 11 ] ex- tend this to latent action spaces. Both target control and generation. EBRM applies latent trajectory optimiza- tion to reasoning tasks using contrastive energy train- ing rather than denoising scores. The trajectory is fixed- length and encoder-seeded, not noise-initialized. 3 Method 3.1 Overview . EBRM has five components: 1. Encoder : h x = enc( x ) ∈ R d . 2. Latent trajectory : z = [ z 1 , . . . , z T ] , z t ∈ R d , stored as a d × T matrix. 3. Energy model : E ( h x , z ) ∈ R ; lower energy means higher trajectory plausibility . 4. Planner : minimizes E ( h x , z ) over z by gradient de- scent, with model parameters fixed. 5. Decoder : ˆ y = dec( z T ) . The encoder and decoder are trained with supervised losses. The energy model is trained with contrastive losses. Inference modifies only z . 3.2 Energy decomposition. The energy function de- composes into three terms aggregated by a learned global scorer: E ( h x , z ) = f global ¯ s step , ¯ s trans , λ smooth (1) where f global is a two-layer MLP mapping three scalars to one energy value. P er-step score. A shared MLP s θ scores each latent state against the problem context: ¯ s step = 1 T T X t =1 s θ [ h x ; z t ] (2) where [ · ; · ] denotes concatenation. T ransition score. A separate MLP s ϕ scores adjacent pairs: ¯ s trans = 1 T − 1 T − 1 X t =1 s ϕ [ z t ; z t +1 ] (3) Smoothness. A parameter-free term penalizes large jumps: λ smooth = 1 T − 1 T − 1 X t =1 ∥ z t +1 − z t ∥ 2 (4) The three terms enforce step-level relevance (Eq. 2 ), pairwise consistency (Eq. 3 ), and trajectory regularity (Eq. 4 ). 3.3 Latent planning. At inference, z is optimized to minimize E ( h x , z ) with model parameters fixed. Initial- ization sets z 1 to the first d components of h x and sam- ples z 2: T from N (0 , σ 2 I ) with small σ . The update rule is: z ← z − η ∇ z E ( h x , z ) + p 2 η σ noise ϵ, ϵ ∼ N (0 , I ) (5) Setting σ noise = 0 recovers gradient descent; σ noise > 0 adds Langevin exploration. Gradients are clipped by norm. The planner runs for K steps and returns z ∗ ; the decoder produces ˆ y = dec( z ∗ T ) . Latent anchoring. An optional quadratic penalty λ anchor ∥ z − h x ∥ 2 is added to the gradient, preventing the trajectory from drifting far from the encoder’s output distribution. This addresses the distribution mismatch identified in Section 6 . 2 3.4 T raining. T wo parameter groups receive separate gradients. Encoder-decoder (supervised). Minimizes a task-specific loss on the decoder output. In the default mode, the decoder is trained on the encoder output h x directly: L dec = ℓ dec( h x ) , y (6) where ℓ is binary cross-entropy (graph, logic) or mean squared error (arithmetic). Dual-path decoder training. T o address the distribution mismatch between encoder outputs and planner outputs (Section 6 ), an optional dual-path mode trains the de- coder on both h x and the planner’s z ∗ T : L dual dec = 1 2 ℓ dec( h x ) , y + 1 2 ℓ dec( z ∗ T ) , y (7) This ensures the decoder can handle inputs from both the encoder and the planner . Energy model (contrastive). A hinge loss pushes positive (teacher) energy below negative (perturbed or planned) energy: L contr = max 0 , E ( h x , z + ) − E ( h x , z − ) + m (8) where z + is the teacher trajectory , z − is a hard negative (planner output or perturbed z + ), and m is the margin. Smoothness regularizer . L smooth = 1 T − 1 T − 1 X t =1 ∥ z + t +1 − z + t ∥ 2 (9) Combined objective. The total loss is a weighted sum: L = α dec L dec + α contr L contr + α smooth L smooth (10) Encoder-decoder parameters receive gradients from L dec + α smooth L smooth . Energy model parameters receive gradients from L contr only . Isolating the energy gradients prevents the energy model from collapsing to trivially low values on all trajectories. 4 T asks All three tasks use procedurally generated data with known ground-truth solutions. Each task has a task- specific encoder and decoder; the energy model archi- tecture and training procedure (Section 3 ) are shared. 4.1 Graph shortest-path. Random weighted directed graphs with n ∈ [8 , 20] nodes and edge probability 0 . 3 . T arget: binary label per node indicating membership on a Dijkstra shortest path between designated source and destination [ 12 ]. Encoder: two-layer MLP on con- catenated node features, flattened adjacency , and one- hot source/destination indicators, producing h x ∈ R d . Decoder: two-layer MLP with sigmoid, one output per Figure 2: Direct vs planner endpoint performance across tasks. Planning degrades logic accuracy from ≈ 95% to ≈ 56% , motivating the failure analysis in Section 6 . node. Loss: binary cross-entropy . Metric: node-level ac- curacy . 4.2 Arithmetic expression evaluation. Random bi- nary expression trees with depth up to 4 , integer operands in [0 , 99] , and operators { + , − , ×} [ 13 ]. T ar- get: scalar value of the expression. Encoder: learned embedding table over tokens, mean-pooled and mapped through a two-layer MLP to h x . Decoder: three-layer MLP producing a single scalar . Loss: mean squared error . Metric: MAE; reported as 100 − MAE (higher is better). 4.3 CNF logic satisfaction. Random satisfiable 3-SA T formulas with 5 variables and 3 to 10 clauses, gener- ated with a known satisfying assignment [ 14 ]. T arget: variable assignment satisfying all clauses. Encoder: per- clause MLP on literal-polarity rows, mean-pooled, then two-layer MLP to h x . Decoder: two-layer MLP with sig- moid, one output per variable, thresholded at 0 . 5 . Loss: binary cross-entropy . Metric: clause satisfaction rate (SA T%). 5 R esults All models are trained on the datasets in Section 4 with the configuration in Appendix A. An encoder-decoder baseline (no energy model, no planner) with matched parameter budget is included for each task. 5.1 Endpoint performance. Figure 2 compares Di- rect (decode from encoder , no planning), Planner (de- code from z ∗ T after latent optimization), and Baseline (encoder-decoder , no energy model). Logic: direct ≈ 95% SA T , planner ≈ 56% , baseline comparable to di- rect. Graph: all methods 0 – 3% accuracy . Arithmetic: all near zero on 100 − MAE . Planning degrades logic per- formance substantially , motivating the failure analysis in Section 6 . 5.2 Energy dynamics during planning. Figure 3 plots E ( h x , z ) over 200 planning steps for five test instances 3 per task. Graph (left): energy decreases monotoni- cally for all instances. Logic (center): same pattern, with steeper descent for higher initial energy . Arithmetic (right): energy is flat across all five expressions, with no measurable descent. The energy model produces useful gradients for graph and logic but not for arithmetic. 5.3 T rajectory geometry . Figure 4 projects latent tra- jectories onto the first two principal components. Graph (left): eight trajectories start from a shared initializa- tion (star) and diverge to instance-specific endpoints (di- amonds). Logic (right): eight formulas start from dif- ferent encodings (diamonds) and converge to a shared terminal cluster (stars) near the origin. 5.4 Energy landscapes. Figure 5 shows 2D energy slices around z T . Graph (left): smooth contours with directional gradient. Logic (center): structured surface with a high-energy peak and smooth descent. Arithmetic (right): energy varies by ∼ 0 . 004 across the slice, produc- ing a flat surface with no useful gradient. 6 F ailure Analysis The most critical finding is that latent planning degrades logic accuracy from ≈ 95% to ≈ 56% . W e investigate five hypotheses. 6.1 H1: Encoder-decoder distribution mismatch (primary cause). The decoder is trained on encoder outputs h x (Eq. 6 ) but evaluated on planner outputs z T . T racking ∥ z T − h x ∥ 2 over planning steps reveals that la- tent drift increases monotonically while SA T% degrades: the planner pushes z into latent regions the decoder has never seen. This is the dominant failure mode. The per- step heatmap (Figure 6 ) provides direct evidence: SA T% is highest at t =1 (where z 1 = h x ) and degrades as the trajectory progresses. 6.2 H2: Energy-decoder misalignment. The energy model is trained contrastively on trajectory structure but receives no signal from the decoder . Computing the P ear- son correlation between E ( h x , z ) and SA T% across the test set yields weak values at all planning steps, con- firming that the energy surface is not aligned with de- coded output quality . The energy model learns trajec- tory structure, not answer correctness. This is further supported by Figure 7 , which shows energy decreasing steadily while SA T% remains flat. 6.3 H3: Optimization overshooting. The planner uses η = 0 . 01 with gradient clipping at 1 . 0 . F or logic ( 5 variables, binary outputs), the decoder’s decision bound- ary is narrow relative to the d = 64 latent space. Even small planner steps can cross the boundary . The ablation protocol in Section 7 sweeps planner learning rate; we expect very small η to preserve SA T% by limiting drift. 6.4 H4: Hard-negative quality . Negatives are gener- ated as z − = z + + 0 . 5 · ϵ , ϵ ∼ N (0 , I ) . This fixed-scale perturbation may be too coarse for the logic task’s sharp decision boundaries. Decoder-informed negatives (per- turbing z until the decoded assignment flips) would pro- vide a tighter contrastive signal. 6.5 H5: Spurious attractor . The PCA plot (Figure 4 , right) shows trajectories converging to a shared terminal cluster . P er-step decoding (Figure 6 ) confirms that SA T% is highest at t = 1 (where z 1 = h x ) and degrades as the trajectory approaches this attractor , which is a low- energy basin that does not correspond to correct assign- ments. 6.6 Proposed fixes. T wo architectural changes address H1 directly: (1) dual-path decoder training (Eq. 7 ), which trains the decoder on both h x and planner z T ; and (2) latent anchoring , which adds λ anchor ∥ z − h x ∥ 2 to the plan- ner’s gradient to prevent excessive drift. Both are imple- mented in the codebase and their ablation protocol is described in Section 7 . 7 Ablation Studies W e design six ablation sets to isolate the contribution of each component. The infrastructure for all sets is im- plemented in the codebase ( run_ablations.jl ); all use reduced datasets (500 train, 50 val, 100 test) and 30 epochs. W e report the experimental design and hypothe- ses; full numerical results across all tasks are deferred to a forthcoming extended version. 7.1 Set A: Component contribution. Five configura- tions: full system, no contrastive loss ( α contr = 0 ), no smoothness ( α smooth = 0 ), no planning (steps = 0 ), and no energy at all ( α contr = 0 , α smooth = 0 , steps = 0 ). Hy - pothesis : the no-planning configuration should recover the ≈ 95% direct accuracy on logic, isolating the planner as the source of degradation. 7.2 Set B: T rajectory length T . T ∈ { 1 , 2 , 4 , 8 , 12 } . T = 1 collapses to a single latent state and should match direct decoding. Hypothesis : longer T provides more room for the planner to drift, producing monotonically increasing degradation on logic. 7.3 Set C: Planner dynamics. Three sub-grids: (C1) planner steps ∈ { 5 , 10 , 25 , 50 , 100 , 200 } ; (C2) gradi- ent descent vs Langevin; (C3) planner learning rate ∈ { 0 . 001 , 0 . 005 , 0 . 01 , 0 . 05 } . Hypothesis : on logic, SA T% should degrade with more steps and higher learning rate, consistent with H1 and H3. 7.4 Set D: Initialization strategy . Three strategies: (a) default ( z 1 = h x , z 2: T ∼ N (0 , 0 . 01) ); (b) all-encoder ( z t = h x + ϵ for all t ); (c) zero initialization. Hypothesis : strategy (b) keeps all trajectory steps near the decoder’s training distribution, preserving accuracy . 7.5 Set E: Decoder training distribution. (a) Decoder trained on h x only (default); (b) dual-path training on 4 Figure 3: Energy during latent planning. L eft: Graph — energy decreases consistently . Center: Logic — monotonic descent across formulas. Right: Arithmetic — energy is flat, indicating limited optimization progress. Figure 4: Latent trajectories in PCA space. Left: Graph — trajectories diverge from a shared start to instance- specific endpoints. Right: Logic — trajectories from di- verse starts converge to a shared terminal cluster . both h x and planner z T (Eq. 7 ). Hypothesis : dual-path training directly closes the distribution gap and should recover planner accuracy on logic. 7.6 Set F: Anchor weight. λ anchor ∈ { 0 , 0 . 01 , 0 . 1 , 1 . 0 } . Hypothesis : higher anchor weight constrains the plan- ner to stay near h x , trading off exploration for decoder compatibility . A moderate value should improve planner accuracy without collapsing to direct decoding. 8 Latent Dynamics Analysis 8.1 P er-step decoding. Figure 6 decodes z t at every planning step. On logic, SA T% is highest at t = 1 and de- grades monotonically , confirming that the planner moves z away from the decoder’s effective region. This is the most direct evidence for H1. 8.2 Gradient decomposition. Decomposing the plan- ner gradient ∇ z E into contributions from the step scorer , transition scorer , and smoothness term reveals that on logic, the step scorer dominates the gradient early in planning, while the smoothness term grows as the tra- jectory contracts toward the attractor . The transition scorer contributes minimally throughout. This suggests the planner is primarily driven by per-step compatibility scores rather than trajectory coherence. 8.3 Energy vs solution quality . On logic (Figure 7 ), clause satisfaction stays constant over 200 planning steps while energy decreases steadily . The planner reduces en- ergy without improving the decoded output, confirming that the energy surface is misaligned with decoder qual- ity (H2). 8.4 PCA with metric coloring. Projecting trajectories into PCA space and coloring each point by its decoded SA T% reveals that encoder outputs h x cluster in a high- SA T% region, while planner endpoints drift into low- SA T% territory . This directly visualizes the distribution mismatch identified in H1: the planner moves z away from the region where the decoder produces correct out- puts. The standard PCA trajectories (Figure 4 , right) show the same convergence pattern without the metric overlay . On arithmetic (Figure 8 ), final energy E ( h x , z ∗ ) corre- lates with absolute error at r = 0 . 073 . Energy does not predict answer quality . 9 Limitations Method limitations. (1) Energy -decoder misalignment : the energy function scores trajectory structure, not de- coded output quality . There is no guarantee that low energy implies correct answers. (2) Distribution shift at inference : the decoder is trained on encoder outputs but evaluated on planner outputs. Dual-path training and anchoring mitigate but do not eliminate this gap. (3) Scalability : each planning step requires a backward pass through the energy model; cost scales linearly with K × T × d . (4) Initialization sensitivity : the planner’s out- put depends on z 0 ; without multi-restart or annealing, it may converge to different local minima. (5) No learned stopping criterion : the planner runs for a fixed K steps with no mechanism to detect when further optimization is harmful. Experimental limitations. (1) Synthetic tasks only : all three tasks use procedurally generated data with known solutions. Generalization to natural-language or real- world reasoning is untested. (2) MLP -only architectures : no graph neural networks, no transformers. The en- coder/decoder capacity may be insufficient for the graph task. (3) Small scale : 5 variables (logic), 8-20 nodes 5 Figure 5: Energy landscapes around z T . L eft: Graph — smooth directional gradients. Center: Logic — structured surface with clear low-energy basin. Right: Arithmetic — nearly flat surface with negligible gradient signal. Figure 6: Logic: per-step SA T% during planning. R ows are test problems, columns are planning steps. SA T% is highest at step 1 and degrades, confirming the spurious attractor hypothesis. (graph), depth-4 trees (arithmetic) in the default con- figuration. Scaled variants (10/15 variables, 5-10/20-50 nodes) are provided but not yet fully evaluated. (4) Seed variance : key experiments should be repeated across multiple seeds to quantify variance. 10 Conclusion EBRM models reasoning as gradient-based energy mini- mization over a structured latent trajectory z 1: T . The en- ergy decomposes into per-step, transition, and smooth- ness terms; training separates supervised encoder- decoder learning from contrastive energy shaping. The central finding is that latent planning can degrade performance when the decoder is not trained on the planner’s output distribution. On logic, planning drops SA T% from ≈ 95% to ≈ 56% because z T drifts into latent regions the decoder has never seen. P er-step decoding, latent-drift tracking, and gradient decomposition con- firm this distribution-mismatch hypothesis. T wo fixes— dual-path decoder training and latent anchoring—are proposed. On graph and logic, the energy model learns a surface that supports monotonic energy descent, structured PCA Figure 7: Logic: energy vs clause satisfaction during planning. Energy (solid) decreases while SA T% (dashed) remains flat. trajectories, and smooth local landscapes. On arithmetic, the energy surface is flat ( r = 0 . 073 ), constituting a doc- umented negative result where contrastive training fails to shape a useful scoring function. A six-set ablation suite is designed to isolate the con- tribution of each component, including the proposed fixes. The immediate next steps are running the full ab- lation protocol, evaluating dual-path and anchoring at full scale, extending to harder task variants (10-15 vari- able SA T , larger graphs), and exploring decoder-aware energy functions that directly couple the energy surface to decoded output quality . R eferences [1] J. W ei, X. W ang, D . Schuurmans, M. Bosma, B. Ichter , F . Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural In- formation Processing Systems , 2022. [2] D. K ong, M. Zhao, A. Qin, B. P ang, C. T ao, D. Hart- mann, E. Honig, D. Xu, A. Kumar , M. Sarte, C. Li, J. Xie, and Y . N. W u. Inference-time rethinking with 6 Figure 8: Arithmetic: final energy vs prediction error ( r = 0 . 073 ). Energy does not reliably predict answer quality . latent thought vectors for math reasoning. arXiv preprint arXiv:2602.06584 , 2026. [3] P . W ang, R. Cai, Z. W ang, H. Mei, Q. Liu, P . Li, and Z. W ang. ∇ -Reasoner: LLM reasoning via test-time gradient descent in latent space. arXiv preprint arXiv:2603.04948 , 2026. [4] Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F . J. Huang. A tutorial on energy-based learning. In Predicting Structured Data . MIT Press, 2006. [5] B. P ang, T. Han, E. Nijkamp, S.-C. Zhu, and Y . N. Wu. Learning latent space energy-based prior model. In Advances in Neural Information Process- ing Systems , 2020. [6] D. Carbone. Hitchhiker’s guide on the relation of energy-based models with other generative mod- els, sampling and statistical physics: a comprehen- sive review . T ransactions on Machine Learning R e- search , 2025. [7] J. Cui and T . Han. Learning latent space hi- erarchical EBM diffusion models. arXiv preprint arXiv:2405.13910 , 2024. [8] P . Raj. Kolmogorov-Arnold energy models: fast, interpretable generative modeling. arXiv preprint arXiv:2506.14167 , 2026. [9] D. Kong, B. P ang, T. Han, and Y . N. W u. Latent thought models with variational Bayes inference- time computation. In Proceedings of the Inter- national Conference on Machine Learning , 2025. [10] M. Janner , Y . Du, J. B. T enenbaum, and S. Levine. Planning with diffusion for flexible behavior syn- thesis. In Proceedings of the International Confer- ence on Machine Learning , 2022. [11] W . Chen, S. Deng, S. Jia, and S. Levine. Ef- ficient planning with latent diffusion. In Inter- national Conference on Learning R epresentations , 2024. [12] P . V eli ˇ ckovi ´ c, A. Buesing, M. Overlan, R. P ascanu, O . Vinyals, C. Blundell, J. Ibarz, A. W . Senior , and G. Swirszcz. The CLRS algorithmic reasoning benchmark. In Proceedings of the International Con- ference on Machine Learning , 2022. [13] A. T rask, F . Hill, S. E. Reed, J. Rae, C. Dyer , and P . Blunsom. Neural arithmetic logic units. In Ad- vances in Neural Information Processing Systems , 2018. [14] D. Selsam, M. Lamm, B. Bünz, P . Liang, L. de Moura, and D. L. Dill. Learning a SA T solver from single-bit supervision. In International Con- ference on Learning Representations , 2019. A Default Hyperparameters T able 1 lists the default configuration used for all exper- iments unless stated otherwise. Ablation studies (Sec- tion 7 ) use reduced datasets (500 train, 50 val, 100 test) and 30 epochs. P arameter V alue Latent space Latent dimension d 64 T rajectory length T 8 T raining Epochs 100 Batch size 32 Learning rate 10 − 3 W eight decay 10 − 4 α contr 0.1 α dec 1.0 α smooth 0.01 Dual-path decoder off Inference (planner) Planner steps K 50 Planner LR η 0.01 Langevin noise σ noise 0.005 Gradient clip norm 1.0 Anchor weight λ anchor 0.0 Energy model Hidden dim 128 Layers 3 Encoder / Decoder Hidden dim 128 Layers 2 Dataset sizes (full) T rain / V al / T est 5000 / 500 / 1000 T able 1: Default hyperparameters. See config.toml in the repository for the complete specification. 7
원본 논문
고화질 논문을 불러오는 중입니다...
댓글 및 학술 토론
Loading comments...
의견 남기기