구조화된 잠재 궤적을 통한 에너지 최소화 기반 추론

R easoning as Energy Minimization over Structured Latent T rajectories David K. Johansson *1 Abstract Single-shot neural decoders commit to answers without iterative reﬁnement; chain-of-thought methods reﬁne over discrete token sequences but lack a scalar measure of reasoning progress. Energy-Based R easoning via Structured Latent Planning (EBRM) models reasoning as gradient- based optimization of a multi-step latent tra- jectory z 1: T under a learned energy function E ( h x , z ) . The energy decomposes into per-step compatibility , pairwise transition consistency , and trajectory smoothness terms. T raining splits into supervised encoder-decoder learning and con- trastive energy shaping with hard negatives. At inference, gradient descent or Langevin dynamics minimize energy over z ; the decoder maps z T to the answer . W e identify a critical failure mode: on CNF logic satisfaction, planning degrades accu- racy from ≈ 95% to ≈ 56% because the decoder is trained only on encoder outputs h x but evaluated on planner outputs z T , which drift into unseen latent regions. W e diagnose this via per-step de- coding, latent-drift tracking, and gradient decom- position, then propose two ﬁxes, dual-path de- coder training and latent anchoring, that address the distribution mismatch. W e design a six-set ab- lation protocol (component contribution, trajec- tory length, planner dynamics, initialization, de- coder training distribution, anchor weight) and present diagnostic experiments across three tasks. On graph shortest-path, energy descends mono- tonically and trajectories show structured PC A ge- ometry . On arithmetic, the energy surface is ﬂat ( r = 0 . 073 ), constituting a documented nega- tive result. Code: https://github.com/dkjo8/ ebr- via- structured- latent-planning . 1 Introduction Single-shot decoders map problem encodings to answers in one pass. Errors in the encoding propagate without correction. Chain-of -thought prompting [ 1 ] adds inter- mediate token-level steps, improving accuracy on multi- 1 P olished Snow Inc.. Correspondence to: David K. Johansson . Preprint. March 2026. Problem x Encoder h x encode min z E ( h x , z ) GD / Langevin z 1: T Decoder ˆ y decode Figure 1: EBRM overview . Encode problem x to con- text h x ; minimize E ( h x , z ) over latent trajectory z 1: T via gradient descent or Langevin dynamics; decode z T to an- swer ˆ y . step tasks, but the resulting traces are discrete, high- dimensional, and lack a scalar signal indicating whether reasoning is improving [ 2 , 3 ]. EBRM replaces token-level iteration with gradient-based optimization in continuous latent space. An encoder maps problem x to context h x ; a structured trajectory z 1: T ∈ R d × T is optimized to minimize a learned energy E ( h x , z ) [ 4 , 5 ]; the decoder reads z T and produces the answer . Energy decreases during optimization, provid- ing a built-in progress measure. The energy function decomposes into per-step, transition, and smoothness terms, each computed by a separate network (Section 3 ). Figure 1 shows the pipeline. Three tasks instantiate this setup. In graph shortest-path, the input is a weighted graph with source and sink; the target is binary node membership on a shortest path. In arithmetic expres- sion evaluation, the input is an expression tree such as (3 + 7) × 2 ; the target is the scalar result. In CNF logic satisfaction, the input is a Boolean formula; the target is a satisfying variable assignment. Each task uses a task-speciﬁc encoder and decoder; the energy model and planner are shared. Contributions. (C1) A latent trajectory representation z 1: T scored by a decomposable energy function (per- step, transition, smoothness). (C2) A gradient-based planner that minimizes E ( h x , z ) with encoder-seeded initialization, optional Langevin noise, and latent an- choring. (C3) A split training procedure: supervised 1 encoder-decoder loss (with optional dual-path training on planner outputs) plus contrastive energy loss with hard negatives. (C4) R oot cause analysis of the planning degradation failure mode, identifying encoder-decoder distribution mismatch as the primary cause. (C5) A six- set ablation protocol and diagnostic analysis (per-step decoding, latent drift, gradient decomposition, energy- accuracy correlation). (C6) Empirical results on three tasks with diagnostic ﬁgures and baselines. 2 R elated W ork Energy-based models and latent-variable models. EBMs assign a scalar energy to variable conﬁgurations and perform inference by energy minimization [ 4 ]. They avoid normalization requirements, allowing ﬂexi- ble architecture design [ 6 ]. Latent EBMs learn a data- dependent prior over a latent vector , with posterior sam- pling via Langevin Monte Carlo [ 5 ]. Recent extensions include diffusion-assisted training [ 7 ] and structured univariate priors [ 8 ]. All of these operate on unstruc- tured latent vectors. EBRM structures the latent space as a multi-step trajectory and decomposes energy into per-step, transition, and smoothness terms. Iterative and multi-step reasoning. Chain-of-thought prompting [ 1 ] elicits intermediate steps as token se- quences but produces traces that are discrete and hard to optimize over . K ong et al. [ 2 ] separate latent thought vectors from token generation and reﬁne them via Gibbs- style inference. W ang et al. [ 3 ] optimize token logits using gradient signals from a reward model. Kong et al. [ 9 ] scale inference-time computation through varia- tional Bayes over latent thoughts. EBRM differs in two ways: reasoning is a trajectory z 1: T rather than a single vector , and a decomposable energy function scores each step of the trajectory . Planning and latent optimization. Janner et al. [ 10 ] cast planning as diffusion-based trajectory sampling with gradient conditioning on rewards. Chen et al. [ 11 ] ex- tend this to latent action spaces. Both target control and generation. EBRM applies latent trajectory optimiza- tion to reasoning tasks using contrastive energy train- ing rather than denoising scores. The trajectory is ﬁxed- length and encoder-seeded, not noise-initialized. 3 Method 3.1 Overview . EBRM has ﬁve components: 1. Encoder : h x = enc( x ) ∈ R d . 2. Latent trajectory : z = [ z 1 , . . . , z T ] , z t ∈ R d , stored as a d × T matrix. 3. Energy model : E ( h x , z ) ∈ R ; lower energy means higher trajectory plausibility . 4. Planner : minimizes E ( h x , z ) over z by gradient de- scent, with model parameters ﬁxed. 5. Decoder : ˆ y = dec( z T ) . The encoder and decoder are trained with supervised losses. The energy model is trained with contrastive losses. Inference modiﬁes only z . 3.2 Energy decomposition. The energy function de- composes into three terms aggregated by a learned global scorer: E ( h x , z ) = f global  ¯ s step , ¯ s trans , λ smooth  (1) where f global is a two-layer MLP mapping three scalars to one energy value. P er-step score. A shared MLP s θ scores each latent state against the problem context: ¯ s step = 1 T T X t =1 s θ  [ h x ; z t ]  (2) where [ · ; · ] denotes concatenation. T ransition score. A separate MLP s ϕ scores adjacent pairs: ¯ s trans = 1 T − 1 T − 1 X t =1 s ϕ  [ z t ; z t +1 ]  (3) Smoothness. A parameter-free term penalizes large jumps: λ smooth = 1 T − 1 T − 1 X t =1 ∥ z t +1 − z t ∥ 2 (4) The three terms enforce step-level relevance (Eq. 2 ), pairwise consistency (Eq. 3 ), and trajectory regularity (Eq. 4 ). 3.3 Latent planning. At inference, z is optimized to minimize E ( h x , z ) with model parameters ﬁxed. Initial- ization sets z 1 to the ﬁrst d components of h x and sam- ples z 2: T from N (0 , σ 2 I ) with small σ . The update rule is: z ← z − η ∇ z E ( h x , z ) + p 2 η σ noise ϵ, ϵ ∼ N (0 , I ) (5) Setting σ noise = 0 recovers gradient descent; σ noise > 0 adds Langevin exploration. Gradients are clipped by norm. The planner runs for K steps and returns z ∗ ; the decoder produces ˆ y = dec( z ∗ T ) . Latent anchoring. An optional quadratic penalty λ anchor ∥ z − h x ∥ 2 is added to the gradient, preventing the trajectory from drifting far from the encoder’s output distribution. This addresses the distribution mismatch identiﬁed in Section 6 . 2 3.4 T raining. T wo parameter groups receive separate gradients. Encoder-decoder (supervised). Minimizes a task-speciﬁc loss on the decoder output. In the default mode, the decoder is trained on the encoder output h x directly: L dec = ℓ  dec( h x ) , y  (6) where ℓ is binary cross-entropy (graph, logic) or mean squared error (arithmetic). Dual-path decoder training. T o address the distribution mismatch between encoder outputs and planner outputs (Section 6 ), an optional dual-path mode trains the de- coder on both h x and the planner’s z ∗ T : L dual dec = 1 2 ℓ  dec( h x ) , y  + 1 2 ℓ  dec( z ∗ T ) , y  (7) This ensures the decoder can handle inputs from both the encoder and the planner . Energy model (contrastive). A hinge loss pushes positive (teacher) energy below negative (perturbed or planned) energy: L contr = max  0 , E ( h x , z + ) − E ( h x , z − ) + m  (8) where z + is the teacher trajectory , z − is a hard negative (planner output or perturbed z + ), and m is the margin. Smoothness regularizer . L smooth = 1 T − 1 T − 1 X t =1 ∥ z + t +1 − z + t ∥ 2 (9) Combined objective. The total loss is a weighted sum: L = α dec L dec + α contr L contr + α smooth L smooth (10) Encoder-decoder parameters receive gradients from L dec + α smooth L smooth . Energy model parameters receive gradients from L contr only . Isolating the energy gradients prevents the energy model from collapsing to trivially low values on all trajectories. 4 T asks All three tasks use procedurally generated data with known ground-truth solutions. Each task has a task- speciﬁc encoder and decoder; the energy model archi- tecture and training procedure (Section 3 ) are shared. 4.1 Graph shortest-path. Random weighted directed graphs with n ∈ [8 , 20] nodes and edge probability 0 . 3 . T arget: binary label per node indicating membership on a Dijkstra shortest path between designated source and destination [ 12 ]. Encoder: two-layer MLP on con- catenated node features, ﬂattened adjacency , and one- hot source/destination indicators, producing h x ∈ R d . Decoder: two-layer MLP with sigmoid, one output per Figure 2: Direct vs planner endpoint performance across tasks. Planning degrades logic accuracy from ≈ 95% to ≈ 56% , motivating the failure analysis in Section 6 . node. Loss: binary cross-entropy . Metric: node-level ac- curacy . 4.2 Arithmetic expression evaluation. Random bi- nary expression trees with depth up to 4 , integer operands in [0 , 99] , and operators { + , − , ×} [ 13 ]. T ar- get: scalar value of the expression. Encoder: learned embedding table over tokens, mean-pooled and mapped through a two-layer MLP to h x . Decoder: three-layer MLP producing a single scalar . Loss: mean squared error . Metric: MAE; reported as 100 − MAE (higher is better). 4.3 CNF logic satisfaction. Random satisﬁable 3-SA T formulas with 5 variables and 3 to 10 clauses, gener- ated with a known satisfying assignment [ 14 ]. T arget: variable assignment satisfying all clauses. Encoder: per- clause MLP on literal-polarity rows, mean-pooled, then two-layer MLP to h x . Decoder: two-layer MLP with sig- moid, one output per variable, thresholded at 0 . 5 . Loss: binary cross-entropy . Metric: clause satisfaction rate (SA T%). 5 R esults All models are trained on the datasets in Section 4 with the conﬁguration in Appendix A. An encoder-decoder baseline (no energy model, no planner) with matched parameter budget is included for each task. 5.1 Endpoint performance. Figure 2 compares Di- rect (decode from encoder , no planning), Planner (de- code from z ∗ T after latent optimization), and Baseline (encoder-decoder , no energy model). Logic: direct ≈ 95% SA T , planner ≈ 56% , baseline comparable to di- rect. Graph: all methods 0 – 3% accuracy . Arithmetic: all near zero on 100 − MAE . Planning degrades logic per- formance substantially , motivating the failure analysis in Section 6 . 5.2 Energy dynamics during planning. Figure 3 plots E ( h x , z ) over 200 planning steps for ﬁve test instances 3 per task. Graph (left): energy decreases monotoni- cally for all instances. Logic (center): same pattern, with steeper descent for higher initial energy . Arithmetic (right): energy is ﬂat across all ﬁve expressions, with no measurable descent. The energy model produces useful gradients for graph and logic but not for arithmetic. 5.3 T rajectory geometry . Figure 4 projects latent tra- jectories onto the ﬁrst two principal components. Graph (left): eight trajectories start from a shared initializa- tion (star) and diverge to instance-speciﬁc endpoints (di- amonds). Logic (right): eight formulas start from dif- ferent encodings (diamonds) and converge to a shared terminal cluster (stars) near the origin. 5.4 Energy landscapes. Figure 5 shows 2D energy slices around z T . Graph (left): smooth contours with directional gradient. Logic (center): structured surface with a high-energy peak and smooth descent. Arithmetic (right): energy varies by ∼ 0 . 004 across the slice, produc- ing a ﬂat surface with no useful gradient. 6 F ailure Analysis The most critical ﬁnding is that latent planning degrades logic accuracy from ≈ 95% to ≈ 56% . W e investigate ﬁve hypotheses. 6.1 H1: Encoder-decoder distribution mismatch (primary cause). The decoder is trained on encoder outputs h x (Eq. 6 ) but evaluated on planner outputs z T . T racking ∥ z T − h x ∥ 2 over planning steps reveals that la- tent drift increases monotonically while SA T% degrades: the planner pushes z into latent regions the decoder has never seen. This is the dominant failure mode. The per- step heatmap (Figure 6 ) provides direct evidence: SA T% is highest at t =1 (where z 1 = h x ) and degrades as the trajectory progresses. 6.2 H2: Energy-decoder misalignment. The energy model is trained contrastively on trajectory structure but receives no signal from the decoder . Computing the P ear- son correlation between E ( h x , z ) and SA T% across the test set yields weak values at all planning steps, con- ﬁrming that the energy surface is not aligned with de- coded output quality . The energy model learns trajec- tory structure, not answer correctness. This is further supported by Figure 7 , which shows energy decreasing steadily while SA T% remains ﬂat. 6.3 H3: Optimization overshooting. The planner uses η = 0 . 01 with gradient clipping at 1 . 0 . F or logic ( 5 variables, binary outputs), the decoder’s decision bound- ary is narrow relative to the d = 64 latent space. Even small planner steps can cross the boundary . The ablation protocol in Section 7 sweeps planner learning rate; we expect very small η to preserve SA T% by limiting drift. 6.4 H4: Hard-negative quality . Negatives are gener- ated as z − = z + + 0 . 5 · ϵ , ϵ ∼ N (0 , I ) . This ﬁxed-scale perturbation may be too coarse for the logic task’s sharp decision boundaries. Decoder-informed negatives (per- turbing z until the decoded assignment ﬂips) would pro- vide a tighter contrastive signal. 6.5 H5: Spurious attractor . The PCA plot (Figure 4 , right) shows trajectories converging to a shared terminal cluster . P er-step decoding (Figure 6 ) conﬁrms that SA T% is highest at t = 1 (where z 1 = h x ) and degrades as the trajectory approaches this attractor , which is a low- energy basin that does not correspond to correct assign- ments. 6.6 Proposed ﬁxes. T wo architectural changes address H1 directly: (1) dual-path decoder training (Eq. 7 ), which trains the decoder on both h x and planner z T ; and (2) latent anchoring , which adds λ anchor ∥ z − h x ∥ 2 to the plan- ner’s gradient to prevent excessive drift. Both are imple- mented in the codebase and their ablation protocol is described in Section 7 . 7 Ablation Studies W e design six ablation sets to isolate the contribution of each component. The infrastructure for all sets is im- plemented in the codebase ( run_ablations.jl ); all use reduced datasets (500 train, 50 val, 100 test) and 30 epochs. W e report the experimental design and hypothe- ses; full numerical results across all tasks are deferred to a forthcoming extended version. 7.1 Set A: Component contribution. Five conﬁgura- tions: full system, no contrastive loss ( α contr = 0 ), no smoothness ( α smooth = 0 ), no planning (steps = 0 ), and no energy at all ( α contr = 0 , α smooth = 0 , steps = 0 ). Hy - pothesis : the no-planning conﬁguration should recover the ≈ 95% direct accuracy on logic, isolating the planner as the source of degradation. 7.2 Set B: T rajectory length T . T ∈ { 1 , 2 , 4 , 8 , 12 } . T = 1 collapses to a single latent state and should match direct decoding. Hypothesis : longer T provides more room for the planner to drift, producing monotonically increasing degradation on logic. 7.3 Set C: Planner dynamics. Three sub-grids: (C1) planner steps ∈ { 5 , 10 , 25 , 50 , 100 , 200 } ; (C2) gradi- ent descent vs Langevin; (C3) planner learning rate ∈ { 0 . 001 , 0 . 005 , 0 . 01 , 0 . 05 } . Hypothesis : on logic, SA T% should degrade with more steps and higher learning rate, consistent with H1 and H3. 7.4 Set D: Initialization strategy . Three strategies: (a) default ( z 1 = h x , z 2: T ∼ N (0 , 0 . 01) ); (b) all-encoder ( z t = h x + ϵ for all t ); (c) zero initialization. Hypothesis : strategy (b) keeps all trajectory steps near the decoder’s training distribution, preserving accuracy . 7.5 Set E: Decoder training distribution. (a) Decoder trained on h x only (default); (b) dual-path training on 4 Figure 3: Energy during latent planning. L eft: Graph — energy decreases consistently . Center: Logic — monotonic descent across formulas. Right: Arithmetic — energy is ﬂat, indicating limited optimization progress. Figure 4: Latent trajectories in PCA space. Left: Graph — trajectories diverge from a shared start to instance- speciﬁc endpoints. Right: Logic — trajectories from di- verse starts converge to a shared terminal cluster . both h x and planner z T (Eq. 7 ). Hypothesis : dual-path training directly closes the distribution gap and should recover planner accuracy on logic. 7.6 Set F: Anchor weight. λ anchor ∈ { 0 , 0 . 01 , 0 . 1 , 1 . 0 } . Hypothesis : higher anchor weight constrains the plan- ner to stay near h x , trading off exploration for decoder compatibility . A moderate value should improve planner accuracy without collapsing to direct decoding. 8 Latent Dynamics Analysis 8.1 P er-step decoding. Figure 6 decodes z t at every planning step. On logic, SA T% is highest at t = 1 and de- grades monotonically , conﬁrming that the planner moves z away from the decoder’s effective region. This is the most direct evidence for H1. 8.2 Gradient decomposition. Decomposing the plan- ner gradient ∇ z E into contributions from the step scorer , transition scorer , and smoothness term reveals that on logic, the step scorer dominates the gradient early in planning, while the smoothness term grows as the tra- jectory contracts toward the attractor . The transition scorer contributes minimally throughout. This suggests the planner is primarily driven by per-step compatibility scores rather than trajectory coherence. 8.3 Energy vs solution quality . On logic (Figure 7 ), clause satisfaction stays constant over 200 planning steps while energy decreases steadily . The planner reduces en- ergy without improving the decoded output, conﬁrming that the energy surface is misaligned with decoder qual- ity (H2). 8.4 PCA with metric coloring. Projecting trajectories into PCA space and coloring each point by its decoded SA T% reveals that encoder outputs h x cluster in a high- SA T% region, while planner endpoints drift into low- SA T% territory . This directly visualizes the distribution mismatch identiﬁed in H1: the planner moves z away from the region where the decoder produces correct out- puts. The standard PCA trajectories (Figure 4 , right) show the same convergence pattern without the metric overlay . On arithmetic (Figure 8 ), ﬁnal energy E ( h x , z ∗ ) corre- lates with absolute error at r = 0 . 073 . Energy does not predict answer quality . 9 Limitations Method limitations. (1) Energy -decoder misalignment : the energy function scores trajectory structure, not de- coded output quality . There is no guarantee that low energy implies correct answers. (2) Distribution shift at inference : the decoder is trained on encoder outputs but evaluated on planner outputs. Dual-path training and anchoring mitigate but do not eliminate this gap. (3) Scalability : each planning step requires a backward pass through the energy model; cost scales linearly with K × T × d . (4) Initialization sensitivity : the planner’s out- put depends on z 0 ; without multi-restart or annealing, it may converge to different local minima. (5) No learned stopping criterion : the planner runs for a ﬁxed K steps with no mechanism to detect when further optimization is harmful. Experimental limitations. (1) Synthetic tasks only : all three tasks use procedurally generated data with known solutions. Generalization to natural-language or real- world reasoning is untested. (2) MLP -only architectures : no graph neural networks, no transformers. The en- coder/decoder capacity may be insufﬁcient for the graph task. (3) Small scale : 5 variables (logic), 8-20 nodes 5 Figure 5: Energy landscapes around z T . L eft: Graph — smooth directional gradients. Center: Logic — structured surface with clear low-energy basin. Right: Arithmetic — nearly ﬂat surface with negligible gradient signal. Figure 6: Logic: per-step SA T% during planning. R ows are test problems, columns are planning steps. SA T% is highest at step 1 and degrades, conﬁrming the spurious attractor hypothesis. (graph), depth-4 trees (arithmetic) in the default con- ﬁguration. Scaled variants (10/15 variables, 5-10/20-50 nodes) are provided but not yet fully evaluated. (4) Seed variance : key experiments should be repeated across multiple seeds to quantify variance. 10 Conclusion EBRM models reasoning as gradient-based energy mini- mization over a structured latent trajectory z 1: T . The en- ergy decomposes into per-step, transition, and smooth- ness terms; training separates supervised encoder- decoder learning from contrastive energy shaping. The central ﬁnding is that latent planning can degrade performance when the decoder is not trained on the planner’s output distribution. On logic, planning drops SA T% from ≈ 95% to ≈ 56% because z T drifts into latent regions the decoder has never seen. P er-step decoding, latent-drift tracking, and gradient decomposition con- ﬁrm this distribution-mismatch hypothesis. T wo ﬁxes— dual-path decoder training and latent anchoring—are proposed. On graph and logic, the energy model learns a surface that supports monotonic energy descent, structured PCA Figure 7: Logic: energy vs clause satisfaction during planning. Energy (solid) decreases while SA T% (dashed) remains ﬂat. trajectories, and smooth local landscapes. On arithmetic, the energy surface is ﬂat ( r = 0 . 073 ), constituting a doc- umented negative result where contrastive training fails to shape a useful scoring function. A six-set ablation suite is designed to isolate the con- tribution of each component, including the proposed ﬁxes. The immediate next steps are running the full ab- lation protocol, evaluating dual-path and anchoring at full scale, extending to harder task variants (10-15 vari- able SA T , larger graphs), and exploring decoder-aware energy functions that directly couple the energy surface to decoded output quality . R eferences [1] J. W ei, X. W ang, D . Schuurmans, M. Bosma, B. Ichter , F . Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural In- formation Processing Systems , 2022. [2] D. K ong, M. Zhao, A. Qin, B. P ang, C. T ao, D. Hart- mann, E. Honig, D. Xu, A. Kumar , M. Sarte, C. Li, J. Xie, and Y . N. W u. Inference-time rethinking with 6 Figure 8: Arithmetic: ﬁnal energy vs prediction error ( r = 0 . 073 ). Energy does not reliably predict answer quality . latent thought vectors for math reasoning. arXiv preprint arXiv:2602.06584 , 2026. [3] P . W ang, R. Cai, Z. W ang, H. Mei, Q. Liu, P . Li, and Z. W ang. ∇ -Reasoner: LLM reasoning via test-time gradient descent in latent space. arXiv preprint arXiv:2603.04948 , 2026. [4] Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F . J. Huang. A tutorial on energy-based learning. In Predicting Structured Data . MIT Press, 2006. [5] B. P ang, T. Han, E. Nijkamp, S.-C. Zhu, and Y . N. Wu. Learning latent space energy-based prior model. In Advances in Neural Information Process- ing Systems , 2020. [6] D. Carbone. Hitchhiker’s guide on the relation of energy-based models with other generative mod- els, sampling and statistical physics: a comprehen- sive review . T ransactions on Machine Learning R e- search , 2025. [7] J. Cui and T . Han. Learning latent space hi- erarchical EBM diffusion models. arXiv preprint arXiv:2405.13910 , 2024. [8] P . Raj. Kolmogorov-Arnold energy models: fast, interpretable generative modeling. arXiv preprint arXiv:2506.14167 , 2026. [9] D. Kong, B. P ang, T. Han, and Y . N. W u. Latent thought models with variational Bayes inference- time computation. In Proceedings of the Inter- national Conference on Machine Learning , 2025. [10] M. Janner , Y . Du, J. B. T enenbaum, and S. Levine. Planning with diffusion for ﬂexible behavior syn- thesis. In Proceedings of the International Confer- ence on Machine Learning , 2022. [11] W . Chen, S. Deng, S. Jia, and S. Levine. Ef- ﬁcient planning with latent diffusion. In Inter- national Conference on Learning R epresentations , 2024. [12] P . V eli ˇ ckovi ´ c, A. Buesing, M. Overlan, R. P ascanu, O . Vinyals, C. Blundell, J. Ibarz, A. W . Senior , and G. Swirszcz. The CLRS algorithmic reasoning benchmark. In Proceedings of the International Con- ference on Machine Learning , 2022. [13] A. T rask, F . Hill, S. E. Reed, J. Rae, C. Dyer , and P . Blunsom. Neural arithmetic logic units. In Ad- vances in Neural Information Processing Systems , 2018. [14] D. Selsam, M. Lamm, B. Bünz, P . Liang, L. de Moura, and D. L. Dill. Learning a SA T solver from single-bit supervision. In International Con- ference on Learning Representations , 2019. A Default Hyperparameters T able 1 lists the default conﬁguration used for all exper- iments unless stated otherwise. Ablation studies (Sec- tion 7 ) use reduced datasets (500 train, 50 val, 100 test) and 30 epochs. P arameter V alue Latent space Latent dimension d 64 T rajectory length T 8 T raining Epochs 100 Batch size 32 Learning rate 10 − 3 W eight decay 10 − 4 α contr 0.1 α dec 1.0 α smooth 0.01 Dual-path decoder off Inference (planner) Planner steps K 50 Planner LR η 0.01 Langevin noise σ noise 0.005 Gradient clip norm 1.0 Anchor weight λ anchor 0.0 Energy model Hidden dim 128 Layers 3 Encoder / Decoder Hidden dim 128 Layers 2 Dataset sizes (full) T rain / V al / T est 5000 / 500 / 1000 T able 1: Default hyperparameters. See config.toml in the repository for the complete speciﬁcation. 7

구조화된 잠재 궤적을 통한 에너지 최소화 기반 추론

원본 논문

댓글 및 학술 토론

의견 남기기

원본 논문

관련 논문

댓글 및 학술 토론

의견 남기기