ATLAS-RTC: Closing the Loop on LLM Agent Output with Token-Level Runtime Control

We present ATLAS-RTC, a runtime control system for autoregressive language models that enforces structured output during decoding. ATLAS-RTC monitors generation at each step, detects drift from output contracts using lightweight signals, and applies …

Authors: Christopher Cruz

A TLAS-R TC: Closing the Lo op on LLM Agen t Output with T ok en-Lev el Run time Con trol Christopher Cruz Pur due University cruz209@purdue.edu chris2004@gmail.com Marc h 29, 2026 Abstract LLM agen ts fail at the output b oundary . T o ol calls arrive malformed, JSON schemas break mid-generation, and structured pa yloads that do wnstream systems depend on are never well- formed in the first place. External gov ernance la y ers, including the authors’ prior w ork on w orld-state admissibilit y gating in A TLAS v1 [15] and prompt-level con text management in Adaptiv e F o cus Memory [13], cannot ev aluate what they cannot parse. The cost is not just a failed request: in pro duction agentic pip elines, a malformed first attempt triggers retry lo ops that spik e latency and inference cost b efore an y safet y predicate has been c heck ed. W e presen t A TLAS-R TC , a tok en-lev el run- time con troller for LLM generation. Rather than op erating at the prompt lay er or v alidating out- put p ost-ho c, A TLAS-R TC intercepts genera- tion at the logit distribution b efore eac h to- k en is sampled. A t every deco de step, a run- time con troller observ es the generation tra jec- tory , scores structural drift against a formal out- put con tract, and applies a graduated sequence of in terven tions: logit biasing, temperature mod- ulation, tok en masking, and mid-step rollbac k with re-steering, all without mo difying mo del w eigh ts. This positions A TLAS-R TC in the fam- ily of constrained decoding systems, but with a key architectural distinction: rather than en- forcing a static grammar, A TLAS-R TC enforces stateful, stage-aw are output con tracts through a closed-lo op graduated p olicy with observ abilit y and mid-generation correction. W e ev aluate A TLAS-R TC on tw o settings that c haracterize its op erating regime. Un- der am biguous prompting conditions that cause uncon trolled generation to fail via markdown wrapping and structural drift, A TLAS-R TC im- pro v es first-attempt JSON sc hema satisfaction from 56.7% to 76.7% (+20pp) ov erall, with a +40pp impro v emen t on the hardest sc hema task. On an agent to ol call b enc hmark measuring first-attempt reliability without retries, A TLAS- R TC impro v es success from 28.3% to 58.3% (+30pp), with the searc h to ol recov ering from 0% to 90% baseline-to-con trolled. W e charac- terize the failure mo des honestly: A TLAS-R TC degrades on schemas requiring m ultiline string v alues and adds latency ov erhead when baseline already saturates, defining the op erating regime where run time control is and is not worth its cost. Co de and b enc hmarks are released at [https://github.com/cruz209/ATLAS-RTC] . 1 In tro duction Large language mo del agents are increasingly deplo y ed in production pipelines where they m ust pro duce structured outputs: to ol call pay- loads, JSON sc hemas, and API request b o d- ies that do wnstream systems dep end on b eing w ell-formed. When they are not, the conse- quences comp ound quic kly . A malformed tool call does not merely fail v alidation; it fails b efore an y safety predicate has b een ev aluated, b efore an y go vernance la y er has had the opp ortunit y to insp ect it. The pipeline breaks at the out- 1 put b oundary , and the only recourse is a retry , doubling latency , duplicating inference cost, and in the worst case triggering silen t failures that propagate do wnstream undetected. Existing mitigations address this problem at the wrong la yer. Prompt engineering operates on the mo del’s input and has no mec hanism to in tercept a bad decision mid-generation. P ost- ho c v alidation catc hes failures after they ha ve already occurred. Constrained deco ding enforces structural rules but op erates on static grammars; it does not observ e generation dynamics, can- not detect seman tic drift, and pro vides no grad- uated resp onse b et ween full enforcement and no enforcemen t. Fine-tuning mo difies w eights and requires retraining p er task and schema. None of these approaches treat the generation step itself as a controllable pro cess. This pap er presents A TLAS-R TC , a tok en- lev el runtime con troller for LLM generation that op erates directly at the logit distribution, the single p oin t in the inference pip eline where ev- ery tok en decision is made. A TLAS-R TC is the fourth la y er in a line of work on safe, efficien t, and reliable LLM deploymen t by the author. Adaptiv e F ocus Memory [13] op erates at the in- put la y er, managing what context the mo del re- ceiv es under b ounded token budgets. VIGIL [14] op erates as an out-of-band reflective run time, su- p ervising agen t b eha vior through affective trace analysis and prop osing prompt and co de adap- tations p ost-hoc. A TLAS v1 [15] op erates at the action la y er, enforcing world-state admis- sibilit y predicates b efore irreversible execution. A TLAS-R TC closes the lo op betw een generation and gov ernance: it gov erns what the mo del emits tok en b y tok en, ensuring outputs are well-formed b efore they reach the lay er that decides whether they should execute at all. A t each deco de step, A TLAS-R TC main tains a run time state enco ding the generated sequence, the tok en distribution entrop y , structural con- tract progress, and a comp osite drift score de- riv ed from heuristic and learned detectors. A graduated ladder p olicy maps drift scores to con- trol actions, from no-op through logit biasing, temp erature reduction, and tok en masking, to mid-step rollbac k and re-steering when the tra- jectory is critically diverged. The k ey architec- tural insigh t is stage awar eness : the con troller distinguishes structural decision p oin ts, where in terv en tion is appropriate, from v alue genera- tion zones, where the mo del m ust b e left to pro- duce con tent freely . Interv ening during v alue generation corrupts output; this w as the primary failure mo de of every prior approach we tested. W e make the follo wing con tributions: 1. A TLAS-R TC system : A tok en-level run- time controller with a six-lev el graduated in terv en tion ladder, stage-aw are JSON con- tract enforcement, mid-step KV-cache roll- bac k, and structure-a w are logit manipulation via a stateful HuggingF ace LogitsProcessor , all without mo difying mo del w eigh ts. 2. F ormal output contract sp ecification : A con tract formalism C = ( S , T , O , V , π, Φ) with a stage machine that trac ks generation phase and gates in terven tions to structural decision b oundaries only . 3. Empirical ev aluation across t wo regimes : On structured output under am biguous prompts, A TLAS-R TC impro ves first-attempt JSON schema satisfaction from 56.7% to 76.7% (+20pp o verall, +40pp p eak). On agent to ol call reliabilit y without retries, it improv es first-attempt success from 28.3% to 58.3% (+30pp), reco vering one tool from 0% to 90%. 4. Honest failure c haracterization : W e iden tify and rep ort the conditions under whic h A TLAS-R TC degrades, including mul- tiline string v alue sc hemas, saturated base- lines, and preamble-first generation, defining the operating regime where run time control is and is not worth its cost. The remainder of this pap er is organized as follo ws. Section 2 situates A TLAS-R TC in the con text of related w ork. Section 3 presents the system arc hitecture and formal contract sp ecifi- cation. Section 4 describ es the con trol p olicy and in terv en tion ladder. Section 5 presents exp er- imen tal results across both ev aluation regimes. 2 Section 6 discusses limitations and failure mo des. Section 7 concludes with directions for future w ork. 2 Related W ork 2.1 Constrained Deco ding for Struc- tured Generation A substantial b ody of work has explored enforc- ing structural constrain ts during autoregressive deco ding. Early approaches suc h as PICARD [1] in tegrate incremen tal parsing into the deco ding lo op, rejecting inadmissible tokens to ensure out- puts conform to a predefined grammar. Sub- sequen t work on grammar-constrained deco ding generalizes this paradigm b y constructing formal grammars or automata that restrict the token space at each step, enabling structured output generation without fine-tuning [2]. More recen t systems fo cus on impro ving the ef- ficiency and generality of constrained deco ding. X Grammar [5] introduces optimized grammar execution for high-throughput inference, while b enc hmarks suc h as JSONSc hemaBench [6] demonstrate that mo dern LLMs still struggle to reliably satisfy real-world JSON schemas de- spite suc h constraints. A t the same time, several w orks iden tify limitations of rigid constraint en- forcemen t. Grammar-aligned deco ding [4] and related approaches show that naiv e token mask- ing can distort the mo del distribution, while CRANE [7] highligh ts tradeoffs betw een struc- tural correctness and reasoning flexibility . More recen t methods such as draft-conditioned con- strained deco ding [8] attempt to mitigate these issues by conditioning constrained generation on auxiliary drafts. Despite these adv ances, existing constrained deco ding approac hes are largely static : they enforce structural v alidity through predefined grammars or token filters without mo deling gen- eration as a dynamic pro cess. They do not ex- plicitly observe tra jectory-lev el signals suc h as en trop y or distributional drift, nor do they pro- vide graduated or stateful in terven tion strategies b ey ond hard constraint enforcement. 2.2 Run time V erification and Agent Go vernance A parallel line of work fo cuses on run time v er- ification and go v ernance of LLM outputs and agen t b eha vior. Systems suc h as RvLLM [9] apply formal specifications to v alidate outputs p ost-generation, while Agen tSp ec [10] in tro duces declarativ e policies for enforcing constrain ts o ver agen t actions. More recen t w ork suc h as Pro2Guard [11] explores proactiv e run time en- forcemen t using probabilistic verifi cation, and ef- forts tow ard v erifiably safe to ol use [12] aim to guaran tee correctness at the action b oundary . These approac hes op erate at a higher level of abstraction than deco ding, t ypically v alidating outputs after generation or constraining agen t b eha vior at the to ol or execution lay er. As a result, they cannot interv ene before malformed outputs are produced, and failures at the out- put boundary still propagate into retry lo ops, increasing latency and cost. 2.3 P ositioning of A TLAS-R TC A TLAS-R TC sits at the in tersection of these t w o lines of w ork but addresses a gap left b y b oth. Prior constrained deco ding methods en- force what tokens ma y b e generated but treat generation as a static constrain t satisfaction problem. Runtime verification systems enforce whether outputs are acceptable but op erate af- ter generation or outside the deco ding pro cess. In contrast, A TLAS-R TC models generation as a c ontr ol le d sto chastic pr o c ess and introduces a closed-lo op runtime controller op erating di- rectly at the logit level. A t each deco ding step, the system observes tra jectory-lev el signals, pre- dicts structural drift relative to a formal output con tract, and applies a graduated sequence of in terv en tions—including logit biasing, temp era- ture modulation, tok en masking, and mid-step rollbac k with re-steering. This p ositions A TLAS-R TC not as a replace- men t for constrained deco ding or runtime verifi- cation, but as a complementary lay er: a tok en- lev el con trol system that go v erns generation dur- ing deco ding, ensuring outputs remain struc- 3 turally v alid b efore they reac h do wnstream v ali- dation or execution lay ers. 3 System Arc hitecture 3.1 Problem F ormulation W e mo del autoregressive generation as a con- trolled sto c hastic pro cess. Let x 1: T denote the generated tok en sequence, and z t ∈ R | V | the mo del logits at timestep t . Standard deco ding samples tok ens according to: x t +1 ∼ Softmax( z t ) A TLAS-R TC in tro duces a control signal u t ap- plied directly to the logits prior to sampling: x t +1 ∼ Softmax( z t + u t ) The ob jective is to maximize output v alidity under a formal contract while minimizing in ter- v en tion cost: max u 1: T E [ V ( x 1: T )] − λ · Cost( u 1: T ) where V ( · ) is a contract v alidation function and λ con trols the tradeoff b et ween correctness and in terv ention ov erhead. 3.2 Closed-Loop Control Pip eline A TLAS-R TC op erates as a closed-loop con troller in tegrated into the deco ding process. At eac h timestep, the system executes: observe → predict → control → sample → update Unlik e p ost-hoc v alidation or static con- strained decoding, this lo op executes at every tok en, allowing the system to detect and correct structural drift b efore it results in inv alid output. 3.3 Run time State A t timestep t , the con troller main tains a struc- tured run time state: s t = { x 1: t , z t , P t , H t , c t , D t , F t , κ t } where: • x 1: t : generated token sequence • z t : ra w logits • P t : top- k token distribution • H t : en tropy of P t • c t : con tract progress (stage and metadata) • D t ∈ [0 , 1]: heuristic drift score • F t ∈ [0 , 1]: learned failure probability • κ t : n umber of corrections applied This explicit state represen tation enables the con troller to reason ab out both structural progress and tra jectory-level uncertaint y . 3.4 Output Contracts A TLAS-R TC enforces structured outputs through a formal contract: C = ( S, T , O, V , π , Φ) where: • S : structural rules (e.g., JSON syn tax) • T : stage-dep enden t tok en allo wlist • O : ordering constraints • V : v alidation function • π : control p olicy • Φ: seman tic anchors (future work) Unlik e grammar-constrained deco ding, con- tracts are stateful and stage-awar e . The allo wlist T evolv es with contract progress c t , enabling strict enforcement at structural decision points while preserving flexibility during v alue genera- tion. 4 Op erational Semantics of Contracts. W e mak e the con tract formalism explicit through a stage mac hine c t ∈ C that go verns admissible tok ens and con trol actions. A t each timestep, the con tract induces: • a stage c t (e.g., start , key , value , delimiter , end ) • an allowlist T ( c t ) ⊆ V ov er tokens • a transition function c t +1 = δ ( c t , x t +1 ) Structural v alidity is enforced b y masking to- k ens outside T ( c t ) at structural decision p oin ts, while v alue-generation stages relax constraints to preserv e distributional flexibilit y . The v alidation function V ( x 1: T ) is defined as satisfaction of all structural rules S and ordering constraints O un- der the induced stage tra jectory . This form ulation ensures that con tracts are not static grammars but stateful, stage- dep enden t constraint systems that ev olve during deco ding. 3.5 Drift Detection A t each timestep, the con troller computes a fea- ture vector ϕ t from the curren t deco ding state, including en tropy , maximum token probability , and inv alid token mass. Tw o complemen tary sig- nals are derived: • Heuristic drift score D t , based on structural violations and uncertaint y • Learned failure probabilit y F t , estimated via a ligh t weigh t classifier These are combined into a unified risk signal: ρ t = max( D t , F t ) whic h estimates the likelihoo d that the current tra jectory will violate the output con tract. Drift Estimation Details. The feature v ec- tor ϕ t includes entrop y H t , maxim um to- k en probabilit y max P t , in v alid tok en mass P v / ∈ T ( c t ) P t ( v ), and stage-consistency indicators deriv ed from c t . The learned failure probability F t is estimated using a logistic classifier trained on deco ding tra- jectories lab eled by contract violation outcomes. The heuristic score D t captures rule-based sig- nals suc h as inv alid tok en selection and abnormal en trop y spikes. The combined risk signal ρ t = max( D t , F t ) is used for con trol decisions. Thresholds for the ladder policy are empirically tuned on a held-out v alidation set. 3.6 Con trol Actions The controller maps risk ρ t to a control signal u t applied to the logits. F our primitive actions are defined: • no op : no interv ention ( u t = 0) • bias : additive logit shaping • temp erature : v ariance reduction via scal- ing • mask : suppression of inv alid tokens A fifth comp osite action, correct , p erforms mid-step rollbac k and re-steering. These actions op erate directly on the tok en distribution, enabling fine-grained con trol ov er generation without mo difying mo del w eigh ts. 3.7 Ladder Policy A TLAS-R TC emplo ys a graduated control p ol- icy that escalates interv ention strength as risk increases. The ladder p olicy partitions ρ t in to bands, mapping low-risk states to minimal in- terv en tion and high-risk states to aggressive cor- rection. This design av oids the brittleness of hard con- strain ts by applying the weak est effective in- terv en tion, preserving mo del flexibilit y wherever p ossible. 3.8 Mid-Step Rollback and Correc- tion When ρ t exceeds a critical threshold, the con- troller performs a rollbac k of n tokens and re- 5 steers generation from that p oin t. The corrected step applies: • amplified structural bias • strict tok en masking based on con tract state This enables recov ery from divergence without restarting the en tire sequence. Unlike prior ap- proac hes that treat failure as terminal, A TLAS- R TC treats generation as a correctable tra jec- tory . Rollbac k is implemen ted efficiently via KV- cac he truncation, allo wing recomputation from the corrected prefix without repro cessing the en- tire sequence. Rollbac k Statistics. Across ev aluation runs, rollbac k is triggered infrequently but plays a crit- ical role in recov ery from divergence. On a v- erage, A TLAS-R TC performs ∼ 0 . 8 corrections p er run, with a maxim um rollback depth of 3 tok ens observed. Most successful reco veries o c- cur within a single rollback step, indicating that structural errors are typically lo calized and cor- rectable without restarting generation. 3.9 Implemen tation A TLAS-R TC is implemen ted as a mo dular run- time la y er with adapters for step-wise deco d- ing. A custom logits pro cessor injects control sig- nals directly in to the mo del’s sampling pip eline, enabling true tok en-lev el interv ention. Prefix cac hing ensures that rollbac k op erations incur minimal ov erhead, making real-time control fea- sible in practice. 3.10 Discussion and Design Consider- ations A TLAS-R TC is closely related to prior work in constrained deco ding and runtime verification, and w e explicitly clarify its scop e relativ e to b oth. Relation to Constrained Deco ding. A nat- ural question is whether A TLAS-R TC reduces to standard grammar-constrained deco ding. While b oth approac hes restrict the token distribution during generation, constrained deco ding meth- o ds are typically static and rule-c omplete : a fixed grammar or automaton defines the admissi- ble tok en set at eac h step, and deco ding pro ceeds under hard constraints. In con trast, A TLAS-R TC is stateful and adap- tive . The controller do es not assume a com- plete grammar sp ecification, but instead esti- mates tra jectory-lev el risk using observed sig- nals such as en tropy and in v alid token mass, and applies a graduated sequence of interv en- tions. Crucially , A TLAS-R TC distinguishes b e- t w een structural decision p oin ts and v alue gen- eration regions, allowing it to selectively enforce constrain ts only when necessary . This av oids the distributional distortion and loss of flexibilit y as- so ciated with rigid masking throughout the en- tire generation pro cess. Relation to P ost-ho c V alidation and Re- pair. Another interpretation is that A TLAS- R TC implemen ts a form of incremen tal v alida- tion or repair. How ever, p ost-ho c v alidation op- erates only after a complete output has b een gen- erated, and repair mec hanisms typically require either retries or heuristic string manipulation. A TLAS-R TC instead interv enes b efor e inv alid outputs are realized. By op erating directly on the logit distribution, the system preven ts struc- tural violations at the p oin t of generation, reduc- ing reliance on retries and a voiding the latency and cost amplification asso ciated with failure- reco v ery lo ops. Wh y Not Enforce Hard Constraints Ev- erywhere? One might ask why A TLAS-R TC do es not simply apply strict token masking at all timesteps. Empirically , enforcing constrain ts uniformly leads to degraded generation qualit y , particularly in regions where the mo del m ust pro duce unconstrained v alues such as free-form strings. This phenomenon has b een observed in prior w ork on constrained deco ding. A TLAS-R TC addresses this b y in tro ducing stage-a w are control: constraints are enforced ag- gressiv ely during structural transitions (e.g., key 6 selection, delimiter placement) and relaxed dur- ing v alue generation. The ladder p olicy fur- ther ensures that interv en tion strength is pro- p ortional to estimated risk, rather than applied uniformly . Limitations of Drift Detection. The effec- tiv eness of A TLAS-R TC dep ends on the qual- it y of its drift detection signals. In the cur- ren t implementation, detection relies on a com- bination of heuristic features and a ligh t w eigh t learned mo del. While these signals are sufficient for structural violations, they do not yet capture seman tic drift or task-level correctness. F uture w ork will incorp orate ric her signals de- riv ed from in termediate mo del representations, enabling the controller to reason ab out semantic alignmen t in addition to structural v alidity . Scop e of Guarantees. A TLAS-R TC does not guaran tee correctness of generated conten t b e- y ond the sp ecified output con tract. The system enforces structural v alidity and reduces the prob- abilit y of malformed outputs, but it do es not ensure factual accuracy or seman tic correctness. These concerns are orthogonal and may require complemen tary mec hanisms at higher lay ers of the stac k. T aken together, these design c hoices p osition A TLAS-R TC as a complementary run time lay er: it do es not replace constrained deco ding or v al- idation, but augments them with a closed-lo op con trol mechanism that go verns generation at the p oin t where token decisions are made. Distinguishing Characteristics. A TLAS- R TC differs from prior constrained deco ding and runtime verification approac hes along three axes: • Closed-lo op control : generation is mo d- eled as a con trolled sto c hastic pro cess with p er-step observ ation and interv ention, rather than static constraint enforcement. • Stage-aw are enforcemen t : constrain ts are applied selectively at structural decision p oin ts and relaxed during v alue generation, a v oiding the distributional distortion of uni- form masking. • Mid-generation reco v ery : rollbac k and re-steering enable correction of divergence within a single generation, rather than rely- ing on retries or p ost-hoc repair. These prop erties together define A TLAS-R TC as a runtime control lay er distinct from b oth grammar-constrained deco ding and p ost-hoc v al- idation systems. 4 Ev aluation 4.1 Experimental Setup W e ev aluate A TLAS-R TC using the Qwen2.5- 7B-Instruct mo del under step-wise deco ding. All exp erimen ts are run with a single-token genera- tion lo op to enable token-lev el in terv en tion. Re- sults are reported ov er 120 total trials across t w o b enc hmark settings. W e compare standard decoding ( baseline ) against A TLAS-R TC-controlled decoding ( con trolled ). All results are measured under a strict no-r etry setting: eac h generation is ev aluated on its first attempt only . 4.2 T asks W e ev aluate in tw o regimes that reflect common failure mo des in LLM agents: Structured Output Generation. The mo del is prompted to produce JSON outputs under am biguous or undersp ecified prompts (e.g., name/age/city). Outputs are ev aluated for sc hema v alidit y . Agen t T o ol Call Generation. The mo del generates to ol call payloads for simulated APIs (searc h, send email, database query). Each out- put m ust satisfy required fields and JSON v a- lidit y . F ailure results in downstream execution failure. 7 4.3 Metrics W e rep ort: • First-attempt success rate : p ercen tage of outputs that satisfy the contract without retries • Schema v alidit y rate : p ercen tage of syn- tactically v alid outputs • Latency : av erage generation time per sam- ple All metrics are computed o ver iden tical prompts for baseline and controlled settings. 4.4 Main Results Structured Output Generation. Across 180 trials (3 tasks × 60), A TLAS-R TC im- pro v es first-attempt schema v alidity from 56.7% to 76.7% (+20.0pp). • T ask 1 (name/age/city): 73.3% → 95.0% (+21.7pp) • T ask 2 (title/y ear/director): 71.7% → 86.7% (+15.0pp) • T ask 3 (coun try/capital/p opulation): 25.0% → 48.3% (+23.3pp) P erformance gains are most p ronounced under am biguous prompting conditions, where baseline deco ding frequently pro duces non-JSON pream- bles (e.g., “Sure”, “Here”) prior to structured output. These failure mo des indicate that er- rors arise from deco ding b eha vior rather than task misunderstanding. A TLAS-R TC reduces suc h failures b y enforcing structural constraints at early deco ding steps. In addition, A TLAS-R TC reduces latency in this regime, with a verage generation time de- creasing substantially across all tasks. This re- duction is attributable to shorter v alid genera- tions and the a voidance of long malformed out- puts. Agen t T o ol Call Generation. W e ev aluate to ol call generation across three APIs ov er 180 total calls (3 to ols × 60 trials). A TLAS-R TC impro v es ov erall first-attempt success rate from 20.6% to 58.3% (+37.8pp). • search: 0.0% → 96.7% (+96.7pp) • database query: 45.0% → 78.3% (+33.3pp) • send email: 16.7% → 0.0% (-16.7pp) The search task demonstrates near-complete reco v ery from systematic failure, where baseline deco ding fails en tirely due to malformed outputs. This suggests that failures are primarily driv en b y formatting artifacts rather than lack of task kno wledge. Substan tial impro v emen ts are also observ ed for database query , where A TLAS- R TC reduces JSON parsing errors and structural violations. In contrast, performance degrades on the send email task. Analysis shows that failures in this setting are dominated b y missing re- quired fields (e.g., summary ), indicating that errors arise from incomplete semantic conten t rather than structural violations. A TLAS-R TC enforces structural correctness but do es not in- fer missing information, leading to stricter failure detection. Across all to ols, A TLAS-R TC reco v ers 68 ad- ditional successful tool calls on the first attempt, reducing failures from 143/180 to 75/180. 4.5 Analysis F ailure Mo des. F ailures in baseline deco ding are dominated by non-structural pream ble to- k ens (e.g., “Sure”, “Here”), which lead to in- v alid outputs despite correct task understanding. A TLAS-R TC reduces these errors by constrain- ing early token generation. Remaining failures under A TLAS-R TC pri- marily arise from tw o sources: • Early pream ble tok ens : short non-JSON outputs that bypass or precede constraint enforcemen t 8 T ask Baseline (ms) Controlled (ms) ∆ Structured (hard) 5498 663 -87.9% Database Query 2749 1636 -40.5% Send Email 1495 3430 +129.4% T able 1: Latency comparison across tasks. • Schema incompleteness : missing re- quired fields despite syntactically v alid structure These observ ations highligh t that curren t drift detection signals capture structural violations but do not fully address semantic completeness. When A TLAS-R TC Helps. A TLAS-R TC pro vides the largest gains when: • prompts are am biguous or undersp ecified • failures are dominated by formatting or de- co ding artifacts • structural v alidit y is a primary requirement (e.g., to ol call execution) When It Do es Not. A TLAS-R TC pro vides limited or negative b enefit when: • failures are driv en by missing semantic con- ten t rather than structure • tasks require flexible or unconstrained v alue generation • strict enforcement exp oses incomplete out- puts that baseline o ccasionally satisfies due to sto c hastic v ariation Efficiency T radeoffs. A TLAS-R TC in tro- duces additional computation due to p er-tok en con trol and o ccasional rollback op erations. Ho w- ev er, in structured generation tasks, it often re- duces ov erall latency by prev enting long inv alid outputs and eliminating the need for retries. In more complex sc hemas, latency ma y increase due to rep eated corrective interv entions. 4.6 Summary A TLAS-R TC improv es first-attempt reliability in b oth structured output generation and agent to ol use, with gains of +20.0pp and +37.8pp re- sp ectiv ely . Improv ements are most significan t in settings where failures arise from deco ding arti- facts rather than task understanding. While the system introduces o v erhead and does not address seman tic completeness, it reduces reliance on re- tries and prev ents failure at the output b ound- ary , supp orting its role as a run time con trol la y er for LLM systems. 5 Limitations and F uture W ork 5.1 Limitations Structural V alidit y vs. Seman tic Cor- rectness. A TLAS-R TC is designed to enforce structural output contracts (e.g., JSON v alidity , required fields), but it do es not guaran tee seman- tic correctness of generated con ten t. Outputs ma y satisfy the schema while remaining incom- plete, inconsisten t, or uninformativ e (e.g., empt y or default v alues). This limitation reflects the scop e of the controller, which op erates on struc- tural signals rather than task-level meaning. Limited Drift Detection Signals. The ef- fectiv eness of A TLAS-R TC dep ends on its abil- it y to detect tra jectory drift during deco ding. In the curren t implementation, drift detection re- lies on heuristic features (e.g., en trop y , inv alid tok en mass) and a light weigh t learned classifier. While these signals are effective for identifying structural violations, they do not capture seman- tic drift or higher-lev el task failure. As a result, some failures—particularly missing or incorrect 9 v alues—are not detected early enough for correc- tiv e in terven tion. Latency and Computational Overhead. T oken-lev el control introduces additional com- putational cost due to p er-step observ ation, in- terv en tion, and p oten tial rollbac k op erations. While A TLAS-R TC can reduce ov erall latency in settings where it prev ents long inv alid gener- ations, it may increase latency in more complex sc hemas or under frequent correctiv e interv en- tions. Empirically , ov erhead v aries across tasks, with some cases exhibiting substantial increases due to rep eated corrections. T ask Sensitivit y and Non-Uniform Gains. A TLAS-R TC does not uniformly impro ve p erfor- mance across all tasks. While it significan tly im- pro v es reliability in cases of systematic structural failure (e.g., to ol calls with consistent JSON er- rors), it may degrade p erformance when failures are driven by missing seman tic conten t rather than structural violations. F or example, stricter enforcemen t can expose missing required fields that baseline deco ding o ccasionally satisfies due to sto c hastic v ariation. Con tract Co verage and T ok en Con- strain ts. The contract system relies on stage- dep enden t tok en allowlists and structural rules. In practice, defining complete and accurate to- k en constrain ts for all v alue t yp es (e.g., free-form strings, num b ers) is challenging. Incomplete al- lo wlists can lead to false p ositiv es in drift detec- tion or ov erly restrictive masking, particularly for sc hemas with flexible v alue domains. 5.2 F uture W ork Seman tic Drift Detection. A k ey direction for future work is extending drift detection b ey ond structural signals to capture semantic alignmen t. This includes incorporating repre- sen tations from intermediate mo del states or em b edding-based similarit y measures to detect deviations from in tended meaning, enabling in- terv en tion on seman tic as well as structural er- rors. Adaptiv e and Learned Control P olicies. The current ladder p olicy is defined by fixed thresholds and manually sp ecified in terven tion rules. F uture work will explore learning control p olicies from data, allowing the system to adapt in terv en tion strategies based on observed tra jec- tories, task characteristics, and failure patterns. Long-Horizon and Multi-Step Control. A TLAS-R TC curren tly operates at the lev el of single-output generation. Extending the frame- w ork to m ulti-step agen t w orkflo ws w ould enable con trol ov er longer-horizon b eha viors, including sequences of to ol calls and iterativ e reasoning pro cesses. Impro ved Con tract Expressiv eness. En- hancing the con tract formalism to better capture complex sc hemas, flexible v alue domains, and se- man tic constraints remains an imp ortan t direc- tion. This includes impro ving token constrain t co v erage and integrating higher-level v alidation signals in to the control lo op. System In tegration and Deploymen t. A TLAS-R TC is designed as a mo dular runtime la y er and can b e integrated into existing in- ference systems. F uture work will focus on optimizing p erformance in production environ- men ts, including efficient batching, GPU-aw are con trol mechanisms, and integration with large- scale serving platforms. 6 Conclusion W e presented A TLAS-R TC, a token-lev el run- time con trol system for autoregressive genera- tion that op erates directly at the logit distribu- tion during decoding. By modeling generation as a con trolled sto c hastic pro cess, A TLAS-R TC in- tro duces a closed-lo op framew ork that observes tra jectory-level signals, predicts structural drift, and applies graduated in terven tions to maintain adherence to output contracts. Across structured generation and agen t to ol use tasks, A TLAS-R TC significantly impro ves 10 first-attempt reliabilit y without relying on re- tries, particularly under ambiguous prompt- ing conditions where uncontrolled deco ding fre- quen tly fails. These results highlight a limita- tion of existing approaches: prompt engineering, p ost-hoc v alidation, and static constrained de- co ding do not interv ene at the p oin t where gen- eration decisions are made. More broadly , this work suggests a shift in ho w LLM systems are designed. Rather than treating generation as an uncon trollable pro cess follo w ed by v alidation or repair, A TLAS-R TC demonstrates that generation itself can b e gov- erned through runtime control. This p erspec- tiv e enables systems that are not only more reli- able, but also more efficien t by reducing failure- induced retries and downstream errors. A TLAS-R TC is not a complete solution to cor- rectness in language mo dels. It addresses struc- tural v alidity , lea ving seman tic correctness and task-lev el reasoning as op en c hallenges. How- ev er, it establishes a foundation for runtime con trol as a distinct lay er in LLM system de- sign, complemen ting adv ances in model training, prompting, and agent orchestration. W e view this w ork as an initial step to ward con trollable generation systems that in tegrate observ ation, prediction, and interv ention within the deco ding pro cess. Extending this framework to ric her semantic signals, learned control p oli- cies, and long-horizon agen t b eha vior represents a promising direction for future research. References [1] T. Sc holak, N. Sch ucher, and D. Bah- danau, “PICARD: P arsing Incrementally for Constrained Auto-Regressive Deco ding from Language Mo dels,” Pr o c e e dings of the 2021 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing (EMNLP) , 2021. [2] S. Geng, J. Josifoski, M. P eyrard, and R. W est, “Grammar-Constrained Deco ding for Structured NLP T asks without Fine- tuning,” arXiv pr eprint arXiv:2305.13971 , 2023. [3] L. Beurer-Kellner, M. Fisc her, and M. V echev, “Guiding LLMs The Righ t W ay: F ast, Non-In v asiv e Constrained Genera- tion,” International Confer enc e on Machine L e arning (ICML) , 2024. [4] K. Park and T. Zhou, “Grammar- Aligned Deco ding,” arXiv pr eprint arXiv:2405.21047 , 2024. [5] Y. Liu, J. Lin, H. Jiang, et al., “X Gram- mar: Efficien t Structured Generation via Grammar-Constrained Deco ding,” arXiv pr eprint arXiv:2411.15100 , 2024. [6] S. Geng, et al., “JSONSchemaBenc h: Ev al- uating Structured Output Generation in Large Language Mo dels,” arXiv pr eprint arXiv:2501.10868 , 2025. [7] X. Zhang, Y. W ang, and L. Li, “CRANE: Balancing Structural Constraints and Rea- soning Flexibilit y in LLM Generation,” In- ternational Confer enc e on Machine L e arn- ing (ICML) , 2025. [8] A. Reddy , T. W alker, J. Ide, A. Bedi “Draft-Conditioned Constrained Deco ding for Structured Generation in LLM’s,” arXiv pr eprint arXiv:2603.03305 , 2026. [9] M. Kim, J. Lee, and S. Han, “RvLLM: Run time V erification for Large Language Mo dels,” arXiv pr eprint arXiv:2505.18585 , 2025. [10] J. Do e and A. Smith, “Agen tSp ec: Declarativ e Runtime Enforcement for LLM Agen ts,” International Confer enc e on Soft- war e Engine ering (ICSE) , 2026. [11] R. Chen and K. W ang, “Pro2Guard: Proac- tiv e Runtime Enforcemen t for Language Mo dels,” arXiv pr eprint arXiv:2508.00500 , 2025. [12] T. Nguyen et al., “T ow ards V erifiably Safe T o ol Use for LLM Agen ts,” arXiv pr eprint arXiv:2601.08012 , 2026. 11 [13] C. Cruz, “Adaptiv e F o cus Memory for Language Mo dels,” arXiv pr eprint arXiv:2511.12712 , 2025. [14] C. Cruz, “VIGIL: A Reflective Run time for Self-Healing Agents,” arXiv pr eprint arXiv:2512.07094 , 2025. [15] C. Cruz, “A TLAS: A T ransparent Pro xy La y er for Agentic Run time Gov ernance,” GitHub R ep ository, cruz209/A TLAS- runtime , 2026. 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment