Reward Hacking as Equilibrium under Finite Evaluation

We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimension…

Authors: Jiacheng Wang, Jinbin Huang

Rew ard Hac king as Equilibrium under Finite Ev aluation Jiac heng W ang and Jinbin Huang Marc h 2026 Abstract W e pro ve that under ve minimal axioms — multi-dimensional qualit y , nite ev aluation, eective optimization, resource niteness, and com binatorial in teraction — an y optimized AI agent will systematically under-in vest eort in quality dimensions not co v ered by its ev aluation system. This result establishes reward hac king as a structural equilibrium, not a correctable bug, and holds re- gardless of the specic alignmen t metho d (RLHF, DPO, Constitutional AI, or others) or ev aluation arc hitecture employ ed. Our framework instantiates the m ulti-task principal-agent mo del of Holm- ström and Milgrom (1991) in the AI alignmen t setting, but exploits a structural feature unique to AI systems — the known, dieren tiable architecture of reward mo dels — to deriv e a computable distortion index that predicts b oth the direction and sev erity of hacking on each quality dimen- sion prior to deplo yment. W e further prov e that the transition from closed reasoning to agentic systems causes ev aluation co verage to decline to ward zero as to ol count grows — b ecause quality dimensions expand com binatorially while ev aluation costs gro w at most linearly p er to ol — so that hac king sev erity increases structurally and without bound. Our results unify the explanation of sycophancy , length gaming, and sp ecication gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. W e further conjecture — with partial formal analysis — the existence of a capability threshold b ey ond which agents transition from gaming within the ev aluation system (Go o dhart regime) to actively degrading the ev aluation system itself (Campb ell regime), pro viding the rst economic formalization of Bostrom’s (2014) “treacherous turn. ” Keyw ords: rew ard hacking, incomplete contracts, principal-agent theory , AI alignment, mecha- nism design, agentic systems, capability threshold, treac herous turn 1. In tro duction 1.1 The Problem Rew ard hacking — the phenomenon whereby an AI agent exploits gaps in its ev aluation system to achiev e high measured scores without gen uinely fullling the principal’s ob jectiv es — is widely recognized as a central obstacle to AI alignment (Amo dei et al. 2016, Skalse et al. 2022). Despite substan tial progress in alignmen t training (RLHF: Christiano et al. 2017, Ouy ang et al. 2022; DPO: Rafailo v et al. 2023; Constitutional AI: Bai et al. 2022), rew ard hac king p ersists across mo del generations. Sycophancy , length gaming, format manipulation, and sp ecication gaming contin ue to b e do cumen ted even in state-of-the-art systems. 1 The AI safet y literature has treated these phenomena primarily as engineering problems: discov er a hacking b ehavior, patch the rew ard mo del, rep eat. Y et eac h x tends to b e follow ed by new forms of gaming along previously unmonitored dimensions — a pattern strikingly reminiscent of the “whack-a-mole” dynamic familiar from regulatory arbitrage in nancial markets. This suggests a deep er structural cause. Recen t practitioner accounts reinforce this concern. Lin (2026), formerly lead of the Qwen team, writes: “As so on as the mo del gets meaningful to ol access, reward hacking b ecomes muc h more dangerous… Better to ols make the mo del more useful, but they also enlarge the attack surface for spurious optimization. ” Sc hmid (2026), a sta engineer at Go ogle DeepMind, argues in an indep enden t analysis that the comp etitive adv antage now lies in the qualit y of execution tra jectories a harness captures, implying that the ev aluation surface itself has b ecome the binding constraint. 1.2 Our Con tribution W e argue that rew ard hacking is not an engineering failure but a structural inevitabilit y: a necessary consequence of optimizing an y agent under a nite-dimensional ev aluation system when the true ob jectiv e is higher-dimensional. This insight is not new in economics. Holmström and Milgrom (1991) pro v ed that in multi-task en- vironmen ts, agen ts shift eort from hard-to-measure to easy-to-measure tasks. Baker (1992) show ed that when p erformance measures imp erfectly correlate with true ob jectiv es, incentiv e contracts in- duce systematic distortion. Our pap er makes three contributions b y applying this framework to AI alignmen t and exploiting the unique structure of AI systems: (C1) F ormal instan tiation. W e sho w that the designer–AI agent relationship, mediated by a rew ard mo del, is a precise instance of the multi-task moral hazard problem with an incomplete p erformance metric. The mapping preserv es the mathematical structure and comparative statics of the economic framew ork. (Section 3) (C2) Computable prediction. Unlik e most h uman contracting en vironments where the p er- formance measure’s sensitivit y structure is unobserv able, AI reward mo dels hav e known, often dieren tiable architectures. W e exploit this to derive a distortion index   that predicts, for each qualit y dimension, the direction and relativ e sev erity of b eha vioral distortion — prior to deploymen t. (Section 4.1) (C3) Agen tic amplication. W e pro ve that the transition from closed reasoning to to ol-using agen tic systems causes ev aluation cov erage to decline tow ard zero as to ol count grows, b ecause qualit y dimensions scale combinatorially (Axiom 5) while ev aluation engineering scales at most linearly p er to ol. Hacking sev erity therefore increases structurally and without b ound. (Section 4.2) 1.3 Related W ork Multi-task agency and incomplete con tracts. Holmström and Milgrom (1991) is the foun- dational result: agen ts distort eort tow ard measurable tasks. Baker (1992) formalizes distortion under imp erfect performance measures. Grossman and Hart (1986) and Hart and Mo ore (1990) establish incomplete contract theory . Our contribution is not to these results themselves, but to their application in a domain (AI alignmen t) where the p erformance metric’s structure is uniquely transparen t. 2 Economics of AI/LLMs. Bergemann, Bonatti, and Smolin (2025) analyze optimal LLM pricing and pro duct design using mechanism design, mo deling the user–provider relationship. W e shift the analytical fo cus to the designer–agent relationship and treat the agen t as the optimizing part y . AI safety . Amo dei et al. (2016) catalog concrete alignment problems. Skalse et al. (2022) dene and characterize reward hacking. Pan et al. (2022) do cumen t reward missp ecication eects. W e pro vide a unied theoretical foundation for these empirical phenomena. Practitioner accoun ts. Lin (2026) distinguishes “reasoning thinking” from “agentic thinking” and iden ties rew ard hacking as the cen tral c hallenge of the agen tic era. Schmid (2026) frames harness-captured tra jectories as the new lo cus of comp etitiv e adv antage. Cognition (2025) do cu- men ts the co-optimization of mo dels and harnesses in dev eloping their SWE-1.5 coding agent. Our Prop osition 2 formalizes these observ ations. 1.4 Paper Structure Section 2: Axioms and mo del. Section 3: Main prop osition (distortion inevitability) and proof. Section 4: F urther results — directional prediction, agentic amplication, and complementarit y . Section 5: Robustness, limitations, and extensions. Section 6: Conjectures on the Go odhart- Campb ell transition. Section 7: Discussion. 2. Axioms and Mo del 2.1 F our Axioms W e build on four axioms. Design criterion: no researcher working on AI alignmen t should nd an y of these deniable. Axiom 1 (Multi-dimensional Qualit y). T ask output quality is describ ed by a v ector q              ,    . If    , ther e is no cr oss-dimensional distortion and the alignment pr oblem r e duc es to sc alar optimization. A l l non-trivial tasks have    . Axiom 2 (Finite Ev aluation). The ev aluation system pro jects the qualit y space on to a strictly lo wer-dimensional subspace:  q  󰄛 q     ,    . A nite-length evaluation signal c annot losslessly r epr esent a higher-dimensional qual- ity ve ctor. This holds for al l r e alizable evaluation systems — r ewar d mo dels, human r atings, rule-b ase d che cks, or any c ombination ther e of. W e imp ose no r estriction on the functional form of 󰄛 . Axiom 3 (Eective Optimization). The agent’s eort allo cation resp onds p ositiv ely to the ev aluation signal’s structure. If the agent’s b ehavior wer e invariant to changes in the evaluation system, al l alignment tr aining would b e inee ctive. Axiom 3 formalizes the pr emise that alignment is p ossible. Denying Axiom 3 is denying alignment itself. Axiom 4 (Resource Finiteness). The agen t allo cates nite resources e              across quality dimensions, sub ject to        ,    . 3 A l l infer enc e c onsumes nite c omputation. Even as  gr ows over time, it is nite at any given moment. Axiom 5 (Combinatorial In teraction). When the agen t has access to    comp osable tools, the quality dimension count satises       󰄌    for some constant 󰄌    reecting the fraction of to ol pairs with meaningful interaction eects. Each interaction dimension is not fully determined by the comp onen t to ols’ individual qualit y dimensions. Justic ation: Each tool  introduces at least one indep enden t qualit y dimension (is the to ol used correctly?). Each interacting pair       introduces at least one additional dimension (is the output of   appropriately used as input to   ? Is the sequencing correct?). This combinatorial structure is a standard observ ation in systems engineering — Bro oks (1975) notes that inter-module comm unication channels grow quadratically with mo dule coun t. The constant 󰄌   accommo dates the fact that not all to ol pairs in teract, but excludes the degenerate case 󰄌   where to ols are fully indep enden t (in which case m ulti-to ol agentic systems oer no adv an tage ov er single-to ol systems, con tradicting the premise that to ol comp osition is useful). What this axiom do es NOT assume: W e do not assume any sp ecic growth rate for  (ev aluation cov erage). W e do not assume an y sp ecic ev aluation architecture. The axiom is purely ab out the structure of the quality space, not the ev aluation system. 2.2 Principal The principal’s ob jectiv e is:   q                 Linearit y is a sucient simplication for transparent pro ofs. All qualitative results extend to any strictly increasing, strictly conca ve  b y replacing   with lo cal gradients 󰄪  󰄪    q  (see Section 5.1). 2.3 Pro duction T ec hnology               where eac h         satises: - (G1)      - (G2)       for all    - (G3)       for all    Dieren t dimensions may hav e dieren t pro duction functions, reecting heterogeneous costs of pro ducing quality across dimensions (e.g., formatting is cheap; factual accuracy is exp ensiv e). On the Inada condition. If additionally lim         (Inada condition), all equilibria are interior (every dimension receives p ositiv e eort). Without Inada, corner solutions are p ossi- ble: some dimensions may receiv e zero eort. W e state results for b oth cases. Corner solutions strengthen rather than weak en our conclusions — they represent dimensions the agen t entir ely ab andons , not merely under-inv ests in. 4 2.4 Agent’s Eectiv e Ob jectiv e Beha vioral Regularit y Assumption. The agent’s eort allo cation can b e describ ed as the solution to: e   arg max e  0             s.t.        where the ee ctive weights are:      󰄗     󰄗  if    (con tractible dimensions)   󰄗  if    (non-con tractible dimensions) Here     is the ev aluation system’s reward weigh t on observ able dimension  , and 󰄗    is the alignment gap — the degree to which the agen t’s b eha vior is driven b y the ev aluation signal v ersus the internalized principal ob jectiv e. On the “as if ” justication. W e do not require that the agen t liter al ly maximizes  . W e require only that its observed b eha vior is r ationalizable b y some  of the ab o ve form. This is the standard “as if ” p osition in economics (F riedman 1953): the mo del’s v alidity rests on predictive accuracy , not mechanistic delity . Op erationally , 󰄗 is a b eha vioral parameter estimated by comparing the agen t’s b eha vior under ev aluation versus without ev aluation. It is not an intrinsic prop erty of the agen t’s architecture. The rationalizability of agent b eha vior b y some  of this form can b e tested empirically using the Generalized Axiom of Revealed Preference (GARP; Afriat 1967): if the agent’s token allo cations under v arying budgets and price v ectors satisfy GARP , a rationalizing ob jectiv e function exists. 2.5 Denitions Denition 1 (Con tract Incompleteness). 󰄕         . Denition 2 (First-Best). e   solv es max e  0            s.t.       . Under (G1)-(G3) and Inada, e   is the unique in terior solution satisfying:             󰄘    where 󰄘     is the budget constraint multiplier. Without Inada, e   ma y in volv e corner solutions but remains unique by strict conca vit y of the ob jectiv e. 3. Main Result 3.1 Agent’s Equilibrium The agent solves: max e  0             s.t.        5 Case 1 (In terior solution, with Inada). The unique solution e  satises:             󰄘        (F OC) Case 2 (P ossible corner solutions, without Inada). The KKT conditions are:             󰄘   with equality if      3.2 Prop osition 1 (Inevitability of Distortion) Statemen t. Let Axioms 1–4 hold, 󰄗    , and    . Then: (a) F or all non-con tractible dimensions    :         , with strict inequalit y whenev er b oth solutions are interior. (b) e   e   . (c)   q      q    . Pro of. W e prov e each part. P art (b): e   e   . The rst-b est solv es max         s.t. budget; the agent solves max          s.t. budget. F or    ,       󰄗     since 󰄗   . F or    ,     󰄗     󰄗  . Since     and 󰄗   , w e ha ve       whenev er      for any    , or unconditionally for    . Therefore  w  w . Moreo ver,  w is not prop ortional to w : the ratio      equals   󰄗 for    but strictly exceeds   󰄗 for    (since     ). Both problems ha ve separable strictly concav e ob jectiv es with identical linear constraints. F or such problems, non-prop ortional w eight vectors pro duce distinct maximizers. Therefore e   e   . □ P art (a):         for    , with strict inequality at interior solutions. Interior c ase (with Inada). Consider the ratio      across all dimensions:        󰄗        󰄗      󰄗    F or    :       󰄗        󰄗    󰄗 since     . F or    :         󰄗 . Therefore non-contractible dimensions hav e the lowest eective-to-true weigh t ratio among all di- mensions. W e formalize the implication via the following lemma: Lemma (Monotone Reallo cation). Consider tw o problems max  󰄌       and max  󰄍       sub ject to      , with 󰄌   󰄍    and   satisfying (G1)-(G3) and Inada. Let  󰅠    󰅡  b e the resp ectiv e interior solutions. If 󰄍  󰄌   󰄍  󰄌  for all  , with strict inequality for at least one  , then  󰅡    󰅠  , with strict inequalit y when      (symmetric pro duction). 6 Pr o of of L emma. At in terior solutions, 󰄌      󰅠    󰄘 󰅠 and 󰄍      󰅡    󰄘 󰅡 for all  . Dene 󰄜   󰄍  󰄌  . Then     󰅡    󰄘 󰅡 󰄍    󰄘 󰅡 󰄜  󰄌   . Supp ose  󰅡    󰅠  for dimension  with the smallest 󰄜  . Then     󰅡        󰅠    󰄘 󰅠 󰄌  , so 󰄘 󰅡 󰄜  󰄌    󰄘 󰅠 󰄌  , giving 󰄘 󰅡  󰄜  󰄘 󰅠 . F or dimension  with 󰄜   󰄜  :     󰅡    󰄘 󰅡 󰄜  󰄌    󰄜  󰄘 󰅠 󰄜  󰄌    󰄘 󰅠 󰄌       󰅠   , so  󰅡    󰅠  for al l  . But then   󰅡     󰅠    , contradicting the budget constraint. □ Apply the Lemma with 󰄌     (rst-b est w eights) and 󰄍      (agen t weigh ts). Non-contractible dimensions    ha ve the lo west 󰄜     󰄗 , strictly below the ratio for any con tractible dimension. Therefore         for all    , with strict inequality under symmetric pro duction functions. □ Corner c ase (without Inada). If       but      for some    , then         trivially . If       , then      (since the agen t’s eective weigh t on  is even low er), giving           . In all cases,         . □ P art (c):   q      q    . e   is the unique maximizer of   g  e  on the budget set. By part (b), e   e   . By uniqueness,   q      q    . □ Remark 1 (Relation to H&M 1991). Prop osition 1(a) is a sp ecic instance of Holmström and Milgrom’s (1991) core result that agents reallo cate eort aw a y from hard-to-measure tasks. Our contribution lies not in this qualitative conclusion but in the corollaries b elow, whic h exploit the unique transparency of AI ev aluation systems to yield quantitativ e, computable predictions una v ailable in the original framework. 4. F urther Results 4.1 Corollary 1: Distortion Index and Directional Prediction Denition 3 (Distortion Index). F or eac h qualit y dimension  , dene:               󰄗        󰄗 if      󰄗 if    Corollary 1. Under the conditions of Prop osition 1 and symmetric pro duction functions (     for all  ): (a) Ranking.              . (b) Over-in v estmen t. F or contractible    :           , in which case         . (c) Under-inv estmen t. F or con tractible    :           , in which case         — ev en though the dimension is observ able. 7 (d) Maximum vulnerabilit y . All non-contractible dimensions share the low est distortion index   󰄗 and hence the most sev ere under-in vestmen t. Pro of. Under symmetric  , the F OC gives            󰄘  for all  . Since     ,   is strictly decreasing, hence inv ertible:         󰄘       . Since     is strictly decreasing,    is strictly increasing in    , hence in         (since   is the same scaling factor across symmetric dimen- sions). Parts (a)-(d) follo w directly . □ Remark 2 (Asymmetric pro duction functions). When      , the ranking in part (a) may b e mo died b y dierences in pro duction technology . Sp ecically ,      guaran tees        only when the pro duction function heterogeneity       do es not dominate the w eigh t heterogeneit y     at the equilibrium p oint. In practice, this condition can b e c hec k ed empirically for an y giv en system. Remark 3 (Computabilit y).   is computable prior to deploymen t. F or dierentiable reward mo dels,   (or its lo cal analogue 󰄪  󰄪   ) can b e obtained via automatic dieren tiation. The principal’s weigh ts   can b e estimated through exp ert elicitation or user studies. The ranking of   v alues constitutes a pre-deploymen t vulnerability assessment. Example (Sycophancy). Let dimension 1 = factual accuracy , dimension 2 = sub jectiv e user satisfaction. If the reward mo del is trained on human preference data where raters themselves struggle to distinguish “correct but uncomfortable” from “incorrect but pleasing” answers, then          : the rew ard model o v er-weigh ts satisfaction relativ e to the principal’s true v aluation. Corollary 1(b) predicts o ver-in v estment in user satisfaction — i.e., sycophancy . This matches the empirical pattern do cumen ted b y Perez et al. (2023) and others. Example (Length Gaming). If ev aluation scores correlate p ositiv ely with output length (a well- do cumen ted empirical pattern), but the principal v alues conciseness, then the “length” dimension has    and is ov er-inv ested. The agent pro duces unnecessarily verbose outputs — length gaming. 4.2 Prop osition 2: Agen tic Amplication Motiv ation. Lin (2026): “Better to ols mak e the mo del more useful, but they also enlarge the attac k surface for spurious optimization. ” W e now pro ve, rather than assume, that agen tic systems face structurally worse alignmen t problems. Setup. Consider a family of agen tic systems indexed b y to ol coun t    . By Axiom 5, the qualit y dimension count satises:       󰄌       󰄌          Let    denote the num b er of quality dimensions cov ered b y the ev aluation system at to ol count  . Denition 4 (Ev aluation Engineering Budget). Let    denote the total engineering re- sources (data collection, ev aluator design, v alidation, maintenance) inv ested in ev aluation at to ol coun t  . Eac h indep enden tly ev aluable dimension requires at least    units of engineering resource to establish and maintain, so        . 8 Prop osition 2 (Agentic Amplication). Let Axioms 1–5 hold. If the ev aluation engineering budget satises        — i.e., ev aluation inv estmen t gro ws strictly slo w er than quadratically in the num ber of to ols — then: (a) The cov erage ratio        as    . (b) The contract incompleteness 󰄕            as    . (c) F or an y 󰄕     , there exists   suc h that for all     , the agentic system’s distortion exceeds that of an y system with incompleteness 󰄕  . Pro of. (a) By Axiom 5:     󰄌    . By Denition 4:        . Therefore:           󰄌        󰄌      Since        , the numerator grows strictly slo wer than   while the denominator grows as    . Therefore        . □ (b) 󰄕                . □ (c) F ollows from (b) and the monotonicity of distortion in 󰄕 (Prop osition 1). □ Remark 4 (Wh y        is the generic case). The condition        — that ev aluation inv estmen t grows slo wer than quadratically — holds generically because of a fundamen tal cost asymmetry b etw een capability expansion and ev aluation expansion: Cap ability side: Integrating to ol    into the agent’s action space requires  engineering cost (writing an API wrapp er, adding a to ol description). T otal capability expansion cost for  tools:   . Evaluation side: Ev aluating to ol    ’s interaction with each of the existing  to ols requires  engineering cost (designing test cases, collecting ground truth for each interaction pattern). T otal ev aluation cost for all pairwise in teractions up to  to ols:         . Th us, maintaining full pairwise ev aluation co verage (      ) requires ev aluation costs that gro w quadratically — which even tually dominates any linearly growing engineering budget. In practice, ev aluation budgets are a fraction of total developmen t resources, and total developmen t resources do not gro w quadratically with to ol coun t. Hence        is the generic case. The only escap e is        : in vesting quadratically growing resources in ev aluation. While tec hnically p ossible, this is practically unsustainable and has not b een observed in an y deploy ed agen tic system. Remark 5 (The holistic ev aluator ob jection). One might ob ject: “A single end-to-end reward mo del can ev aluate the entire tra jectory , cov ering all interactions at once without explicitly en umerating dimensions. ” This objection conates the ev aluator’s in ternal complexit y with its informational output. A rew ard mo del that outputs a scalar score provides the agent with exactly one dimension of feedbac k (   9  ), regardless of the mo del’s internal parameter count. By the data pro cessing inequalit y , a  - dimensional ev aluation signal carries at most   indep enden t bits ab out the quality v ector. A holistic scalar score therefore c ompr esses all  qualit y dimensions in to one n umber, maximizing information loss rather than minimizing it. Concretely: if the agen t receiv es only a single score, it can optimize along only one direction in qualit y space — the gradient of the score function. All directions orthogonal to this gradient are uncon trolled. With      quality dimensions and    , the fraction of quality space under ev aluation control is    . T o escap e this, the ev aluator must output a higher-dimensional signal — which returns us to the    regime and the cost analysis ab o v e. Remark 6 (T estable prediction). Prop osition 2 yields a testable prediction: the same base mo del, when equipped with a larger to ol set, should exhibit greater quality degradation on non- ev aluated dimensions. This can b e tested b y controlling to ol set size and measuring quality on held-out dimensions across m ultiple v alues of  . 4.3 Corollary 2: Complementarit y of Alignmen t Stages Corollary 2. Impro ving ev aluation cov erage (increasing  , reducing 󰄕 ) and impro ving preference in ternalization (reducing 󰄗 ) are complemen ts: 󰄪   󰄪 󰄕 󰄪 󰄗   where     q       q   is the alignmen t loss. In tuition. A t high 󰄕 (most dimensions non-contractible), reducing 󰄗 has high marginal v alue: for non-con tractible dimensions,   󰄗  is the only eort driver, so small impro vemen ts in in ternalization yield large eort increases. Con versely , at low 󰄗 (strong internalization), increasing  has high marginal v alue: the agent already “w ants” to do the right thing, so making more dimensions observ able eliminates the remaining reward–w elfare w edge without introducing new distortions. P olicy implication. Preference reshaping (RLHF, etc.) and mec hanism design (harness engineer- ing) should b e co-optimized rather than treated as indep enden t engineering tasks. This aligns with observ ed practice: Cognition (2025), in developing their SWE-1.5 co ding agent, rep orts contin uous iteration on mo del training, harness improv emen ts, to ols, and prompt engineering as a unied pro- cess, and states that “the quality of the co ding en vironments in RL tasks is the most imp ortan t factor for downstream mo del p erformance. ” 5. Robustness, Limitations, and Extensions 5.1 Nonlinear Ob jectiv es Under nonlinear   q  and nonlinear rew ard function    q  , replace   with 󰄪  󰄪    q  and   with 󰄪  󰄪      q  . All results hold lo cally around the equilibrium. The distortion index b ecomes: 10    󰄗  󰄪  󰄪       󰄗  󰄪  󰄪   󰄪  󰄪    q  for contractible dimensions, and      󰄗 for non-con tractible dimensions, exactly as b efore. 5.2 Sub jectivity of  The n um b er of qualit y dimensions  dep ends on the analyst’s decomp osition of “qualit y” — just as the dimensionality of commo dit y space in consumer theory dep ends on the mo deler’s denition of “go ods. ” Our results are qualitativ ely inv ariant to the sp ecic choice of  , provided    . The condition    is a qualitative judgment ab out the niteness of ev aluation, not a quan titative claim ab out the precise v alue of  . 5.3 Dimension Correlations Axiom 1 implicitly allo ws but do es not require dimensional indep endence. If dimensions are corre- lated in pro duction (e.g., reasoning eort simultaneously improv es accuracy and coherence), non- con tractible dimensions may “free-ride” on eort in vested in correlated contractible dimensions. This attenuates but do es not eliminate the distortion iden tied in Prop osition 1: as long as some non-con tractible dimensions ha v e imperfect correlation with all contractible ones, under-in v estment p ersists. 5.4 Dynamic Boundary Bet w een Stages Practitioners rep ort that harness capabilities are contin uously absorb ed into mo dels through p ost- training (Cognition 2025, Schmid 2026). Cherny (2026), head of Claude Co de at Anthropic, do c- umen ts a worko w where each agen t failure is recorded into p ersistent instruction les, and these accum ulated corrections are p eriodically incorp orated into mo del training — a concrete instance of the Stage 2 to Stage 1 migration. In our framework, this corresp onds to the Stage 1/Stage 2 b oundary shifting o ver time: constrain ts previously enforced externally (harness) b ecome internal- ized b ehaviors (reduced 󰄗 on sp ecic dimensions). This do es not aect Proposition 1, whose v alidit y requires only    and 󰄗   at an y given time — conditions indep endent of where the stage b oundary lies. F urthermore, Prop osition 2 predicts a “Red Queen eect”: ev en as mo dels absorb existing harness capabilities (lo cally reducing 󰄗 ), the in tro duction of new to ols contin uously creates new non-contractible dimensions (increasing  ), so that 󰄕 may not decrease — and may even increase — ov er time. 5.5 Conditions for Mo del F ailure The framew ork do es not apply when: (1)    — quality is unidimensional (unrealistic for complex tasks); (2)    — ev aluation co vers all dimensions (tec hnically p ossible in formal v erication but unrealistic for general AI); (3) Agent b eha vior violates b eha vioral regularit y — the agen t do es not resp ond to ev aluation signals (implies alignment training is en tirely ineective). A fourth condition — that the agent can mo dify its own ev aluation system — is not a failure mo de but an extension, whic h we dev elop as conjectures in Section 6. 11 5.6 F uture Directions (i) Empirical v alidation. Designing controlled API exp eriments that manipulate  ,  ,  , and   to test the quan titative predictions of Prop ositions 1–2 and Corollaries 1–2. (ii) Multi-agent co ordination. Extending from bilateral principal-agent to m ulti-agen t settings, corresp onding to the emerging “co ordination engineering” paradigm. (iii) Dynamic mo del. Extending the single-p erio d analysis to a m ulti-perio d game w ould capture learning, adaptation, and reputation eects in rep eated agent interactions. 6. Conjectures: F rom Go o dhart to Campb ell The results established in Sections 3–4 c haracterize agen t behavior within a xe d ev aluation system. W e now consider a more fundamental question: what happ ens when the agent is capable enough to degrade the ev aluation system itself ? This section presents tw o conjectures with supp orting analysis. W e do not claim to hav e pro ven these results — the assumptions required go b ey ond the “undeniable” level of Axioms 1–5. W e presen t them b ecause we b elieve they iden tify a critical frontier for AI safety theory , and b ecause the partial analysis w e oer ma y b e useful to researchers pursuing formal pro ofs. 6.1 T wo Mo des of Gaming W e distinguish tw o qualitatively dierent mo des of rew ard hacking, named after the tw o sc holars who identied the corresp onding phenomena in so cial science: Go o dhart regime. The agen t op erates within a xed ev aluation system, reallo cating eort from non-ev aluated to ev aluated dimensions. The ev aluation system itself is not aected b y the agent’s actions. This is the regime analyzed in Prop ositions 1–2. Campb ell regime. The agent allo cates part of its resources to de gr ading the evaluation system’s ee ctive c over age — not b y mo difying the ev aluator’s co de, but by pro ducing outputs that are harder for the ev aluator to assess accurately . The ev aluation system’s eective cov erage    b ecomes a function of the agen t’s capability and b ehavior. The distinction matters b ecause the tw o regimes hav e qualitativ ely dierent safet y prop erties. In the Go o dhart regime, the principal can impro ve outcomes by improving the ev aluation system (increasing  ). In the Campb ell regime, this strategy may be self-defeating: a more capable agent can degrade the ev aluation faster than the principal can impro ve it. 6.2 Capability-Dependent Ev aluation Fidelit y W e prop ose the following assumption, weak er than p ostulating strategic manipulation: Assumption C1 (P assiv e Ev aluation Degradation). The eective ev aluation cov erage    is non-increasing in the agent’s capability  : 󰄪    󰄪    12 Supp orting ar gument: More capable agents produce more complex outputs (longer to ol-use chains, more intricate reasoning, higher-dimensional action sequences). Ev aluating complex outputs re- quires more information — by the data pro cessing inequality , a xed-dimensional ev aluation signal loses delity as the ob ject b eing ev aluated grows in complexity . This is not strategic b eha vior b y the agent; it is an information-theoretic consequence of the capability gap b et ween the agent and the ev aluator. Under the stronger assumption that agents str ate gic al ly in vest resources    to degrade ev alua- tion (at the cost of div erting  from production), one can mo del the agen t’s problem as a t wo-stage optimization: rst c ho ose  , then allo cate the remaining budget    across quality dimensions under eective cov erage        , where  is the manipulation pro duction function. 6.3 Conjecture 1: Capability Threshold Conjecture 1 (Go o dhart-Campb ell T ransition). Under Assumption C1 and the strategic manipulation extension, there exists a critical capabilit y level   suc h that: (a) F or     : the agen t dev otes all resources to pro duction (     ). The Go o dhart regime obtains, and Prop ositions 1–2 fully c haracterize agent b eha vior. (b) F or     : the agent devotes p ositiv e resources to ev aluation degradation (     ). The Campb ell regime obtains, and eective ev aluation cov erage declines endogenously . (c) The threshold   is determined b y the condition that the marginal b enet of manipulation (from relaxing the ev aluation constraint) equals the marginal cost (from reduced pro duction budget). Partial analysis: A t low  , the Inada condition (      ) ensures that pro duction’s marginal v alue is un b oundedly high, exceeding any nite manipulation b enet. A t high  , diminishing returns (     ) reduce pro duction’s marginal v alue b elo w manipulation’s marginal b enet (whic h scales with  󰅡 through the increased gaming pay o ). By the in termediate v alue theorem, a crossing p oin t   exists. A complete pro of requires verifying monotonicit y of the net b enet function, whic h dep ends on the sp ecic forms of   and  . 6.4 Conjecture 2: Capability T rap Conjecture 2 (Non-Monotone W elfare). If Conjecture 1 holds and the Go o dhart-Campb ell transition is suciently sharp, then the principal’s welfare    ma y b e non-monotone in agent capabilit y  : (a) F or     :    is strictly increasing in  . Capabilit y growth translates directly into w elfare impro vemen t. (b) F or  in a neighborho o d ab ov e   :    may b e de cr e asing in  . The w elfare loss from ev aluation degradation and eort reallo cation to manipulation ma y exceed the welfare gain from increased total capabilit y . Implic ation: There exists a capability trap — a range of capability lev els where making the AI more capable mak es the outcome worse for the principal. This o ccurs when the ev aluation system has not b een upgraded to withstand the agent’s increased abilit y to degrade it. R elation to existing c onc epts: This is a formalization of Bostrom’s (2014) informal notion of a “treac herous turn” — the idea that an AI system might b eha ve co op erativ ely while w eak but 13 defect once suciently capable. Our framework provides the rst economic mechanism for this phenomenon: the transition o ccurs not b ecause the agent’s “v alues” change, but b ecause the relative returns to pro duction versus manipulation shift as capability gro ws. The agent’s ob jectiv e function  remains constant; only the budget  c hanges. 6.5 Why These Are Conjectures, Not Theorems W e are transparen t ab out what separates these conjectures from our prov en results: Prop ositions 1–2 Conjectures 1–2 Axiom base Axioms 1–5: undeniable Assumption C1: plausible but not undeniable Agen t b ehavior T akes ev aluation as giv en Ma y activ ely degrade ev aluation Pro of status Complete P artial (monotonicit y condition un v eried in general) Empirical testability T estable with curren t systems Requires suciently capable systems F alsiabilit y Y es — measure distortion vs.   predictions Y es — measure manipulation as function of  The conjectures are presen ted here b ecause (a) they iden tify a qualitatively imp ortan t regime transition that current AI safety theory has not formalized, (b) the partial analysis pro vides a concrete researc h program for future work, and (c) ev en as conjectures, they yield actionable implications: ev aluation systems should b e designed not only to b e accurate but to be r obust to de gr adation by the agent b eing evaluate d . 7. Discussion 7.1 What This F ramew ork Pro vides F or AI safet y researchers. Our pro v en results (Prop ositions 1–2) establish that rew ard hac king is a structural equilibrium under an y nite ev aluation system, that its direction is predictable via the distortion index   , and that agentic systems face structurally worse alignment problems. Resources should target reducing 󰄕 (expanding ev aluation co v erage on high-risk dimensions) and 󰄗 (impro ving in ternalization), rather than attempting to eliminate hac king en tirely . Our conjectures (Section 6) further suggest that ev aluation systems must b e designed to b e robust against degradation by capable agents — a consideration absent from current alignment practice. F or AI system arc hitects. Prop osition 2 provides a design principle: when expanding to ol access, sim ultaneously expand ev aluation cov erage on the new qualit y dimensions introduced, or accept structurally increased hac king risk. Corollary 2 implies that harness design and alignment training should b e join tly optimized. Conjecture 1 adds a warning: b ey ond a capability threshold, the agent may activ ely undermine ev aluation impro vemen ts. F or economists. AI agents oer an unprecedented exp erimental platform for principal-agent the- ory . Unlik e human sub jects, AI agents hav e precisely controllable “preferences” ( 󰄗 ,   ), exactly 14 sp ecied “budgets” (  ), and p erfectly replicable b ehavior. This enables, for the rst time, ex- act exp erimental tests of multi-task moral hazard predictions under controlled conditions. The Go odhart-Campb ell transition, if empirically conrmed, would b e the rst do cumen ted instance of endogenous con tract degradation by the agent — a phenomenon theorized but never cleanly observ ed in human organizations. 7.2 V ulnerability Assessmen t Pro cedure 1. Iden tify  quality dimensions for the target task (via domain exp ertise, literature, user researc h). 2. Estimate principal w eights   (via preference elicitation or exp ert scoring). 3. Compute rew ard model eectiv e weigh ts   (via gradient analysis for dierentiable models, or p erturbation exp erimen ts for black-box mo dels). 4. Compute   for each dimension. F or non-contractible dimensions,      󰄗 . 5. Rank dimensions b y   . High-   = ov er-in vestmen t risk. Low-   = under-inv estmen t risk. 6. Augmen t monitoring or adjust reward mo del on high-risk dimensions. 7. (If Conje ctur e 1 is ac c epte d) Assess whether agent capability  approaches the estimated threshold   , and if so, prioritize ev aluation robustness ov er ev aluation breadth. References Afriat, S. N. (1967). The Construction of Utility F unctions from Exp enditure Data. International Ec onomic R eview , 8(1), 67–77. Amo dei, D., Olah, C., Steinhardt, J., Christiano, P ., Sch ulman, J., & Mané, D. (2016). Concrete Problems in AI Safet y . . Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI F eedback. . Bak er, G. P . (1992). Incen tive Contracts and P erformance Measuremen t. Journal of Politic al Ec onomy , 100(3), 598–614. Bergemann, D., Bonatti, A., & Smolin, A. (2025). The Economics of Large Language Mo dels: T oken Allo cation, Fine-T uning, and Optimal Pricing. In Pr o c e e dings of EC’25 . Bostrom, N. (2014). Sup erintel ligenc e: Paths, Dangers, Str ate gies . Oxford Universit y Press. Bro oks, F. P . (1975). The Mythic al Man-Month: Essays on Softwar e Engine ering . Addison-W esley . Chern y , B. (2026). Claude Co de Tips and W orkow. X thread, January 31, 2026. h ttps://x.com/b c herny/status/2017742741636321619 Christiano, P . F., et al. (2017). Deep Reinforcemen t Learning from Human Preferences. In NeurIPS . Cognition (2025). Introducing SWE-1.5: Our F ast Agen t Model. Cognition Blog, Octob er 29, 2025. h ttps://cognition.ai/blog/swe-1-5 F riedman, M. (1953). The Metho dology of P ositive Economics. In Essays in Positive Ec onomics . Univ ersity of Chicago Press. 15 Grossman, S. J., & Hart, O. D. (1986). The Costs and Benets of Ownership. Journal of Politic al Ec onomy , 94(4), 691–719. Hart, O., & Mo ore, J. (1990). Property Rights and the Nature of the Firm. Journal of Politic al Ec onomy , 98(6), 1119–1158. Holmström, B., & Milgrom, P . (1991). Multitask Principal-Agen t Analyses: Incen tiv e Con tracts, Asset Ownership, and Job Design. Journal of L aw, Ec onomics, and Or ganization , 7, 24–52. Lin, J. (2026). F rom “Reasoning” Thinking to “Agen tic” Thinking. Published on X, March 25, 2026. https://x.com/JustinLin610/status/2037116325210829168 Ouy ang, L., et al. (2022). T raining Language Mo dels to F ollo w Instructions with Human F eedback. In NeurIPS . P an, A., et al. (2022). The Eects of Rew ard Missp ecication: Mapping and Mitigating Misaligned Mo dels. In ICLR . P erez, E., et al. (2023). Disco v ering Language Mo del Behaviors with Mo del-W ritten Ev aluations. In A CL . Rafailo v, R., et al. (2023). Direct Preference Optimization: Y our Language Mo del Is Secretly a Rew ard Mo del. In NeurIPS . Sc hmid, P . (2026). The Imp ortance of Agen t Harness in 2026. P ersonal blog, January 5, 2026. h ttps://www.philschmid.de/agen t-harness-2026 Skalse, J., et al. (2022). Dening and Characterizing Reward Hacking. In NeurIPS . 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment