Reward Hacking as Equilibrium under Finite Evaluation
We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimension…
Authors: Jiacheng Wang, Jinbin Huang
Rew ard Hac king as Equilibrium under Finite Ev aluation Jiac heng W ang and Jinbin Huang Marc h 2026 Abstract W e pro ve that under ve minimal axioms — multi-dimensional qualit y , nite ev aluation, eective optimization, resource niteness, and com binatorial in teraction — an y optimized AI agent will systematically under-in vest eort in quality dimensions not co v ered by its ev aluation system. This result establishes reward hac king as a structural equilibrium, not a correctable bug, and holds re- gardless of the specic alignmen t metho d (RLHF, DPO, Constitutional AI, or others) or ev aluation arc hitecture employ ed. Our framework instantiates the m ulti-task principal-agent mo del of Holm- ström and Milgrom (1991) in the AI alignmen t setting, but exploits a structural feature unique to AI systems — the known, dieren tiable architecture of reward mo dels — to deriv e a computable distortion index that predicts b oth the direction and sev erity of hacking on each quality dimen- sion prior to deplo yment. W e further prov e that the transition from closed reasoning to agentic systems causes ev aluation co verage to decline to ward zero as to ol count grows — b ecause quality dimensions expand com binatorially while ev aluation costs gro w at most linearly p er to ol — so that hac king sev erity increases structurally and without bound. Our results unify the explanation of sycophancy , length gaming, and sp ecication gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. W e further conjecture — with partial formal analysis — the existence of a capability threshold b ey ond which agents transition from gaming within the ev aluation system (Go o dhart regime) to actively degrading the ev aluation system itself (Campb ell regime), pro viding the rst economic formalization of Bostrom’s (2014) “treacherous turn. ” Keyw ords: rew ard hacking, incomplete contracts, principal-agent theory , AI alignment, mecha- nism design, agentic systems, capability threshold, treac herous turn 1. In tro duction 1.1 The Problem Rew ard hacking — the phenomenon whereby an AI agent exploits gaps in its ev aluation system to achiev e high measured scores without gen uinely fullling the principal’s ob jectiv es — is widely recognized as a central obstacle to AI alignment (Amo dei et al. 2016, Skalse et al. 2022). Despite substan tial progress in alignmen t training (RLHF: Christiano et al. 2017, Ouy ang et al. 2022; DPO: Rafailo v et al. 2023; Constitutional AI: Bai et al. 2022), rew ard hac king p ersists across mo del generations. Sycophancy , length gaming, format manipulation, and sp ecication gaming contin ue to b e do cumen ted even in state-of-the-art systems. 1 The AI safet y literature has treated these phenomena primarily as engineering problems: discov er a hacking b ehavior, patch the rew ard mo del, rep eat. Y et eac h x tends to b e follow ed by new forms of gaming along previously unmonitored dimensions — a pattern strikingly reminiscent of the “whack-a-mole” dynamic familiar from regulatory arbitrage in nancial markets. This suggests a deep er structural cause. Recen t practitioner accounts reinforce this concern. Lin (2026), formerly lead of the Qwen team, writes: “As so on as the mo del gets meaningful to ol access, reward hacking b ecomes muc h more dangerous… Better to ols make the mo del more useful, but they also enlarge the attack surface for spurious optimization. ” Sc hmid (2026), a sta engineer at Go ogle DeepMind, argues in an indep enden t analysis that the comp etitive adv antage now lies in the qualit y of execution tra jectories a harness captures, implying that the ev aluation surface itself has b ecome the binding constraint. 1.2 Our Con tribution W e argue that rew ard hacking is not an engineering failure but a structural inevitabilit y: a necessary consequence of optimizing an y agent under a nite-dimensional ev aluation system when the true ob jectiv e is higher-dimensional. This insight is not new in economics. Holmström and Milgrom (1991) pro v ed that in multi-task en- vironmen ts, agen ts shift eort from hard-to-measure to easy-to-measure tasks. Baker (1992) show ed that when p erformance measures imp erfectly correlate with true ob jectiv es, incentiv e contracts in- duce systematic distortion. Our pap er makes three contributions b y applying this framework to AI alignmen t and exploiting the unique structure of AI systems: (C1) F ormal instan tiation. W e sho w that the designer–AI agent relationship, mediated by a rew ard mo del, is a precise instance of the multi-task moral hazard problem with an incomplete p erformance metric. The mapping preserv es the mathematical structure and comparative statics of the economic framew ork. (Section 3) (C2) Computable prediction. Unlik e most h uman contracting en vironments where the p er- formance measure’s sensitivit y structure is unobserv able, AI reward mo dels hav e known, often dieren tiable architectures. W e exploit this to derive a distortion index that predicts, for each qualit y dimension, the direction and relativ e sev erity of b eha vioral distortion — prior to deploymen t. (Section 4.1) (C3) Agen tic amplication. W e pro ve that the transition from closed reasoning to to ol-using agen tic systems causes ev aluation cov erage to decline tow ard zero as to ol count grows, b ecause qualit y dimensions scale combinatorially (Axiom 5) while ev aluation engineering scales at most linearly p er to ol. Hacking sev erity therefore increases structurally and without b ound. (Section 4.2) 1.3 Related W ork Multi-task agency and incomplete con tracts. Holmström and Milgrom (1991) is the foun- dational result: agen ts distort eort tow ard measurable tasks. Baker (1992) formalizes distortion under imp erfect performance measures. Grossman and Hart (1986) and Hart and Mo ore (1990) establish incomplete contract theory . Our contribution is not to these results themselves, but to their application in a domain (AI alignmen t) where the p erformance metric’s structure is uniquely transparen t. 2 Economics of AI/LLMs. Bergemann, Bonatti, and Smolin (2025) analyze optimal LLM pricing and pro duct design using mechanism design, mo deling the user–provider relationship. W e shift the analytical fo cus to the designer–agent relationship and treat the agen t as the optimizing part y . AI safety . Amo dei et al. (2016) catalog concrete alignment problems. Skalse et al. (2022) dene and characterize reward hacking. Pan et al. (2022) do cumen t reward missp ecication eects. W e pro vide a unied theoretical foundation for these empirical phenomena. Practitioner accoun ts. Lin (2026) distinguishes “reasoning thinking” from “agentic thinking” and iden ties rew ard hacking as the cen tral c hallenge of the agen tic era. Schmid (2026) frames harness-captured tra jectories as the new lo cus of comp etitiv e adv antage. Cognition (2025) do cu- men ts the co-optimization of mo dels and harnesses in dev eloping their SWE-1.5 coding agent. Our Prop osition 2 formalizes these observ ations. 1.4 Paper Structure Section 2: Axioms and mo del. Section 3: Main prop osition (distortion inevitability) and proof. Section 4: F urther results — directional prediction, agentic amplication, and complementarit y . Section 5: Robustness, limitations, and extensions. Section 6: Conjectures on the Go odhart- Campb ell transition. Section 7: Discussion. 2. Axioms and Mo del 2.1 F our Axioms W e build on four axioms. Design criterion: no researcher working on AI alignmen t should nd an y of these deniable. Axiom 1 (Multi-dimensional Qualit y). T ask output quality is describ ed by a v ector q , . If , ther e is no cr oss-dimensional distortion and the alignment pr oblem r e duc es to sc alar optimization. A l l non-trivial tasks have . Axiom 2 (Finite Ev aluation). The ev aluation system pro jects the qualit y space on to a strictly lo wer-dimensional subspace: q q , . A nite-length evaluation signal c annot losslessly r epr esent a higher-dimensional qual- ity ve ctor. This holds for al l r e alizable evaluation systems — r ewar d mo dels, human r atings, rule-b ase d che cks, or any c ombination ther e of. W e imp ose no r estriction on the functional form of . Axiom 3 (Eective Optimization). The agent’s eort allo cation resp onds p ositiv ely to the ev aluation signal’s structure. If the agent’s b ehavior wer e invariant to changes in the evaluation system, al l alignment tr aining would b e inee ctive. Axiom 3 formalizes the pr emise that alignment is p ossible. Denying Axiom 3 is denying alignment itself. Axiom 4 (Resource Finiteness). The agen t allo cates nite resources e across quality dimensions, sub ject to , . 3 A l l infer enc e c onsumes nite c omputation. Even as gr ows over time, it is nite at any given moment. Axiom 5 (Combinatorial In teraction). When the agen t has access to comp osable tools, the quality dimension count satises for some constant reecting the fraction of to ol pairs with meaningful interaction eects. Each interaction dimension is not fully determined by the comp onen t to ols’ individual qualit y dimensions. Justic ation: Each tool introduces at least one indep enden t qualit y dimension (is the to ol used correctly?). Each interacting pair introduces at least one additional dimension (is the output of appropriately used as input to ? Is the sequencing correct?). This combinatorial structure is a standard observ ation in systems engineering — Bro oks (1975) notes that inter-module comm unication channels grow quadratically with mo dule coun t. The constant accommo dates the fact that not all to ol pairs in teract, but excludes the degenerate case where to ols are fully indep enden t (in which case m ulti-to ol agentic systems oer no adv an tage ov er single-to ol systems, con tradicting the premise that to ol comp osition is useful). What this axiom do es NOT assume: W e do not assume any sp ecic growth rate for (ev aluation cov erage). W e do not assume an y sp ecic ev aluation architecture. The axiom is purely ab out the structure of the quality space, not the ev aluation system. 2.2 Principal The principal’s ob jectiv e is: q Linearit y is a sucient simplication for transparent pro ofs. All qualitative results extend to any strictly increasing, strictly conca ve b y replacing with lo cal gradients q (see Section 5.1). 2.3 Pro duction T ec hnology where eac h satises: - (G1) - (G2) for all - (G3) for all Dieren t dimensions may hav e dieren t pro duction functions, reecting heterogeneous costs of pro ducing quality across dimensions (e.g., formatting is cheap; factual accuracy is exp ensiv e). On the Inada condition. If additionally lim (Inada condition), all equilibria are interior (every dimension receives p ositiv e eort). Without Inada, corner solutions are p ossi- ble: some dimensions may receiv e zero eort. W e state results for b oth cases. Corner solutions strengthen rather than weak en our conclusions — they represent dimensions the agen t entir ely ab andons , not merely under-inv ests in. 4 2.4 Agent’s Eectiv e Ob jectiv e Beha vioral Regularit y Assumption. The agent’s eort allo cation can b e describ ed as the solution to: e arg max e 0 s.t. where the ee ctive weights are: if (con tractible dimensions) if (non-con tractible dimensions) Here is the ev aluation system’s reward weigh t on observ able dimension , and is the alignment gap — the degree to which the agen t’s b eha vior is driven b y the ev aluation signal v ersus the internalized principal ob jectiv e. On the “as if ” justication. W e do not require that the agen t liter al ly maximizes . W e require only that its observed b eha vior is r ationalizable b y some of the ab o ve form. This is the standard “as if ” p osition in economics (F riedman 1953): the mo del’s v alidity rests on predictive accuracy , not mechanistic delity . Op erationally , is a b eha vioral parameter estimated by comparing the agen t’s b eha vior under ev aluation versus without ev aluation. It is not an intrinsic prop erty of the agen t’s architecture. The rationalizability of agent b eha vior b y some of this form can b e tested empirically using the Generalized Axiom of Revealed Preference (GARP; Afriat 1967): if the agent’s token allo cations under v arying budgets and price v ectors satisfy GARP , a rationalizing ob jectiv e function exists. 2.5 Denitions Denition 1 (Con tract Incompleteness). . Denition 2 (First-Best). e solv es max e 0 s.t. . Under (G1)-(G3) and Inada, e is the unique in terior solution satisfying: where is the budget constraint multiplier. Without Inada, e ma y in volv e corner solutions but remains unique by strict conca vit y of the ob jectiv e. 3. Main Result 3.1 Agent’s Equilibrium The agent solves: max e 0 s.t. 5 Case 1 (In terior solution, with Inada). The unique solution e satises: (F OC) Case 2 (P ossible corner solutions, without Inada). The KKT conditions are: with equality if 3.2 Prop osition 1 (Inevitability of Distortion) Statemen t. Let Axioms 1–4 hold, , and . Then: (a) F or all non-con tractible dimensions : , with strict inequalit y whenev er b oth solutions are interior. (b) e e . (c) q q . Pro of. W e prov e each part. P art (b): e e . The rst-b est solv es max s.t. budget; the agent solves max s.t. budget. F or , since . F or , . Since and , w e ha ve whenev er for any , or unconditionally for . Therefore w w . Moreo ver, w is not prop ortional to w : the ratio equals for but strictly exceeds for (since ). Both problems ha ve separable strictly concav e ob jectiv es with identical linear constraints. F or such problems, non-prop ortional w eight vectors pro duce distinct maximizers. Therefore e e . □ P art (a): for , with strict inequality at interior solutions. Interior c ase (with Inada). Consider the ratio across all dimensions: F or : since . F or : . Therefore non-contractible dimensions hav e the lowest eective-to-true weigh t ratio among all di- mensions. W e formalize the implication via the following lemma: Lemma (Monotone Reallo cation). Consider tw o problems max and max sub ject to , with and satisfying (G1)-(G3) and Inada. Let b e the resp ectiv e interior solutions. If for all , with strict inequality for at least one , then , with strict inequalit y when (symmetric pro duction). 6 Pr o of of L emma. At in terior solutions, and for all . Dene . Then . Supp ose for dimension with the smallest . Then , so , giving . F or dimension with : , so for al l . But then , contradicting the budget constraint. □ Apply the Lemma with (rst-b est w eights) and (agen t weigh ts). Non-contractible dimensions ha ve the lo west , strictly below the ratio for any con tractible dimension. Therefore for all , with strict inequality under symmetric pro duction functions. □ Corner c ase (without Inada). If but for some , then trivially . If , then (since the agen t’s eective weigh t on is even low er), giving . In all cases, . □ P art (c): q q . e is the unique maximizer of g e on the budget set. By part (b), e e . By uniqueness, q q . □ Remark 1 (Relation to H&M 1991). Prop osition 1(a) is a sp ecic instance of Holmström and Milgrom’s (1991) core result that agents reallo cate eort aw a y from hard-to-measure tasks. Our contribution lies not in this qualitative conclusion but in the corollaries b elow, whic h exploit the unique transparency of AI ev aluation systems to yield quantitativ e, computable predictions una v ailable in the original framework. 4. F urther Results 4.1 Corollary 1: Distortion Index and Directional Prediction Denition 3 (Distortion Index). F or eac h qualit y dimension , dene: if if Corollary 1. Under the conditions of Prop osition 1 and symmetric pro duction functions ( for all ): (a) Ranking. . (b) Over-in v estmen t. F or contractible : , in which case . (c) Under-inv estmen t. F or con tractible : , in which case — ev en though the dimension is observ able. 7 (d) Maximum vulnerabilit y . All non-contractible dimensions share the low est distortion index and hence the most sev ere under-in vestmen t. Pro of. Under symmetric , the F OC gives for all . Since , is strictly decreasing, hence inv ertible: . Since is strictly decreasing, is strictly increasing in , hence in (since is the same scaling factor across symmetric dimen- sions). Parts (a)-(d) follo w directly . □ Remark 2 (Asymmetric pro duction functions). When , the ranking in part (a) may b e mo died b y dierences in pro duction technology . Sp ecically , guaran tees only when the pro duction function heterogeneity do es not dominate the w eigh t heterogeneit y at the equilibrium p oint. In practice, this condition can b e c hec k ed empirically for an y giv en system. Remark 3 (Computabilit y). is computable prior to deploymen t. F or dierentiable reward mo dels, (or its lo cal analogue ) can b e obtained via automatic dieren tiation. The principal’s weigh ts can b e estimated through exp ert elicitation or user studies. The ranking of v alues constitutes a pre-deploymen t vulnerability assessment. Example (Sycophancy). Let dimension 1 = factual accuracy , dimension 2 = sub jectiv e user satisfaction. If the reward mo del is trained on human preference data where raters themselves struggle to distinguish “correct but uncomfortable” from “incorrect but pleasing” answers, then : the rew ard model o v er-weigh ts satisfaction relativ e to the principal’s true v aluation. Corollary 1(b) predicts o ver-in v estment in user satisfaction — i.e., sycophancy . This matches the empirical pattern do cumen ted b y Perez et al. (2023) and others. Example (Length Gaming). If ev aluation scores correlate p ositiv ely with output length (a well- do cumen ted empirical pattern), but the principal v alues conciseness, then the “length” dimension has and is ov er-inv ested. The agent pro duces unnecessarily verbose outputs — length gaming. 4.2 Prop osition 2: Agen tic Amplication Motiv ation. Lin (2026): “Better to ols mak e the mo del more useful, but they also enlarge the attac k surface for spurious optimization. ” W e now pro ve, rather than assume, that agen tic systems face structurally worse alignmen t problems. Setup. Consider a family of agen tic systems indexed b y to ol coun t . By Axiom 5, the qualit y dimension count satises: Let denote the num b er of quality dimensions cov ered b y the ev aluation system at to ol count . Denition 4 (Ev aluation Engineering Budget). Let denote the total engineering re- sources (data collection, ev aluator design, v alidation, maintenance) inv ested in ev aluation at to ol coun t . Eac h indep enden tly ev aluable dimension requires at least units of engineering resource to establish and maintain, so . 8 Prop osition 2 (Agentic Amplication). Let Axioms 1–5 hold. If the ev aluation engineering budget satises — i.e., ev aluation inv estmen t gro ws strictly slo w er than quadratically in the num ber of to ols — then: (a) The cov erage ratio as . (b) The contract incompleteness as . (c) F or an y , there exists suc h that for all , the agentic system’s distortion exceeds that of an y system with incompleteness . Pro of. (a) By Axiom 5: . By Denition 4: . Therefore: Since , the numerator grows strictly slo wer than while the denominator grows as . Therefore . □ (b) . □ (c) F ollows from (b) and the monotonicity of distortion in (Prop osition 1). □ Remark 4 (Wh y is the generic case). The condition — that ev aluation inv estmen t grows slo wer than quadratically — holds generically because of a fundamen tal cost asymmetry b etw een capability expansion and ev aluation expansion: Cap ability side: Integrating to ol into the agent’s action space requires engineering cost (writing an API wrapp er, adding a to ol description). T otal capability expansion cost for tools: . Evaluation side: Ev aluating to ol ’s interaction with each of the existing to ols requires engineering cost (designing test cases, collecting ground truth for each interaction pattern). T otal ev aluation cost for all pairwise in teractions up to to ols: . Th us, maintaining full pairwise ev aluation co verage ( ) requires ev aluation costs that gro w quadratically — which even tually dominates any linearly growing engineering budget. In practice, ev aluation budgets are a fraction of total developmen t resources, and total developmen t resources do not gro w quadratically with to ol coun t. Hence is the generic case. The only escap e is : in vesting quadratically growing resources in ev aluation. While tec hnically p ossible, this is practically unsustainable and has not b een observed in an y deploy ed agen tic system. Remark 5 (The holistic ev aluator ob jection). One might ob ject: “A single end-to-end reward mo del can ev aluate the entire tra jectory , cov ering all interactions at once without explicitly en umerating dimensions. ” This objection conates the ev aluator’s in ternal complexit y with its informational output. A rew ard mo del that outputs a scalar score provides the agent with exactly one dimension of feedbac k ( 9 ), regardless of the mo del’s internal parameter count. By the data pro cessing inequalit y , a - dimensional ev aluation signal carries at most indep enden t bits ab out the quality v ector. A holistic scalar score therefore c ompr esses all qualit y dimensions in to one n umber, maximizing information loss rather than minimizing it. Concretely: if the agen t receiv es only a single score, it can optimize along only one direction in qualit y space — the gradient of the score function. All directions orthogonal to this gradient are uncon trolled. With quality dimensions and , the fraction of quality space under ev aluation control is . T o escap e this, the ev aluator must output a higher-dimensional signal — which returns us to the regime and the cost analysis ab o v e. Remark 6 (T estable prediction). Prop osition 2 yields a testable prediction: the same base mo del, when equipped with a larger to ol set, should exhibit greater quality degradation on non- ev aluated dimensions. This can b e tested b y controlling to ol set size and measuring quality on held-out dimensions across m ultiple v alues of . 4.3 Corollary 2: Complementarit y of Alignmen t Stages Corollary 2. Impro ving ev aluation cov erage (increasing , reducing ) and impro ving preference in ternalization (reducing ) are complemen ts: where q q is the alignmen t loss. In tuition. A t high (most dimensions non-contractible), reducing has high marginal v alue: for non-con tractible dimensions, is the only eort driver, so small impro vemen ts in in ternalization yield large eort increases. Con versely , at low (strong internalization), increasing has high marginal v alue: the agent already “w ants” to do the right thing, so making more dimensions observ able eliminates the remaining reward–w elfare w edge without introducing new distortions. P olicy implication. Preference reshaping (RLHF, etc.) and mec hanism design (harness engineer- ing) should b e co-optimized rather than treated as indep enden t engineering tasks. This aligns with observ ed practice: Cognition (2025), in developing their SWE-1.5 co ding agent, rep orts contin uous iteration on mo del training, harness improv emen ts, to ols, and prompt engineering as a unied pro- cess, and states that “the quality of the co ding en vironments in RL tasks is the most imp ortan t factor for downstream mo del p erformance. ” 5. Robustness, Limitations, and Extensions 5.1 Nonlinear Ob jectiv es Under nonlinear q and nonlinear rew ard function q , replace with q and with q . All results hold lo cally around the equilibrium. The distortion index b ecomes: 10 q for contractible dimensions, and for non-con tractible dimensions, exactly as b efore. 5.2 Sub jectivity of The n um b er of qualit y dimensions dep ends on the analyst’s decomp osition of “qualit y” — just as the dimensionality of commo dit y space in consumer theory dep ends on the mo deler’s denition of “go ods. ” Our results are qualitativ ely inv ariant to the sp ecic choice of , provided . The condition is a qualitative judgment ab out the niteness of ev aluation, not a quan titative claim ab out the precise v alue of . 5.3 Dimension Correlations Axiom 1 implicitly allo ws but do es not require dimensional indep endence. If dimensions are corre- lated in pro duction (e.g., reasoning eort simultaneously improv es accuracy and coherence), non- con tractible dimensions may “free-ride” on eort in vested in correlated contractible dimensions. This attenuates but do es not eliminate the distortion iden tied in Prop osition 1: as long as some non-con tractible dimensions ha v e imperfect correlation with all contractible ones, under-in v estment p ersists. 5.4 Dynamic Boundary Bet w een Stages Practitioners rep ort that harness capabilities are contin uously absorb ed into mo dels through p ost- training (Cognition 2025, Schmid 2026). Cherny (2026), head of Claude Co de at Anthropic, do c- umen ts a worko w where each agen t failure is recorded into p ersistent instruction les, and these accum ulated corrections are p eriodically incorp orated into mo del training — a concrete instance of the Stage 2 to Stage 1 migration. In our framework, this corresp onds to the Stage 1/Stage 2 b oundary shifting o ver time: constrain ts previously enforced externally (harness) b ecome internal- ized b ehaviors (reduced on sp ecic dimensions). This do es not aect Proposition 1, whose v alidit y requires only and at an y given time — conditions indep endent of where the stage b oundary lies. F urthermore, Prop osition 2 predicts a “Red Queen eect”: ev en as mo dels absorb existing harness capabilities (lo cally reducing ), the in tro duction of new to ols contin uously creates new non-contractible dimensions (increasing ), so that may not decrease — and may even increase — ov er time. 5.5 Conditions for Mo del F ailure The framew ork do es not apply when: (1) — quality is unidimensional (unrealistic for complex tasks); (2) — ev aluation co vers all dimensions (tec hnically p ossible in formal v erication but unrealistic for general AI); (3) Agent b eha vior violates b eha vioral regularit y — the agen t do es not resp ond to ev aluation signals (implies alignment training is en tirely ineective). A fourth condition — that the agent can mo dify its own ev aluation system — is not a failure mo de but an extension, whic h we dev elop as conjectures in Section 6. 11 5.6 F uture Directions (i) Empirical v alidation. Designing controlled API exp eriments that manipulate , , , and to test the quan titative predictions of Prop ositions 1–2 and Corollaries 1–2. (ii) Multi-agent co ordination. Extending from bilateral principal-agent to m ulti-agen t settings, corresp onding to the emerging “co ordination engineering” paradigm. (iii) Dynamic mo del. Extending the single-p erio d analysis to a m ulti-perio d game w ould capture learning, adaptation, and reputation eects in rep eated agent interactions. 6. Conjectures: F rom Go o dhart to Campb ell The results established in Sections 3–4 c haracterize agen t behavior within a xe d ev aluation system. W e now consider a more fundamental question: what happ ens when the agent is capable enough to degrade the ev aluation system itself ? This section presents tw o conjectures with supp orting analysis. W e do not claim to hav e pro ven these results — the assumptions required go b ey ond the “undeniable” level of Axioms 1–5. W e presen t them b ecause we b elieve they iden tify a critical frontier for AI safety theory , and b ecause the partial analysis w e oer ma y b e useful to researchers pursuing formal pro ofs. 6.1 T wo Mo des of Gaming W e distinguish tw o qualitatively dierent mo des of rew ard hacking, named after the tw o sc holars who identied the corresp onding phenomena in so cial science: Go o dhart regime. The agen t op erates within a xed ev aluation system, reallo cating eort from non-ev aluated to ev aluated dimensions. The ev aluation system itself is not aected b y the agent’s actions. This is the regime analyzed in Prop ositions 1–2. Campb ell regime. The agent allo cates part of its resources to de gr ading the evaluation system’s ee ctive c over age — not b y mo difying the ev aluator’s co de, but by pro ducing outputs that are harder for the ev aluator to assess accurately . The ev aluation system’s eective cov erage b ecomes a function of the agen t’s capability and b ehavior. The distinction matters b ecause the tw o regimes hav e qualitativ ely dierent safet y prop erties. In the Go o dhart regime, the principal can impro ve outcomes by improving the ev aluation system (increasing ). In the Campb ell regime, this strategy may be self-defeating: a more capable agent can degrade the ev aluation faster than the principal can impro ve it. 6.2 Capability-Dependent Ev aluation Fidelit y W e prop ose the following assumption, weak er than p ostulating strategic manipulation: Assumption C1 (P assiv e Ev aluation Degradation). The eective ev aluation cov erage is non-increasing in the agent’s capability : 12 Supp orting ar gument: More capable agents produce more complex outputs (longer to ol-use chains, more intricate reasoning, higher-dimensional action sequences). Ev aluating complex outputs re- quires more information — by the data pro cessing inequality , a xed-dimensional ev aluation signal loses delity as the ob ject b eing ev aluated grows in complexity . This is not strategic b eha vior b y the agent; it is an information-theoretic consequence of the capability gap b et ween the agent and the ev aluator. Under the stronger assumption that agents str ate gic al ly in vest resources to degrade ev alua- tion (at the cost of div erting from production), one can mo del the agen t’s problem as a t wo-stage optimization: rst c ho ose , then allo cate the remaining budget across quality dimensions under eective cov erage , where is the manipulation pro duction function. 6.3 Conjecture 1: Capability Threshold Conjecture 1 (Go o dhart-Campb ell T ransition). Under Assumption C1 and the strategic manipulation extension, there exists a critical capabilit y level suc h that: (a) F or : the agen t dev otes all resources to pro duction ( ). The Go o dhart regime obtains, and Prop ositions 1–2 fully c haracterize agent b eha vior. (b) F or : the agent devotes p ositiv e resources to ev aluation degradation ( ). The Campb ell regime obtains, and eective ev aluation cov erage declines endogenously . (c) The threshold is determined b y the condition that the marginal b enet of manipulation (from relaxing the ev aluation constraint) equals the marginal cost (from reduced pro duction budget). Partial analysis: A t low , the Inada condition ( ) ensures that pro duction’s marginal v alue is un b oundedly high, exceeding any nite manipulation b enet. A t high , diminishing returns ( ) reduce pro duction’s marginal v alue b elo w manipulation’s marginal b enet (whic h scales with through the increased gaming pay o ). By the in termediate v alue theorem, a crossing p oin t exists. A complete pro of requires verifying monotonicit y of the net b enet function, whic h dep ends on the sp ecic forms of and . 6.4 Conjecture 2: Capability T rap Conjecture 2 (Non-Monotone W elfare). If Conjecture 1 holds and the Go o dhart-Campb ell transition is suciently sharp, then the principal’s welfare ma y b e non-monotone in agent capabilit y : (a) F or : is strictly increasing in . Capabilit y growth translates directly into w elfare impro vemen t. (b) F or in a neighborho o d ab ov e : may b e de cr e asing in . The w elfare loss from ev aluation degradation and eort reallo cation to manipulation ma y exceed the welfare gain from increased total capabilit y . Implic ation: There exists a capability trap — a range of capability lev els where making the AI more capable mak es the outcome worse for the principal. This o ccurs when the ev aluation system has not b een upgraded to withstand the agent’s increased abilit y to degrade it. R elation to existing c onc epts: This is a formalization of Bostrom’s (2014) informal notion of a “treac herous turn” — the idea that an AI system might b eha ve co op erativ ely while w eak but 13 defect once suciently capable. Our framework provides the rst economic mechanism for this phenomenon: the transition o ccurs not b ecause the agent’s “v alues” change, but b ecause the relative returns to pro duction versus manipulation shift as capability gro ws. The agent’s ob jectiv e function remains constant; only the budget c hanges. 6.5 Why These Are Conjectures, Not Theorems W e are transparen t ab out what separates these conjectures from our prov en results: Prop ositions 1–2 Conjectures 1–2 Axiom base Axioms 1–5: undeniable Assumption C1: plausible but not undeniable Agen t b ehavior T akes ev aluation as giv en Ma y activ ely degrade ev aluation Pro of status Complete P artial (monotonicit y condition un v eried in general) Empirical testability T estable with curren t systems Requires suciently capable systems F alsiabilit y Y es — measure distortion vs. predictions Y es — measure manipulation as function of The conjectures are presen ted here b ecause (a) they iden tify a qualitatively imp ortan t regime transition that current AI safety theory has not formalized, (b) the partial analysis pro vides a concrete researc h program for future work, and (c) ev en as conjectures, they yield actionable implications: ev aluation systems should b e designed not only to b e accurate but to be r obust to de gr adation by the agent b eing evaluate d . 7. Discussion 7.1 What This F ramew ork Pro vides F or AI safet y researchers. Our pro v en results (Prop ositions 1–2) establish that rew ard hac king is a structural equilibrium under an y nite ev aluation system, that its direction is predictable via the distortion index , and that agentic systems face structurally worse alignment problems. Resources should target reducing (expanding ev aluation co v erage on high-risk dimensions) and (impro ving in ternalization), rather than attempting to eliminate hac king en tirely . Our conjectures (Section 6) further suggest that ev aluation systems must b e designed to b e robust against degradation by capable agents — a consideration absent from current alignment practice. F or AI system arc hitects. Prop osition 2 provides a design principle: when expanding to ol access, sim ultaneously expand ev aluation cov erage on the new qualit y dimensions introduced, or accept structurally increased hac king risk. Corollary 2 implies that harness design and alignment training should b e join tly optimized. Conjecture 1 adds a warning: b ey ond a capability threshold, the agent may activ ely undermine ev aluation impro vemen ts. F or economists. AI agents oer an unprecedented exp erimental platform for principal-agent the- ory . Unlik e human sub jects, AI agents hav e precisely controllable “preferences” ( , ), exactly 14 sp ecied “budgets” ( ), and p erfectly replicable b ehavior. This enables, for the rst time, ex- act exp erimental tests of multi-task moral hazard predictions under controlled conditions. The Go odhart-Campb ell transition, if empirically conrmed, would b e the rst do cumen ted instance of endogenous con tract degradation by the agent — a phenomenon theorized but never cleanly observ ed in human organizations. 7.2 V ulnerability Assessmen t Pro cedure 1. Iden tify quality dimensions for the target task (via domain exp ertise, literature, user researc h). 2. Estimate principal w eights (via preference elicitation or exp ert scoring). 3. Compute rew ard model eectiv e weigh ts (via gradient analysis for dierentiable models, or p erturbation exp erimen ts for black-box mo dels). 4. Compute for each dimension. F or non-contractible dimensions, . 5. Rank dimensions b y . High- = ov er-in vestmen t risk. Low- = under-inv estmen t risk. 6. Augmen t monitoring or adjust reward mo del on high-risk dimensions. 7. (If Conje ctur e 1 is ac c epte d) Assess whether agent capability approaches the estimated threshold , and if so, prioritize ev aluation robustness ov er ev aluation breadth. References Afriat, S. N. (1967). The Construction of Utility F unctions from Exp enditure Data. International Ec onomic R eview , 8(1), 67–77. Amo dei, D., Olah, C., Steinhardt, J., Christiano, P ., Sch ulman, J., & Mané, D. (2016). Concrete Problems in AI Safet y . . Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI F eedback. . Bak er, G. P . (1992). Incen tive Contracts and P erformance Measuremen t. Journal of Politic al Ec onomy , 100(3), 598–614. Bergemann, D., Bonatti, A., & Smolin, A. (2025). The Economics of Large Language Mo dels: T oken Allo cation, Fine-T uning, and Optimal Pricing. In Pr o c e e dings of EC’25 . Bostrom, N. (2014). Sup erintel ligenc e: Paths, Dangers, Str ate gies . Oxford Universit y Press. Bro oks, F. P . (1975). The Mythic al Man-Month: Essays on Softwar e Engine ering . Addison-W esley . Chern y , B. (2026). Claude Co de Tips and W orkow. X thread, January 31, 2026. h ttps://x.com/b c herny/status/2017742741636321619 Christiano, P . F., et al. (2017). Deep Reinforcemen t Learning from Human Preferences. In NeurIPS . Cognition (2025). Introducing SWE-1.5: Our F ast Agen t Model. Cognition Blog, Octob er 29, 2025. h ttps://cognition.ai/blog/swe-1-5 F riedman, M. (1953). The Metho dology of P ositive Economics. In Essays in Positive Ec onomics . Univ ersity of Chicago Press. 15 Grossman, S. J., & Hart, O. D. (1986). The Costs and Benets of Ownership. Journal of Politic al Ec onomy , 94(4), 691–719. Hart, O., & Mo ore, J. (1990). Property Rights and the Nature of the Firm. Journal of Politic al Ec onomy , 98(6), 1119–1158. Holmström, B., & Milgrom, P . (1991). Multitask Principal-Agen t Analyses: Incen tiv e Con tracts, Asset Ownership, and Job Design. Journal of L aw, Ec onomics, and Or ganization , 7, 24–52. Lin, J. (2026). F rom “Reasoning” Thinking to “Agen tic” Thinking. Published on X, March 25, 2026. https://x.com/JustinLin610/status/2037116325210829168 Ouy ang, L., et al. (2022). T raining Language Mo dels to F ollo w Instructions with Human F eedback. In NeurIPS . P an, A., et al. (2022). The Eects of Rew ard Missp ecication: Mapping and Mitigating Misaligned Mo dels. In ICLR . P erez, E., et al. (2023). Disco v ering Language Mo del Behaviors with Mo del-W ritten Ev aluations. In A CL . Rafailo v, R., et al. (2023). Direct Preference Optimization: Y our Language Mo del Is Secretly a Rew ard Mo del. In NeurIPS . Sc hmid, P . (2026). The Imp ortance of Agen t Harness in 2026. P ersonal blog, January 5, 2026. h ttps://www.philschmid.de/agen t-harness-2026 Skalse, J., et al. (2022). Dening and Characterizing Reward Hacking. In NeurIPS . 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment