What does a system modify when it modifies itself?

When a cognitive system modifies its own functioning, what exactly does it modify: a low-level rule, a control rule, or the norm that evaluates its own revisions? Cognitive science describes executive control, metacognition, and hierarchical learning…

Authors: Florentin Koch

What does a system modify when it modifies itself?
What do es a system mo dify when it mo difies itself ? Self-mo dification regimes and crossed opacities in cognitive systems Floren tin Koch École P olytechnique, P alaiseau, F rance florentin.koch@polytechnique.edu Abstract When a cognitiv e system mo difies its o wn functioning, what exactly do es it modify— a lo w-lev el rule, a con trol rule, or the norm that ev aluates its o wn revisions? Cognitive science describ es executive control, metacognition, and hierarchical learning with con- siderable precision, but lacks a formal framew ork distinguishing these targets of trans- formation and the conditions that separate them. Mean while, con temp orary artificial in telligence realizes a growing sp ectrum of self-mo dification without common criteria for comparison with biological cognition. W e sho w that the question “what is a self-mo difying system?” imp oses, by logical deriv ation, a minimal structure: a rule hierarch y Φ t = { R 0 , . . . , R k max } , an unav oid- able fixed core ( R k max ), and a distinction b et w een effectiv e rules ( Φ t ), represented rules ( Φ R t ⊆ Φ t ), and causally accessible rules ( Φ C t ⊆ Φ t ). F our self-mo dification regimes are iden tified according to the level of the hierarc h y at which mo dification op erates: (1) action without mo dification, (2) lo w-level mo dification, (3) structural mo dification, (4) teleological revision. Eac h regime is anc hored in a c haracterized cognitive phe- nomenon and a corresp onding artificial system. Application to the h uman case yields a cen tral result: the crossing of opacities. Humans p ossess self-representation ( Φ R t ) and causal p ow er ( Φ C t ) concentrated at the upp er lev els of their hierarch y ( R k max and neigh b oring levels), while op erational lev els ( R 0 , R 1 , . . . ) remain largely opaque. Reflexiv e artificial systems exhibit the inv erse pro- file: Φ R t and Φ C t ric h at operational levels, empt y at the lev el of R k max . This crossed asymmetry constitutes the structural signature of h uman/AI comparison. The frame- w ork further provides a structural proto col for the question of artificial consciousness, sho wing that higher-order theories (HOT) and Atten tion Schema Theory (AST) app ear as special cases of the formalism. F our testable predictions and a nov el exp erimental proto col are prop osed. F our op en problems emerge in cascade: the indep endence of 1 transformativit y and autonomy , the viabilit y of self-mo dification, the teleological lo ck, and iden tity under transformation. Keyw ords: cognitiv e architecture, self-modification, metacognition, reflexivity , hierar- c hical control, artificial consciousness, human/AI comparison Notation table Notation Definition Φ t F unctional state of the system at time t : { R ( t ) 0 , R ( t ) 1 , . . . , R ( t ) k max } R ( t ) i Lev el- i rule at time t in the functional hierarch y R 0 Lo west-lev el rule (observ able b ehavior, weigh ts, asso ciativ e strength) R 1 Rule go verning the mo dification of R 0 (learning rule) R k max T eleological norm: ultimate ev aluation criterion k max Hierarc hy depth: maximum num b er of lev els E t En vironment at time t ; E t ⇝ Φ t denotes feedbac k (via F ) R epr esentation and c ausality Φ R t Self-represen tation: subset of Φ t that the system represen ts to itself ( Φ R t ⊆ Φ t ) Φ C t Causally activ e subset: rules of Φ t o ver which the system p ossesses effectiv e causal p ow er ( Φ C t ⊆ Φ t , not necessarily ⊆ Φ R t ) Φ R t ∩ Φ C t Rules b oth represen ted and causally accessible (reflexive mo dification) Φ C t \ Φ R t Rules causally mo dified without represen tation (blind mo dification) Φ R t \ Φ C t Rules represen ted but without causal p ow er (introspection without leverage) A \ B Set difference: elements of A not b elonging to B L o c al structur e of an op er ation R i +1 → R i ˜ R i Compressed, partial, or inaccurate represen tation of R i ^ R i +1 → R i Represen tation (p ossibly degraded) of the causal link b etw een levels R i +1 ⇝ R i Capacit y (p ossibly degraded) of R i +1 to mo dify R i Reflexivit y: Φ R t ⊇ { ˜ R i +1 , ^ R i +1 → R i , ˜ R i } ⊆ Φ t Causalit y: Φ C t ⊇ { R i +1 ⇝ R i } Dynamics F T ransformation dynamics: ( x t +1 , p t +1 ) = F ( x t , p t ) Hierarc hical form: R ( t +1) i = R ( t ) i +1 ( R ( t ) i ; R ( t ) k max ) Π Pro jection op erator: pro duces Φ R t from Φ t and state s t Systemic pr op erties T T ransformativity: max. lev el at which the system can transform A Organizational autonom y: capacity to maintain conditions of existence S1, S2, S3 Sim ulation mo dalities: logical, execution/sandb ox, predictive 2 1 In tro duction 1.1 T w o literatures, one common question In cognitiv e science, an empirical con vergence establishes that h uman cognition is not re- ducible to lo cal resp onses to stimuli but deploys hierarchical, stable, and revisable con trol structures. Miller & Cohen (2001) show ed that the prefrontal cortex activ ely maintains goal represen tations that bias processing in posterior systems—the agent implements organizing structures, not merely resp onses. K o echlin et al. (2003) and Badre & Nee (2018) do cumen ted a cascading organization of executiv e pro cesses along a rostro-caudal axis of the fron tal cor- tex, where progressively more abstract forms of control gov ern more lo cal op erations (see also Botvinick et al., 2009). Nelson & Narens (1990) formalized a tw o-lev el arc hitecture— obje ct-level and meta-level —connected b y monitoring and con trol flows; Fleming & Dolan (2012) sho wed that metacognitiv e accuracy is disso ciated from first-order p erformance and asso ciated with anterior prefron tal cortex. Computational cognitive arc hitectures—A CT-R (Anderson, 2007), SOAR (Laird, 2012), CLARION (Sun, 2002)—implement v arious lev els of metacognitiv e control, but none formalizes the conditions under which cognitiv e mo dification c hanges regime. In parallel, con temp orary artificial intelligence realizes a concrete sp ectrum of self-mo dification lev els. A classical Mark ov decision process op erates with fixed rules; p olicy gradien t learn- ing adjusts lo w-level rules under an inv arian t reward function; meta-reinforcemen t learning (W ang et al., 2016) mo difies the learning rules themselves; architecture searc h (Zoph & Le, 2017) rewrites the netw ork structure. Upstream of these achiev ements, the computational re- flection tradition formalized the conditions of p ossibility for self-mo dification: Smith (1984) sho w ed that a system can con tain an op erativ e represen tation of its o wn in terpreter and mo dify it; Maes (1987) clarified the minimal threshold b y p ositing that reflexivit y requires an in ternal representation that is causally active in determining b ehavior. These tw o literatures address the same question: under what conditions can a cognitive system interv ene on the rules that go vern its own functioning? Y et they do not conv erge to w ard a common framew ork. The literature on cognitive control and metacognition dis- tinguishes monitoring from regulation, confidence from performance, task switc hing from adaptation—but do es not formalize the conditions under whic h these mo difications c hange in nature. Con versely , the AI literature provides systems that mo dify their rules at v arious lev els, but without criteria enabling comparison of these mo difications with those of human cognition within the same conceptual space. This gap is not merely terminological. A frontal patien t who p erseverates on the Wiscon- sin Card Sorting T est despite explicit negativ e feedback (Milner, 1963) has lost the ability 3 to mo dify an active rule while retaining in tact asso ciativ e learning. A patien t in metacog- nitiv e therapy who learns to treat “worrying protects me” as a revisable rule (W ells, 2009) p erforms a transformation qualitativ ely distinct from an asso ciative adjustmen t. These tw o cases—dev elop ed formally in §3—illustrate a gap that the literature do es not bridge: b e- t w een mo difying low-lev el rules within a fixed framew ork and mo difying the framew ork it- self. Katy al & Fleming (2024) note that contemporary metacognition researc h m ust reco v er greater c onstruct br e adth , b eyond its dominan t core cen tered on confidence. 1.2 Con tribution This article prop oses to bridge this gap through a minimal theory of cognitiv e modification regimes. Our question is t wofold: (i) what are the formally distinct regimes of mo dification that a cognitiv e system can exert on its own rule hierarch y , and (ii) ho w does the exten t of self-represen tation determine which regimes are accessible to it? W e seek neither to prop ose a complete theory of mind nor to subsume all forms of cognitiv e regulation under a single principle, but to introduce a lev el of analysis that is missing betw een existing theories. W e sp ecify at the outset what we mean b y endogenous self-mo dification: a self-mo difying system is one capable of pro ducing, through its own functioning, a transformation of its functional structure—not merely of its states or outputs. This sp ecification excludes purely exogenous mo difications, while ackno wledging that in systems coupled to their environmen t (Odling-Smee et al., 2003), the b oundary b et w een endogenous and exogenous is p orous. T o address this question, w e first show (§2) that the question “what is a self-mo difying system?” imp oses a minimal structure: a rule hierarc hy , a fixed core, and a distinction b et ween effective rules, represented rules, and causally accessible rules. F rom this formalism, w e distinguish four regimes (§3), eac h anc hored in a c haracterized cognitiv e phenomenon. W e apply the framework to h uman cognition (§4), iden tifying the crossing of opacities as a structural signature and p ositioning the framew ork relative to theories of consciousness. The taxonom y gives rise to four structural problems in cascade (§5)—the indep endence of transformativit y and autonomy , the viability of self-modification, the teleological lo c k, and iden tit y under transformation—of whic h the first three receive partial resolutions within the framew ork. W e deriv e four testable predictions and prop ose an exp erimental proto col (§6). The discussion (§7) synthesizes the contributions, identifies limitations and future directions. These results constitute a bridge b et w een cognitiv e science, philosophy of mind, and AI safet y . 4 2 Mo del The framew ork rests on the idea that a self-mo difying cognitiv e system can b e described as a hierarch y of rules of whic h the system represents only a subset. This section justifies this structure (§2.1), derives it logically (§2.2), and then formalizes it (§2.3). 2.1 Wh y a hierarc h y of rules? The h yp othesis of hierarchical organization is supp orted b y three conv ergent arguments. Empirical argumen t. Human b eha vior manifests stable, transferable, and revisable con- trol structures. Miller & Cohen (2001) show ed that the prefrontal cortex maintains goal repre- sen tations that bias p osterior pro cessing. Conflict and adaptation paradigms sho w that these structures are mo dulable: sub jects strategically adjust their selection priorities (Botvinic k et al., 2001). Bec ker et al. (2023) show ed experimentally that systematic metacognitive reflec- tion leads sub jects to adopt more far-sighte d strategies—what is revised is not a punctual resp onse but a decision structure. Koechlin et al. (2003) and Badre & Nee (2018) do cumented a cascading organization of executiv e pro cesses along a rostro-caudal axis. Structural argumen t. Simon (1962) show ed that stable complex systems are necessar- ily nearly decomp osable. His argument rests on the parable of the w atc hmakers: b et w een a w atchmak er who assem bles a thousand pieces in one go and one who assembles stable sub-mo dules of ten pieces eac h, only the second surviv es interruptions—because each p er- turbation destro ys only a sub-module, not the en tire assem bly . F or a self-mo difying system, this constrain t is reinforced: if an y comp onent could be modified without hierarc hical orga- nization, a lo cal revision could propagate its effects through the entire system without an isolation mec hanism, pro ducing cascades of uncontrolled revisions. More formally , stability requires that interactions b etw een components within a level b e muc h more frequent and rapid than interactions b et w een levels—the prop ert y of near-decomp osability . Hierarc h y is therefore not merely an empirical prop erty of stable systems: it is a viabilit y condition for self-mo dification. Logical argumen t. As we show in §2.2, hierarch y is not merely an observ ation or a viability constrain t: it derives from the v ery question “what is a self-mo difying system?” 5 2.2 Deriv ation of the minimal structure The prop osed structure is not one mo deling choice among others. It deriv es, step by step, from the question: what is a self-mo difying system? Step 1: an ordinary dynamical system do es not suffice. A classical dynamical system ev olv es according to a fixed la w: x t +1 = f ( x t ) . The state x t c hanges, but the la w f remains in v arian t—it b elongs to the definition of the system, not to its conten t. Such a system adapts its states, never its pro cesses. Step 2: pro cesses m ust b e distinct in ternal ob jects. F or a system to mo dify its o wn processes, these pro cesses m ust be represen ted as en tities distinct from the ordinary state. W e therefore in tro duce tw o comp onen ts: a state x t and a process p t . As so on as one requires that a pro cess b e mo difiable, one imp oses that it b e iden tifiable, distinguishable, and replaceable—hence that it b e a manipulable ob ject. A manipulable ob ject that determines the system’s b eha vior is, in the minimal sense, an op erativ e rule. Step 3: rules m ust b e part of the system’s state. If rules remain external to the system, an y mo dification can only come from outside—the system do es not mo dify itself; it is mo dified. F or self-mo dification to b e endogenous, rules must b e internalized. The total state b ecomes s t = ( x t , p t ) : not only what the system is , but also ho w it op er ates . Step 4: the dynamics must pro duce b oth the next state and the next rule. The functioning at time t m ust b e able to determine not only x t +1 but also p t +1 : ( x t +1 , p t +1 ) = F ( x t , p t ) . If F pro duces only x t +1 , p t remains fixed—one falls back into an ordinary dynamical system. Step 5: the infinite regress imp oses a fixed core. The system can mo dify p t , but F remains fixed. One might wan t to mak e F mo difiable by in tro ducing a meta-rule, then a meta-meta-rule, and so on. This escalation is endless. Any attempt to eliminate the distinction b et w een the level that is mo dified and the level that mo difies leads to an infinite regress or circularity . The consequence is structural: an y coheren t self-mo difying system necessarily p ossesses, at least at each time t , a minimal fixed core. This result is not a p ostulate: it is a logical constraint. It directly grounds the notion of the teleological norm R k max and Proposition 1 (causal closure) established in §3.3. The qualification “at eac h time t ” is decisive: it lea ves op en the p ossibilit y that the level serving as the fixed core at one 6 momen t may itself b ecome an ob ject of revision at a later moment—a p ossibility explored in Regime 4 (§3.4). Step 6: partial represen tation is una v oidable. The dynamics F op erates on the total state, but nothing guarantees that the system has complete access to its own rules. A biolog- ical system implemen ts rules in neural netw orks of whic h it has no explicit represen tation. A computational system can access the source co de to which it is given access and will discov er emergen t properties of its execution. Self-mo dification therefore does not operate on the complete set of effective rules, but on the system’s represen tation of them—a represen tation that is generally partial, compressed, and p oten tially inaccurate. This observ ation requires distinguishing the hierarch y of effectiv e rules from the subset that the system represents. Syn thesis. The structure Φ t = { R 0 , . . . , R k max } , a fixed core R k max , and a distinction Φ R t ⊆ Φ t result from six requiremen ts, each necessary for the notion of self-modification to b e w ell-defined: pro cesses must be internal ob jects (step 2), internalized in the s tate (step 3), pro duced by the dynamics (step 4), bounded b y a fixed core (step 5), and accessible via partial represen tation (step 6). Eac h is satisfied b y the cognitive systems—biological or artificial—that the framework seeks to describ e. 2.3 F ormalism W e denote Φ t = { R ( t ) 0 , R ( t ) 1 , . . . , R ( t ) k max } the functional state of the system at time t , from the most concrete rules R 0 to the teleological norm R k max . The conv ention is one of an ascending hierarc hical gradien t: higher levels gov ern lo w er lev els. T erminological clarification. The term “rule” is used in a broad functional sense: it designates any relativ ely stable structure that constrains or organizes a class of cognitiv e op erations and can become the ob ject of revision. A rule ma y b e instantiated as a goal, an action p olicy , a heuristic, a generative mo del, a probabilistic exp ectation, or a precision- w eigh ting scheme. The notion is neutral with resp ect to substrate: it is compatible with activ e inference (F riston, 2010), whic h mo dels cognition in terms of generative models rather than sym b olic rules. The shift from “pro cess” to “rule” is not an added hypothesis but a consequence of the mo difiability requirement (step 2). Ev ery adaptiv e system changes state; but a self-mo difying system mo difies at least some of the rules that go vern its future ev olution. This precision allo ws tw o frequent confusions to b e set aside. First, a c hange in the system’s outputs (a different response to a different stim ulus) do es not constitute self-mo dification in the sense of the framework: only a c hange 7 in the hierarch y Φ t constitutes a mo dification. A thermostat that adjusts ro om temp erature c hanges the state of the world, but not its own con trol rules. Second, the fact that a system partially represen ts itself ( Φ R t  = ∅ ) does not y et imply that it can transform the principles organizing its activity . Self-mo dification th us requires not only v ariabilit y but a certain in ternalization of the system’s op erative structures—and, for Regime 4, effective causal p o wer o v er the norm. Represen tation and causal p o w er. The system do es not necessarily ha v e complete ac- cess to Φ t . W e denote Φ R t ⊆ Φ t the set of rules that the system represen ts to itself, and Φ C t ⊆ Φ t the set of rules ov er which the system p ossesses effectiv e causal p o w er of mo difica- tion. These tw o sets do not necessarily coincide, and neither is necessarily included in the other. Three configurations illustrate the distinction. (i) Causal p ower without r epr esentation. A netw ork trained b y gradient descent mo difi es its weigh ts ( R 0 ) at each iteration under the action of R 1 : θ t +1 = θ t − η ∇ L ( θ t ) . This pro cess is causally effective—it transforms R 0 —but the system represen ts neither R 1 nor the fact that it is learning. R 1 is in Φ C t but not in Φ R t . (ii) R epr esentation without c ausal p ower. A human can, through brain imaging, represent the synaptic strength of certain circuits ( R 0 in Φ R t ). But this knowledge confers no causal p o wer: knowing that a giv en synapse has a giv en strength do es not imply b eing able to mo dify it. R 0 is in Φ R t but not in Φ C t . (iii) R epr esentation and c ausal p ower jointly. A h uman in metacognitive therap y repre- sen ts their meta-b eliefs ( R 2 in Φ R t ) and can revise them through delib eration ( R 2 in Φ C t ). This is the case where mo dification is b oth reflexiv e and causally effectiv e. Lo cal structure of an op eration R i +1 → R i . Ev ery down w ard arrow in the hierarc hy— ev ery act by whic h a higher lev el mo difies a lo w er level—possesses the same internal structure with t wo comp onen ts. The reflexiv e comp onent, when non-empty , contains three elements: Φ R t ⊇ { ˜ R i +1 , ^ R i +1 → R i , ˜ R i } , where the tilde ( ˜ · ) denotes a p ossibly compressed, partial, or inaccurate representation—a representation of the higher rule, the causal link betw een lev els, and the modified rule. The causal comp onent con tains the effectiv e capacit y of R i +1 to mo dify R i : Φ C t ⊇ { R i +1 ⇝ R i } , where ⇝ denotes a p ossibly degraded causal p o w er. The indep endence of the t wo comp onents is essen tial: a system can causally modify R i without an y represen tation of what it is doing (gradient descent: Φ R t = ∅ , Φ C t  = ∅ ), or finely represen t the link R i +1 → R i without b eing able to act on it (in trosp ection of a neural circuit: Φ R t  = ∅ , Φ C t = ∅ ). 8 The central hypothesis is that self-mo dification op erates via t wo path w ays: a structural path w ay (the mec hanism F mo difies Φ t via Φ C t , whether or not there is represen tation) and a reflexiv e path wa y (the system mo difies Φ t b y passing through Φ R t ∩ Φ C t —represen tation serv es as a causal lever). Because Φ R t and Φ C t are b oth partial, self-mo dification is necessarily limited. The framew ork makes these limits formalizable. The gap b et w een Φ R t and Φ t , and more precisely the exten t of Φ R t , th us constitutes the cen tral v ariable of the framework, since it determines which forms of self-mo dification are accessible to the system. T eleological norm. W e call R k max the teleological norm: the ultimate ev aluation criterion that orients the transformation d ynamics. In a sup ervised learning system, R k max corresp onds to the loss function; in reinforcemen t learning, to the reward signal; in a living organism, to viabilit y constraints; in human cognition, to a more heterogeneous set of goals, preferences, v alues, or explicit commitments—a set that is not necessarily unitary or coherent, as v alue conflicts and moral dilemmas illustrate. The important p oint is not to impose a single substan tiv e theory of finalit y , but to recognize that a non-trivial self-mo difying system do es not transform its rules indifferently: its revisions are orien ted by higher-lev el constraints. Whether R k max is itself accessible to revision is what separates Regime 3 from Regime 4. Mo des of action. The system has t wo fundamental mo des of action. In the first, a rule R i acts on the world without modifying Φ t . In the second, a higher-level rule R i +1 , acting on a portion of the arc hitecture, mo difies R i . A third, intermediate case deserves mention: an action on the world can modify Φ t through feedback—instrumen tal practice (pla ying the piano), for example, mo difies op erative rules through sensorimotor coupling. This case p oses no difficult y for the framew ork: environmen tal feedback enters F as an input factor. The minimal dynamics can b e summarized as: R ( t +1) i = R ( t ) i +1 ( R ( t ) i ; R ( t ) k max ) . In full generality , the modification of R i can dep end on the en tire represen ted hierarch y Φ R t and on en vironmen tal input. The hierarc hical notation captures the minimal constraint: mo dification is principally go verned by a higher level and oriented by the teleological norm— an idealization empirically motiv ated b y the cascading organization of prefron tal con trol (K o echlin et al., 2003; Badre & Nee, 2018). It do es not imply that ev ery real arc hitecture is strictly linear or that eac h lev el acts only on the immediately lo w er lev el. 9 Figure 1: Hierarc hical arc hitecture of the functional state Φ t and lo cal structure of an op- eration R i +1 → R i . Left panel: The system Φ t is organized as a hierarc h y of rules from observ able b ehaviors ( R 0 ) to the teleological norm ( R k max ). Eac h lev el corresponds to a self-mo dification regime (1–4). The dashed red contour indicates Φ R t in the h uman profile ( Φ R t, human ≫ Φ C t, human at upper lev els); the dashed blue con tour indicates Φ R t in the AI profile ( Φ R t, AI ≈ Φ C t, AI at low er levels). En vironmental feedbac k E t ⇝ Φ t en ters the dynamics via F . Righ t panel: Lo cal view of an y op eration R i +1 → R i , decomp osed in to a reflexiv e compo- nen t ( Φ R t ⊇ { ˜ R i +1 , ^ R i +1 → R i , ˜ R i } ⊆ Φ t ) and a causal comp onen t ( Φ C t ⊇ { R i +1 ⇝ R i } ), with the hierarc hical dynamics equation. 3 F our self-mo dification regimes W e distinguish four regimes, eac h characterized by the level of Φ t at whic h modification op erates. Regimes are cumulativ e: each higher regime presupposes the capacities of the preceding one. W e anchor eac h regime in a cognitive phenomenon and an artificial system. F rom Regime 2 onw ard, each distinction yields a theoretical result that the existing literature do es not pro vide. 3.1 Regime 1: action without self-mo dification Definition 1 (Fixed regime) . A system is in the fixe d r e gime if Φ t +1 = Φ t for al l t . The system pr o duc es outputs but do es not mo dify any c omp onent of its functional hier ar chy. F ormal pr ofile. Φ R t = ∅ (the system represen ts none of its rules); Φ C t reduces to the ac- tion of R 0 on the environmen t—a causal p ow er directed out w ard, not tow ard the in ternal arc hitecture. No rule is an ob ject of mo dification. 10 Empirical anc hor. The spinal reflex arc: the withdraw al of the hand from a no ciceptive surface is mediated by a p olysynaptic circuit fixed developmen tally (Sherrington, 1906). The reflex latency ( ∼ 35 ms for the H-reflex) do es not v ary wi th exp erience (Pierrot-Deseilligny & Burk e, 2012). In the formalism: R 0 = { stim ulus → motor resp onse } , inv arian t; Φ R t = ∅ ; Φ C t ( in ternal ) = ∅ . Artificial anc hor. A finite automaton with a fixed transition table. Boundary . Habituation—the progressive decline of the resp onse to a rep eated stimulus (Rankin et al., 2009)—constitutes the b orderline case: as so on as R 0 is mo dified, the system en ters Regime 2. 3.2 Regime 2: lo w-lev el mo dification Definition 2 (Lo cal regime) . A system is in the lo c al r e gime if only the lowest-level rules ar e mo difie d, under invariant governanc e. F ormally: R ( t +1) 0 = R 1 ( R ( t ) 0 ; R k max ) , where R 1 , . . . , R k max are in v ariant. F ormal pr ofile. Φ R t con tains at most the current v alue of R 0 , but not the description of R 1 as a rule; Φ C t con tains R 0 (gradien t descen t or the Rescorla-W agner rule effectively mo difies R 0 ), but this causal p o wer is exercised without represen tational mediation—it is a blind mo dification ( R 1 ∈ Φ C t but R 1 / ∈ Φ R t ). The system do es not “kno w” that it is learning. On tological clarification and theoretical con tribution. Regime 2 clarifies a funda- men tal distinction: the formalism do es not separate “parameters” and “rules”—it contains only rules ordered by degree of abstraction. What the literature calls a “parameter”—the asso ciativ e strength V , the w eigh ts of a netw ork—is the low est-level rule R 0 . What the liter- ature calls the “learning rule” is R 1 . V determines the system’s b eha vior at time t (it is R 0 ); ∆ V = α β ( λ − V ) prescrib es how V c hanges (it is R 1 ). The relev an t distinction is not the mathematical t yp e of the ob ject—the distinction betw een v alue and function exists—but its p osition in the go v ernance hierarc hy . In such a simple system, the learning rule is the ev alua- tion criterion—this indifferentiation is the signature of minimal Regime 2: the system cannot in terv ene on its norm b ecause the norm is not represen ted as distinct from the rule that implemen ts it. The term “rule learning” is commonly used to describ e phenomena that, in our formalism, amount to the mo dification of R 0 under fixed R 1 . The distinction is testable: in Regime 2, transfer errors are errors of R 0 (maladapted asso ciative strength); in Regime 3, they are errors of R 1 (learning strategy maladapted to the task structure). Blo c king (Kamin, 11 1969) and long-term p otentiation (Bliss & Lømo, 1973) confirm that mo dification, in classical conditioning, b ears exclusiv ely on R 0 . Empirical anchor. The Rescorla-W agner mo del (Rescorla & W agner, 1972) allows a com- plete instan tiation. R 0 designates the system’s input-output rule: the function that, for a giv en conditioned stim ulus, produces a conditioned resp onse of a given strength. R 1 des- ignates the up date rule ∆ V = α β ( λ − V ) , which gov erns the mo dification of R 0 . The prediction error ( λ − V ) mo difies R 0 , but R 1 is structurally in v ariant. The crucial p oint is that the system do es not represen t R 1 as a rule—it do es not “kno w” that it is learning. Artificial anc hor. Gradien t descent: the weigh ts—which constitute R 0 —c hange according to R 1 : θ t +1 = θ t − η ∇ L ( θ t ) . But R 1 , η , and L ( R k max ) are fixed. Boundary . If R 1 c hanges—for example, the shift from Rescorla-W agner to Pearce-Hall (P earce & Hall, 1980), which mo difies how atten tion mo dulates learning—the system op erates in Regime 3. 3.3 Regime 3: rule mo dification Definition 3 (Structural regime) . A system is in the structur al r e gime if mo dific ation c an b e ar on rules R i of arbitr ary level, with the exc eption of R k max . F ormally: R ( t +1) i = R i +1 ( R ( t ) i ; R k max ) , where R k max is in v ariant. F ormal pr ofile. Φ R t includes R 1 as an explicit ob ject—the system represen ts its own rule and can treat it as revisable. Φ C t includes R 1 via R 2 (the meta-rule of change). The transition from Regime 2 to Regime 3 corresp onds formally to the en try of R 1 in to Φ R t ∩ Φ C t : the rule is no longer merely applied but represen ted and causally accessible. Empirical anchor. The Wisconsin Card Sorting T est (Milner, 1963; Monchi et al., 2001). R 0 = sorting action; R 1 = activ e sorting rule (“sort by color”); R 2 = meta-rule (“if p ersistent negativ e feedbac k, change sorting criterion”); R k max = implicit norm (“maximize correct re- sp onses”). R k max do es not prescrib e how to ac hiev e the ob jective; R 2 is a particular strategy for satisfying R k max —other R 2 s w ould b e p ossible under the same R k max . Artificial anchor. Meta-reinforcement learning (W ang et al., 2016): a meta-RL agent learns a learning rule ( R 1 ) under fixed R k max . Architecture searc h (Zoph & Le, 2017) and Gö del mac hines (Sc hmidhuber, 2007) push this principle to its limits. 12 Prop osition 1 (Causal closure) . The lo gic al c onstr aint establishe d at step 5 (§2.2) tr anslates as fol lows: in any system in R e gime 3, ther e exists at e ach time t a level k max such that R k max is fixe d. This c onstr aint is r e c o gnize d in the AI safety liter atur e under the name of go al stability (So ar es & F al lenstein, 2017), but had not b e en formulate d in terms of hier ar chic al rule levels nor c onne cte d to the clinic al disso ciation b etwe en r e gimes. Theoretical contribution: a predicted double disso ciation. The formalism predicts that Regimes 2 and 3, having distinct targets in Φ t , are disso ciable b y selectiv e lesion. This is exactly what is observ ed: patients with dorsolateral prefrontal lesions exhibit p ersev erativ e errors on the WCST—they con tin ue applying R 1 (“sort by color”) despite p ersisten t negative feedbac k—but retain intact asso ciative conditioning (Monc hi et al., 2001). The disso ciation is not merely compatible with the framew ork: it is a direct consequence, b ecause the t wo capacities op erate on formally distinct targets in Φ t — R 0 for conditioning, R 1 for rule change. The switc h cost (Monsell, 2003) reflects the computational load of replacing R 1 —a cost absen t in Regime 2, where only R 0 c hanges under fixed gov ernance. T ransition. Prop osition 1 establishes that an y system in Regime 3 p ossesses a fixed core R k max . But the health y human seems to violate this closure: they can delib erate on their v al- ues, revise their moral criteria, c ho ose exp eriences they kno w will transform their preferences. Ho w can a fixed-core system revise the core itself ? 3.4 Regime 4: teleological revision Definition 4 (Reflexiv e regime) . A system is in the tele olo gic al r evision r e gime if R k max ∈ Φ C t and R k max ∈ Φ R t —that is, if R k max is b oth r epr esente d and ac c essible to the system’s c ausal p ower of mo dific ation. The c ondition R k max ∈ Φ C t alone (without Φ R t ) would c orr esp ond to a physic al system that entir ely r e c onfigur es its ar chite ctur e without r epr esentational me diation— a form of self-mo dific ation that is p owerful but non-delib er ative. R e gime 4 in the str ong sense r e quir es the c onjunction: the system mo difies R k max through the r epr esentation it has of it, which op ens the p ossibility of evaluation—and henc e of the tele olo gic al lo ck. The threshold separating Regime 3 from Regime 4 is precise: R k max en ters Φ R t ∩ Φ C t . F or this en try to be non-trivial, the represen tation must be causally activ e in the sense of Maes (1987). Theoretical foundations. Regime 4 rests on tw o results that must be distinguished. The univ ersal T uring mac hine (1936) sho wed that a system can sim ulate any machine, including 13 itself—but simulation is not reflexivity: the universal machine do es not represen t itself to itself as an ob ject of mo dification. Smith (1984) crossed an additional threshold in the 3-LISP arc hitecture: the system’s represen tation of its o wn in terpreter is not an inert sim ulation but a causal lever—modifying it c hanges the system’s b ehavior in real time. The system can reason ab out this represen tation, mo dify it, and then execute the mo dified v ersion—realizing an in terpretiv e reen try . This is the transition from self-sim ulation to reflexiv e self-mo dification. Maes (1987) then generalized this threshold independently of any architecture: a system is reflexiv e if and only if its in ternal represen tation of its own structure causally in tervenes in its b eha vior. Description alone constitutes introspection; causally activ e description constitutes reflexivit y . Empirical anc hor. Metacognitive therap y (MCT; W ells, 2009) p ro vides the clearest clin- ical instantiation. Consider a patien t with generalized anxiet y disorder: R 0 designates the emotional and behavioral resp onses; R 1 the first-order beliefs (“this situation is dangerous”); R 2 the meta-b eliefs gov erning R 1 (“w orrying protects me”); R k max the ultimate ev aluation criterion. Classical CBT targets R 1 ; MCT targets R 2 and R k max : it leads the patient to treat “w orrying protects me” not as a fact but as a revisable rule. Longitudinal studies confirm that c hanges in meta-b eliefs predict symptomatic impro vemen t (Solem et al., 2009), which is compatible with the thesis that the target is hierarchically higher—but strict temp oral precedence remains to b e demonstrated. The transformative exp erience in the sense of Paul (2014) constitutes the framework’s limiting case. An agen t chooses an exp erience (parenthoo d, profound con version) kno wing that it will mo dify their preferences unpredictably . The ev aluation of the mo dification of R k max requires R k max as a criterion—but R k max is what will change. This is the very structure of the teleological lo ck (§5.3). Artificial anc hor. Regime 4 remains largely programmatic in AI. Current rew ard shaping systems mo dify the rew ard function during training, but this mo dification is go verned by a fixed meta-criterion. W ork on goal stability (Everitt et al., 2021; Soares & F allenstein, 2017) formalizes related questions in differen t framew orks. Our framew ork provides a com- plemen tary p ersp ective by situating the problem within a hierarc hy of cognitive rules and their represen tation. Prop osition 2 (Reflexive op enness) . A system is r eflexively op en if, for al l k , it is in prin- ciple p ossible that R k ∈ Φ R t at a later time. The tension b et w een Prop ositions 1 and 2 is apparen t: the first imp oses a fixed core at eac h momen t, the s econd sa ys that no lev el is definitively excluded. The resolution is that 14 causal closure is lo cal and temp oral, not absolute. The h uman satisfies Prop osition 1 at eac h momen t while dynamically shifting the core. It is this mobility of the fixed p oint—not its absence—that c haracterizes Regime 4. The displacemen t of the core follo ws tw o path w ays in h umans. (a) Thr ough envir onmental c oupling : external v ariables (ev ents, losses, relationships) are in ternalized as internal states that mo dify ho w R k max ev aluates situations (Sch w ab e & W olf, 2013; Baumeister, 1990). The core shifts without b eing explicitly revised. A b erea vemen t, for example, can con tract the space of imaginable futures to the p oin t where the decision criteria that orien ted the sub ject’s life lose their normativ e force and are replaced by criteria of immediate surviv al. R k max has changed, but not through an act of delib erate revision: it is the system’s internal state, mo dified b y coupling with the en vironment, that shifted the ev aluativ e fixed p oint. Baumeister (1990) show ed that this mec hanism can lead to the extreme: when the space of p erceived futures contracts to the point of b ecoming empt y , the sub ject ma y come to revise R k max to the annulmen t of the viabilit y constraint itself. (b) Thr ough pr osp e ctive c ommitment : the system commits to an action it kno ws will mo dify R k max —the curren t core authorizes the leap, but do es not con trol the result. An individual who decides to b ecome a parent p erforms a prosp ective commitment: their curren t criteria authorize the leap, but do not control the result—paren tho o d transforms preferences, priorities, and ev aluation criteria themselv es in unpredictable wa ys (Paul, 2014). The R k max that will retrosp ectively ev aluate the decision will no longer b e the one that authorized it—the norm changed in the mean time, and it is from the new norm that the old decision is judged. Longitudinal data on the transition to parenthoo d confirm this pattern: v alues, temp oral priorities, and risk thresholds reorganize in wa ys unforeseeable b y the sub ject (Nelson et al., 2014). F or an AI, a third pathw a y would b e p ossible: (c) thr ough existential sandb ox . The system instan tiates itself in a comparable or iden tical environmen t, app lies the mo dification of R k max to the copy , observes what the cop y do es and b ecomes under the new R k max , then ev aluates the result from its curren t R k max —and decides whether to adopt the modification. This p ossibilit y is structurally inaccessible to the human, who cannot duplicate themselves. It constitutes a fundamen tal asymmetry b et ween the t wo types of systems in their relationship to normativ e revision. The framew ork do es not pro vide a formal dynamics of this displacemen t—this is an ac kno wledged limitation (§7.2). 15 T able 1: Corresp ondences b etw een regimes, targets, representation/causalit y profiles, cogni- tiv e phenomena, and artificial systems. Regime T arget Φ R t Φ C t Cognitiv e phenomenon Artif. system 1: Fixed None ∅ ∅ (in ternal) Spinal reflex Finite automaton 2: Lo cal R 0 R 0 curren t R 0 (via R 1 , blind) Conditioning Gradien t descent 3: Structural R i ( i < k max ) R i represen ted R i (via R i +1 ) W CST Meta-RL, AutoML 4: Reflexive R k max R k max ∈ Φ R t R k max ∈ Φ C t MCT, transf. exp. (programmatic) 4 Application to the h uman case 4.1 The h uman profile: exten t and opacit y of Φ R t The monitoring/con trol architecture of Nelson & Narens (1990) describ es a lo op in whic h the meta-lev el receiv es information from the ob ject-lev el and mo dulates its op erations in return: in our framew ork, certain comp onen ts of Φ t b ecome ob jects of Φ R t , so that mo dification is mediated by self-mo deling. Metacognitiv e accuracy disso ciated from first-order p erformance (Fleming & Dolan, 2012) sho ws that Φ R t p ossesses its o wn dimensions of fidelity , indep enden t of Φ t . In humans, the profile of Φ R t is structurally asymmetric. Humans represen t their tele- ological norms ( R k max ) relatively well: they can deliberate on their v alues, life goals, and moral criteria (Paul, 2014; W ells, 2009). But they hav e virtually no access to their lo w-lev el op erativ e rules ( R i for small i ): how they recognize a face, pro duce a grammatical sen tence, or adjust fine motor mo vemen ts. Dev elopmen tal data show that Φ R t is constructed progressiv ely . Zelazo (2004) sho w ed that the capacity to represent rules of increasing level follows an ordered dev elopmen tal tra jectory: a 3-y ear-old can follo w a single rule, a 5-y ear-old can main tain t wo rules in alternation, and it is only around 7–8 y ears that the capacity to represen t the rule for selecting betw een rules emerges—a tra jectory that corresp onds, in the formalism, to a progressive expansion of Φ R t to w ard increasingly higher hierarc hical lev els. Karmiloff-Smith (1992) formalized the underlying mec hanism: represen tational redescription, by which implicit pro cedures ( R i that are effectiv e but not represen ted) b ecome explicit ob jects of Φ R t —a necessary condition for an in terven tion by R i +1 to b ecome possible. But this pro cess of constructing Φ R t is not faithful. Nisb ett & Wilson (1977) sho w ed that sub jects do not ha ve access to the real causes of their judgmen ts and confabulate p ost-ho c reasons. This result do es not contradict the existence of Φ R t —it qualifies it: the “reasons” rep orted are not high-lev el R i to whic h the sub ject has access, but reconstructions within Φ R t that purp ort to describ e Φ t without corresp onding to it. The true causes of the judgment 16 T able 2: Compared profiles of represen tation ( Φ R t ) and causal p ow er ( Φ C t ). Φ R t (represen tation) Φ C t (causal p o wer) Human — R k max Ric h Strong (delib eration) Human — R 0 P o or (opaque) None AI — R 0 . . . R i Ric h (transparent) Strong (read/write) AI — R k max None None (the low-lev el R i that actually pro duce the decision) remain opaque; what the sub ject reports is a coheren t confabulation. Johansson et al. (2005) reinforced this thesis with the c hoice blindness paradigm: sub jects accept and rationalize choices they did not make, confirming that Φ R t is constructed b y in terpretation, not b y direct reading of Φ t (Carruthers, 2011). The divergence Φ t  = Φ R t is therefore not an empirical anomaly—it is the structural norm predicted b y the framew ork. Con v ersely , a Smithian system (3-LISP; Smith, 1984) has transparent access to its op er- ativ e rules but no endogenous capacity for revision of R k max . This crossing of opacit y profiles constitutes the framework’s central result: Φ R t ⊃ { R k max } but Φ R t ⊃ { R 0 , R 1 , . . . } in h umans, whereas Φ R t ⊃ { R 0 , . . . , R k max − 1 } but Φ R t ⊃ { R k max } in Smith-type architectures. Represen tation and causal p ow er. The crossing concerns the extent of Φ R t . But repre- sen ting a rule and b eing able to act on it are tw o distinct capacities. Φ C t ⊆ Φ t designates the subset o ver which the system p ossesses effectiv e causal p o w er. In general, Φ C t  = Φ R t . A thought exp erimen t clarifies the distinction. A h uman who, through adv anced imaging, managed to represent the en tirety of their connectome would hav e a Φ R t co v ering Φ t . But this knowledge w ould confer no causal p o wer: kno wing that a given synapse has a given strength do es not imply b eing able to mo dify it. Φ C t w ould remain confined to the upp er lev els. The extension of Φ R t to lo wer levels would b e accompanied b y an extension of Φ C t only through an additional technique (neurostim ulation, brain-computer interface) transforming kno wledge in to a causal lever. F or a Smithian system, Φ C t ≈ Φ R t at op erational lev els: what the system represen ts (its co de, its w eights), it can modify in the same act—reading and writing op erate in the same formal space. But at upp er lev els, Φ C t is empt y , b ecause R k max is not in Φ R t . The human singularit y is that the h uman is the only kno wn system p ossessing b oth Φ R t and Φ C t non-empt y at the level of R k max —ev en if this causal p ow er is imp erfect, slow, and without guaran tee. It is this conjunction that defines Regime 4. T wo additional properties reinforce the asymmetry . First, human normativit y is not unitary . Sch wabe & W olf (2013) show ed that stress triggers a switc h from mo del-based 17 con trol to mo del-free con trol—a reorganization of the gov ernance hierarc h y itself, without analogue in curren t artificial systems. Second, low-lev el opacity constrains mo dification to transit through the top of the hierarc h y , conferring on h uman cognition a qualitatively distinct mo dification profile from that of computational systems, where mo dification transits through the b ottom. F rom a strictly structural standp oin t, computational reflexivit y can exceed human re- flexivit y on the transparency axis: a Smithian system sees what the h uman cannot see of themselv es. But this transparency do es not automatically imply an endogenous capacit y for teleological revision. 4.2 P ositioning relativ e to theories of consciousness The Φ t / Φ R t framew ork does not constitute a theory of phenomenal consciousness and tak es no p osition on the hard problem (Chalmers, 1996). It do es, how ever, formalize the functional structure of reflexivit y—a territory that several theories of consciousness presupp ose without fully formalizing. T w o theoretical families are directly concerned; the others less so, for reasons w e sp ecify at the outset. Among con temp orary theories, w e retain higher-order theories (HOT) and Atten tion Sc hema Theory (AST) as primary interlocutors, b ecause their central mechanism—the causally activ e metarepresen tation—corresp onds directly to Φ R t ∩ Φ C t . Global W orkspace Theory (Baars, 1988; Dehaene & Changeux, 2011) addresses an upstream question: ho w a con- ten t b ecomes globally accessible. This mec hanism can b e read as a condition of entry in to Φ R t , but GWT does not theorize what the system do es with this accessibilit y to reconfigure itself—whic h is precisely our ob ject. Recurren t Pro cessing Theory (Lamme, 2006) concerns p erceptual stabilization in sensory lo ops—an implementation condition for certain R i , not the representation of the system’s own architecture. In tegrated Information Theory (T ononi et al., 2016) concerns the in tegrated causal structure of the physical substrate; our framework is functional and algorithmic, hence orthogonal to I IT at the lev el of analysis. The COG- IT A TE results (2025) confirm that the predictions of GWT and I IT are b oth empirically con tested, whic h reinforces the interest of a distinct p ositioning. Higher-order theories (Rosen thal, 2005; Lau & Rosenthal, 2011; Bro wn et al., 2019) hold that a mental state b ecomes conscious when it is the ob ject of a higher-order represen tation. HOT asks: which states are represented b y higher-order states? Our framew ork asks a broader question: whic h comp onen ts of the functional architecture—lo w-level rules, meta- rules, norms—are in Φ R t , and do es this representation hav e a causal p o wer of revision ( Φ C t )? HOT th us app ears as the special case where Φ R t con tains certain first-order represen tations. Our framework extends the analysis to control rules and teleological norms—a territory that 18 HOT do es not address. AST (Graziano, 2013; Graziano & W ebb, 2015) prop oses that consciousness is a simpli- fied internal mo del of attention serving attentional control—an instan tiation of the Φ t / Φ R t distinction restricted to the attentional subsystem. The decisiv e p oint of conv ergence is the shared thesis: it is the causal efficacy of self-represen tation, not its fidelity , that matters. The framework adv ances on three problems that HOT and AST lea v e op en or treat as anomalies. First, the formal sep ar ation of metar epr esentation levels : the literature regu- larly conflates the presence of a conten t in pro cessing, global accessibility , rep ortability , and causally active metarepresentation—a confusion noted by Kat y al & Fleming (2024). The for- malism separates them: a state can b elong to Φ t without b elonging to Φ R t (non-represen ted pro cessing); it can b e represented in Φ R t without b eing faithfully so (in trosp ective misrep- resen tation); it can b e represen ted in Φ R t without b elonging to Φ C t (in trosp ection without reflexivit y); it can b e in Φ C t without b eing in Φ R t (blind causal mo dification). HOT and AST op erate on sp ecial cases of these distinctions; the formalism unifies them into a grammar finer than the conscious/unconscious binary . Second, intr osp e ctive err or as a structur al pr op erty : HOT treats misrepresen tation as a recurring theoretical problem (Blo ck, 2011; Rosenthal, 2005). In our framew ork, the div ergence Φ t  = Φ R t is the structural norm—it is a direct consequence of the partiality of the pro jection op erator Π . Data on confabulation, choice blindness, and self-interpretation are no longer theoretical embarrassmen ts—they are direct predictions of the framew ork. Third, a structur al pr oto c ol for artificial c onsciousness : the notion of “representation of a represen tation” b ecomes trivial if it is not constrained (Butlin et al., 2025). Our framework provides this constraint: the op erationalizable question is not “do es the system hav e a second-order represen tation?” but “whic h comp onents of its arc hi- tecture are in Φ R t and Φ C t , at whic h lev el, and do es this represen tation serve to reconfigure functioning?” The crossing of opacities—humans represent their norms well but their op era- tiv e rules p o orly , reflexiv e systems exhibit the in v erse profile—constitutes the most directly testable prediction. The sp ectrum of artificial self-mo dification. Con temp orary AI systems realize a sp ec- trum of self-mo dification levels that confirms the crossing. A t inference, an LLM mo difies no comp onen t of Φ t ; in-con text learning (Bro wn et al., 2020; Co da-F orno et al., 2023) mimics b eha vioral adaptation without modifying weigh ts—a proto-regime. Fine-tuning and RLHF (Ouy ang et al., 2022) constitute a standard Regime 2. This distinction b etw een inference and training is decisiv e and often obscured. Meta-learning (W ang et al., 2016) and arc hitecture searc h (Zoph & Le, 2017) reac h Regime 3. The Darwin Gö del Mac hine (Zhang et al., 2025) pushes furthest: the system rewrites its own self-mo dification co de, realizing an in terpretive 19 reen try in the sense of Smith (1984)—but R k max remains externally fixed. The common de- nominator is that R k max remains fixed and exogenous. Prop osition 1 applies uniformly . The DGM’s profile illustrates the inv ersion: Φ R t ric h in op erative R i , empt y at R k max —the inv erse profile of the human. 5 Structural problems and partial resolutions The dev elopment of the framew ork gives rise to four problems in logical cascade: eac h arises only if the preceding one is resolved or at least p osed. The first three receiv e partial resolutions within the framework; the fourth remains op en. 5.1 The independence of transformativit y and autonom y The human/AI comparison rev eals that transformativity ( T —the maxim um level at whic h the system can apply an endogenous transformation) and organizational autonom y ( A ) are logically independent. A self-optimizing soft ware can rewrite its rules but b e stopp ed or deleted at any time ( T high, A nil). A simple organism main tains its conditions of existence but do es not mo dify its rules ( A high, T lo w). The human combines b oth. It is only at their in tersection that a self-mo difying system in the strong sense—an indep endent one—would app ear. The literature has treated T and A separately . Autopoiesis theory (Maturana & V arela, 1980) formalizes organizational closure—the system’s capacity to pro duce the comp onents necessary for its o wn p ersistence—but do es not address the transformation of rules: au- top oiesis describ es ho w a system p ersists, not ho w it mo difies itself. V on Neumann (1966) formalized a second dimension of A —reproductive autonom y—sho wing for the first time that the capacit y to generate new instances of oneself is a formalizable mec hanism. On the T side, meta-learning and self-mo dification arc hitectures (W ang et al., 2016; Zhang et al., 2025) describ e transformation without p ersistence. Our framework contributes b y making the independence of T and A visible and formalizable, whic h clarifies why a system can be highly adv anced on one axis and nil on the other—and wh y the com bination is rare. 5.2 The viabilit y of self-mo dification A transforming and autonomous system is not thereb y viable. An agent capable of rewrit- ing its o wn rules can produce irrev ersible transformations: infinite lo ops, functional col- lapse, loss of controllabilit y . The question—can a self-mo difying system revise itself without collapsing?—giv es rise to an additional condition: the capacity to sim ulate b efore acting. 20 T able 3: Positioning in the ( T , A ) space. System T A Characterization Thermostat 1 ∼ 0 Regulation without self-mo dification Simple organism 1–2 high High autonomy , limited T Con temp orary LLM 1–2 0 Lo cal, heteronomous Meta-RL agen t 3 0 Struc tural, heteronomous DGM 3+ 0 Self-referen tial, heteronomous Human cognition 4 high Reflexive, opaque, autonomous W e distinguish three sim ulation mo dalities, ordered b y safet y . (S1) Logical sim ulation— static analysis, mo del che cking —allows prop erties to b e established without real execution. (S2) Execution sim ulation—sandb ox, rev ersible sp eculativ e execution—ev aluates a mo difica- tion candidate in an isolated environmen t. The DGM (Zhang et al., 2025) explicitly real- izes this mo dality . (S3) Predictive sim ulation—in ternal w orld mo dels (Ha & Sc hmidh ub er, 2018)—an ticipates consequences b y pro jection. This distinction app ears tec hnical, but it has a deep structural consequence. The crossing of opacities (§4.1) manifests along the safet y axis. The human has only S3: episo dic prosp ec- tion (Sc hacter et al., 2012; Buckner & Carroll, 2007) allo ws ev aluating hypothetical scenarios without implementing them. But architectural opacit y precludes any equiv alen t of S1 or S2: h uman normativ e revision pro ceeds without a sandb ox, through direct commitmen t—whic h explains wh y transformativ e exp eriences (Paul, 2014) are structurally risky . Artificial sys- tems ha ve S1 and S2 in principle but hav e nothing to test on R k max , since it is outside Φ R t . Eac h t yp e of system is vulnerable where the other is protected. The literature on formal v erification and sandboxing (cf. Amo dei et al., 2016, for a syn thesis of AI safet y problems) has not connected these techniques to the question of reflexiv e self-modification—our frame- w ork do es so b y situating simulation as a viability condition, not merely as an engineering tec hnique. 5.3 The teleological lo ck Can a viable system rationally revise its own ev aluation norm? Let R k max b e the activ e criterion. A modification ∆ pro duces R ′ k max = R k max + ∆ . Ev aluating ∆ requires a criterion E . If E = R k max , the judgmen t is conserv ative: an y substan tial mo dification w ould b e judged negativ ely by the curren t criteria. If E  = R k max , the question shifts: where do es E come from? This problem is analogous to Neurath’s b oat (Quine, 1960) and has b een p osed in logical decision theory b y Soares & F allenstein (2017) and formalized b y Everitt et al. (2021) via causal influence diagrams. Our framework adds a precise lo calization: the lo ck is inactiv e as 21 long as R k max is outside Φ C t (Regimes 1–3); it activ ates exactly at the threshold of Regime 4. P artial resolutions. The human solv es this problem de facto. The framew ork dev elop ed in §3.4 sho ws that the lo ck is not a dead end: the human solves it de facto through tw o path w ays—en vironmen tal coupling (the in ternal state, mo dified b y experience, shifts the ev aluativ e core without a delib erate act of revision) and prosp ectiv e commitmen t (the cur- ren t core authorizes the leap, but do es not control the result). F or AI, the existen tial sandb ox (§3.4) op ens a third pathw ay structurally inaccessible to the human. The h uman resolution pro ceeds without formal guarantee—the lo ck is circum v ented, not suppressed. This observ a- tion constitutes a result of the framework: the teleological lo c k is not an absolute obstacle but a constrain t whose mo des of trav ersal differ according to the system’s Φ R t / Φ C t profile. The h uman crosses it from abov e (coupling affecting R k max through non-delib erativ e path wa ys); AI could cross it through causal transparency (sandb ox on copy). Implications for AI design. The observ ation that the h uman solv es the lo ck through dynamic coupling rather than transparency suggests a design principle: not the elimination of the fixed core (whic h would lead to instability), but its softening through dep endence on in ternal state—a functional analogue of the biological mechanism. The active inference framew ork (F riston, 2010)—where the agen t minimizes v ariational free energy—corresponds in our formalism to an adv anced Regime 3 under a fixed norm (free energy minimization is R k max ). The op en question is whether activ e inference can b e extended to Regime 4: could the system represent and revise the v ery principle of free energy minimization, or is this principle b y construction outside Φ R t ? General constrain t. An y guaran tee concerning a self-mo dification is relative to an ev alu- ation rule that do es not itself mo dify at the moment the mo dification is judged. This stability need not b e absolute; it suffices that it hold at the relev an t momen t of ev aluation. 5.4 Iden tit y under transformation If a system has effectively crossed the teleological lo c k—if it has revised its R k max —is the resulting system still “the same”? This problem presupp oses the preceding one: it arises only for a system that has effectively realized an endogenous revision of R k max . What p ersists through self-mo dification is neither the op erative rules (they c hange in Regime 2) nor the norm (it c hanges in Regime 4): it is, possibly , the meta-rule of transformation—the w ay the system mo difies itself. But if this meta-rule is itself revisable (Regime 4 fully open), no structural in v ariant subsists by necessit y . Iden tity b ecomes an op en problem. 22 Philosoph y has treated this question under the name of the Ship of Theseus; biology illustrates it through niche construction (Odling-Smee et al., 2003), where organisms mo dify their environmen t to the p oin t of transforming the selection pressures that act on them—the “rules” of ev olution b ecome partially ob jects of action, and the lineage’s identit y redefines itself through transformation. But none of these discussions had formalized the problem in terms of hierarchical rule lev els and self-representation. Our framework sho ws that the iden tit y problem do es not arise for systems in Regimes 1–3 (their R k max is fixed, and identit y can be defined by this in v arian t core): it emerges exactly at Regime 4, where the core itself b ecomes revisable. The framew ork predicts that the first artificial system to realize an endogenous revision of R k max will also b e the first to face this question in its acute form. 6 T estable predictions 6.1 Neuro cognitiv e gradien t and represen tational asymmetry (P1– P2) P1: Regime gradien t. Regimes 1–4 should corresp ond to a rostro-caudal gradient of the prefron tal cortex and a monotonically increasing cognitive cost with lev el k . Existing data are compatible: the W CST activ ates dorsolateral prefrontal cortex during rule c hange (Monc hi et al., 2001), Badre & D’Esp osito (2007) do cument a rostro-caudal hierarc hy asso ciated with the degree of control abstraction, and K o echlin et al. (2003) show progressive anterior recruitmen t with control lev el. The framew ork predicts that this neural hierarc hy corresp onds to the stratification of levels k in Φ t —a corresp ondence that the literature suggests but that our taxonom y formalizes for the first time in terms of self-mo dification regimes. The prediction is falsifiable: if Regimes 2 and 3 recruited the same netw orks to the same degree, the formal distinction w ould lack a neural correlate. P2: Asymmetry of Φ R t . Humans represent R k max b etter than R i for small i . This asym- metry should manifest in mec hanisms of therap eutic c hange: interv entions targeting R k max (metacognitiv e therapy) should sho w qualitativ ely distinct change patterns from those tar- geting R 1 (classical CBT). The framework’s strong prediction is more precise: the opacity gradien t should b e monotone—the lo wer the level, the po orer the representation in Φ R t and the w eaker the Φ R t → Φ C t coupling ratio—whic h is testable through in trosp ection tasks strat- ified b y hierarchical level. Suc h a gradient would constitute the first quan titativ e measure of the in ternal structure of Φ R t in h umans. 23 6.2 Crossed opacities and hierarc hical plasticit y (P3–P4) P3: Human/AI double disso ciation. A meta-learning system should exhibit a Φ R t profile rich in operative R i and p o or in R k max , while the h uman exhibits the in v erse profile. On rule-transfer tasks, error patterns should differ qualitatively: p erseveration errors in h umans ( R 1 main tained incorrectly; Milner, 1963), exploration errors in the artificial agen t ( R 1 mo dified incorrectly when the task distribution c hanges radically). The tw o-step task literature (Da w et al., 2011) provides the metho dological framework for distinguishing mo del- based from mo del-free con trol, but do es not connect this distinction to the exten t of Φ R t —this is what our framework adds. P4: Hierarc hical plasticit y . Human singularit y lies not only in hierarc hical depth but in hierarchical plasticity: the capacit y to reorder whic h level go v erns which other. T asks requiring hierarchical reorganization should recruit prefrontal net works distinct from those in v olved in standard hierarchical control. Data from Sch wa b e & W olf (2013) on priority reorganization under stress are compatible. The transformativ e exp erience (P aul, 2014) constitutes the limiting case: the sub ject enters an exp erience kno wing that their criteria will c hange—the hierarchical ceiling shifts. 6.3 Proto col: testing the causality of Φ R t Motiv ation. T wo recent results precisely delimit the gap this protocol fills. Beck er et al. (2023) sho wed that systematic metacognitive reflection accelerates the adoption of far-sighte d strategies in a planning task—showing that an in terv ention on strategy representation can reorganize the strategy itself. But their design do es not dissociate the represen tation of the rule ( Φ R t ) from the effective rule ( Φ t ): they sho w that reflection helps, not that it is the correction of self-representation that pro duces the effect. Con v ersely , Grinsc hgl et al. (2021) sho wed that metacognitive b eliefs can be modified b y fake fe e db ack without c hanging the actual offloading strategy—showing that Φ R t can div erge from Φ t without b ehavioral consequence. T aken together, these results indicate that the complete causal lo op—correcting Φ R t and observing whether Φ t c hanges—has not been directly tested. Design. A multi-step planning task with structural shift (the optimal strategy changes mid-session) is used. P articipan ts—h umans and meta-RL agents—tra verse five phases. Phase 1: L e arning. The system acquires an effective strategy R 1 in a stable-structure en vironmen t. R 1 is measured b y pro cess tracing—that is, observing information-consultation sequences and decision times, which allo w the effectiv ely used strategy to b e inferred (Mouse- 24 lab paradigm; Bec ker et al., 2023). In the agent, the effective strategy is iden tified b y ana- lyzing the internal states of the recurrent net w ork—sp ecifically , a linear classifier is trained to predict, from hidden activ ations, whic h strategy the net work implements. This classifier iden tifies the direction in activ ation space corresponding to R 1 . Phase 2: Me asuring Φ R t . The system’s represen tation of its o wn strategy is elicited. In the human: structured self-rep ort (“describ e ho w y ou decide”) and prediction of one’s own b eha vior on a h yp othetical scenario. In the agen t: the linear prob e trained in Phase 1 pro vides the encoding of R 1 in activ ation space. Phase 3: Diver genc e induction. An exp erimen tal gap b etw een what the system do es ( Φ t ) and what it b elieves it do es ( Φ R t ) is created. In the h uman, a false but plausible description of the sub ject’s strategy is provided (Grinschgl et al., 2021 metho d: the sub ject receiv es manipulated feedback on their own decision-making). In the agen t, the activ ation v ector iden tified in Phase 2 is p erturb ed, implan ting an incorrect represen tation of the strategy in the in ternal states—a functional analogue of h uman fak e feedbac k. In b oth cases, Φ R t is mo dified without directly touching Φ t . Phase 4: R epr esentational c orr e ction only. Only Φ R t is corrected—the sub ject is provided with an exact description of their effective strategy—without mo difying the rew ard structure, the task, or rule R 1 . In the agen t, the correct direction in the hidden states is restored. Phase 5: T r ansfer test. The structural shift is in tro duced (the optimal strategy changes). The dep endent v ariable is adaptation sp eed, measured b y the num b er of trials needed to reac h p erformance criterion in the new structure. Conditions. Three b etw een-sub jects conditions: (a) correction of Φ R t (Phase 4 active), (b) fake fe e db ack main tained ( Φ R t degraded), (c) con trol (no interv ention on Φ R t ). The design is a 3 × 2 factorial (condition × system: h uman vs meta-RL agen t). Predictions deriv ed from the framew ork. First pr e diction (causalit y of Φ R t ): if Φ R t is causally activ e in the sense of Maes (1987), representational correction in Phase 4 should accelerate adaptation in Phase 5—condition (a) faster than (c), whic h is faster than (b). Se c ond pr e diction (crossed opacities): in the h uman, correction of Φ R t should ha ve a strong effect at high lev els (meta-strategy , planning criterion) but a w eak effect at lo w lev els (motor execution, p erceptual parsing). In the meta-RL agent, the effect should b e strong at op erative lev els (the learned p olicy) but nil on the ev aluation criterion (externally fixed and inaccessible). Moreo v er, in the h uman, the Φ R t → Φ C t coupling is strong at upp er lev els and nearly nil at low er levels; in the agent, coupling is strong wherev er Φ R t is non-empt y but Φ R t is empt y at upp er levels. This crossed profile constitutes a double disso ciation directly 25 deriv ed from the Φ R t / Φ C t distinction. Thir d pr e diction (Regime 2 / Regime 3 distinction): a “null” con trol condition can b e added where the task structure is random but statistical difficult y is comparable. The frame- w ork predicts that correction of Φ R t accelerates adaptation only in the structured condition (Regime 3: the sub ject m ust change rules), not in the n ull condition (Regime 2: only R 0 c hanges). This directly tests the b oundary b et w een Regimes 2 and 3. Implemen tabilit y . The meta-RL agen t version is sim ulable with existing to ols (W ang et al., 2016; Duan et al., 2016): the recurren t netw ork pro vides direct access to latent represen- tations for Phases 2–4. The h uman v ersion uses v alidated paradigms (Mouselab for pro cess tracing, fak e feedback for div ergence induction). The complete design is pre-registrable and requires neither brain imaging nor sp ecialized equipmen t. 7 Discussion 7.1 Con tributions The framework provides five con tributions: (1) a formal taxonom y of four self-mo dification regimes defined b y the level of transformation in the hierarch y R 0 . . . R k max , eac h anc hored in a c haracterized cognitiv e phenomenon and a corresponding artificial system; (2) a for- malism ( Φ t , Φ R t , Φ C t ) that renders commensurable phenomena typically treated in disjoint framew orks—conditioning, rule c hange, metacognition, meta-learning, v alue revision, AI arc hitectures—b y p ositioning them in a common space defined by the transformation target lev el and the exten t of self-representation. This commensurability enables metho dological transfers in b oth directions: clinical disso ciations observ ed in humans (Regime 2 preserved / Regime 3 lost) can guide diagnosis of artificial architectures, and con versely , the structural transparency of computational systems can inform the understanding of human opacities; (3) the iden tification of crossed opacities—in its dual dimension of represen tation ( Φ R t ) / causal p o w er ( Φ C t )—as the structural signature of human/AI comparison; (4) four structural problems in cascade (§5), of which three receive partial resolutions—in particular the precise lo calization of the teleological lo ck at the threshold of Regime 4, the iden tification of three mo des of tra v ersal (coupling, prosp ective commitmen t, existential sandb o x), and viability as an additional condition linking simulation and self-mo dification; (5) an exp erimental proto col directly testing the causalit y of Φ R t through indep endent manipulation of self-representation. The most consequen tial con tribution for the field ma y b e commensurabilit y itself: the framew ork pro vides a formal common space in whic h h uman cognition, classical cognitive 26 arc hitectures, and contemporary learning systems can b e directly compared—not b y v erbal analogy , but by p ositioning in a common space of dimensions (exten t of Φ R t , maximum transformation target level, status of R k max ). This commensurabilit y op ens the p ossibilit y of systematic comparativ e studies. 7.2 Limitations and future directions The framew ork raises four structural problems (§5) and partially resolv es three. The identit y problem and several empirical limitations structure future directions. Regime b oundaries: formal sharpness, empirical blur. The b oundary b et ween regimes is sharp in the formalism but ma y b e blurred empirically . Ho w do es one distinguish a h ighly sophisticated Regime 2 from a minimal Regime 3? The framework prop oses a formal criterion (do es R 1 c hange or not?), but observing a b ehavioral change do es not suffice to decide—one m ust determine whether it is R 0 or R 1 that c hanged. The proto col in §6.3 prop oses a metho d for a sp ecific case (indep endent manipulation of Φ R t ), but general op erationalization remains op en and constitutes a metho dological c hallenge for the framework. Mapping Φ t : the topology of the hierarc hical netw ork. The rostro-caudal hierarch y of prefron tal con trol is solidly established (Koechlin et al., 2003; Badre & Nee, 2018; Badre, 2025), but the fine top ology of the net work—its linearity , interactions b et w een non-adjacen t lev els, p ossible partial orders—remains largely unmapp ed. The op en question is whether the regime taxonomy survives in a partial order or requires a total order. The immediate direction is to systematically map go vernance relations b etw een levels, distinguishing top- do wn (hierarc hical) influences from lateral and b ottom-up ones. Measuring Φ R t : the missing opacit y gradient. Metacognitive accuracy is measurable at the p erceptual lev el through Fleming’s signal-theoretic framework (meta- d ′ ; Fleming & Lau, 2014), and neural correlates are iden tified (anterior prefrontal cortex; Fleming et al., 2010, 2014; Lapate et al., 2020). But all of these measures op erate at the same level—the metacognition of R 0 (p erception, memory). No one has measured the complete profile of metacognitiv e accuracy across the lev els of the hierarch y , from p erceptual judgmen ts ( R 0 ) to strategy judgments ( R 1 ) to v alue judgments ( R k max ). Prediction P2 (monotone opac- it y gradient) w ould constitute the first measure of the in ternal structure of Φ R t in h umans. In trosp ection tasks stratified by hierarchical level would allow this profile to b e traced. 27 Measuring Φ C t : causalit y remains to b e established. The clinical MCT literature abundan tly do cumen ts that changes in meta-b eliefs predict symptomatic improv ement (Solem et al., 2009; meta-analysis Normann & Morina, 2018), and a recent systematic review confirms the promise of metacognitions as a transdiagnostic change mechanism while emphasizing that metho dological rigor is lacking to establish strict causality . The proto col prop osed in §6.3 fills precisely this gap: by manipulating Φ R t indep enden tly of Φ t , it tests whether represen tational correction alone pro duces a causal effect—which has nev er b een done directly . Crossed opacities: a d ouble disso ciation to b e realized. The elemen ts on b oth sides of the crossing are do cumen ted separately—h uman low-lev el opacit y (Nisb ett & Wilson, 1977; Johansson et al., 2005) and the operative transparency of artificial systems (probing, in terpretabilit y). But the direct comparison of h uman vs. meta-RL agen t on the same task with measuremen t of error patterns (P3) has never b een carried out. This is the most immediately testable prediction of the framew ork. The question of surpassabilit y . What w ould an artificial system p ossessing b oth Φ R t and Φ C t non-empt y at the lev el of R k max , together with p o werful Φ R t and Φ C t at operative levels, lo ok like? Such a system w ould combine the op erativ e transparency of AI and the normativ e reflexivit y of the h uman, p ossessing a reflexivity structurally superior to that of either. This question is b oth the most sp eculative and the most consequen tial of the framew ork. Iden tit y under transformation: to w ard formalization. The iden tity problem (§5.4) could b enefit from connections with computability theory and pro v abilit y logic. Kleene’s recursion theorem guaran tees that any computable transformation of programs admits a fixed p oin t—a behavioral in v ariant. Applied to self-mo dification, it suggests that certain classes of transformations necessarily preserv e an in v arian t, even if the system mo difies itself. Löb’s theorem constrains what a formal system can pro v e ab out its own mo difications—it is the most direct formalization of the teleological lo c k within mathematical logic. These connections remain to b e developed. F ormalizing regime transitions. The framew ork classifies regimes but do es not mo del ho w a system transitions from one to another—ho w Φ R t progressiv ely expands. Dev elopmen tal data (Zelazo, 2004; Karmiloff-Smith, 1992) describ e the progressive expansion of Φ R t ; a formal dynamics of this expansion w ould constitute a ma jor con tribution. F rom classificatory framew ork to dynamical theory . The presen t formalism con- stitutes a classificatory framework—analogous to thermo dynamic phases—and not yet a dy- 28 namical theory—analogous to equations of state. The classification is non-trivial: it pro duces the four regimes, the crossing of opacities, the teleological lo ck, and Propositions 1–2. But to derive quan titativ e resu lts—b ounds on self-mo dification accuracy , regime transition rates, sim ulable dev elopmental tra jectories—three thresholds remain to be crossed. First, the pro- jection op erator Π is named but not mo deled: a c haracterization of Π as a noisy information c hannel, with a sp ecifiable capacity C i at eac h lev el i , w ould allow deriving a no fr e e lunch result—the accuracy of self-mo dification at lev el i w ould b e b ounded ab o v e b y the qualit y of represen tation at lev el i + 1 . Second, the meta- d ′ measure (Fleming & Lau, 2014) rigorously quan tifies Φ R t , but exclusively at the R 0 lev el (p erceptual metacognition). The equiv alen t for R 1 (accuracy of strategic introspection: do es the sub ject know which strategy they actually use?) and for R k max (accuracy of normative introspection) do es not exist: without these m ulti-lev el measures, the opacity gradient cannot be empirically calibrated. Third, a formal dynamics of Φ R t —ho w self-representation expands, contracts, or deforms ov er time—is needed to sim ulate tra jectories. Concretely , a sim ulation would require a parametric form for the rules, a mo del of Π as an adjustable lossy c hannel, a dynamics of Π ’s evolution, and m ulti- lev el empirical anc horing. The present article inaugurates the framework; the transition to a sim ulable dynamical theory constitutes the most ambitious horizon of w ork it op ens. Bidirectional design. F or AI: softening R k max through dependence on internal state— arc hitectures where the reward function dep ends on an aggregated in ternal state—and the existen tial sandbox. F or h uman cognition: cognitiv e augmen tation practices (education, psyc hotherap y , con templativ e practices, brain-computer in terfaces) reconceptualized as sys- tematic in terven tions on Φ R t and Φ C t . Cultural and so cial extension. Φ R t is not only pro duced b y the individual but shap ed by cultural to ols—writing, delib erativ e institutions, philosophical traditions constitute cultural extensions of Φ R t . 8 Conclusion This article p osed a t wofold question: what are the formally distinct types of mo dification that a cognitive system can exert on its own rule hierarc hy , and ho w do es the exten t of self-represen tation determine whic h types are accessible to it? The deriv ation (§2) sho ws that the question “what is a self-mo difying system?” imp oses a minimal structure: a rule hierarch y , an unav oidable fixed core, and a threefold distinction b et ween effective rules ( Φ t ), represen ted rules ( Φ R t ), and causally accessible rules ( Φ C t ). The 29 taxonom y (§3) distinguishes four regimes—fixed, lo cal, structural, reflexiv e—each anchored in a c haracterized cognitive phenomenon. The comparativ e result (§4) identifies the cross- ing of opacities in its dual dimension (representation and causal p ow er) and sho ws that the sp ectrum of artificial self-mo dification con tin ually pushes the hierarchical lev el of transfor- mation upw ard while encountering the imp ossibility of an endogenous revision of R k max . The p ositioning relative to theories of consciousness (§4.2) shows that the formalism unifies levels of metarepresen tation that the literature conflates, treats introspective error as a structural prop ert y rather than an anomaly , and provides a structural proto col for the question of artifi- cial consciousness. The four structural problems in cascade (§5)—three of which are partially resolv ed—sho w that the framew ork has deductiv e p o w er b ey ond classification: it lo calizes the teleological lo c k at the threshold of Regime 4 and iden tifies distinct mo des of trav ersal dep ending on the system’s profile. F our testable predictions and an exp erimen tal proto col allo w the cen tral thesis—the causalit y of Φ R t —to be submitted to a direct test (§6). The framew ork do es not prop ose a complete theory of mind. It in tro duces a level of analysis that w as missing b etw een existing theories of cognitive control, metacognition, and computational self-modification. The formalism renders commensurable phenomena that the literature treats in disjoint frameworks. The four op en problems that emerge in cascade—indep endence of transformativit y and autonom y , viability of self-mo dification, teleological lo ck, identit y under transformation—are not residuals but structural consequences of the taxonomy . The question of how a system can rationally revise the very criterion that ev aluates its revisions is, p erhaps, the most imp ortan t question that cognitive science and artificial in telligence share without ha ving y et formalized it in a common language. This article prop oses that language. References [1] Anderson, J. R. (2007). How c an the human mind o c cur in the physic al universe? Oxford Univ ersit y Press. [2] Amo dei, D., Olah, C., Steinhardt, J., et al. (2016). Concrete problems in AI safet y . arXiv pr eprint arXiv:1606.06565 . [3] Baars, B. J. (1988). A c o gnitive the ory of c onsciousness. Cam bridge Univ ersit y Press. [4] Badre, D. (2025). Cognitive control. Annual R eview of Psycholo gy, 76 , 167–195. 30 [5] Badre, D., & D’Esp osito, M. (2007). F unctional magnetic resonance imaging evidence for a hierarchical organization of the prefron tal cortex. Journal of Co gnitive Neur oscienc e, 19 (12), 2082–2099. [6] Badre, D., & Nee, D. E. (2018). F rontal cortex and the hierarchical control of b ehavior. T r ends in Co gnitive Scienc es, 22 (2), 170–188. [7] Baumeister, R. F. (1990). Suicide as escape from self. Psycholo gic al R eview, 97 (1), 90– 113. [8] Bec k er, F., Wirzb erger, M., Pammer-Sc hindler, V., Sriniv as, S., & Lieder, F. (2023). Systematic metacognitive reflection helps p eople disco v er far-sighted decision strategies. Judgment and De cision Making, 18 , e15. [9] Bliss, T. V. P ., & Lømo, T. (1973). Long-lasting p oten tiation of synaptic transmission in the dentate area. Journal of Physiolo gy, 232 (2), 331–356. [10] Blo c k, N. (2011). Perceptual consciousness ov erflows cognitiv e access. T r ends in Co gni- tive Scienc es, 15 (12), 567–575. [11] Botvinic k, M. M., Bra ver, T. S., Barch, D. M., Carter, C. S., & Cohen, J. D. (2001). Conflict monitoring and cognitive control. Psycholo gic al R eview, 108 (3), 624–652. [12] Botvinic k, M. M., Niv, Y., & Barto, A. C. (2009). Hierarc hically organized behavior and its neural foundations. Co gnition, 113 (3), 262–280. [13] Bro wn, R., Lau, H., & LeDoux, J. E. (2019). Understanding the higher-order approach to consciousness. T r ends in Co gnitive Scienc es, 23 (9), 754–768. [14] Bro wn, T. B., et al. (2020). Language mo dels are few-shot learners. A dvanc es in Neur al Information Pr o c essing Systems, 33 , 1877–1901. [15] Buc kner, R. L., & Carroll, D. C. (2007). Self-pro jection and the brain. T r ends in Co g- nitive Scienc es, 11 (2), 49–57. [16] Butlin, P ., Long, R., Ba yne, T., et al. (2025). Iden tifying indicators of consciousness in AI systems. T r ends in Co gnitive Scienc es, 29 (2), 106053. [17] Carruthers, P . (2011). The op acity of mind. Oxford Universit y Press. [18] Chalmers, D. J. (1996). The c onscious mind. Oxford Univ ersity Press. 31 [19] Co da-F orno, J., et al. (2023). Meta-in-con text learning in large language mo dels. A d- vanc es in Neur al Information Pr o c essing Systems, 36 , 65189–65201. [20] Cogitate Consortium et al. (2025). Adv ersarial testing of global neuronal workspace and in tegrated information theories of consciousness. Natur e, 642 (8066), 133–142. [21] Da w, N. D., Gershman, S. J., Seymour, B., Day an, P ., & Dolan, R. J. (2011). Mo del- based influences on humans’ choices and striatal prediction errors. Neur on, 69 (6), 1204– 1215. [22] Dehaene, S., & Changeux, J.-P . (2011). Exp erimental and theoretical approac hes to conscious pro cessing. Neur on, 70 (2), 200–227. [23] Duan, Y., et al. (2016). RL 2 : F ast reinforcement learning via slow reinforcement learning. arXiv pr eprint arXiv:1611.02779 . [24] Ev eritt, T., Lea, S., & Hutter, M. (2021). Agent incen tives: A causal p ersp ectiv e. Pr o- c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e, 35 (13), 11487–11495. [25] Fleming, S. M., & Dolan, R. J. (2012). The neural basis of metacognitiv e ability . Philo- sophic al T r ansactions of the R oyal So ciety B, 367 (1594), 1338–1349. [26] Fleming, S. M., Ryu, J., Golfinos, J. G., & Blac kmon, K. E. (2014). Domain-sp ecific im- pairmen t in metacognitiv e accuracy follo wing anterior prefron tal lesions. Br ain, 137 (10), 2811–2822. [27] Fleming, S. M. (2024). Metacognition and confidence: A review and syn thesis. Annual R eview of Psycholo gy, 75 , 241–268. [28] F riston, K. (2010). The free-energy principle: A unified brain theory? Natur e R eviews Neur oscienc e, 11 (2), 127–138. [29] Graziano, M. S. A. (2013). Consciousness and the so cial br ain. Oxford Universit y Press. [30] Graziano, M. S. A., & W ebb, T. W. (2015). The attention sc hema theory . F r ontiers in Psycholo gy, 6 , 500. [31] Grinsc hgl, S., Meyerhoff, H. S., Sc hw an, S., & Papenmeier, F. (2021). F rom metacogni- tiv e b eliefs to strategy selection. Psycholo gic al R ese ar ch, 85 , 2654–2666. [32] Ha, D., & Schmidh ub er, J. (2018). W orld models. arXiv pr eprint arXiv:1803.10122 . 32 [33] Johansson, P ., Hall, L., Sikström, S., & Olsson, A. (2005). F ailure of introspective rep ort. Scienc e, 310 (5745), 116–119. [34] Kamin, L. J. (1969). Predictability , surprise, atten tion, and conditioning. In B. A. Camp- b ell & R. M. Churc h (Eds.), Punishment and aversive b ehavior (pp. 279–296). Appleton- Cen tury-Crofts. [35] Karmiloff-Smith, A. (1992). Beyond mo dularity. MIT Press. [36] Kat y al, S., & Fleming, S. M. (2024). The future of metacognition research. Cortex, 171 , 223–234. [37] K o echlin, E., Ody , C., & K ouneiher, F. (2003). The arc hitecture of cognitiv e control in the h uman prefrontal cortex. Scienc e, 302 (5648), 1181–1185. [38] Laird, J. E. (2012). The SO AR c o gnitive ar chite ctur e. MIT Press. [39] Lamme, V. A. F. (2006). T ow ards a true neural stance on consciousness. T r ends in Co gnitive Scienc es, 10 (11), 494–501. [40] Lapate, R. C., et al. (2020). P erceptual metacognition of human faces is causally sup- p orted b y function of the lateral prefron tal cortex. Communic ations Biolo gy, 3 , 360. [41] Lau, H., & Rosenthal, D. (2011). Empirical supp ort for higher-order theories of conscious a w areness. T r ends in Co gnitive Scienc es, 15 (8), 365–373. [42] Maes, P . (1987). Concepts and exp eriments in computational reflection. Pr o c e e dings of OOPSLA , 147–155. [43] Maturana, H. R., & V arela, F. J. (1980). Autop oiesis and c o gnition. D. Reidel. [44] Miller, E. K., & Cohen, J. D. (2001). An integrativ e theory of prefrontal cortex function. A nnual R eview of Neur oscienc e, 24 (1), 167–202. [45] Milner, B. (1963). Effects of different brain lesions on card sorting. A r chives of Neur olo gy, 9 (1), 90–100. [46] Monc hi, O., P etrides, M., Petre, V., W orsley , K., & Dagher, A. (2001). Wisconsin Card Sorting revisited. Journal of Neur oscienc e, 21 (19), 7733–7741. [47] Monsell, S. (2003). T ask switching. T r ends in Co gnitive Scienc es, 7 (3), 134–140. [48] Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and new findings. Psycholo gy of L e arning and Motivation, 26 , 125–173. 33 [49] Nelson, S. K., Kushlev, K., English, T., Dunn, E. W., & Lyub omirsky , S. (2014). In defense of parenthoo d. Psycholo gic al Scienc e, 25 (1), 3–10. [50] Nisb ett, R. E., & Wilson, T. D. (1977). T elling more than w e can kno w. Psycholo gic al R eview, 84 (3), 231–259. [51] Normann, N., & Morina, N. (2018). The efficacy of metacognitiv e therap y: A systematic review and meta-analysis. F r ontiers in Psycholo gy, 9 , 2211. [52] Odling-Smee, F. J., Laland, K. N., & F eldman, M. W. (2003). Niche c onstruction. Princeton Univ ersity Press. [53] Ouy ang, L., et al. (2022). T raining language mo dels to follo w instructions with human feedbac k. A dvanc es in Neur al Information Pr o c essing Systems, 35 , 27730–27744. [54] P aul, L. A. (2014). T r ansformative exp erienc e. Oxford Univ ersit y Press. [55] P earce, J. M., & Hall, G. (1980). A model for Pa vlovian learning. Psycholo gic al R eview, 87 (6), 532–552. [56] Pierrot-Deseillign y , C., & Burk e, D. (2012). The cir cuitry of the human spinal c or d. Cam bridge Univ ersit y Press. [57] Quine, W. V. O. (1960). W or d and obje ct. MIT Press. [58] Rankin, C. H., et al. (2009). Habituation revisited. Neur obiolo gy of L e arning and Mem- ory, 92 (2), 135–138. [59] Rescorla, R. A., & W agner, A. R. (1972). A theory of P avlo vian conditioning. In A. H. Blac k & W. F. Prok asy (Eds.), Classic al c onditioning II (pp. 64–99). Appleton-Century- Crofts. [60] Rosen thal, D. M. (2005). Consciousness and mind. Oxford Univ ersity Press. [61] Sc hacter, D. L., et al. (2012). The future of memory . Neur on, 76 (4), 677–694. [62] Sc hmidh ub er, J. (2007). Gö del machines. In B. Go ertzel & C. Pennac hin (Eds.), A rtificial Gener al Intel ligenc e (pp. 199–226). Springer. [63] Sc h wabe, L., & W olf, O. T. (2013). Stress and m ultiple memory systems. T r ends in Co gnitive Scienc es, 17 (2), 60–68. [64] Sherrington, C. S. (1906). The inte gr ative action of the nervous system. Scribner. 34 [65] Simon, H. A. (1962). The arc hitecture of complexity . Pr o c e e dings of the A meric an Philo- sophic al So ciety, 106 (6), 467–482. [66] Smith, B. C. (1984). Reflection and seman tics in LISP . Pr o c e e dings of the 11th ACM SIGA CT-SIGPLAN Symp osium on Principles of Pr o gr amming L anguages , 23–35. [67] Soares, N., & F allenstein, B. (2017). Agen t foundations for aligning mac hine in telli- gence with h uman in terests. In K. Callaghan et al. (Eds.), The te chnolo gic al singularity (pp. 103–125). Springer. [68] Solem, S., Håland, Å. T., V ogel, P . A., Hansen, B., & W ells, A. (2009). Change in metacognitions predicts outcome in obsessive-compulsiv e disorder patients undergoing treatmen t with exp osure and response preven tion. Behaviour R ese ar ch and Ther apy, 47 (4), 301–307. [69] Sun, R. (2002). Duality of the mind. Lawrence Erlbaum. [70] T ononi, G., Boly , M., Massimini, M., & Koch, C. (2016). In tegrated information theory . Natur e R eviews Neur oscienc e, 17 (7), 450–461. [71] v on Neumann, J. (1966). The ory of self-r epr o ducing automata (A. W. Burks, Ed.). Uni- v ersit y of Illinois Press. [72] W ang, J. X., et al. (2016). Learning to reinforcement learn. arXiv pr eprint arXiv:1611.05763 . [73] W ells, A. (2009). Metac o gnitive ther apy for anxiety and depr ession. Guilford Press. [74] Zelazo, P . D. (2004). The dev elopment of conscious con trol in c hildho o d. T r ends in Co gnitive Scienc es, 8 (1), 12–17. [75] Zhang, J., Hu, S., Lu, C., Lange, R., & Clune, J. (2025). Darwin Gö del Mac hine: Op en- ended ev olution of self-impro ving agen ts. arXiv pr eprint arXiv:2505.22954 . [76] Zoph, B., & Le, Q. V. (2017). Neural architecture searc h with reinforcemen t learning. Pr o c e e dings of ICLR . 35

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment