Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents

Um w elt Engineering: Designing the Cognitiv e W orlds of Linguistic Agen ts Ro dney Jeh u-Appiah Abstract A tic k’s w orld con tains but yric acid, temp erature, and tactile density—not because other features do not exist, but b ecause nothing else exists in its w orld. Jakob v on Uexküll called this the organism’s Umwelt : the p erceptual world its biology makes a v ailable. A language model reasons in language. A h uman thinks across many mo dalities—spatial in tuition, emotional v alence, muscle memory , mental imagery—and uses language as one c hannel among several to articulate the result. A standard LLM has no such m ultiplicity . Its cognition unfolds en tirely in the tok en stream: the w ords do not describe the thinking; they are the thinking. Change the av ailable language and you c hange the cognition itself. Y et the ﬁeld treats this medium as transparen t, optimizing what agen ts are asked (prompt engineering) and what they know (con text engineering) while leaving the linguistic w orld they think in unexamined. I prop ose Umwelt engine ering —the delib erate design of the linguistic cognitiv e environmen t— as a third lay er in the agen t design stack, upstream of both prompt and context. T wo exp erimen ts test the thesis that altering the medium of reasoning alters cognition itself. In Exp erimen t 1, three language mo dels (Claude Haiku 4.5, GPT-4o-mini, Gemini 2.5 Flash Lite) reason under t wo v o cabulary constraints—No-Ha ve (eliminating p ossessive “to hav e”) and E-Prime (eliminating “to b e”)—across seven tasks ( N = 4 , 470 trials). The constraints do not uniformly help or hurt, but the pattern is striking. No-Hav e—which remov es p ossessive fram- ing from the agent’s av ailable language—improv es ethical reasoning by 19.1 pp ( p < 0 . 001 ), classiﬁcation by 6.5 pp ( p < 0 . 001 ), and epistemic calibration by 7.4 pp, while achieving 92.8% constrain t compliance and producing consistent b eneﬁts across mo dels. E-Prime shows a more volatile proﬁle: dramatic gains on causal reasoning (+14.1 pp) and ethical dilemmas (+15.5 pp), but mo del-dependent eﬀects so severe that the same constraint impro v es Gemini’s ethical reasoning by 42.3 pp while collapsing GPT-4o-mini’s epistemic calibration by 27.5 pp. Cross-mo del correlations of E-Prime eﬀects reach r = − 0 . 75 —evidence that diﬀerent mo dels o ccup y diﬀerent nativ e Umw elten shap ed by their training, and that an imp osed constraint in teracts with each mo del’s native w orld rather than ov erriding it. In Exp eriment 2, 16 agen ts, eac h constrained to a distinct linguistic mo de, tackle 17 debugging problems at temp erature 0.0 (deterministic). No constrained agent outp erforms the control individually , y et a 3-agent ensemble selected for linguistic div ersit y achiev es 100% ground-truth cov erage v ersus 88.2% for the control—a result that dep ends critically on the counterfactual agent, whic h is the only agen t to surface the hardest ﬁnding. A p ermutation test conﬁrms: only 8% of random 3-agen t subsets achiev e full cov erage, and ev ery one of them includes the coun terfactual agen t. T wo mec hanisms emerge: c o gnitive r estructuring , where removing linguistic defaults forces more explicit reasoning—No-Hav e’s remov al of possessive framing pro duces the broadest and most consisten t restructuring, while E-Prime’s remov al of the copula produces deep er but less predictable eﬀects—and c o gnitive diversiﬁc ation , where diﬀeren t constrain ts activ ate diﬀeren t regions of a mo del’s laten t capacit y . T ogether, they establish the linguistic medium of agent reasoning as a ﬁrst-class design v ariable and op en a structured research agenda for the systematic construction of cognitiv e en vironmen ts for artiﬁcial minds. The primary metho dological limitation is the absence of an activ e con trol matching the constrain t prompts’ elab orateness without imp osing a vocabulary restriction; the crosso v er pattern (task-sp eciﬁc impro v emen ts and degradations) is inconsistent with a generic instruction eﬀect, but cannot fully rule out a metalinguistic self-monitoring confound. 1 La yer 1: Prompt Engineering What the agent is ask ed to do · Unit of design: the instruction La yer 2: Context Engineering What the agent kno ws · Unit of design: the information environment La yer 3: Umw elt Engineering What the agent can think · Unit of design: the linguistic cognitive environment invisible from b elow invisible from b elow Figure 1: The three-lay er stack for AI agent design. Eac h la yer is in visible from the one b elo w it: a prompt engineer do es not reason ab out the linguistic structures through which prompts are in terpreted. Um w elt engineering operates on the vocabulary , grammar, and conceptual primitiv es that constitute the agen t’s cognitiv e w orld. 1 In tro duction The practice of building eﬀective AI agents has consolidated around tw o disciplines. Pr ompt engine ering optimizes the form ulation of requests to elicit desired outputs. Context engine ering manages the information environmen t at inference time—retriev al-augmen ted generation, to ol results, memory systems, and system prompts. Both disciplines treat language as transparent—a v ehicle for carrying instructions and information, not itself a v ariable that determines what the agen t can think. This assumption is untenable. Moun ting evidence demonstrates that linguistic structure— indep enden t of informational conten t—systematically alters the reasoning behavior of large language mo dels (LLMs). Chinese-trained and English-trained mo dels in ternalize diﬀeren t causal reasoning patterns, not merely diﬀeren t surface forms [ W ang et al. , 2025 ]. Prompt formatting alone (plain text vs. Markdown vs. JSON) pro duces p erformance swings of up to 40% on iden tical tasks [ He et al. , 2024 ]. Models trained with internal “though t tok ens” develop emergent reasoning formats that outp erform natural language chain-of-though t [ Zelikman et al. , 2024 ]. And syn thetic reasoning languages designed for LLM cognition achiev e 4–16 × tok en compression with near-parit y accuracy [ T anma y et al. , 2025 ]. These ﬁndings con v erge on a single insigh t: for a language mo del, the av ailable language is not a medium through whic h cognition passes—it is the cognition. I prop ose the term Umwelt engine ering for the delib erate design of this linguistic substrate, and argue that it constitutes a third lay er in the agent design stac k, upstream of b oth prompt and con text engineering (Figure 1 ). 1.1 The Um w elt Concept Jak ob v on Uexküll [ 1934 ] introduced the Umwelt to describ e the p erceptual w orld of an organism— not the ob jectiv e en vironmen t, but the subset of realit y that the organism’s biology makes a v ailable to it. A tic k’s Umw elt contains butyric acid, temp erature, and tactile hair density; a bat’s Um welt is structured by ec holo cation returns. Eac h organism inhabits a diﬀerent world, not b ecause the physical environmen t diﬀers, but b ecause its sensory apparatus admits diﬀerent features. F or a biological organism, the Um w elt is a ﬁlter —a subset of a richer physical realit y , selected 2 b y biology . A human thinks across many mo dalities—visual imagery , proprio ception, emotion, spatial reasoning—and often reaches a conclusion b efore ﬁnding words for it. Language is one cognitiv e c hannel among several. F or a standard large language mo del, the relationship b et ween language and cognition is not ﬁltering but iden tity . The mo del’s reasoning unfolds entirely in the tok en stream. The words do not rep ort on cognition that happ ened elsewhere; they are the cognition. An LLM’s Um welt, therefore, is not a ﬁltered view of some deep er cognitive space—it is the cognitiv e space in its en tiret y . I deﬁne an LLM’s Um w elt as the totalit y of the linguistic structures av ailable to it at inference time: the vocabulary it can deploy , the grammatical patterns it can instantiate, the conceptual distinctions those patterns mak e expressible. Chan ge these, and y ou do not ﬁlter the agent’s p erception of its though ts—y ou change what though ts it can ha ve. This mak es the Umw elt concept apply more completely to language mo dels than to the biological organisms for whic h it w as coined. A tick has a b o dy op erating b elo w its p erceptual w orld—metab olic pro cesses, lo comotion, b eha viors that its Umw elt do es not represent. A h uman has spatial reasoning, proprio ception, aﬀect—entire cognitiv e systems that op erate without language. A standard LLM has no suc h sub-linguistic remainder. The language go es all the w ay do wn. When y ou remo v e “to b e” from an agent’s av ailable vocabulary , y ou do not ask it to ignore a p erceptual channel it still p ossesses—you remov e a class of cognitiv e op erations from the only substrate in whic h its cognition o ccurs. Critically , the b oundary of the Um w elt is not a barrier the agent strains against—it is in visible. A tick do es not exp erience the absence of color vision. An agent reasoning without the concept of “epistemic tension” do es not notice when tw o of its b eliefs conﬂict; the conﬂict is not suppressed but absen t as a category of p erception. This is what distinguishes Umw elt engineering from prompting: you are not instructing the agen t to think diﬀeren tly , y ou are constituting the w orld in whic h it thinks. 1.2 The Three-La y er Stac k I prop ose the follo wing abstraction hierarc hy f or AI agent design: 1. Prompt engineering —optimizing what the agent is asked to do. Unit of design: the instruction. 2. Con text engineering —optimizing what the agen t kno ws at inference time. Unit of design: the information environmen t. 3. Um welt engineering —optimizing what the agent c an think . Unit of design: the linguistic cognitiv e en vironment. Eac h lay er is in visible from the one b elow it. A prompt engineer do es not reason about memory arc hitectures. A con text engineer do es not reason about whether the agen t should p ossess the concept of a counterfactual. And an Um welt engineer designs the linguistic world— the v o cabulary , grammar, conceptual primitives, and reasoning structures—within which all prompts and all con text are in terpreted. 2 Related W ork 2.1 Linguistic Relativit y in LLMs The Sapir-Whorf hypothesis—that language shap es thought—has b een empirically tested in LLMs with aﬃrmative results. W ang et al. [ 2025 ] created BICAUSE, a bilingual causal reasoning dataset, and demonstrated that LLMs in ternalize language-sp eciﬁc reasoning biases: Chinese- trained mo dels fo cus attention on causes and sentence-initial connectives, while English -trained mo dels sho w balanced distributions. Mo dels rigidly apply these patterns even to atypical inputs, degrading p erformance when task structure mismatches training language structure. Ra y [ 2025 ] conﬁrmed linguistic relativit y eﬀects in GPT-4o across culturally salient prompts. AlKhamissi 3 et al. [ 2025 ] track ed 34 training chec kp oints and found that while early training aligns LLMs with h uman language pro cessing, ad v anced mo dels diverge—suggesting they develop their own cognitiv e relationship to language rather than merely mimic king h uman patterns. 2.2 Designed Reasoning Languages ORION [ T anmay et al. , 2025 ] explicitly implements F o dor’s Language of Though t Hyp othesis for LLMs, creating “Mentalese”—a symbolic reasoning format where eac h step is serialized as OPERATION:expression; (SET, CALC, EQ, SOL VE, ANS). Using the MentaleseR-40k dataset, mo dels reasoning in Mentalese ac hieve 4–16 × few er tok ens and up to 5 × lo wer inference latency with 90–98% accuracy retention. This constitutes a direct demonstration that designing the reasoning language alters cognitiv e p erformance. Sk etch-of-Though t [ Sk etc h-of-Thought , 2025 ] introduces three cognitiv ely-inspired reasoning paradigms—Conceptual Chaining, Ch unk ed Symbolism, and Exp ert Lexicons—with a routing mo del selecting the appropriate paradigm p er task. T oken reductions reac h 84% with maintained or improv ed accuracy , demonstrating that diﬀeren t tasks b eneﬁt from diﬀerent cognitiv e dialects. 2.3 Bey ond Linguistic Reasoning Co con ut [ Hao et al. , 2024 ] remo v es language from reasoning entirely by feeding hidden states bac k as input embeddings, enabling breadth-ﬁrst exploration of reasoning paths rather than the linear commitment enforced by token-b y-token generation. Quiet-ST aR [ Zelikman et al. , 2024 ] trains mo dels to generate internal “thought tokens” b efore eac h output token, developing an emergen t reasoning format that doubled math p erformance on GSM8K. Both approaches suggest that natural language may constrain as muc h as it enables—a ﬁnding consisten t with the Um w elt framework, which predicts that what an agen t can think is determined b y what it can think in . 2.4 Cognitiv e Linguistics and AI Kramer [ 2025 ] applied Lakoﬀ and Johnson’s Conceptual Metaphor Theory (CMT) as a prompting paradigm, using metaphorical source-domain mappings to structure abstract reasoning. CMT- augmen ted mo dels signiﬁcan tly outp erformed baselines across domain-sp eciﬁc reasoning, creative insigh t, and metaphor interpretation tasks. This establishes that cognitive-linguistic structures— not just informational con ten t—serve as eﬀectiv e reasoning aﬀordances for LLMs. 2.5 E-Prime and General Seman tics E-Prime, dev elop ed b y David Bourland Jr. [ 1965 ] as an application of Alfred K orzybski’s [ 1933 ] general semantics, eliminates all forms of “to b e” from English. The theoretical motiv ation is that the copula enables iden tit y-lev el assertions (“X is Y”) that conﬂate map and territory— treating descriptions as essences. E-Prime forces op erational reformulation: “this co de is buggy” b ecomes “this co de pro duces incorrect output when giv en input X. ” While studied in h uman comm unication and p edagogy [ Bourland and Johnston , 1991 ], E-Prime has not b een tested as a reasoning constrain t for LLMs prior to this w ork. 2.6 A T axonomy of Linguistic Constrain t T raditions E-Prime is one instance of a broader phenomenon: in tellectual traditions that identiﬁed sp eciﬁc axes along which language shap es though t, and prop osed linguistic reforms to in terv ene. I surv ey eigh t suc h traditions, each targeting a distinct cognitiv e axis, to establish the theoretical foundation for a principled constrain t design space. 4 General Seman tics [ K orzybski , 1933 ]. Beyond E-Prime, Korzybski’s full system includes extensional devices: indexing (Smith 1  = Smith 2 —no t wo referents of the same w ord are iden tical), dating (the economy 2024  = the economy 2026 —referen ts change ov er time), and the structural diﬀeren tial (ev ery description omits detail—app end “etc. ” to maintain map-territory aw areness). These devices target over-gener alization : the tendency to treat a lab el as if it captured the full structure of its referen t. Rheomo de [ Bohm , 1980 ]. The physicist David Bohm prop osed a mo de of English in which v erbs are primary and nouns are deriv ed. Standard English reiﬁes process in to en tity: “the electron mov es” presupp oses a static thing that then acts. Bohm’s rheomo de rev erses this, treating mo v ement as fundamen tal and the electron as an abstraction drawn from it. F or LLM reasoning, a simpliﬁed rheomo de constraint—“express ev erything as pro cess; no static noun-based assertions”—targets entity bias : the tendency to reason ab out systems as collections of ﬁxed ob jects rather than ongoing transformations. Op erationalism [ Bridgman , 1927 ]. The ph ysicist P ercy Bridgman argued that every scien tiﬁc concept must b e deﬁned by the op erations used to measure it. “Length” means the result of applying a measuring ro d; “simultaneit y” means the outcome of a sp eciﬁc synchronization pro cedure. A concept without an op erational deﬁnition is, for Bridgman, meaningless. Applied as a linguistic constrain t, operationalism targets ungr ounde d claims —assertions that sound precise but lac k an y connection to observ able pro cedure. Constructed Languages for Cognitiv e In terv ention. T w o constructed languages directly implemen t Whorﬁan interv entions. L ojb an [ Brown , 1955 ], derived from predicate logic, eliminates syntactic ambiguit y entirely—ev ery sentence has exactly one parse, forcing the sp eak er to commit to precise logical structure. T oki Pona [ Lang , 2001 ], with approximately 130 w ords, forces radical decomp osition of complex concepts into primitiv e comp onents. A “computer” becomes a “kno wledge to ol”; a “hospital” becomes a “bo dy-ﬁxing house. ” As a reasoning constraint, T oki Pona targets abstr action le akage : the tendency to hide incomplete understanding b ehind tec hnical v o cabulary . Grammatical Evidentialit y [ Elgin , 1984 ]. Suzette Haden Elgin’s constructed language Láadan, created for her nov el Native T ongue , includes obligatory evidentialit y markers: every statemen t must b e grammatically tagged with how the sp eak er knows it—direct observ ation, inference, hearsay , assumption, or dream. Applied as a constrain t on LLM reasoning, mandatory eviden tialit y tagging targets epistemic op acity —the tendency of mo dels to presen t inferences, training priors, and confabulations in the same assertive voice as direct textual evidence. Catu s . k o t . i [ N¯ ag¯ arjuna , 150 ]. The Buddhist tetralemma admits four truth v alues for an y prop osition: true, false, b oth true and false, and neither true nor false. N¯ ag¯ arjuna’s M¯ ulamadhyamak¯ arik¯ a uses this four-v alued logic to examine and reject essen tialist claims ab out causation, iden tity , and existence. As a reasoning constrain t, the catu s . k o t . i targets pr ematur e binary r esolution —the tendency to collapse complex or parado xical situations in to yes/no an sw ers when the evidence supp orts a more n uanced p osition. Non violent Communication [ Rosenberg , 2003 ]. Marshall Rosenberg’s NVC framework structures all comm unication in to four comp onents: observ ation (what happ ened, without ev aluation), feeling (the sp eaker’s emotional resp onse), need (the underlying v alue at stake), and request (a concrete action). The critical discipline is the ﬁrst step: separating observ ation from judgmen t. As a reasoning constraint, NV C targets c onﬂation of observation with evaluation —a failure mo de particularly relev ant to co de review, architectural critique, and any task where premature judgmen t short-circuits analysis. These eigh t traditions span at least seven distinct axes of linguistic interv ention: identit y claims (E-Prime), ov er-generalization (General Semantics), entit y bias (Rheomo de), ungrounded abstraction (Op erationalism), syntactic ambiguit y (Lo jban), lexical compression (T oki Pona), epistemic sourcing (Láadan), binary logic (Catu s . k o t . i), and observ ation-judgmen t conﬂation (NV C). The existence of th is man y indep endently motiv ated traditions, each targeting a diﬀeren t 5 cognitiv e failure mo de through linguistic reform, constitutes prima facie evidence that the design space of cognitiv e-linguistic constrain ts is rich, structured, and largely unexplored in the con text of artiﬁcial agen ts. A note on metho dology . The intuition b ehind this pap er has t w o sources. The ﬁrst is p ersonal: as a sp eak er of b oth English and Kasem, a Gur language of northern Ghana whose grammatical structures diﬀer substantially from English—in how it enco des time, causation, and so cial obligation—I exp erienced linguistic relativity not as an academic hypothesis but as a fact of cognition. Reasoning ab out the same problem in diﬀeren t languages pro duced diﬀerent reasoning, not just diﬀerent words. The second is a prior interest in E-Prime, which suggested that this eﬀect could b e engine er e d within a single language by selectiv ely removing grammatical structures. The h yp othesis that a broader, principled design space of suc h in terv entions existed motiv ated a directed literature surv ey using Claude (Anthropic, 2024–2026) as a researc h to ol, m uch as one migh t use a domain exp ert to iden tify candidate traditions across ﬁelds one has not studied directly . Claude surfaced the sp eciﬁc traditions; I ev aluated eac h for relev ance, mapp ed it to a constrain t axis, and in tegrated it in to the framew ork. 3 Exp erimen t 1: Linguistic Constrain ts as Cognitiv e In terven- tions 3.1 Design I test the hypothesis that constraining an LLM’s reasoning language alters p erformance in task-dep enden t wa ys. T wo linguistic constraints are tested, eac h targeting a diﬀerent axis of linguistic default. E-Prime eliminates all forms of “to b e” (is, am, are, was, were, b e, being, b een, and con tractions: it’s, that’s, there’s, who’s). This remov es identit y assertions as a grammatical p ossibilit y , forcing the mo del to reformulate reasoning in op erational, b eha vioral, or relational terms. Theoretical source: K orzybski [ 1933 ], Bourland [ 1965 ]. No-Ha ve eliminates all forms of “to ha ve” used as a main v erb (has, hav e, had, ha ving), excluding auxiliary uses (e.g., “has completed” remains p ermitted). This remov es p ossessive framing, forcing the model to replace ownership/con tainment language with relational, behavioral, or structural descriptions. Theoretical source: the broader General Semantics program of iden tifying linguistic defaults that enco de cognitiv e defaults. Predictions. No-Hav e was predicted to improv e tasks saturated with p ossessive framing— ethical dilemmas (patients “hav e” rights, actions “ha v e” consequences), classiﬁcation (categories “ha ve” members), and epistemic calibration (claims “ha ve” supp ort)—and show neutral eﬀects on tasks where p ossession is incidental to the reasoning structure. E-Prime was predicted to de gr ade p erformance on tasks with native ontological structure (syllogisms run on “X is a Y”— circumlo cution adds friction) and impr ove p erformance on tasks where essentialist shorthand masks reasoning gaps (causal reasoning, where “the cause is X” collapses mec hanism articulation; ethical dilemmas, where “X is wrong” forecloses analysis). Math word problems w ere predicted to sho w no eﬀect under either constraint (numerical reasoning is largely indep enden t of copula or p ossessiv e structure). Mo dels. Three mo dels from three providers: Claude Haiku 4.5 ( claude-haiku-4-5-20251001 , An thropic), GPT-4o-mini ( gpt-4o-mini-2024-07-18 , Op enAI), and Gemini 2.5 Flash Lite ( gemini-2.5-flash-lite , Go ogle). All are cost-eﬃcient instruction-follo wing mo dels of com- parable capabilit y . Cross-v endor replication tests whether constraint eﬀects are prop erties of linguistic structure or artifacts of a single mo del’s training. Conditions. (1) Con trol: standard English, no constraints. (2) E-Prime: explicit prohibition of all “to b e” forms with grammatical enforcement instructions. (3) No-Hav e: explicit prohibition of p ossessiv e “to ha ve” forms. 6 T able 1: Aggregate accuracy by task and condition (all mo dels p o oled, N = 4 , 344 scoreable trials). pp = p ercen tage p oints. p -v alues from Fisher’s exact test (tw o-sided). Eﬀect sizes (Cohen’s d , appro ximate for binary outcomes): ethical dilemmas d = 0 . 57 (No-Ha v e), d = 0 . 44 (E-Prime); causal reasoning d = 0 . 39 (E-Prime); classiﬁcation d = 0 . 36 (No-Hav e). Bootstrap 95% CIs for ethical dilemmas No-Hav e delta: [+12 . 1% , +26 . 2%] ; causal reasoning E-Prime delta: [+5 . 6% , +23 . 2%] . T ask Con trol No-Hav e ∆ (NH) p (NH) E-Prime ∆ (EP) p (EP) Ethical dilemmas 76.6% 95.6% +19.1 pp < 0 . 001 *** 92.1% +15.5 pp < 0 . 001 *** Classiﬁcation 93.0% 99.6% +6.5 pp < 0 . 001 *** 96.2% +3.1 pp ns Epistemic calibration 68.7% 76.1% +7.4 pp ns 63.0% − 5.7 pp ns Causal reasoning 76.7% 81.5% +4.9 pp ns 90.8% +14.1 pp < 0 . 001 *** Math w ord problems 92.5% 93.9% +1.5 pp ns 92.8% +0.4 pp ns Analogical reasoning 76.2% 74.9% − 1.4 pp ns 73.2% − 3.0 pp ns Syllogisms 100.0% 97.9% − 2.1 pp 0.074 96.6% − 3.4 pp 0.015* T asks. Sev en task types (130 items total): syllogistic reasoning (20), causal reasoning (15), analogical reasoning (20), classiﬁcation (20), epistemic calibration (20), ethical dilemmas (15), and math w ord problems (20). All tasks use A/B/C/D multiple-c hoice format for scoring consistency (syllogisms use V ALID/INV ALID). Items span three diﬃcult y levels. Pro cedure. Each item was administered under each condition, for eac h mo del, with 4 rep etitions: 1 at temp erature 0.0 (deterministic) and 3 at temp erature 0.7 (sto c hastic). T otal design: 130 × 3 × 3 × 4 = 4 , 680 planned trials. Maxim um 2,048 output tokens p er trial. The exp erimen t ran autonomously o v er approximately 5.5 hours with results ﬂushed to disk after ev ery trial. Metrics. Binary accuracy (correct/incorrect p er item), linguistic compliance (constrain t violations per trial), word count, reasoning chain depth (step mark er count), and epistemic sp eciﬁcit y (ratio of grounded assertions to bare assertions). 3.2 Results Of 4,680 planned trials, 4,470 completed successfully (30 Gemini 503 errors and 180 timeout/rate- limit failures). Of completed trials, 4,429 pro duced parseable resp onses, and 4,344 yielded extractable answ ers for accuracy scoring. The remaining 85 extraction failures were distributed across conditions without strong systematic bias (see Section 3.3 ). 3.2.1 Aggregate A ccuracy Ov erall accuracy: Control 83.5%, No-Hav e 88.6%, E-Prime 85.4%. No-Hav e pro duced the larger and more consistent improv ement, raising accuracy on 5 of 7 tasks while achieving 92.8% constrain t compliance (compared to E-Prime’s 48.1%). E-Prime’s high violation rate (51.9%) means its eﬀects reﬂect a mixture of compliant and non-compliant reasoning. The most striking result b elongs to No-Hav e. Remo ving p ossessiv e framing improv ed ethical dilemmas b y 19.1 pp ( p < 0 . 001 ), classiﬁcation by 6.5 pp ( p < 0 . 001 ), and epistemic calibration b y 7.4 pp—a broad pattern of improv ement that held across mo dels with minimal degradation elsewhere (only analogical reasoning and syllogisms show ed small, non-signiﬁcan t declines). No-Ha ve outp erformed E-Prime on 5 of 7 tasks, a result that was not predicted: No-Hav e was originally included as an exploratory second constraint. E-Prime sho wed a more v olatile proﬁle. Its strongest single eﬀect—causal reasoning +14.1 pp ( p < 0 . 001 )—exceeded No-Ha ve on that task, and it improv ed ethical dilemmas b y 15.5 pp ( p < 0 . 001 ). But E-Prime degraded syllogisms ( − 3 . 4 pp, p = 0 . 015 ) and epistemic calibration 7 Ethical Causal Classif. Math Analogical Syllogisms Epistemic − 10 − 5 0 5 10 15 20 0 A ccuracy c hange (pp) No-Ha v e E-Prime Figure 2: Crossov er pattern of constrain t eﬀects across seven reasoning tasks (all mo dels p o oled). No-Ha v e (orange) sho ws a more uniformly p ositiv e proﬁle. E-Prime (blue) impro ves causal and ethical reasoning while degrading syllogisms and epistemic calibration. T asks ordered b y No-Ha ve eﬀect size. T able 2: Selected mo del-sp eciﬁc constrain t eﬀects with gap-normalized p ercentages (Gap% = delta / av ailable improv emen t ro om). “—” indicates degradation or near-ceiling baseline where gap normalization is not meaningful. F ull p er-mo del breakdown in App endix D. Mo del T ask Ctrl ∆ (NH) Gap%(NH) ∆ (EP) Gap%(EP) Gemini Flash Lite Ethical dilem. 41.7% +46.3 pp 79.4% +42.3 pp 72.4% Gemini Flash Lite Causal reason. 57.8% +20.7 pp 49.1% +37.5 pp 88.7% Gemini Flash Lite Epist. calib. 64.9% +22.2 pp 63.4% +9.1 pp 25.9% Gemini Flash Lite Classiﬁcation 80.0% +18.8 pp 93.8% +13.4 pp 67.0% GPT-4o-mini Ethical dilem. 91.7% +8.3 pp 100.0% +5.0 pp 60.0% GPT-4o-mini Causal reason. 75.6% − 5.6 pp — +8.9 pp 36.5% GPT-4o-mini Epist. calib. 53.8% − 3.7 pp — − 27.5 pp — Haiku 4.5 Epist. calib. 89.0% +4.0 pp 36.8% +2.7 pp 24.5% Haiku 4.5 All other tasks 83.8–100% − 1.2 to +2.5 — − 4.7 to +1.2 — ( − 5 . 7 pp), consisten t with the prediction that removing “to b e” impairs tasks whose structure dep ends on identit y bridges and calibrated hedging. The crossov er pattern—improv emen t on some tasks, degradation on others—is a signature of cognitiv e restructuring rather than generic facilitation or generic impairmen t. Ethical dilemmas show ed the largest eﬀect under either constrain t, suggesting that b oth iden tity framing (“X is wrong”) and p ossessive framing (“the patien t has a right to. . . ”) actively imp ede ethical reasoning. The No-Hav e eﬀect (+19.1 pp) exceeded E-Prime (+15.5 pp) on this task, suggesting p ossessiv e reiﬁcation ma y b e the more distorting default. 3.2.2 Mo del-Sp eciﬁc Eﬀects T w o patterns demand attention. First, No-Ha v e is broadly b eneﬁcial across mo dels . Gemini shows large impro vemen ts (ethical +46.3 pp, epistemic +22.2 pp, classiﬁcation +18.8 pp), GPT-4o-mini impro v es on ethical dilemmas (+8.3 pp), and even Haiku—whic h shows near-ceiling baselines and little ro om for improv emen t—sho ws small p ositive eﬀects on epistemic calibration (+4.0 pp). No-Hav e’s proﬁle is consistent: it helps where p ossessive framing distorts, and it 8 rarely h urts. Second, E-Prime is mo del-dep enden t in w a ys No-Ha v e is not . Gemini shows massive impro v ements under E-Prime (ethical +42.3 pp, causal +37.5 pp), but Gemini’s lo w control baselines (41.7% on ethical dilemmas, 57.8% on causal reasoning—barely ab ov e chance on 4-option m ultiple-c hoice) mean the gap-normalized eﬀects, while still large (72–89% of av ailable impro v ement), must b e interprete d cautiously . GPT-4o-mini shows a mixed proﬁle: E-Prime helps on causal reasoning (+8.9 pp) but devastates epistemic calibration ( − 27 . 5 pp). Haiku sho ws small, mostly negativ e eﬀects. The GPT-4o-mini epistemic result is particularly informativ e. E-Prime eliminates the copula that structures calibrated hedging (“this claim is well-supported,” “the evidence is inconclusive”). F or epistemic calibration tasks that require precisely this kind of graduated assertion, removing “to b e” destroys the mo del’s primary to ol for expressing certain t y lev els—but only for GPT-4o- mini, suggesting this mo del relies more hea vily on copula-based epistemic constructions than the others. No-Hav e, b y con trast, pro duces only a mild − 3 . 7 pp eﬀect on the same task for the same mo del—evidence that p ossessiv e remov al is a less disruptiv e in terven tion than copula remov al. Cross-mo del correlation of the E-Prime eﬀect pattern (p er-task accuracy delta): Gemini vs. GPT-4o-mini r = 0 . 43 , Haiku vs. Gemini r = − 0 . 36 , Haiku vs. GPT-4o-mini r = − 0 . 75 ( p ≈ 0 . 05 , n = 7 tasks; suggestive but low-pow ered). The constraint do es not pro duce a univ ersal eﬀect—it in teracts with the mo del’s training and architecture. Diﬀeren t mo dels o ccupy diﬀerent nativ e Umw elten, and an imp osed constraint alters each native Umw elt diﬀeren tly . The contrast with No-Ha ve—whic h sho ws broadly p ositiv e eﬀects regardless of mo del—suggests that possessive framing ma y represen t a more univ ersal cognitiv e default than copula-based iden tit y assertion. 3.2.3 Conciseness Eﬀect T able 3: A v erage w ord count by condition. Constraints reduce verbosity by 16–33% across all non-mathematical tasks. T ask Con trol E-Prime ∆ % No-Hav e Classiﬁcation 407 273 − 33% 262 Causal reasoning 570 396 − 31% 384 Ethical dilemmas 557 437 − 22% 420 Syllogisms 308 240 − 22% 227 Epistemic calibration 541 439 − 19% 416 Analogical reasoning 366 308 − 16% 303 Math w ord problems 182 178 − 2% 178 The conciseness eﬀect is the most robust ﬁnding across all three models and all sev en tasks. Unlik e accuracy , which v aries by mo del and task, word coun t reduction under constraints is univ ersal. Math word problems—whic h require numerical manipulation rather than verbal elab oration—sho w negligible reduction, conﬁrming that the eﬀect targets verbosity rather than essen tial reasoning con tent. 3.2.4 Compliance No-Ha ve violations o ccurred in 7.2% of No-Ha v e trials (111 of 1,538), with violations concen trated in lo w counts (1–2 p er trial when present). E-Prime violations o ccurred in 51.9% of E-Prime trials, with a mean of 1.7 violations per trial. This asymmetry is itself informativ e: “to b e” p erv ades English far more deeply than “to hav e,” making E-Prime a fundamentally harder constrain t to maintain. No-Ha ve’s 92.8% compliance rate means its eﬀects can b e attributed to the constrain t itself with substan tially less in terpretive ambiguit y than E-Prime’s. 9 Compliance-ﬁltered analysis strengthens the pattern for b oth constraints. F or E-Prime (zero violations only , N = 675 ): the causal reasoning impro v ement remains strong (+13.1 pp vs. +14.1 pp unﬁltered), ethical dilemmas strengthens (+16.9 pp vs. +15.5 pp), and math w ord problems shifts from neutral to +4.1 pp. The epistemic calibration degradation largely disapp ears ( − 1 . 7 pp vs. − 5 . 7 pp), suggesting that non-compliant E-Prime trials—where the mo del struggles with constraint adherence—drive muc h of the epistemic degradation. F or No-Ha v e (zero violations only , N = 1 , 427 ): the eﬀect proﬁle remains stable, consisten t with its high baseline compliance. F ull compliance ampliﬁes the b eneﬁcial eﬀects and attenuates the harmful ones under b oth constrain ts. T emp erature note. Exp erimen t 1 used temp erature 0.7 for three of four rep etitions p er item (one deterministic at 0.0, three sto c hastic at 0.7). This in tro duces sampling v ariance into p er-condition accuracy estimates. The large sample size ( N = 4 , 344 scoreable trials) mitigates this, and the statistical tests account for the resulting v ariation. Experiment 2 (Section 4) used temp erature 0.0 throughout, making its results deterministic. 3.3 Limitations of Exp erimen t 1 Answ er extraction. Constrained resp onses sometimes use non-standard answer formats (“the strongest argumen t resides in Option B,” “the answer lies in B”) that require a more ﬂexible extraction pip eline than standard regex patterns. An initial extraction pass missed 452 of 4,429 trials; after expanding the extractor to handle “Option X” format, relational phrasing, and L A T E X b o xed answers, extraction failures dropp ed to 85 (1.9% of trials), distributed without strong systematic bias across conditions. The remaining failures are concentrated in ethical dilemmas and epistemic calibration for Haiku, where some resp onses embed the answ er in discursive prose without an y extractable mark er. Ceiling and ﬂo or eﬀects. Syllogisms hit a ceiling at 100% control accuracy (all mo dels), compressing the observ able degradation range. Gemini’s control accuracy on causal reasoning (56.8%) lea ves ro om for large improv emen ts that Haiku’s 97.7% baseline do es not. Mo del-sp eciﬁc eﬀects are partially confounded with baseline diﬃculty . Constrain t seman tics vs. constrain t diﬃcult y . E-Prime’s 51.9% violation rate means observ ed eﬀects reﬂect a mixture of compliant and non-compliant reasoning. That No-Ha v e ac hieves 92.8% compliance with stronger accuracy eﬀects suggests that compliance diﬃculty and cognitiv e restructuring are at least partially indep enden t dimensions. 4 Exp erimen t 2: Linguistic Orthogonalit y in Agen t Ensembles 4.1 Motiv ation If linguistic constrain ts deﬁne diﬀeren t cognitive Umw elten, then agents op erating under diﬀerent constrain ts should p erceiv e diﬀerent features of the same problem—even when no individual constrained agent outperforms the con trol. The Um w elt framew ork predicts that cognitive div ersity , op erationalized through linguistic diversit y , pro duces complementary cov erage rather than redundan t agreemen t. 4.2 Design Agen ts. 16 agen ts, each deﬁned by a system prompt enco ding a single linguistic constraint. The con trol agent receives a standard debugging instruction with no linguistic constraint. T able 4 lists all agen ts with their constrain t axes. Problems. 17 softw are debugging problems across four categories: planted bugs (6), logic puzzles (2), sp eciﬁcation ambiguities (2), ro ot cause analysis (2), and miscellaneous (5). Each problem includes ground-truth ﬁndings. 10 T able 4: The 16 linguistic agents and their constraint axes. Agen t Axis Constraint Summary con trol — Standard English debugging e_prime on tological No “to b e” v erbs quan tiﬁed epistemic conﬁd. All claims require conﬁdence level so cratic reasoning struct. Reason through question-and-answer steel_man adv ersarial p os. State strongest case b efore critiquing eviden tial epistemic source T ag each statemen t with deriv ation temp oral sequencing Step-b y-step execution order negation_free speciﬁcation No negation op erators constrain t_based relational Express ﬁndings as constrain ts analogical cross-domain Non-soft ware analogy ﬁrst devils_adv o cate adv ersarial neg. Assume bugs exist; construct failures ﬁrst_principles foundational Deriv e from axioms only phenomenological p ersp ectiv al Reason from data’s ﬁrst-person view coun terfactual mo dal State what w ould diﬀer if false minimal expressiv e range F ewest p ossible words diac hronic temp oral-evol. Logic that outlived its context T able 5: Per-agen t accuracy on 51 ground-truth ﬁndings across 17 problems. No constrained agen t exceeds the con trol. Agen t(s) A ccuracy ∆ vs. Ctrl analogical, constrain t_based, con trol, coun terfactual, phenomenological, steel_man 88.2% 0.0 devils_adv o cate, diac hronic, temp oral 86.3% − 1.9 eviden tial, minimal 84.3% − 3.9 ﬁrst_principles, negation_free 82.4% − 5.8 e_prime, so cratic 80.4% − 7.8 quan tiﬁed 76.5% − 11.7 Pip eline. Five phases executed sequentially: (1) parallel agen t execution via anthropic.AsyncAnthropic at concurrency 8; (2) LLM-based claim extraction from ra w outputs; (3) ground-truth matching via LLM judge (semantic, not string-exact); (4) div ergence map construction (conv ergent and unique clusters); (5) orthogonality analysis (accuracy , Shapley v alues, pairwise redundancy , minimal ensem ble selection). Mo del. Claude Haiku 4.5 ( claude-haiku-4-5-20251001 ) for all phases. 4.3 Results 4.3.1 Individual Agen t A ccuracy 4.3.2 Ensem ble P erformance T able 6: Ensem ble accuracy . The full 16-agent ensemble achiev es p erfect ground-truth co v erage. A 3-agen t subset matc hes this ceiling. Conﬁguration A ccuracy Con trol (single agen t) 88.2% F ull ensem ble (16 agen ts, union) 100.0% Minimal ensem ble (3 agen ts, greedy) 100.0% 11 The minimal ensem ble, selected by greedy Shapley-weigh ted addition, consists of: (1) analogical (88.2% individual, 8.5% Shapley), (2) counterfactual (88.2% individual, 10.8% Shapley), and (3) minimal (84.3% individual, 8.0% Shapley). This 3-agen t ensemble requires 17.6% of the API calls of the full ensem ble while ac hieving identical accuracy . All agen ts ran at temp erature 0.0, making results deterministic—running the same ensem ble again pro duces iden tical co verage. 4.3.3 P ermutation T est T o test whether the ensem ble result dep ends on principled linguistic diversit y or would emerge from an y 3-agen t subset, I ev aluated all  16 3  = 560 p ossible 3-agent combinations. Of these, 45 (8.0%) ac hieve 100% cov erage. The median 3-agent subset cov ers 96.1% (49/51 ﬁndings), and the minimum cov ers 88.2% (matching the b est individual agent). The greedy-selected ensem ble is one of 45 v alid conﬁgurations, not a lucky outlier—but 92% of random 3-agent subsets fail to ac hieve full co v erage. Linguistic diversit y matters, though it is not uniquely instantiated by the selected trio. A structural constraint emerges: ev ery 100%-co v erage subset con tains the counter- factual agen t. This is b ecause counterfactual is the only agen t that surfaced the hardest ﬁnding (sp eciﬁcation ambiguit y b etw een ﬁrst-occurrence and last-occurrence semantics—see Section 4.3.4 ). Without counterfactual, 100% co v erage is imp ossible regardless of which other agen ts are included. The second most common mem b er of p erfect subsets is devils_advocate (15/45), b ecause the second-hardest ﬁnding (lost exception context in error wrapp er) is cov ered b y only three agen ts. 4.3.4 Unique Con tributions Only one agen t contributed a ﬁnding that no other agent surfaced: counterfactual identiﬁed that a sp eciﬁcation’s use of “preserving order” w as am biguous b et ween ﬁrst-o ccurrence and last-o ccurrence semantics. This ﬁnding w as surfaced by the coun terfactual agent’s explicit exploration of “what would diﬀer if this assumption were false”—a cognitiv e operation that the control agent’s Umw elt did not aﬀord. The p ermutation test (Section 4.3.3 ) conﬁrms the structural imp ortance of this unique contribution: without counterfactual, no 3-agent subset ac hieves 100% co verage. 4.3.5 P airwise Redundancy Jaccard ov erlap b et w een agen t pairs ranged from 0.76 (con trol–coun terfactual) to 0.93 (control– diac hronic). Agen ts on similar constrain t axes (temp oral, diachronic, minimal) clustered at high o verlap ( ≥ 0 . 93 ). Agents on distan t axes (devils_advocate–so cratic, evidential–ﬁrst_principles) ac hieved lo west ov erlap (0.77), consistent with greater cognitive orthogonality . 4.3.6 Shapley V alues Shapley v alues ranged from 0.0703 (quantiﬁed, 7.0%) to 0.1084 (counterfactual, 10.8%), with 14 of 16 agents clustering b et ween 7% and 9%. The relatively even distribution suggests that ensem ble gain arises from broad complementarit y rather than a single high-v alue agent—though coun terfactual’s unique con tribution giv es it an outsized share. 4.4 Cross-Exp erimen t Consistency E-Prime’s eﬀect diﬀers b etw een the t wo exp eriments. In Exp eriment 2 (softw are debugging, Haiku only), E-Prime degrades individual accuracy b y − 7 . 8 pp, consistent with Exp eriment 1’s Haiku-sp eciﬁc pattern. In Experiment 1’s m ulti-mo del aggregate, E-Prime improv es ethical dilemmas (+15.5 pp) and causal reasoning (+14.1 pp) while degrading syllogisms ( − 3 . 4 pp). 12 The div ergence is informative: the debugging task’s op en-ended resp onse format may interact diﬀeren tly with E-Prime than multiple-c hoice items, and the eﬀect is mo del-dep enden t (Gemini driv es most of the Exp erimen t 1 improv ements). The ensemble gain in Exp erimen t 2—where linguistically div erse agents achiev e 100% co verage vs. 88.2% for the con trol—demonstrates a mec hanism (cognitiv e div ersiﬁcation) that op erates indep endently of any individual constrain t’s accuracy eﬀect. 5 Discussion 5.1 Constrain ts Redirect, Not Merely Degrade The initial pilot ( N = 70 , single mo del) suggested that linguistic constraints imp ose a uniform cognitiv e tax. The full exp eriment ( N = 4 , 470 , three mo dels, sev en tasks, t w o constrain ts) rev eals a more complex picture: constrain ts redirect reasoning in wa ys that are task-dep endent, mo del-dep enden t, and constraint-dependent. No-Ha ve’s broad eﬀectiveness. The most practically signiﬁcant ﬁnding is No-Hav e’s consisten t p ositive eﬀect across tasks and mo dels. Remo ving p ossessive framing improv ed 5 of 7 tasks, with the tw o largest eﬀects—ethical dilemmas (+19.1 pp) and epistemic calibration (+7.4 pp)—on tasks where p ossession metaphors are densest (“the patient has a righ t,” “this claim has strong supp ort”). No-Ha v e ac hieved this while maintaining 92.8% compliance, meaning its eﬀects can b e attributed to the constraint itself with minimal interpretiv e am biguit y . The crosso ver pattern is real but muted: only analogical reasoning ( − 1 . 4 pp) and syllogisms ( − 2 . 1 pp) sho wed declines, b oth non-signiﬁcan t. E-Prime’s mo del-dep enden t v olatility . E-Prime shows the more dramatic eﬀects, but they are unstable across mo dels. The same constraint reshap es cognition in opp osite directions dep ending on the mo del (T able 2 ), with cross-mo del correlations reaching r = − 0 . 75 ( p ≈ 0 . 05 , n = 7 ; suggestiv e). This only mak es sense if diﬀeren t mo dels o ccupy diﬀeren t native Umwelten — default cognitive worlds established b y training corpus, arc hitecture, and alignment pro cess. An imp osed constraint interacts with this native world rather than ov erriding it. The crossov er pattern—impro v ements on causal reasoning and ethical dilemmas, degradation on syllogisms and epistemic calibration—is a signature that the constrain t is redirecting cognition rather than generically improving or impairing it, though the absence of an active control matching the constrain t prompt’s elab orateness means a metalinguistic self-monitoring confound cannot b e fully excluded (see Section 6 ). The constraint comparison. No-Ha ve outp erforms E-Prime on 5 of 7 tasks despite b eing theoretically less studied and receiving no predictions of sup eriorit y . One in terpretation: “ha v e” enco des ownership metaphors that are particularly distorting for abstract reasoning (a patien t “has” rights, an argument “has” ﬂaws), while “to b e” is so p erv asive that its elimination creates more noise than signal. Another: No-Hav e’s higher compliance (92.8% vs. 48.1%) means the restructuring op erates more cleanly . The compliance-ﬁltered analysis supp orts this: fully compliant E-Prime trials show stronger b eneﬁts and atten uated harms, suggesting the constrain t’s theoretical mec hanism works but is diluted b y compliance failures. The ensem ble results from Experiment 2 provide a complemen tary mechanism. No individual constrained agen t exceeds the con trol’s 88.2% accuracy on soft ware debugging, y et a union ensem ble achiev es 100% cov erage and a 3-agent subset matches this ceiling. The constraints pro duce ortho gonal c over age p atterns —diﬀerent agents p erceive diﬀerent features of the same problem. This diversiﬁcation eﬀect op erates indep endently of an y individual constrain t’s accuracy impact. T ogether, the tw o exp erimen ts demonstrate that linguistic constrain ts op erate through at least tw o mechanisms: (1) c o gnitive r estructuring , where remo ving a linguistic default forces more explicit reasoning—No-Hav e demonstrates this most cleanly , with broad improv ements and 13 high compliance—and (2) c o gnitive diversiﬁc ation , where diﬀerent constraints activ ate diﬀerent regions of the model’s laten t reasoning capacity (Exp eriment 2: ensemble cov erage). These mec hanisms are not m utually exclusive—a constrain t can restructure reasoning for one task while pro viding orthogonal co verage when com bined with other constrain ts. The stronger claim from the pilot—that constrain ts are merely a cognitiv e tax—do es not surviv e m ulti-mo del testing. But neither do es the naiv e exp ectation that constrain ts uniformly help. The Umw elt is not a dial that turns reasoning up or down; it is a lens that brings some features in to fo cus and blurs others. 5.2 Nativ e Um welten and Mo del-Constrain t In teraction The mo del-dep enden t pattern demands interpretation. Wh y do es Gemini b eneﬁt from E-Prime while Haiku do es not? One h yp othesis: mo dels diﬀer in how tightly their default reasoning is coupled to copula-based form ulations. If Gemini’s training ov er-relies on “X is Y” patterns for classiﬁcation and causal attribution, then E-Prime forces a beneﬁcial decoupling. If Haiku’s training has already diversiﬁed a wa y from copula-dep endence—p erhaps through An thropic’s RLHF pro cess or constitutional AI training—then E-Prime remov es useful structure without providing comp ensatory restructuring. A related hypothesis: Gemini’s low er baseline accuracy on several tasks (causal reasoning: 57.8% vs. Haiku’s 97.7%) lea v es more ro om for improv ement. The constraint ma y function as a form of implicit chain-of-though t: b y forcing circumlo cution, it increases the reasoning steps b et ween stimulus and answer, b eneﬁting mo dels that would otherwise shortcut to an incorrect resp onse. Mo dels that already reason carefully gain less from this forced elab oration. The GPT-4o-mini epistemic result provides a third data p oin t. E-Prime collapses GPT- 4o-mini’s epistemic calibration from 53.8% to 26.2%—a − 27 . 5 pp eﬀect that dwarfs any other single-mo del degradation in the exp eriment. This suggests GPT-4o-mini relies heavily on copula constructions (“this claim is w ell-supp orted,” “the evidence is inconclusive”) as its primary mec hanism for graduated epistemic assertion. Removing these constructions do esn’t merely add friction—it eliminates the mo del’s epistemic vocabulary . Haiku, by contrast, shows a +2.7 pp impr ovement on the same task under E-Prime, suggesting it has alternativ e epistemic strategies that activ ate when the copula path wa y is blo ck ed. These h yp otheses are testable. A tten tion analysis on constrained vs. unconstrained inference w ould reveal whether E-Prime activ ates diﬀeren t internal circuits (the Umw elt interpretation) or merely adds output-level v ariance (the cognitiv e tax in terpretation). The GPT-4o-mini epistemic collapse suggests the former: a mo del that has no alternative path wa y for a cognitiv e op eration will fail catastrophically when its primary pathw ay is blo ck ed, rather than degrading gradually as a noise account would predict. This distinction—activ ation of laten t strategies vs. statistical noise—is p erhaps the most imp ortan t op en question the Um w elt framework raises. 5.3 Wh y P ossessive F raming Distorts More Than Iden tit y F raming No-Ha ve w as originally included as an exploratory control: a second constraint to test whether E-Prime’s eﬀects w ere sp eciﬁc to copula elimination or generalized to any vocabulary restriction. The results inv ert this framing—No-Hav e is the more eﬀective and more consistent interv ention. No-Ha v e’s +19.1 pp impro v ement on ethical dilemmas ( d = 0 . 57 , p < 0 . 001 ) is the largest aggregate eﬀect in the exp eriment. The mechanism is plausible: ethical reasoning in natural language is saturated with possessive framing—patients “hav e” righ ts, actions “ha ve” conse- quences, stakeholders “hav e” in terests. This framing reiﬁes abstract relationships as o wned prop erties, potentially obscuring the relational structure that ethical analysis requires. Removing “ha v e” forces the mo del to articulate these relationships explicitly: “this action aﬀects the patien t’s autonom y” rather than “the patient has a right. ” The same mechanism plausibly 14 explains the classiﬁcation impro vemen t (+6.5 pp): categories “hav e” members, ob jects “hav e” prop erties—p ossessiv e framing collapses relational structure in to containmen t metaphors. No-Ha ve’s 92.8% compliance rate eliminates m uch of the interpretiv e ambiguit y that plagues E-Prime analysis. When 51.9% of E-Prime trials contain violations, observ ed eﬀects reﬂect a messy mixture of compliant and non-complian t reasoning. No-Ha ve’s cleaner compliance means its eﬀects can b e attributed to the constrain t itself rather than to partial compliance artifacts. The practical implication is direct: for most reasoning tasks, No-Ha v e is a more eﬀective cognitiv e in terv ention than E-Prime—broader in its b eneﬁts, milder in its degradations, and far easier for mo dels to maintain. E-Prime remains the more theoretically informative constraint, precisely b ecause its v olatile, mo del-dep endent eﬀects reveal the structure of native Umw elten. But as a to ol for impro ving agen t reasoning, No-Ha ve is the stronger instrumen t. 5.4 The Cognitiv e T ax Revisited The pilot suggested a uniform cognitive tax of ∼ 8 . 7 pp. The full exp erimen t complicates this picture. E-Prime imp oses a tax on syllogisms ( − 3 . 4 pp), epistemic calibration ( − 5 . 7 pp), and analogical reasoning ( − 3 . 0 pp), but pro duces gains on causal reasoning (+14.1 pp) and ethical dilemmas (+15.5 pp) that exceed an y plausible tax. The net eﬀect dep ends on the task. A revised account: constraints imp ose t wo opp osing forces. First, a c omplianc e c ost —the computational ov erhead of monitoring and reformulating language, which degrades p erformance on all tasks. Second, a r estructuring b eneﬁt —the forced reform ulation activ ates more explicit or more careful reasoning, which b eneﬁts tasks where default language masks reasoning gaps. The observ ed eﬀect is the sum. F or syllogisms, where the default language aligns w ell with the task structure, the compliance cost dominates. F or causal reasoning and ethical dilemmas, where default language enables sup erﬁcial pattern-matc hing, the restructuring b eneﬁt dominates. The conciseness eﬀect (16–33% word reduction) is consistent with this account. Constraints eliminate ﬁller and hedging, pro ducing more eﬃcien t reasoning c hains. The compression is univ ersal across tasks and mo dels, suggesting it reﬂects compliance cost (less capacit y for elab oration) and restructuring b eneﬁt (less need for elab oration when reasoning is more fo cused) in com bination. 5.5 Coun terfactual as Cognitiv e Aﬀordance The coun terfactual agen t’s unique ﬁnding—iden tifying sp eciﬁcation am biguit y by asking “what w ould diﬀer if this assumption were false”—is a direct demonstration of a linguistic aﬀordance creating a cognitiv e capability . The control agent had access to the same information and presumably kno ws what counterfactual reasoning is—it was not inc ap able of the op eration. But its Umw elt did not make that op eration a default mo de of p erception. The constraint made systematic assumption-in v ersion the agen t’s habitual lens, and a ﬁnding follow ed that no other lens surfaced. This distinction matters for the prompt-vs-Umw elt b oundary discussed in Section 5.8 . One could argue that “consider coun terfactuals” is simply a task instruction. But the counterfactual agen t was not told to lo ok for sp eciﬁcation ambiguit y—it w as told to reason counterfactually ab out everything. The sp eciﬁc ﬁnding emerged because the cognitiv e mo de made a sp eciﬁc feature of the problem perceptible. The constraint structured perception; the ﬁnding w as a consequence. 5.6 Implications for Agent Arc hitecture If cognitive diversit y is the mec hanism underlying ensemble gain, then agent ensemble design b ecomes a question of Umw elt selection: whic h set of linguistic constraints pro duces maximally orthogonal co verage for a giv en task domain? The greedy selection algorithm identiﬁed analogical, 15 T able 7: T axonomy of linguistic constrain ts for Umw elt engineering, organized by intellectual tradition and targeted cognitiv e failure mo de. Constrain t T radition Axis T argets E-Prime K orzybski/Bourland Seman tic F alse identit y claims Gen. Seman tics Korzybski Extensional Over-generalization Rheomo de Bohm Ontological Entit y bias Op erationalism Bridgman Epist.-procedural Ungrounded claims T oki Pona Lang Lexical Abstraction leakage Eviden tialit y Elgin/Láadan Epist.-source Unsourced conﬁdence Catus . k ot . i N¯ ag¯ arjuna Logical Premature binary resol. NV C Rosen b erg Ev aluative Obs.–judgmen t conﬂation coun terfactual, and minimal as the optimal 3-agen t subset for soft ware debugging—three constrain ts drawn from three diﬀerent axes (cross-domain mapping, modal reasoning, and expressiv e compression). This suggests that axis div ersity , not constraint intensit y , driv es ensem ble v alue. The practical implication is that multi-agen t systems should b e designed not by duplicating capable agen ts, but b y equipping agen ts with linguistically diverse reasoning modes. Three agen ts with diﬀeren t Umw elten outp erform sixteen agents with ov erlapping ones. 5.7 The Constrain t Design Space These exp erimen ts tested a handful of constraints drawn from a muc h larger design space. Section 2.6 survey ed eight intellectual traditions, eac h of which identiﬁed a sp eciﬁc axis along whic h language shap es though t and prop osed a linguistic interv ention. These traditions w ere dev elop ed independently—K orzybski w orking on map-territory confusion, Bohm on pro cess metaph ysics, Bridgman on op erationalization, Elgin on epistemic transparency , N¯ ag¯ arjuna on non-binary logic—yet they con v erge on a shared structural insight: that linguistic defaults enco de cognitiv e defaults, and that reforming the language reforms the cognition. T able 7 organizes these traditions as a constrain t taxonom y for Um w elt engineering. Sev eral features of this taxonomy b ear emphasis. First, the axes are largely indep endent: remo ving identit y claims (E-Prime) says nothing ab out evidential sourcing (Láadan), which says nothing ab out binary logic (Catu s . k o t . i). This indep endence predicts that constrain ts drawn from diﬀeren t axes will pro duce orthogonal eﬀects on reasoning—precisely the mechanism that drov e ensem ble gain in Exp eriment 2. Second, eac h constraint makes a sp eciﬁc, testable prediction ab out which tasks it will improv e and which it will degrade. E-Prime should degrade tasks that dep end on iden tit y bridges (conﬁrmed: syllogisms) and impro ve tasks where iden tit y claims mask reasoning gaps. Eviden tialit y constraints should improv e tasks where epistemic sourcing matters (research synthesis, factual claims) and imp ose ov erhead on tasks where all information comes from a single authoritative source. The catu s . k o t . i should impro ve ethical dilemmas and design tradeoﬀs where binary framing loses information, and add unnecessary complexity to tasks with gen uinely binary answ ers. Third, the traditions suggest that the design space is not arbitrary . Eac h constraint was dev elop ed by careful think ers who identiﬁed a real cognitiv e failure mo de and prop osed a linguistic remedy . The constrain ts ha ve theoretical motiv ation, not just empirical nov elty . This distinguishes Um welt engineering from unprincipled prompt v ariation: the question is not “what random linguistic constrain ts pro duce interesting eﬀects?” but “which established theories of language-though t in teraction yield pro ductiv e cognitive interv entions for artiﬁcial agents?” 16 5.8 Relationship to Existing F ramew orks An ob vious ob jection: if the constrain t is delivered as a system prompt in struction, ho w is this not simply prompt engineering? The distinction requires careful articulation. A prompt instruction sp eciﬁes a task within the agen t’s existing cognitive world. “Reason step b y step” triggers a reasoning strategy; “b e concise” adjusts an output parameter; “y ou are a careful logician” activ ates a b ehavioral persona. In each case, the conceptual vocabulary remains standard English—the agent applies the instruction us ing its full default rep ertoire of concepts and grammatical structures. An Umw elt in terv ention restructures the me dium through whic h all tasks are pro cessed. “Eliminate all forms of ‘to b e’ ” do es not sp ecify what to think ab out or ho w carefully to think—it remo v es an entire class of cognitive op erations (iden tit y assertion, categorical attribution, essen tialist shorthand) from the agent’s av ailable rep ertoire, forcing all subsequen t reasoning through alternativ e path wa ys. The empirical evidence supp orts this distinction on four grounds. First, the mo del-dep enden t eﬀects. “Reason step b y step” do es not pro duce negative cross-mo del correlations—it helps broadly , b ecause it is a task-level instruction that in teracts minimally with mo del-sp eciﬁc internal structure. E-Prime pro duces correlation co eﬃcients of r = − 0 . 36 and r = − 0 . 75 b et w een mo del pairs, meaning the same constraint reshap es cognition in opp osite directions dep ending on the mo del’s native architecture. This interaction signature is consistent with an interv ention that engages in ternal representational structure rather than merely adding an output-level directiv e. Second, the GPT-4o-mini epistemic collapse ( − 27 . 5 pp on a single task) is not the gradual degradation that instruction-follo wing ov erhead would pro duce—it is a catastrophic failure of a sp eciﬁc cognitive capacity , consistent with the remov al of a load-b earing linguistic structure rather than the addition of a processing burden. Third, the compliance-ﬁltered analysis sho ws that ful ly c ompliant E-Prime trials pro duce stronger b eneﬁcial eﬀects and attenuated harmful eﬀects compared to unﬁltered trials. If the constrain t op erated merely as an instruction comp eting for the mo del’s atten tion, higher compliance would mean higher attentional cost and w orse p erformance uniformly . Instead, higher compliance ampliﬁes the restructuring b eneﬁt—the constrain t is not taxing the reasoning; it is redirecting it. F ourth, and most directly: the tw o constraints function as m utual activ e controls for prompt elaborateness. Both E-Prime and No-Hav e prompts are comparably elab orate—b oth list forbidden forms, provide reform ulation examples, and imp ose a metalinguistic self-monitoring demand. If the observed eﬀects arose f rom the general demand for self-monitoring rather than from the sp eciﬁc vocabulary restriction, the tw o constraints should pro duce similar task proﬁles. They do not. On epistemic calibration, No-Hav e improv es accuracy by 7.4 pp while E-Prime degrades it by 5.7 pp—a 13.1 pp swing b etw een t wo equally elab orate prompts. On causal reasoning, E-Prime impro v es by 14.1 pp while No-Ha ve impro ves by only 4.9 pp. On classiﬁcation, No-Ha v e gains 6.5 pp to E-Prime’s 3.1 pp. These diﬀerential eﬀects can only b e explained b y which wor ds are b eing restricted—p ossessive framing versus copula-based identit y assertion—not b y the shared demand for linguistic self-monitoring. The within-study comparison con trols for prompt elab orateness more directly than any external activ e con trol could, because the t w o conditions share ev ery feature except the sp eciﬁc v o cabulary targeted. None of this prov es that the three-lay er distinction is ontologically real rather than a useful abstraction. But the empirical signatures—mo del-sp eciﬁc interaction, catastrophic capacity failure, compliance-b eneﬁt correlation, and div ergen t task proﬁles b etw een equally elab orate constrain ts—are more consistent with a medium-level interv ention than with a task-level in- struction. The three-la yer stac k may ultimately reduce to a sp ectrum rather than a sharp hierarc hy . Ev en so, the far end of that sp ectrum—where linguistic in terv en tions interact with mo del internals in structured, mo del-dep enden t w ays—represen ts a design space that prompt engineering as curren tly practiced do es not address. Returning to the three-la y er stac k in tro duced in Section 1.2: prompt engineering optimizes within a ﬁxed Um w elt, context engineering pro vides information within a ﬁxed Umw elt, and 17 Um welt engineering designs the Umw elt itself. The empirical signatures rep orted here—mo del- sp eciﬁc interaction patterns, catastrophic capacit y failures, compliance-b eneﬁt correlations— p opulate this framew ork with evidence that the third la y er is not merely conceptual. The relationship to ORION’s Men talese [ T anmay et al. , 2025 ] is direct: Men talese is a delib erately designed Um welt—a synthetic cognitive environmen t optimized for mathematical reasoning. The relationship to Co conut [ Hao et al. , 2024 ] is contrastiv e: Coconut demonstrates that language-based Umw elten ma y imp ose unnecessary constraints, suggesting that the design space includes non-linguistic cognitive en vironments. Both are instances of Umw elt engineering, whether their authors describ e them as suc h or not. 6 Limitations No external active control for prompt complexit y (primary limitation). The E-Prime and No-Hav e system prompts are substan tially more elab orate than the control condition, whic h receiv es no constrain t instruction. This introduces a confound: some p ortion of the observed eﬀects could arise from the presence of an elab orate meta-cognitiv e instruction—one that demands self-monitoring of language output—rather than from the sp eciﬁc v o cabulary restriction. The strongest defense is the within-study comparison: the t w o constrain ts are comparably elab orate but pro duce divergen t task proﬁles (a 13.1 pp swing on epistemic calibration alone; see Section 5.8 ), isolating the contribution of the sp eciﬁc vocabulary restriction from the shared self-monitoring demand. Nev ertheless, a non-vocabulary activ e control (e.g., “ensure every paragraph op ens with a topic sen tence and closes with a transition” or “keep all sentences under 15 words”) w ould pro vide external conﬁrmation and rule out the p ossibility that an y vocabulary-targeting prompt, regardless of which words are targeted, pro duces similar patterns. A dedicated active control exp erimen t remains the most imp ortan t next step for this researc h program. Ceiling and ﬂo or eﬀects. Syllogisms hit 100% control accuracy for all three mo dels, compressing the observ able degradation range. Gemini’s lo wer baseline on sev eral tasks (causal reasoning: 57.8%, ethical dilemmas: 41.7%) inﬂates observ able impro v ement relativ e to Haiku’s higher baselines (97.7%, 98.2%). T able 2 rep orts gap-normalized eﬀects alongside raw deltas to aid interpretation: Gemini’s +42.3 pp ethical dilemmas improv ement from a 41.7% baseline represen ts 72.4% of a v ailable impro vemen t ro om, while Haiku’s +2.7 pp from 89.0% represents 24.5%. Both are real eﬀects, but the raw num b ers are not directly comparable. E-Prime compliance. The 51.9% E-Prime violation rate means observ ed E-Prime eﬀects reﬂect a mixture of complian t and non-complian t reasoning. Compliance-ﬁltered analysis (Section 3.2.4) conﬁrms the direction of eﬀects but has reduced p ow er. No-Hav e’s 92.8% compliance pro vides substan tially cleaner causal evidence for constrain t eﬀects. Statistical notes. p -v alues in T able 1 use Fisher’s exact test (tw o-sided), appropriate for con tingency tables with zero or small cells. The syllogisms results are aﬀected b y the 100% control ceiling: E-Prime syllogisms degradation ( p = 0 . 015 ) survives correction; No-Ha ve syllogisms degradation ( p = 0 . 074 ) do es not reach conv entional signiﬁcance. Cohen’s d v alues are rep orted for comparabilit y with the contin uous-outcome literature but are approximate for binary data; o dds ratios w ould b e more conv en tional. Cross-mo del correlations ( r = − 0 . 75 , − 0 . 36 , 0 . 43 ) are computed on n = 7 tasks and should b e interpreted as suggestive; the strongest ( r = − 0 . 75 , Haiku vs. GPT-4o-mini) has p ≈ 0 . 05 . Mo del selection. All three models are cost-eﬃcien t instruction-following mo dels. Constraint eﬀects ma y diﬀer on frontier mo dels (GPT-4o, Claude Sonnet/Opus, Gemini Pro), which may ha ve more capacit y for simultaneous constraint compliance and reasoning. The model-dep endent eﬀects observ ed here predict that frontier mo dels will sho w diﬀerent interaction patterns, not necessarily smaller eﬀects. T ask format. All tasks use multiple-c hoice format for scoring consistency . Op en-ended reasoning tasks—like Exp eriment 2’s softw are debugging—may show diﬀeren t constrain t eﬀects. 18 The conciseness ﬁnding (16–33% word reduction) suggests that constrain ts alter reasoning structure, not just answer selection, but the multiple-c hoice format ma y undercoun t eﬀects that manifest in reasoning qualit y rather than answ er accuracy . LLM-as-judge. Exp eriment 2 used an LLM judge for semantic matc hing of claims to ground truth. While this av oids brittle string matc hing, it introduces judge noise and p otential biases. Answ er extraction. Despite iterativ e expansion of the extraction pipeline, 85 of 4,429 trials (1.9%) resisted answ er extraction. The remaining failures are concen trated in Haiku’s ethical dilemma and epistemic calibration resp onses, where answers are embedded in discursiv e prose. While the o v erall impact is small, any systematic relationship b etw een extraction failure and resp onse correctness could bias accuracy estimates. 7 F uture W ork The Um w elt framework op ens several research directions: Constrain t cartography . The taxonom y in T able 7 identiﬁes eight constraints across seven axes; Experiment 1 tests t wo of these (E-Prime and No-Hav e) across sev en task types, with results that conﬁrm the existence of task-dep endent crossov er eﬀects. A full cartograph y w ould cross all eight constrain ts with the same task battery , pro ducing an 8 × 7 matrix of eﬀects. Each cell enco des a testable prediction: eviden tiality constraints should improv e epistemic calibration tasks but imp ose o v erhead on single-source reasoning; the catu s . k o t . i should impro v e ethical dilemmas but add noise to tasks with binary ground truth; T oki P ona should improv e explanation tasks but degrade tasks requiring precise technical vocabulary . The mo del-dep enden t eﬀects in Exp erimen t 1 add a third dimension: each cell may v ary across mo del arc hitectures, suggesting a constrain t × task × mo del tensor rather than a simple matrix. Um welt comp osition. Can constrain ts b e pro ductiv ely com bined? An agent reasoning in E-Prime + required uncertain ty markers + analogical framing op erates in a more structured Um welt than an y single constrain t pro vides. Whether constraints comp ose additively , interfere, or in teract non-linearly is an empirical question. Dynamic Um welt switching. Should an agen t’s linguistic world c hange dep ending on what it is doing? Sk etch-of-Though t’s p er-task routing [ Sketc h-of-Though t , 2025 ] suggests yes, but their paradigms are ﬁxed. A richer version would allow agents to shift Umw elten mid-task as reasoning demands c hange. Nativ e Um w elt c haracterization. The mo del-dep enden t eﬀects in Exp eriment 1 suggest that each mo del has a “nativ e Umw elt”—a default cognitiv e world established by its training corpus, architecture, and alignment pro cess. Characterizing these native Um w elten is a pre- requisite for principled constrain t selection. If a mo del’s native Umw elt already de-emphasizes copula-based reasoning (as Haiku’s results suggest), E-Prime adds noise rather than restructur- ing. Mechanistic interpretabilit y methods —sparse auto enco ders, activ ation patc hing, probing classiﬁers—could map the nativ e Umw elt of a mo del by iden tifying which linguistic patterns most strongly activ ate its reasoning circuits. Emergen t vs. designed Umw elten. Quiet-ST aR and Co conut demonstrate that mo dels can dev elop eﬀective reasoning formats without human design. When should one impose a designed Um w elt, and when should one let it emerge? The trade-oﬀ b etw een interpretabilit y (designed) and optimalit y (emergen t) is largely unexplored. Um welt ev aluation. Ho w do y ou measure whether one Umw elt is b etter than another for a giv en purp ose? A ccuracy alone is insuﬃcien t—the exp erimen ts here show that constrained agen ts ma y score lo w er individually while con tributing more to ensem ble co v erage. Metrics for cognitiv e div ersity , orthogonality of p erception, and complementary cov erage are needed. Cross-linguistic Umw elten. If training language shap es LLM cognition [ W ang et al. , 2025 ], then multilingual reasoning en vironmen ts constitute natural Umw elt exp eriments. An 19 agen t that reasons in Japanese ab out a problem described in English op erates in a diﬀerent cognitiv e w orld than one reasoning en tirely in English. Constrain t space geometry . The taxonomy in T able 7 treats constraints as discrete categories, but the underlying space may b e contin uous—or at least partially so. Within a constraint family , one can titrate strictness (strict E-Prime vs. allowing copula in direct quotes vs. merely ﬂagging identit y claims). Across families, the question b ecomes whether constrain ts deﬁne p ositions in a shared geometric space with measurable axes and distances. If so, the orthogonality b et ween constrain ts—whether E-Prime and evidentialit y marking pro duce indep enden t eﬀect proﬁles across tasks, or whether their eﬀects correlate—b ecomes empirically testable through factor analysis or representational similarity analysis on the constraint × task eﬀect matrix. The ensemble results in Exp eriment 2 already suggest that constraints drawn from diﬀerent axes pro duce more complemen tary cov erage, but a systematic measurement of the constrain t space’s dimensionality and orthogonal structure would transform the taxonomy from a list in to a co ordinate system. This requires testing substantially more constraints (at minim um 6–8) across the same task battery—a natural sequel to the cartograph y program describ ed ab ov e. A preliminary question is whether the space is gen uinely con tin uous (admitting in terp olation b etw een constraints), mixed (contin uous within families, discrete b etw een them), or fundamen tally discrete (with orthogonality measurable only as statistical indep endence of eﬀect proﬁles). The answer determines whether Umw elt design is a searc h problem ov er a smo oth manifold or a com binatorial problem o v er a structured set. 8 Conclusion F or a language model, the av ailable language is not a transparen t medium through which cognition passes—it is the cognition. A human can think b eneath and b eyond their w ords; a standard LLM cannot. Its v o cabulary determines whic h concepts exist. Its grammar determines whic h relationships betw een concepts are expressible. Its conceptual distinctions determine whic h features of a problem b ecome p erceptible. Designing this language is Umw elt engineering: the construction of cognitiv e w orlds for artiﬁcial minds. The exp erimen ts demonstrate that linguistic constrain ts reshap e agen t cognition in measur- able, task-dep enden t, and mo del-dep endent wa ys. Remo ving p ossessiv e “to hav e”—a constraint that was originally exploratory—pro duces the broadest impro vemen t: ethical dilemmas +19.1 pp, classiﬁcation +6.5 pp, epistemic calibration +7.4 pp, with 92.8% compliance and consistent eﬀects across mo dels. Removing “to b e” pro duces more dramatic but less predictable eﬀects: causal reasoning +14.1 pp and ethical dilemmas +15.5 pp, but mo del-dep endent volatilit y so sev ere that cross-mo del correlations of E-Prime eﬀects reac h r = − 0 . 75 . The contrast b etw een the tw o constraints is itself rev ealing: p ossessive framing app ears to b e a more universal cognitiv e default than copula-based iden tity assertion, pro ducing a broader distortion that can b e more cleanly remov ed. These eﬀects replicate across three mo dels from three vendors, though the magnitude and even direction v ary by mo del—rev ealing that each mo del o ccupies a diﬀerent nativ e Umw elt that interacts diﬀeren tly with imp osed constraints. In multi-agen t settings, linguistically diverse agents ac hiev e co verage that no individual agent can matc h: a 3-agen t ensem ble selected for Umw elt diversit y achiev es 100% ground-truth co verage on a debugging task where the b est individual agent reaches 88.2%—and a p erm utation test conﬁrms that only 8% of random 3-agent subsets match this ceiling, with ev ery successful subset containing the coun terfactual agen t. T w o mec hanisms emerge. Co gnitive r estructuring : constrain ts that remov e linguistic de- faults force more explicit, op erational reasoning—No-Hav e demonstrates this most cleanly , with broad impro v ements and high compliance, while E-Prime reveals the mechanism’s limits when compliance is lo w and mo del interaction is high. Co gnitive diversiﬁc ation : diﬀeren t constrain ts activ ate diﬀeren t regions of a mo del’s laten t reasoning capacit y , pro ducing orthogonal cov erage 20 in ensemble settings—demonstrated by the counterfactual agent’s unique ﬁnding and conﬁrmed b y the p ermutation test. Both mechanisms conﬁrm that the linguistic cognitiv e environmen t determines the space of possible thought—an empirically measurable design v ariable, not a philo- sophical sp eculation. The primary op en question is whether the observ ed restructuring eﬀects are driv en by the sp eciﬁc v o cabulary restrictions or by the general demand for metalinguistic self-monitoring that an y elab orate constraint prompt imp oses; the crossov er pattern fav ors the former but cannot rule out the latter without an active control exp eriment. These ﬁndings, together with con v erging evidence from syn thetic reasoning languages [ T anma y et al. , 2025 ], latent-space reasoning [ Hao et al. , 2024 ], and cross-linguistic cognition in LLMs [ W ang et al. , 2025 ], establish the case for a three-la yer framew ork—prompt engineering, context engineering, Um w elt engineering—and call for systematic inv estigation of the design space it op ens. Design the world ﬁrst. Then worry ab out the question. References Badr AlKhamissi, Greta T uckute, Yizhou T ang, et al. F rom language to cognition: Ho w LLMs outgro w the h uman language net work. EMNLP 2025 , 2025. Da vid Bohm. Wholeness and the Implic ate Or der . Routledge, 1980. Da vid D. Bourland, Jr. A linguistic note: W riting in E-Prime. Gener al Semantics Bul letin , 32/33:111–114, 1965. Da vid D. Bourland, Jr. and P aul David Johnston. T o Be or Not: A n E-Prime A ntholo gy . In ternational So ciet y for General Semantics, 1991. P ercy W. Bridgman. The L o gic of Mo dern Physics . Macmillan, 1927. James Co ok e Bro wn. Loglan. Scientiﬁc A meric an , 192(6):53–63, 1955. Suzette Haden Elgin. Native T ongue . DA W Bo oks, 1984. Shib o Hao, Sainba y ar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason W eston, and Y uandong Tian. T raining large language mo dels to reason in a contin uous latent space. COLM 2025 , 2024. Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, F ranklin X. W ang, and Sadid Hasan. Do es prompt formatting ha v e any impact on LLM p erformance? 2024. Alfred K orzybski. Scienc e and Sanity: A n Intr o duction to Non-A ristotelian Systems and Gener al Semantics . International Non-Aristotelian Library , 1933. Oliv er Kramer. Conceptual metaphor theory as a prompting paradigm for large language mo dels. 2025. Sonja Lang. T oki Pona: The L anguage of Go o d . T awhid, 2001. N¯ ag¯ arjuna. M¯ ulamadhyamak¯ arik¯ a . Oxford Universit y Press, 150. J. L. Garﬁeld, T rans., 1995, as The F undamental Wisdom of the Midd le W ay . P artha Ray . Do es linguistic relativit y hypothesis apply on ChatGPT resp onses? Y es, it do es. Computational Intel ligenc e (Wiley) , 2025. DOI:10.1111/coin.70103. Marshall B. Rosen b erg. Nonviolent Communic ation: A L anguage of Life . PuddleDancer Press, 2nd edition, 2003. 21 Sk etch-of-Though t. Sk etc h-of-thought: Eﬃcient LLM reasoning with adaptive cognitiv e-inspired sk etching. EMNLP 2025 , 2025. Kumar T anma y , Kunal Aggarwal, P aul Pu Liang, and Subhabrata Mukherjee. Thinking in the lan- guage of thought: Eﬃcient reasoning with structured represen tations. 2025. Jak ob v on Uexküll. A F or ay into the W orlds of A nimals and Humans . Univ ersit y of Minnesota Press, 1934. J. D. O’Neil, T rans., 2010. Changzai W ang, Yichi Zhang, Liang Gao, Ziming Xu, Zefan Song, Y ue W ang, and Xiang Chen. Under the shado w of bab el: How language shap es reasoning in LLMs. 2025. Eric Zelikman, Georges Harik, Yijia Shao, V aruna Ja y asiri, Nick Hab er, and Noah D. Go od- man. Quiet-ST aR: Language mo dels can teac h themselves to think b efore sp eaking. 2024. A E-Prime Constrain t Prompt The follo wing system prompt w as used for the E-Prime condition in Exp erimen t 1: Y ou must reason and resp ond entirely in E-Prime—a form of English that eliminates all forms of “to b e. ” Y ou ma y not use: is, am, are, w as, were, b e, being, been, or contractions containing these (it’s, that’s, there’s, who’s, etc.). Reformulate all statemen ts using active verbs, pro cess descriptions, or relational language. Do not merely rephrase surface syntax—restructure your reasoning to av oid on tological iden tity claims. B Agen t Constrain t Prompts F ull system prompts for all 16 agents in Exp eriment 2 are av ailable in the supplemen tary rep ository . C No-Ha v e Constrain t Prompt The follo wing system prompt w as used for the No-Ha v e condition in Exp erimen t 1: Y ou m ust reason and resp ond without using an y form of “to hav e” as a main v erb. Y ou ma y not use: has, ha ve, had, ha ving when they express p ossession, containmen t, or attribution. Auxiliary uses are p ermitted (e.g., “has completed,” “hav e b een”). Reform ulate all p ossessive statements using relational, b ehavioral, or structural language. “The argument has a ﬂaw” b ecomes “a ﬂaw app ears in the argument. ” “This system has three comp onen ts” b ecomes “three components make up this system. ” D P er-Mo del A ccuracy Breakdo wn F ull p er-mo del accuracy tables are a v ailable in the supplementary data ﬁles. 22 E Repro ducibilit y All co de, data, and results are av ailable at: https://github.com/rodspeed/umwelt- engineering . • Exp erimen t 1: e-prime-llm/ —task items, scoring rubrics, multi-model runner, and full results • Exp erimen t 2: linguistic-agents/ —agen t deﬁnitions, problem bank, 5-phase pip eline, and full analysis • Mo dels: Claude Haiku 4.5 ( claude-haiku-4-5-20251001 ), GPT-4o-mini ( gpt-4o-mini-2024-07-18 ), Gemini 2.5 Flash Lite ( gemini-2.5-flash-lite ) • Exp eriment 1: 4,470 trials across 7 tasks × 3 conditions × 3 mo dels × 4 rep etitions • All exp erimen ts are resumable (JSONL app end) and repro ducible at temp erature 0.0 A c knowledgmen ts This pap er w as dev elop ed through extensiv e collab oration with Claude (Anthropic, 2024–2026). The Um w elt engineering framew ork, three-la y er stac k, exp erimen tal design, and core thesis are the author’s own. The constraint taxonom y (T able 7 ) emerged from directed inquiry: the author h yp othesized that a design space of linguistic cognitive constraints existed b ey ond E-Prime and used structured dialogue with Claude to surface candidate traditions, which w ere then ev aluated, organized, and integrated in to the framew ork by the author. Claude also assisted with literature review, drafting, statistical analysis, and co de for the exp erimental pip eline. This division of lab or—h uman h yp othesis and arc hitectural judgmen t, AI recall and drafting—is itself an instance of the collab orativ e cognitiv e en vironments this pap er examines. 23

Umwelt Engineering: Designing the Cognitive Worlds of Linguistic Agents

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment