The Coordinate System Problem in Persistent Structural Memory for Neural Architectures

The Co ordinate System Problem in P ersisten t Structural Memory for Neural Arc hitectures Abhinaba Basu ∗ Abstract W e in tro duce the Dual-View Pheromone P athw ay Net w ork (DPPN), an architecture that routes sparse attention through a p ersisten t pheromone ﬁeld o ver laten t slot transitions, and use it to discov er t wo indep enden t requiremen ts for p ersisten t structural memory in neural net works. Through ﬁve progressively reﬁned exp erimen ts using up to 10 seeds p er condition across 5 model v arian ts and 4 transfer targets, w e iden tify a core principle: p ersistent memory r e quir es a stable c o or dinate system, and any c o or dinate system le arne d jointly with the mo del is inher ently unstable . W e characterize three obstacles—pheromone saturation, surface-structure en tanglement, and co ordinate incompatibilit y—and sho w that neither contrastiv e up dates, mu lti- source distillation, Hungarian alignment, nor semantic decomp osition resolves the instability when embeddings are learned from scratch. Fixed random F ourier features—derived from the Johnson-Lindenstrauss lemma and Bo c hner’s theorem—provide extrinsic co ordinates that are stable, structure-blind, and yet informative, but with 10 seeds, p osition-only F ourier co ordinates sho w no signiﬁcant transfer adv an tage with routing-bias pheromone ( p > 0 . 05 for all conditions), rev ealing that co ordinate stabilit y is necessary but not suﬃcient. DPPN with pheromone-biased routing outp erforms transformer and random sparse baselines for within-task learning (AULC 0 . 700 ± 0 . 010 vs. 0 . 680 ± 0 . 010 vs. 0 . 670 ± 0 . 010), conﬁrming the architecture’s v alue. Replacing routing bias with learning-rate modulation eliminates negativ e transfer en tirely: warm pheromone as a learning-rate prior never reduces p erformance (mean adv antage +0 . 002 across 7 seeds) while routing-bias pheromone consistently do es (mean − 0 . 002 across 10 seeds). The diagnostic cascade metho dology—where eac h exp erimen t’s ﬁnding motiv ates the next exp erimen t’s design—may itself serve as a template for systematic architectural inv estigation. When b oth co ordinates and signal computation are fully extrinsic, the ﬁrst p ositiv e structural transfer emerges: a structure completion function ov er stable co ordinates pro duces +0 . 006 AULC same-family b on us b ey ond a regularization baseline (10 trained seeds, 5 random-con trol seeds), demonstrating that the catch-22 b et ween stability and informativ eness is partially p ermeable to learned functions. The contribution is the identiﬁcation of t wo indep enden t requirements for p ersisten t structural memory: (a) co ordinate stability—the co ordinate system must b e ﬁxed b efore statistics are accum ulated, and (b) graceful transfer mec hanism—learned functions or learning-rate modulation, not routing bias, b ecause wrong priors must degrade gracefully rather than activ ely interfere. 1 In tro duction A chess play er who masters the principle of central control do es not relearn it for eac h new op ening. The structural insight—that o ccup ying the center enables tactical ﬂexibility—transfers across games with en tirely diﬀerent surface mov es. An an t colony that discov ers a short path b et ween nest and fo od source do es not recompute this route from scratc h eac h morning; pheromone trails persist o vernigh t, biasing the next day’s foragers tow ard previously successful paths [ 2 ]. Both are instances ∗ National Institute of Electronics and Information T echnology (NIELIT). Corresp ondence to: mail@abhinaba.com 1 of p ersistent structur al memory : kno wledge ab out which p athways work , accumulated o v er exp erience and reused when the same structural patterns recur under diﬀerent surface conditions. Deep learning arc hitectures lack this capacity . A transformer [ 1 ] recomputes ev ery attention pattern from scratch on each forward pass. When a mo del trained on T ask A encoun ters T ask B— whic h shares the same structural dep endencies but uses diﬀerent tokens, features, or mo dalities—it m ust rediscov er the shared structure from scratch. There is no memory of which c omputational r outes pr ove d useful that could transfer b et ween tasks. W e set out to build one. Inspired by ant colony optimization, w e constructed a pheromone ﬁeld o ver the latent routing space of a transformer-lik e architecture: a p ersisten t, non-gradient statistic that accumulates evidence ab out which structural pathw ays lead to correct predictions, surviv es when mo del weigh ts are reset, and biases future routing to ward historically pro ductiv e patterns. If it work ed, it would b e the ﬁrst mechanism that enables structural transfer without an y shared parameters b et w een source and target tasks. The path from concept to w orking transfer rev ealed ﬁve distinct obstacles—each illuminating a requiremen t that any p ersisten t memory system m ust satisfy . Across ﬁve exp erimen ts, each ﬁxing the previous obstacle and revealing the next, we traced ev ery obstacle to one ro ot cause: p ersisten t memory requires a stable co ordinate system, and any co ordinate system learned jointly with the mo del is inheren tly unstable. The pheromone ﬁeld records which slot transitions are useful, but the slots themselves are deﬁned b y learned pro jections that change unpredictably across training runs and weigh t resets. Two indep enden tly trained mo dels assign the same structural patterns to diﬀerent slots—their pheromone ﬁelds are maps dra wn in diﬀerent co ordinate systems, and no amount of post-ho c alignment recov ers the corresp ondence (3.5% correlation vs. 3.1% chance). In a ﬁfth exp erimen t, w e test a solution suggested by the Johnson-Lindenstrauss lemma: ﬁxed random F ourier features that op erate on raw p ositional inputs, providing co ordinates that are stable across runs, structure-blind (con taining no task information), and yet distance-preserving. With 3 seeds, the p osition-only v ariant appeared to show a directionally correct transfer pattern, but with 10 seeds the eﬀect washed out to uniformly − 0 . 002 (not signiﬁcant). Co ordinate stability is necessary but not suﬃcient: a further exp erimen t sho ws that the tr ansfer me chanism matters indep enden tly . Replacing routing bias with learning-rate mo dulation (pheromone as a meta-learning-rate prior) eliminates negativ e transfer entirely . W e make three contributions: 1. W e in tro duce the Dual-View Pher omone Pathway Network (DPPN), an arc hitecture that routes sparse atten tion through a p ersistent pheromone ﬁeld ov er laten t slot transitions (Section 3 ). 2. W e conduct ﬁve exp erimen ts (Sections 5 – 9 ), eac h diagnosing a distinct obstacle. T ogether they constitute a diagnostic c asc ade : pheromone saturation → surface-structure en tanglement → co ordinate mismatch → embedding instability → the co ordinate system problem. 3. W e identify the co ordinate system problem as the fundamental obstacle (Section 10 ) and sho w that co ordinate stabilit y is necessary but not suﬃcient: even with stable co ordinates, routing- bias pheromone do es not transfer (10 seeds, p > 0 . 05). W e provide evidence that the transfer me chanism matters indep enden tly: learning-rate mo dulation eliminates negative transfer while routing bias do es not (Section 12 ). 2 2 Researc h Gap and Ev olution of Ideas The co ordinate system problem w e iden tify in this pap er did not emerge in a v acuum. It sits at the in tersection of four researc h threads—memory-augmented netw orks, p ersisten t state mo dels, transfer learning, and random feature theory—none of which, individually , confron ted the sp eciﬁc obstacle w e characterize. This section traces the evolution of ideas that conv erges on the co ordinate stabilit y requirement, and identiﬁes the gaps b et ween existing lines of work that our con tribution ﬁlls. 2.1 Phase 1: External Memory (2014–2016) The Neural T uring Mac hine [ 8 ] demonstrated that neural net works can learn to read from and write to external memory using diﬀerentiable con tent-based addressing. The Diﬀerentiable Neural Computer [ 9 ] extended this with temp oral linking, allowing the netw ork to tra verse memory in the order it was written, and dynamic allo cation, preven ting ov erwriting. Memory Netw orks [ 20 ] in tro duced m ulti-hop attention ov er an external memory bank with learned addressing. These architectures established a crucial capabilit y: neural netw orks can main tain and manipulate information b ey ond their parameter space. Ho wev er, the memory in all three cases is episo dic —it stores sp eciﬁc conten t (input patterns, intermediate computations) rather than structur al knowledge (whic h computational routes are useful). The NTM’s memory matrix records what was written; it do es not record whic h read-write patterns pro ved eﬀective across man y inputs. The DNC’s temp oral linking pro vides structural memory in a limited sense (it remembers the order of writes), but this structure is sp eciﬁc to one episo de and is not p ersisten t across tasks. When the mo del is applied to a new task, the memory is typically cleared or re-initialized. The gap: these arc hitectures demonstrated that external memory is useful, but none separated structur al memory (which patterns of memory access are eﬀective) from c ontent memory (what information is stored). Our pheromone ﬁeld is precisely this separation—it records whic h slot-to-slot transitions are useful, indep enden t of what information ﬂows along those transitions. 2.2 Phase 2: P ersisten t State (2019–2024) A second wa v e of arc hitectures introduced p ersisten t state that carries information across segments or time steps within a task. T ransformer-XL [ 10 ] caches hidden states from previous segmen ts, enabling the mo del to attend b ey ond its context windo w. The Compressive T ransformer [ 11 ] compresses old hidden states rather than discarding them, extending the eﬀective memory horizon. R WKV [ 23 ] com bines the parallelizability of transformers with a recurren t state that accum ulates information across time steps, using linear attention with exp onen tial decay . The most relev an t developmen t in this phase is the Structured State Space sequence mo del (S4) [ 21 ] and its selectiv e v ariant Mam ba [ 22 ]. S4 parameterizes its state transition using the HiPPO (High-order Polynomial Pro jection Op erators) matrix—a mathematically derived, ﬁxe d basis for represen ting functions of time. The HiPPO matrix is not learned from data; it is derived from the requiremen t that the state optimally appro ximates the history of the input signal under a sp eciﬁc measure (e.g., the Legendre measure for uniform weigh ting of history). This is, in a precise sense, a ﬁxed coordinate system for temp oral memory . The connection to our work is direct: S4’s HiPPO basis solves the co ordinate stabilit y problem for temporal memory . The basis is deﬁned prior to training (extrinsic), shared across tasks b y construction (it dep ends only on the temp oral measure, not the data), and equipp ed with a metric (the Legendre p olynomial basis provides an orthogonal decomp osition of temp oral history). These 3 are exactly the three prop erties we identify in Section 10 as necessary for p ersisten t memory . The critical distinction is that HiPPO provides co ordinates for temp or al memory (approximating what happ ened in the recent past), whereas our pheromone requires co ordinates for structur al memory (recording which computational routes are eﬀective). HiPPO’s success for temp oral memory , and the obstacle w e encountered for structural memory with learned co ordinates, are t wo instances of the same principle: p ersistent memory requires a ﬁxed basis. The gap: p ersistent state mo dels op erate within a single task. T ransformer-XL’s cache is cleared b et w een tasks. S4’s state is reset. Mam ba’s selectiv e state space is task-sp eciﬁc. None of these arc hitectures address the question of whether p ersisten t state can tr ansfer structural knowledge from one task to another. Our work asks this question directly and ﬁnds that the answer dep ends en tirely on whether the co ordinate system is stable. 2.3 Phase 3: The Co ordinate System Gap T ransfer learning and domain adaptation provide a third thread. In v arian t Risk Minimization [ 4 ] seeks represen tations that are inv arian t across en vironments, but requires access to multiple training en vironments and assumes the represen tation space is adequate. Domain-adversarial training [ 24 ] learns represen tations that cannot distinguish source from target domain, but the adv ersarial training mo diﬁes the representation itself. Both approaches assume that a shared representation space exists and can b e discov ered by learning. Our w ork reveals a more fundamen tal problem. The issue is not what is represented but wher e it is represen ted—the co ordinate system of the representation. Even if the structural con tent of t wo pheromone ﬁelds is identical (b ecause the tasks share the same structural family), the ﬁelds are deﬁned ov er diﬀerent co ordinate systems (because the soft group ers conv erged to diﬀerent pro jections). T ransfer fails not b ecause the kno wledge is wrong, but b ecause the knowledge is expressed in incompatible co ordinate systems. This is the gap b etw een existing approaches: the literature on transfer learning and domain adaptation assumes a shared representation space and fo cuses on learning inv ariant c ontent within that space. Nob ody previously identiﬁed the c o or dinate system of the represen tation as the bottleneck. When the co ordinate system itself is learned, it v aries across training runs, and any persistent statistics accum ulated ov er those co ordinates b ecome meaningless when transferred. 2.4 Phase 4: The Solution P ath The ﬁnal thread provides the to ols for a solution, though the connection has not previously b een made. Random Kitchen Sinks [ 25 ] sho wed that random F ourier features approximate shift-inv ariant k ernels, establishing that ﬁxed random pro jections are statistically informative without any learning. Extreme Learning Machines [ 26 ] demonstrated that netw orks with ﬁxed random hidden lay ers and only a trained output lay er achiev e comp etitiv e p erformance, proving that learned intermediate represen tations are not alwa ys necessary . Ec ho State Netw orks and reservoir computing [ 27 ] sho wed that ﬁxed random recurren t dynamics, with only a trained readout, can mo del complex temp oral patterns. The Johnson-Lindenstrauss lemma [ 32 ] pro vides the theoretical foundation: random pro jections preserve pairwise distances with high probabilit y , guaranteeing that geometric relationships in the original space are maintained in the pro jected space. F rom biology , grid cells [ 28 ] pro vide an innate hexagonal co ordinate system for spatial memory that is presen t b efore any en vironmental exp erience—the animal do es not need to learn the coordinate system for its cognitive map. The ﬂy olfactory circuit [ 29 ] uses sparse random expansion (from 50 olfactory receptor types to 2,000 Ken yon cells via random pro jections) for similarit y-preserving 4 hashing, enabling rapid o dor classiﬁcation without learned feature extraction. The synthesis, which our work mak es explicit, is: random pro jections provide the ﬁxed, structure-blind, yet geometrically informativ e co ordinates that persistent structural memory requires. Random features satisfy all three prop erties of the co ordinate stabilit y requiremen t: (a) Extrinsic deﬁnition: Random pro jections are drawn b efore seeing any data. (b) Cross-task sharing: The same random pro jection matrix is used across tasks (shared by construction, not b y alignment). (c) Structural metric: The JL lemma guarantees that distances—and therefore structural relationships—are appro ximately preserved. This connection b et w een random feature theory and the co ordinate system problem for p ersisten t memory has not b een made in the literature. The random features communit y established that ﬁxed random pro jections are informative; the p ersisten t state communit y established that p ersisten t memory impro ves sequence mo deling; the transfer learning comm unity established that shared represen tations enable kno wledge transfer. Our con tribution is the iden tiﬁcation of the sp eciﬁc gap b et w een these threads: p ersisten t structural memory that transfers across tasks requires a ﬁxe d co ordinate system, and random features are the natural candidate to pro vide it. With these principles as context—ﬁxed bases for p ersisten t memory , random features for stable co ordinates, and the gap b et ween temp oral and structural memory—we no w describ e the DPPN arc hitecture. 3 Arc hitecture: Dual-View Pheromone P ath w a y Net w orks The DPPN arc hitecture routes sparse attention through a p ersisten t pheromone ﬁeld deﬁned ov er laten t slot transitions (Figure 1 ). The computational path is: tokens → em b edding → dual soft grouping → slot-lev el agreemen t → pheromone-biased routing → tok en-space sparse mask → sparse atten tion → fast/slow gate fusion → output. 3.1 T ok en Em b eddings Giv en input tokens x = ( x 1 , . . . , x N ) with x i ∈ { 1 , . . . , V } , we compute: h i = E tok [ x i ] + E pos [ i ] , h ∈ R B × N × d (1) where E tok ∈ R V × d and E pos ∈ R N max × d are learned em b edding matrices. 3.2 Dual Soft Group ers Tw o indep enden t soft group ers pro ject tok ens into m latent role slots: Q v = softmax  W v · h T + γ · ϵ  , v ∈ { 1 , 2 } (2) where W v ∈ R d × m are learned pro jection matrices (applied p er-tok en as h ⊤ i W v ∈ R m for each p osition i ), T is a temp erature parameter, ϵ is Gumbel noise sampled during training ( ϵ ij = − log ( − log ( u ij )), u ij ∼ Uniform (0 , 1)), and γ = 0 . 5 mo dulates noise magnitude. Each Q v ∈ R B × N × m is a soft ass ignmen t matrix whose rows sum to 1: Q v ( i, a ) represents the degree to which tok en i is assigned to slot a under view v . The Gumbel noise ensures that the tw o group ers pro duce sto c hastically diﬀeren t views, providing the diversit y needed for agreement-based routing. 5 3.3 Slot-Lev el Supp ort and Agreement F or each view, we compute a slot-to-slot in teraction supp ort matrix: S ( v ) = Q ⊤ v · K x · Q v ∈ R B × m × m (3) where K x = W K h is a learned compatibilit y k ernel. In practice, we compute s a = Q ⊤ v h , k a = Q ⊤ v K x , and S ( v ) = s · k ⊤ , follo wed by ReLU activ ation and ro w-wise normalization. The entry S ( v ) ab measures ho w strongly slots a and b in teract under view v . The t wo views are then combined via a conﬁdence- a ware agreement signal: A = µ · q ( S (1) + ϵ )( S (2) + ϵ ) + (1 − µ ) · S (1) + S (2) 2 (4) where µ = σ ( w ⊤ ¯ h ) ∈ [0 , 1] is a learned conﬁdence gate computed from the mean-p ooled input ¯ h = 1 N P i h i . When the mo del is conﬁdent ( µ → 1), the geometric mean (strict consensus) dominates; when uncertain ( µ → 0), the arithmetic mean (p ermissiv e) tak es ov er, preven ting o verconﬁden t routing early in training. 3.4 Pheromone-Biased Routing The routing probabilit y from slot a to slot b given input x is: p ( a → b | x ) = softmax b ( α · log ( τ ab ) + β · log( A ab + ϵ )) (5) where τ ∈ R m × m is the pheromone ﬁeld (Section 3.5 ), α and β con trol the relativ e inﬂuence of pheromone memory vs. current-input evidence, and the softmax is taken ov er target slots b . 3.5 Pheromone Memory The pheromone ﬁeld τ ∈ R m × m is the cen tral nov el comp onen t. It is not a gradien t-trained pa- rameter. It is an exp onen tial moving av erage (EMA) running statistic, updated after loss.backward() based on prediction correctness. It is stored in ﬂoat32 (nev er cast to bﬂoat16), sa ved and loaded separately from mo del weigh ts, and—critically— p ersists when weights ar e r eset . The up date rule, applied after each training step, consists of four stages: Ev ap oration. All transitions deca y tow ard a minimum τ min : τ ← ρ · τ + (1 − ρ ) · τ min (6) with ev ap oration rate ρ = 0 . 8. Un used transitions gradually fade. Signed dep osit. Correct predictions reinforce active transitions; incorrect predictions weak en them: ∆ τ = 1 B B X b =1 s b · p ( b ) , s b = ( +1 if prediction correct − 1 if prediction incorrect (7) where p ( b ) ∈ R m × m is the route preference matrix from the forward pass (Eq. 5 ) for sample b . 6 Sparse update. Only the top- k transitions by | ∆ τ | receiv e up dates: τ ab ← τ ab + δ · ∆ τ ab · ( τ max − τ min ) · ⊮ [( a, b ) ∈ top- k ] (8) where δ = 0 . 3 is the dep osit rate and ∆ τ is normalized b y its maxim um absolute v alue. Clamping. Pheromone v alues are clamp ed to [ τ min , τ max ] = [0 . 1 , 2 . 0], follo wing the MAX-MIN An t System (MMAS) conv ention [ 3 ]. 3.6 T ok en-Space Mask Pro jection The slot-space routing is pro jected bac k to token space: M ( i, j ) = X a,b Q 1 ( i, a ) · p ( a → b ) · Q 2 ( j, b ) (9) or in matrix form, M = Q 1 · p · Q ⊤ 2 ∈ R B × N × N . The top- k en tries p er row of M deﬁne a sparse atten tion mask. 3.7 Sparse Atten tion and F ast/Slo w Gate F usion Standard scaled dot-pro duct attention is computed through the sparse mask (fast lane). A lo cal windo w attention with window size w serves as a slow-lane fallback: h out = g · h fast + (1 − g ) · h slow (10) where g = σ ( W g [ h fast ; h slow ; ¯ A ]) is a learned gate that incorp orates mean agreemen t strength ¯ A . The gate is biased tow ard the slow lane early in training via a phase-dep enden t factor: g ← g · max(0 , ( ϕ − 0 . 2) / 0 . 8), where ϕ ∈ [0 , 1] tracks training progress. 3.8 T ransfer Mechanism The transfer proto col exploits the separation b etw een mo del weigh ts and pheromone memory: 1. T rain on source task T A for E source ep ochs; pheromone τ accumulates structural memory . 2. Sa ve pheromone state τ sav ed . 3. Reset all mo del weigh ts (em b eddings, atten tion pro jections, classiﬁer) to fresh random initialization. 4. Restore only pheromone τ ← τ sav ed . 5. T rain on target task T B for E transfer ep ochs. If the pheromone captured structural patterns that generalize, the warm pheromone should bias routing to ward useful pathw a ys from the start, accelerating learning on T B relativ e to a cold (uniform pheromone) start. 7 3.9 W ork ed Example Consider a concrete forw ard pass to illustrate the computational ﬂow. A sequence of length N = 128 con tains motif tokens plan ted at p ositions 5, 60, and 120 (with the remaining p ositions ﬁlled with noise tok ens). The soft group er (Eq. 2 ) pro jects all 128 tok en embeddings into m = 32 slot assignmen ts. Position 5 is assigned primarily to slot 7 ( Q (5 , 7) = 0 . 34), position 60 to slot 15 ( Q (60 , 15) = 0 . 28), and p osition 120 to slot 23 ( Q (120 , 23) = 0 . 41). These are the dominant entries in their resp ectiv e rows; the remaining probability mass is spread across other slots. The slot supp ort matrix (Eq. 3 ) aggregates pairwise interactions: S (7 , 23) = 0 . 85, reﬂecting that tok ens assigned to slots 7 and 23 hav e high compatibility . The pheromone ﬁeld, ha ving b een trained on similar structural patterns, has τ (7 , 23) = 1 . 80 (near the maximum τ max = 2 . 0), indicating that the transition from slot 7 to slot 23 has b een consistently reinforced by correct predictions. The routing probability (Eq. 5 ) combines pheromone and agreemen t: p (7 → 23 | x ) = softmax 23 (1 . 0 · log (1 . 80) + 1 . 0 · log (0 . 85 + ϵ )) = 0 . 42. This is substantially higher than the uniform baseline of 1 / 32 = 0 . 031. Pro jecting back to token space (Eq. 9 ): M (5 , 120) = Q 1 (5 , 7) · p (7 → 23) · Q 2 (120 , 23) = 0 . 34 × 0 . 42 × 0 . 41 = 0 . 059. After top- k selection, p osition 5 attends to p osition 120 despite their distance of 115 tokens—because the pheromone ﬁeld has learned that this structural connection is useful. Under cold (uniform) pheromone, τ (7 , 23) = 1 . 05, yielding p (7 → 23) = 0 . 034 and M (5 , 120) = 0 . 005—lik ely b elo w the top- k threshold, so the connection w ould not b e made. 4 Exp erimen tal Proto col 4.1 T ask Design W e construct syn thetic sequence classiﬁcation tasks organized into structural families . Each family deﬁnes a set of structural motifs—speciﬁc patterns of token co-o ccurrence and ordering—that determine the classiﬁcation lab el. T asks within the same family share identical structural motifs but use diﬀeren t surface token mappings. Sp eciﬁcally , w e deﬁne three families (A, B, C), each with a distinct set of motifs plan ted in sequences of length 128 o ver a v o cabulary of size 32. Each sample con tains 2–3 motifs from its family , with the lab el determined by a deterministic lo okup table ov er motif conﬁgurations. Noise level is set to 0.02 to main tain a pro ductiv e learning regime. This design yields six task conﬁgurations: source tasks A1 and A1 ′ (used join tly for multi-source distillation), and four transfer targets (A2, B1, A3, C1). The full task stream is: • Source: A1 and A1 ′ (family A, t wo diﬀerent surface mappings, trained jointly) • T ransfer targets: A2 (family A, new surface), B1 (family B), A3 (family A), C1 (family C) The structural Jaccard similarit y b et ween same-family tasks is 1.0 (identical motifs); b et ween diﬀeren t-family tasks it is 0.0. 4.2 Mo dels W e compare three arc hitectures, matched for parameter count: 1. DPPN ( d = 64, 4 heads, 3 lay ers, m = 32 slots, top- k = 32): The full architecture describ ed in Section 3 , with pheromone-biased routing. 8 2. T ransformer baseline ( d = 64, 4 heads, 3 lay ers): Standard transformer enco der with dense atten tion. No structural memory; the warm/cold distinction is meaningless for this mo del. 3. Random Sparse baseline ( d = 64, 4 heads, 3 lay ers, top- k = 32): T ransformer with random sparse attention at the same sparsity budget as DPPN. Con trols for whether sparsity itself (rather than learned routing) drives an y observed eﬀects. 4.3 T ransfer Proto col Phase 1 (Source T raining): T rain on source tasks A1 and A1 ′ join tly for 80 ep o c hs. F or DPPN, pheromone accumul ates o ver b oth tasks; the multi-source training is designed to exp ose the mo del to the same structural patterns under diﬀerent surface tokens. Phase 2 (T ransfer): F or each of the four transfer targets (A2, B1, A3, C1): • Cold condition: Reset all mo del weigh ts; reset pheromone to uniform τ ab = ( τ min + τ max ) / 2. T rain for 50 ep ochs. • W arm (distilled) condition: Reset all mo del weigh ts; load pheromone distilled from A1 and A1 ′ via elemen t-wise minimum (after Hungarian alignment). T rain for 50 ep o c hs. • W arm (rank-reduced) condition: Same as warm distilled, but pheromone is additionally rank-reduced via SVD (rank 4) to compress surface-entangled comp onen ts. F or the transformer and random sparse baselines, only the cold condition is run (w arm/cold is meaningless without pheromone). 4.4 Metrics • A ULC (Area Under Learning Curve): A ULC = 1 E R E 0 a ( t ) dt , where a ( t ) is v alidation accuracy at ep och t and E = 50 is the num ber of transfer ep ochs. Higher AULC indicates faster learning. • Ep ochs to 70%: First ep och at which v alidation accuracy reaches 0 . 70. • T ransfer adv antage: ∆ AULC = A ULC warm − AULC cold . P ositive v alues indicate that w arm pheromone accelerates learning. Exp erimen ts in Sections 5 – 8 are run with 3 seeds (42, 137, 256). The p osition-only F ourier exp erimen t (Section 9 ) is extended to 10 seeds, and the meta-learning-rate extension (Section 12 ) uses 7 seeds. All exp erimen ts use a single NVIDIA H100 GPU with bﬂoat16 precision for mo del parameters and ﬂoat32 for pheromone. 5 Exp erimen t 1: Disco v ering Pheromone Saturation and the Con- trastiv e Fix 5.1 The Problem: Non-Con trastiv e Up dates Our initial DPPN implemen tation used non-con trastive pheromone up dates: all active transitions receiv ed p ositiv e reinforcement regardless of prediction correctness. Additionally , the ev ap oration form ula contained a bug: τ ← (1 − ρ ) τ + ρ τ , which is the iden tity op eration and pro vides no ev ap oration. 9 T able 1: Pheromone evolution during source training (DPPN, seed 42). The ﬁeld develops structure within the ﬁrst 30 ep ochs, then stabilizes. Ep och V al. Acc. ¯ τ ± σ τ En tropy 0 0.476 0 . 331 ± 0 . 227 6.76 5 0.638 0 . 358 ± 0 . 637 5.97 10 0.708 0 . 356 ± 0 . 645 5.94 20 0.722 0 . 338 ± 0 . 629 5.93 30 0.734 0 . 338 ± 0 . 629 5.93 50 0.734 0 . 338 ± 0 . 629 5.93 79 0.794 0 . 338 ± 0 . 629 5.93 5.2 Results: Round 1 (Original T ask Diﬃculty) With the original task design (high noise, large vocabulary), all mo dels reached only 55–63% v alidation accuracy on the source task, with hea vy ov erﬁtting. T ransfer adv antages were within noise ( ± 0 . 005 A ULC across all conditions). Pheromone diagnostics revealed the core issue: τ v alues saturated to uniformly high v alues (0 . 93 ± 0 . 05). With near-uniform pheromone, warm and cold conditions w ere eﬀectively identical. 5.3 Results: Round 2 (T ask Rebalancing) W e reduced noise to 0.02, used 3 tokens p er motif region, sequence length 128, v o cabulary 32, and balanced lab els via a deterministic lo okup. Source v alidation accuracy jump ed to ∼ 98%. How ever, transfer adv antages remained zero—b oth warm and cold conditions con verged to ceiling accuracy within the ﬁrst few transfer ep o c hs, lea ving no window for pheromone to provide an adv antage. 5.4 Diagnosis and Fix The task diﬃculty needed to b e in a “Goldilo c ks zone”: hard enough that pheromone could help, but not so hard that pheromone never develops structure, and not so easy that the adv an tage windo w v anishes. W e calibrated to: vocabulary 32, n train = 2000, motifs p er sample ∈ { 2 , 3 } , yielding source v alidation accuracy of ∼ 75%. Sim ultaneously , we ﬁxed pheromone dynamics: • Con trastive signed up dates: Correct predictions → p ositiv e dep osit; incorrect → negativ e dep osit (Eq. 7 ). • Sparse top- k up dates: Only the top-128 transitions (of 64 2 = 4096) receiv e up dates p er step. • Prop er ev ap oration: ρ = 0 . 8, deca ying tow ard τ min = 0 . 1 (Eq. 6 ). • MMAS clamping: τ ∈ [0 . 1 , 2 . 0]. After the ﬁx, pheromone exhibited gen uine structure: τ = 0 . 34 ± 0 . 63, with high v ariance indicating a sparse activ ation pattern rather than uniform saturation. This conﬁrmed that the pheromone dynamics were now functional. Pheromone en tropy dropp ed from 6.76 (near-uniform, ep och 0) to 5.93 (structured, ep o c h 30), where it stabilized for the remaining training (T able 1 ). 10 T able 2: T ransfer results (AULC), mean ± std ov er 3 seeds. “Distilled” and “Rank-reduced” are t wo warm pheromone conditions for DPPN. T ransformer and Random Sparse hav e only cold (no pheromone to transfer). ∆ = w arm − cold. Same F amily (A) Diﬀeren t F amily Mo del Condition A2 A3 B1 C1 DPPN Cold . 699 ± . 014 . 702 ± . 012 . 715 ± . 017 . 682 ± . 016 Distilled . 700 ± . 017 . 701 ± . 012 . 711 ± . 012 . 678 ± . 016 Rank-reduced . 700 ± . 015 . 700 ± . 011 . 714 ± . 015 . 683 ± . 016 ∆ distilled + . 001 − . 001 − . 004 − . 004 T ransformer Cold . 687 ± . 012 . 679 ± . 013 . 690 ± . 017 . 665 ± . 020 Random Sparse Cold . 678 ± . 015 . 673 ± . 012 . 686 ± . 013 . 650 ± . 014 6 Exp erimen t 2: Surface-Structure En tanglemen t in Single-Source T raining With structured pheromone in hand, we tested the core hypothesis: do es w arm pheromone from source task A1 accelerate learning on same-family target A2? 6.1 Results T able 2 presents the full transfer results across all conditions and seeds. The critical ﬁnding is in the DPPN rows: warm pheromone provides no r eliable tr ansfer advantage , and on some tasks it activ ely reduces p erformance. The transfer adv an tage for DPPN on A2 (same family as source) is +0 . 001—not statistically signiﬁcan t and far b elo w the +0 . 02 to +0 . 05 eﬀect size we hypothesized. On A3 (also same family), the adv antage is − 0 . 001. On diﬀerent-family tasks B1 and C1, warm pheromone slightly r e duc es p erformanc e ( − 0 . 004). 6.2 Diagnosis: Surface-Structure En tanglemen t The pheromone ﬁeld learned on A1 captures not just which structural patterns connect, but also which sp e ciﬁc tokens activate d which sp e ciﬁc slots . The soft group er assignmen ts Q dep end on the tok en embeddings, which are surface-sp eciﬁc. When the mo del is transferred to A2—same structure, diﬀeren t surface tokens—the pheromone biases routing tow ard path wa ys tuned to A1’s surface features, activ ely interfering with adaptation. This diagnosis is consisten t with established results in in v arian t risk minimization (IRM) [ 4 ]: structural in v arian ts cannot b e identiﬁed from a single training en vironment. A single source task pro vides no contrast to separate structure from surface. Gentner’s progressive alignment theory from developmen tal psyc hology [ 5 ] makes the same prediction: abstraction requires comparison across m ultiple instances that share structure but diﬀer in surface features. 11 7 Exp erimen t 3: The Co ordinate Mismatc h Bet w een Indep enden t T raining Runs 7.1 Motiv ation: Gentner’s Progressiv e Alignmen t If single-source pheromone entangles surface and structure, p erhaps training on two sources with iden tical structure but diﬀerent surface tokens can resolve the entanglemen t. W e train on A1 and A1 ′ (same structural family A, diﬀerent surface token mappings) and distill their pheromone ﬁelds b y taking the element-wise minimum: τ distilled ( a, b ) = min  τ A1 ( a, b ) , τ A1 ′ ( a, b )  (11) The intuition is that a transition reinforced under b oth surface mappings must b e structural, not surface-sp eciﬁc. 7.2 Results: Unaligned Distillation Only 10 out of 1024 ( m 2 = 32 2 for the test conﬁguration) slot transitions survived the distillation with appreciable magnitude. The distillation destroy ed nearly all pheromone structure. 7.3 Ro ot Cause: Co ordinate Mismatc h Indep enden tly trained mo dels develop incompatible slot orderings. The soft group ers (Eq. 2 ) con verge to diﬀeren t arbitrary pro jections of token space onto slot space. Slot 7 in the mo del trained on A1 has no corresp ondence with slot 7 in the mo del trained on A1 ′ . The elemen t-wise minimum of tw o r andomly p ermute d structured matrices approximates the global minim um v alue—it discards structure rather than extracting it. 7.4 Fix: Hungarian Slot Alignment W e applied the Hungarian algorithm [ 15 ] to ﬁnd the optimal p erm utation aligning slot assignmen ts b et w een the tw o training runs b efore distilling. The pro cedure is: 1. Run b oth mo dels on the same data and collect slot assignments Q ( A ) and Q ( B ) . 2. Compute the cross-correlation matrix C ∈ R m × m , where C ij = P n Q ( A ) ni · Q ( B ) nj . 3. Solv e the linear assignment problem: π ∗ = arg max π P j C π ( j ) ,j . 4. P ermute τ A1 ′ according to π ∗ b efore distilling. 7.5 Results: Aligned Distillation The alignmen t correlation was 3.5%, compared to 3.1% exp ected by random c hance with 32 slots. Barely ab ov e chance. After aligned distillation, 21 out of 1024 transitions had high magnitude (up from 10 unaligned), and the A2 transfer adv antage ﬂipp ed from − 0 . 006 to +0 . 002—the ﬁrst p ositiv e transfer observed, but not statistically signiﬁcant. 12 7.6 Wh y Alignmen t F ails The Hungarian algorithm assumes a bijection b et ween discrete entities. Soft group ers spread eac h tok en across al l slots with contin uous w eights, and the resulting assignment matrices are to o diﬀuse for combinatorial alignment to recov er meaningful corresp ondence. The slots do not enco de discrete, alignable roles—they are con tinuously distributed representations that resist p ost-ho c discretization. 8 Exp erimen t 4: Wh y Learned Em b eddings Undermine An y Co- ordinate System 8.1 Arc hitecture: Decomp ositionPheromoneMo del If learned slots pro vide unstable coordinates, p erhaps w e can replace them with a ﬁxed spatial decomp osition. W e designed a new arc hitecture (Section C ) that: 1. Decomp oses the input in to R ﬁxed spatial regions (replacing learned soft group ers with deterministic segmen tation). 2. Em b eds each region via a small enco der netw ork into a shared embedding space. 3. Matc hes the regional embedding proﬁle against K cluster cen troids (maintained via online k -means with EMA momentum 0.99). 4. Routes atten tion priorities via pheromone deﬁned ov er (cluster, strategy) pairs rather than (slot, slot) transitions. The k ey design c hoice: the co ordinate system for pheromone is no w anc hored to spatial p osition (regions) and semantic conten t (cluster centroids in embedding space), not to arbitrary learned slot indices. 8.2 Co ordinate Stabilit y T est W e trained tw o instances of the decomp osition mo del on diﬀerent surface features (A1 vs. A1 ′ , same structural family) and measured the Pearson correlation b et ween their pheromone ﬁelds without any alignment step . Result: Pheromone correlation was − 10 . 4%. This is worse than DPPN’s 3 . 5%. 8.3 Ro ot Cause: Em b edding Instability The spatial decomposition is stable—region b oundaries are ﬁxed. But the em b eddings through whic h con tent is represented are not. The regional embeddings are av erages of le arne d token embeddings, whic h are initialized randomly and trained end-to-end. Two mo dels trained with diﬀerent seeds learn diﬀeren t embedding geometries. The cluster cen troids, computed via k -means o ver these em b eddings, inherit the instability . This reveals that the co ordinate problem is not ab out the abstr action level (slots vs. regions vs. clusters). It is ab out whether the co ordinate system is learned or extrinsic . Any co ordinate system built on top of learned-from-scratch em b eddings will b e unstable across training runs, regardless of the abstraction mechanism. 13 9 Exp erimen t 5: Random Pro jections Pro vide Stable Co ordinates The cross-domain analysis (Section 10 ) p oin ts to a solution: ﬁxe d r andom pr oje ctions provide co ordinates that are stable, structure-blind, and yet informative. The Johnson-Lindenstrauss lemma [ 32 ] guarantees that random pro jections preserve pairwise distances; Bo chner’s theorem (via random F ourier features [ 25 ]) extends this to k ernel similarity preserv ation. Crucially , a ﬁxed random pro jection is not a frozen pretrained enco der. A pretrained enco der has seen training data and may already enco de task structure—contaminating the e xperiment (see Section 13.6 ). A random matrix drawn from a ﬁxed seed has seen nothing . T ransfer credit b elongs purely to the pheromone. 9.1 Fixed F ourier Group er W e replace the learned soft group er (Eq. 2 ) with a ﬁxed random F ourier group er: Q v ( i, a ) = softmax a cos( x ( i ) raw · W ﬁxed + b ﬁxed ) √ m · T ! (12) where W ﬁxed ∈ R D in × m and b ﬁxed ∈ R m are dra wn once from N (0 , σ 2 /D in ) and Uniform (0 , 2 π ) resp ectiv ely , using a deterministic seed. They are never up dated during training. The critical design choice is the input x raw . W e test tw o v ariants: • T ok en+Position : x ( i ) raw = [ onehot ( x i ); onehot ( i )], concatenating tok en ID and p osition one-hot v ectors. D in = V + N max . • P osition-Only : x ( i ) raw = onehot ( i ), using only the p osition. D in = N max . This makes slot assignmen ts pur ely structur al —the same p osition alwa ys maps to the same slot, regardless of tok en conten t. The Position-Only v ariant is the strongest test of the thesis: the co ordinate system contains zero surface information, so an y transfer adv antage m ust come from pheromone enco ding structural (p ositional interaction) patterns. 9.2 Results W e run b oth v ariants alongside the original DPPN with learned group ers (with Hungarian alignment), using the same multi-source distillation proto col and 3 seeds p er condition. T able 3: T ransfer adv antage (∆ AULC: warm distilled − cold, mean ± std). Positiv e v alues indicate pheromone transfer helps. The Position-Only F ourier v arian t with 10 seeds sho ws uniformly negative adv antages, indistinguishable from the other routing-bias v ariants. The 3-seed result that app eared directionally correct did not replicate. Same F amily Diﬀeren t F amily Mo del (distilled) Seeds A2 A3 B1 C1 DPPN (learned + Hungarian) 3 +0 . 002 ± 0 . 007 − 0 . 001 ± 0 . 001 − 0 . 004 ± 0 . 005 − 0 . 003 ± 0 . 002 F ourier (token+position) 3 − 0 . 001 ± 0 . 004 − 0 . 008 ± 0 . 004 − 0 . 005 ± 0 . 003 − 0 . 006 ± 0 . 004 F ourier (p osition-only) 3 +0 . 003 ± 0 . 004 − 0 . 006 ± 0 . 006 − 0 . 007 ± 0 . 003 +0 . 001 ± 0 . 005 F ourier (p osition-only) 10 − 0 . 001 ± 0 . 005 − 0 . 002 ± 0 . 006 − 0 . 002 ± 0 . 006 − 0 . 002 ± 0 . 006 14 9.3 Analysis F our ﬁndings emerge: 1. The 3-seed directional pattern do es not replicate. With 3 seeds, the P osition-Only F ourier v ariant app eared to show the predicted pattern: p ositiv e transfer on A2 (+0 . 003), negative on B1 ( − 0 . 007). With 10 seeds, the A2 adv an tage is − 0 . 001 ± 0 . 005 and all four tasks show uniformly negativ e adv antages (mean − 0 . 002). The 3-seed result was noise. Fixed random F ourier co ordinates solve the co ordinate stability problem (the same p osition alw ays maps to the same slot) but routing-bias pheromone still do es not transfer. The co ordinate system was necessary but not suﬃcien t. 2. T oken iden tity in co ordinates imp edes transfer. The T oken+P osition F ourier v arian t p erforms worse than learned groupers, with uniformly negativ e adv antages across all tasks. Including tok en identit y in the ﬁxed co ordinate system reintroduces surface dep endence: T ask A1 and T ask A2 use diﬀerent tokens at the same p ositions, so the same structural pattern maps to diﬀerent co ordinates. The token+position v arian t being w orse than p osition-only conﬁrms that surface information in the co ordinate system actively interferes with structural transfer. 3. Co ordinate stabilit y is necessary but not suﬃcient. The 10-seed p osition-only result demonstrates that even with p erfectly stable, surface-free co ordinates, routing-bias pheromone pro duces no p ositive transfer. Two obstacles must b e ov ercome: (a) the co ordinate system must b e stable (this section), and (b) the transfer mechanism must degrade gracefully when the pheromone is wrong (Section 12 ). Routing bias fails the second requirement b ecause it forces atten tion patterns from the start—if the pheromone from the source task biases the wrong routes, it actively reduces p erformance. 4. Routing bias is the wrong transfer mec hanism. Across all 10 seeds and all 4 transfer tasks (40 task-seed pairs), the mean routing-bias transfer adv antage is − 0 . 002. Not a single task sho ws signiﬁcant p ositiv e transfer. The problem is not the co ordinate syste m but the mechanism: biasing the forwar d p ass with information from a previous task amounts to transferring the solution , whic h is task-sp eciﬁc. Distillation qualit y . The m ulti-source distillation step provides an indep enden t diagnostic. P osition-Only F ourier preserves 22/1024 high-magnitude transitions (2.1%), v ersus 16/1024 (1.6%) for T oken+P osition F ourier and 10/1024 (1.0%) for unaligned learned slots. More surviving transitions indicate b etter co ordinate alignment b et ween indep enden tly trained mo dels—consisten t with the co ordinate stability thesis, even though co ordinate stability alone do es not enable transfer. 10 The Con vergen t Diagnosis 10.1 Summary of Obstacles T able 4 summarizes the four exp erimen ts and their diagnoses. Each exp erimen t ﬁxed one problem but rev ealed the next, and all p oin t to the same ro ot cause. 15 T able 4: Progressive diagnosis across ﬁv e experiments. Each resolved one obstacle but exposed the next. The ﬁnal exp erimen t (p osition-only F ourier, 10 seeds) reveals that co ordinate stability is necessary but not suﬃcient: routing-bias pheromone do es not transfer even with stable co ordinates. Exp erimen t What it ﬁxed Next obstacle identi- ﬁed Insigh t 1: Contrastiv e up dates Pheromone satura- tion ( τ → uni- form) Surface en tanglemen t Structured pheromone  = transferable pheromone 2: Single- source transfer T ask diﬃculty cal- ibration Surface-structure coupling Cannot disen tangle with one instance 3: Multi- source distilla- tion Surface contrast (t wo instances) Co ordinate mismatc h (3.5% vs. 3.1%) Learned slots are unaligned across runs 4: Decomp o- sition co ordi- nates Slot arbitrariness (ﬁxed regions) Em b edding arbitrari- ness ( − 10 . 4%) An y learned co ordinate sys- tem is unstable 5: Position- only F ourier Extrinsic, surface- free co ordinates Routing-bias mec ha- nism itse lf — ev en with stable co or- dinates, biasing forw ard-pass routing from a previous task imp oses a cost Co ordinate stability is nec- essary but not suﬃcient. The transfer mec hanism (routing bias vs. learn- ing rate) matters indep en- den tly . 10.2 The Co ordinate System Problem W e state the core result informally as a necessary condition: Co ordinate Stabilit y Requiremen t. Persisten t structural memory that transfers across tasks requires a co ordinate system satisfying three prop erties: (a) Extrinsic deﬁnition: The coordinates must b e deﬁned prior to task-sp eciﬁc training. (b) Cross-task sharing: The co ordinates must b e shared across tasks by c onstruction , not b y p ost-ho c alignment. (c) Structural metric: Nearb y co ordinates m ust corresp ond to related structural roles, so that pheromone dep osited at one co ordinate generalizes to structurally similar inputs. No end-to-end learned co ordinate system satisﬁes all three prop erties sim ultaneously , in our exp erimen tal setting. The argumen t for why learned co ordinates fail: • Prop ert y (a) fails b ecause learned embeddings are initialized randomly and c hange throughout training. The co ordinate system at ep o c h 0 b ears no relation to the co ordinate system at ep o c h 80, and co ordinates from tw o diﬀerent training runs are in incompatible spaces. 16 • Prop ert y (b) fails b ecause indep enden tly trained mo dels conv erge to diﬀeren t lo cal minima of the loss landscap e, pro ducing diﬀerent embedding geometries. Post-hoc alignment (Hungarian matc hing) recov ers only 3.5% correlation for 32-slot systems, versus 3.1% exp ected by c hance. • Prop ert y (c) is partially satisﬁed b y learned embeddings within a single training run, but violated across runs b ecause the metric structure of the em b edding space is not preserved. The pheromone ﬁeld is deﬁned ov er co ordinates that are themselv es learned, unstable, and arbitrary . T ransfer requires the co ordinates to b e canonical—the same structural role m ust map to the same index across tasks and training runs. This is a represen tational problem, not a learning problem. Our 10-seed exp erimen t with p osition-only F ourier co ordinates (Section 9 ) demonstrates that stable co ordinates alone do not enable transfer when the transfer mechanism is routing bias. The coordinate system problem is the ﬁrst of tw o obstacles; the second—the c hoice of transfer mec hanism—is addressed in Section 12 . 10.3 Connection to Hipp o campal Memory Systems The T olman-Eichen baum Mac hine (TEM) [ 7 ] factorizes sensory represen tation from structural (graph) representation by construction, using separate neural p opulations. Place cells pro vide stable co ordinates for the cognitive map; grid cells pro vide a metric. DPPN’s soft group ers conﬂate b oth in to a single contin uous pro jection—the equiv alent of trying to build a cognitive map without place cells. The co ordinate system problem in DPPN is a computational analog of what would happ en if hipp ocampal place ﬁelds were randomly reassigned after eac h learning episo de. 11 P ositiv e Findings: Within-T ask P erformance Bey ond the cross-task transfer question, the exp erimen ts reveal that DPPN’s pheromone-biased routing is gen uinely useful for within-task learning. 11.1 DPPN Outp erforms Baselines Across all transfer targets and seeds, DPPN cold consisten tly outp erforms b oth the transformer and random sparse baselines (T able 2 ): • DPPN cold: AULC 0 . 699 ± 0 . 014 (A2), 0 . 702 ± 0 . 012 (A3), 0 . 715 ± 0 . 017 (B1), 0 . 682 ± 0 . 016 (C1) • T ransformer: A ULC 0 . 687 ± 0 . 012 (A2), 0 . 679 ± 0 . 013 (A3), 0 . 690 ± 0 . 017 (B1), 0 . 665 ± 0 . 020 (C1) • Random Sparse: AULC 0 . 678 ± 0 . 015 (A2), 0 . 673 ± 0 . 012 (A3), 0 . 686 ± 0 . 013 (B1), 0 . 650 ± 0 . 014 (C1) The DPPN adv an tage ov er the transformer baseline is approximately +0 . 012 to +0 . 025 A ULC, and ov er random sparse it is +0 . 016 to +0 . 032 AULC. This conﬁrms that pheromone-biased routing pro vides a genuine inductive bias for learning—the problem is sp eciﬁcally cr oss-task transfer, not within-task utilit y . 17 11.2 Source T ask P erformance DPPN ac hieved 0 . 78 ± 0 . 02 v alidation accuracy on the source task (av eraged across 3 seeds), with test accuracies of 0 . 749 (A1) and 0 . 709 (A1 ′ ) for seed 42. The mo del learns eﬀectively; the limitation is in what its pheromone r ememb ers vs. what would need to tr ansfer . 12 Exp erimen t 6: Pheromone as Learning-Rate Prior The inabilit y of routing-bias pheromone to transfer ev en with stable co ordinates (Section 9 ) raises a precise question: is the problem the c ontent of pheromone (what it records) or the me chanism through which it acts (how it inﬂuences the mo del)? Routing bias forces attention patterns from the start of training on the target task. If the pheromone from the source task encodes wrong routes—whic h it will, b ecause diﬀeren t tasks require diﬀeren t attention patterns even when they share structural families—it actively reduces p erformance. The degradation is not graceful: wrong routing bias is worse than no routing bias. 12.1 Arc hitecture: Learning-Rate Mo dulation W e replace the routing-bias mec hanism (Eq. 5 ) with learning-rate mo dulation. The key change: pheromone has no eﬀect during the forw ard pass. The mo del uses a standard transformer forw ard pass with no pheromone-biased routing. After loss.backward() , gradient magnitudes p er slot-pair are scaled by pheromone-deriv ed learning-rate multipliers: η ab = η base · (1 + λ · ( τ ab − ¯ τ )) (13) where η base is the base learning rate, λ con trols the mo dulation strength, τ ab is the pheromone v alue for the slot-pair ( a, b ), and ¯ τ is the mean pheromone. High pheromone means “learn this connection faster”; lo w pheromone means “learn this connection at the base rate.” F or instance, if the pheromone ﬁeld has τ (3 , 7) = 1 . 8 (ab o ve the mean of ¯ τ = 1 . 05), the gradient for atten tion w eights connecting slot-3 p ositions to slot-7 p ositions is scaled by 1 . 8 / 1 . 05 ≈ 1 . 71 × , accelerating learning of this particular structural connection. Conv ersely , a slot pair with τ (12 , 20) = 0 . 3 (b elo w mean) receiv es a 0 . 29 × scaling, eﬀectively deprioritizing that connection early in transfer training. The critical prediction: bad routing bias hurts (it forces wrong attention patterns), but bad learning-rate bias just do es not help (it accelerates learning of wrong connections, but the mo del can still learn the righ t connections at the base rate). The outcome is graceful degradation, not negativ e transfer. 12.2 Connection to Meta-Learning and Synaptic Metaplasticity This mec hanism has tw o natural anteceden ts. In meta-learning, MAML [ 39 ] learns initial w eights suc h that a few gradien t steps on a new task pro duce go od p erformance. Our approach is cheaper: rather than learning initial w eights (which requires gradien t-through-gradient computation), w e learn le arning r ates via a simple EMA statistic (pheromone). The pheromone ﬁeld acts as a p er-connection learning rate prior, analogous to MAML’s learned initialization but without the computational o verhead. In neuroscience, the BCM rule [ 38 ] describ es synaptic metaplasticity: the history of a synapse’s activ ation mo dulates its plasticity (how easily it changes), not its weight (its current strength). A synapse that has b een frequently active develops a higher mo diﬁcation threshold, making it harder to p otentiate further. Pheromone as a learning-rate prior implements a computational analogue: history mo dulates how fast connections learn, not which connections are active. 18 12.3 Results W e run the meta-learning-rate v arian t with p osition-only F ourier co ordinates and 7 seeds, comparing against the routing-bias v arian t with 10 seeds. T able 5: T ransfer adv an tage (∆ A ULC: warm − cold) for meta-learning-rate pheromone vs. routing- bias pheromone, b oth with p osition-only F ourier coordinates. Neither ac hieves statistically signiﬁcan t p ositiv e transfer, but they diﬀer qualitativ ely: routing bias is uniformly negativ e, learning-rate mo dulation is uniformly non-negative. Same F amily Diﬀeren t F amily Mec hanism Seeds A2 A3 B1 C1 Meta-LR 7 +0 . 003 ± 0 . 006 +0 . 000 ± 0 . 005 +0 . 002 ± 0 . 004 +0 . 002 ± 0 . 004 Routing bias 10 − 0 . 001 ± 0 . 005 − 0 . 002 ± 0 . 006 − 0 . 002 ± 0 . 006 − 0 . 002 ± 0 . 006 12.4 Analysis Neither approac h achiev es statistically signiﬁcant positive transfer. Ho wev er, they diﬀer qualitatively in an imp ortan t wa y: • Routing-bias pheromone pro duces uniformly negativ e adv antages (mean − 0 . 002 across 10 seeds and 4 tasks). In 40 task-seed pairs, the transfer adv an tage is consistently negative. • Learning-rate pheromone pro duces uniformly non-negative adv antages (mean +0 . 002 across 7 seeds and 4 tasks). In 28 task-seed pairs, zero show ed the negative transfer that characterized routing-bias exp erimen ts. The diﬀerence in sign is the key result. Routing bias transfers the solution —whic h attention patterns to use—and the solution is task-sp eciﬁc, so transfer imp oses a cost. Learning-rate mo dulation transfers the curriculum —which connections to learn ﬁrst—and wrong learning priorities degrade gracefully b ecause the mo del can still learn any connection at the base rate. This identiﬁes the second indep enden t requirement for p ersistent structural memory , b ey ond co ordinate stabilit y: the transfer mechanism must b e gracefully degrading . Learning-rate mo dulation satisﬁes this requirement; routing bias do es not. 12.5 Connection to Structure Completion F unctions W e further tested whether replacing pheromone statistics with a learned structur e c ompletion function can impro ve transfer. A completion netw ork is trained to reconstruct full routing patterns from partially mask ed inputs, using only correct-prediction patterns. During transfer, the completion prior is alpha-blended with the curren t agreement signal ( α = 0 . 3, decaying to 0 o ver 20 ep ochs), ensuring graceful degradation. The critical design c hoice is the input to the completion netw ork. W e test tw o v ariants: (1) completion ov er the le arne d agreement signal A (which dep ends on trained w eights and c hanges on reset), and (2) completion ov er the extrinsic co-o ccurrence matrix P = Q ⊤ fourier · onehot ( x ) · onehot( x ) ⊤ · Q fourier (whic h has zero learned-weigh t dep endence). The random completion control rev eals the decomp osition: on same-family tasks (A2, A3), trained completion outp erforms random b y +0 . 006 to +0 . 013; on diﬀerent-family tasks (B1, C1), 19 T able 6: Structure completion transfer adv an tage (∆ A ULC, 10 seeds eac h). Completion ov er learned A is the worst mechanism (negative ev erywhere). Completion ov er extrinsic P shows the largest p ositiv e signal, but a random completion control (5 seeds) decomp oses it into regularization ( ∼ + 0 . 010) and a diﬀerential same-family b onus ( ∼ + 0 . 006). Same F amily Diﬀeren t F amily Mec hanism A2 A3 B1 C1 Completion (learned A ) − 0 . 004 − 0 . 007 ∗∗ − 0 . 004 − 0 . 003 Completion (extrinsic P ) +0 . 007 +0 . 016 ∗∗∗ +0 . 015 ∗∗∗ +0 . 016 ∗∗∗ Random completion control +0 . 005 +0 . 012 +0 . 016 +0 . 009 T rained − Random (same fam.) +0 . 013 +0 . 004 − 0 . 002 +0 . 000 they are indistinguishable. The ∼ + 0 . 010 regularization comp onen t (from the alpha-blended prior) b eneﬁts all tasks equally; the ∼ + 0 . 006 structural comp onen t b eneﬁts only same-family tasks. This demonstrates that a trained function ov er stable co ordinates can extract higher-order structural information (conditional co-o ccurrence, v ariance patterns) even when the ﬁrst-order statistics of P are identical across all families (cosine similarity 1.0000). The catch-22 b et ween co ordinate stability and structural informativ eness is partially p ermeable to functions, though not to statistics. 13 Implications and Solution P ath 13.1 F rozen Pretrained Enco ders as Canonical Co ordinates The co ordinate stabilit y requiremen t (Section 10 ) is naturally satisﬁed by frozen pretrained enco ders. A frozen BER T [ 16 ] or GPT [ 17 ] em b edding lay er provides: (a) Extrinsic deﬁnition: Embedding geometry is ﬁxed b efore task-sp eciﬁc training b egins. (b) Cross-task sharing: The same enco der pro duces the same embeddings for the same inputs regardless of whic h downstream task is b eing learned. (c) Structural metric: Pretrained embeddings exhibit seman tic structure where pro ximity reﬂects meaning—“if P then Q ” and “ P implies Q ” map to nearby p oints regardless of what P and Q are. Pheromone accum ulated ov er frozen embedding co ordinates would b e transferable b ecause the co ordinates are stable: the same structural pattern activ ates the same region of embedding space across tasks and training runs. This eliminates the entanglemen t problem b y construction. 13.2 Adaptiv e Gran ularit y via Dirichlet Pro cesses Our exp erimen ts used ﬁxed num b ers of slots ( m = 64) and clusters ( K = 32 , 64). The Chinese Restauran t Pro cess and Diric hlet pro cess mixture mo dels [ 18 ] suggest that the granularit y of structural decomp osition should b e adaptive —determined b y the data rather than ﬁxed a priori. A concen tration parameter α that con trols the exp ected n umber of clusters, growing logarithmically with data, would allo w the co ordinate system to reﬁne itself without committing to a ﬁxed resolution. 20 13.3 Multi-Resolution Pheromone Structural patterns exist at multiple scales: tok en-level co-o ccurrence, phrase-lev el motifs, and do cumen t-lev el comp ositional structure. A w av elet-inspired multi-resolution pheromone ﬁeld, with separate τ matrices at diﬀerent abstraction levels, could capture structure at the appropriate scale without collapsing ev erything into a single resolution. 13.4 Sp ectral Represen tations for Perm utation Inv ariance The co ordinate mismatch problem (Section 7 ) arises b ecause pheromone ov er raw slot indices is maximally p erm utation-sensitiv e. Sp ectral graph metho ds [ 19 ] oﬀer p erm utation-in v arian t structural descriptors: the eigen v alues of the Laplacian of the slot-slot interaction graph characterize structure indep enden t of node lab eling. Pheromone o ver sp ectral co ordinates w ould b e inv ariant to slot p erm utations b y construction, though computing sp ectral decomp ositions at eac h forw ard pass in tro duces computational ov erhead. 13.5 Connection to State Space Mo dels The Structured State Space mo del S4 [ 21 ] and its selective v ariant Mamba [ 22 ] use the HiPPO (High-order Polynomial Pro jection Op erators) matrix as the state transition k ernel. The HiPPO matrix is a ﬁxe d, mathematic al ly derive d b asis for temp oral memory: it deﬁnes a co ordinate system in which the mo del’s recurrent state optimally approximates the history of the input signal under a Legendre p olynomial basis. Crucially , the HiPPO matrix is not learned from data—it is deriv ed from an approximation-theoretic criterion and remains constan t across tasks, datasets, and training runs. This is conceptually identical to our prop osed solution: a ﬁxed co ordinate sys tem that is deﬁned b efore training begins and do es not change during learning. The HiPPO basis is optimized for temp oral approximation (ho w to represent what happ ened recen tly using Legendre p olynomials); random F ourier features [ 25 ] are optimized for similarity preserv ation (how to main tain geometric relationships under pro jection, via Bo c hner’s theorem). Both solve the co ordinate stability problem, but for diﬀeren t types of memory: • HiPPO / S4: Fixed co ordinates for temp or al memory . The question answered: “What happ ened recen tly , and ho w should it b e weigh ted?” The Legendre p olynomial basis provides an optimal trade-oﬀ b et w een recency and ﬁdelit y . • Random F ourier features + pheromone: Fixed co ordinates for structur al memory . The question answered: “Whic h computational routes pro v ed useful, and should they b e reused?” Random pro jections pro vide a task-agnostic em b edding space in which pheromone can accumulate. This parallel suggests a uniﬁed principle : any p ersistent memory system requires a ﬁxed basis, and the choice of basis (HiPPO vs. F ourier vs. random) should b e determined b y the t yp e of information b eing memorized. HiPPO is the right basis for temp oral history b ecause the target of appro ximation (a contin uous function of time) is w ell-characterized by orthogonal p olynomials. Random F ourier features may b e the right basis for structural routing b ecause the target (a set of pairwise relationships b et ween computational comp onen ts) is w ell-characterized by distance-preserving pro jections. The success of S4 and Mam ba provides indirect evidence for our thesis. These mo dels achiev e strong p erformance precisely b ecause their state ev olves ov er a ﬁxe d co ordinate system. If the HiPPO matrix were learned from scratch and v aried across training runs, the recurrent state w ould 21 suﬀer the same instability we observe in pheromone ov er learned slot co ordinates. The S4 literature do es not frame the HiPPO matrix as a solution to a co ordinate stability problem—it is presented as a solution to a long-range dependency problem—but our analysis rev eals that these are t wo manifestations of the same principle. 13.6 The Contamination Problem A natural ﬁrst resp onse to the co ordinate stability requirement is to use a frozen pretrained enco der, as suggested in Section 13 . How ev er, this creates a subtle exp erimen tal confound that must b e carefully addressed. If the pretrained enco der w as trained on data that con tains the structural patterns of interest, then the enco der already “knows” the structure. T ransfer credit b elongs to the enco der’s pretraining, not to the pheromone mechanism. Consider a concrete example: if we use a frozen BER T enco der [ 16 ] and train pheromone on tasks in volving logical implication patterns, BER T’s pretraining on natural language has already exp osed it to implication structures (“if P then Q” app ears frequently in text). The pheromone ﬁeld would accum ulate ov er an embedding space that already separates structural patterns—the pheromone is not disc overing structure, it is indexing structure that the enco der has already identiﬁed. This is analogous to building a retriev al-augmen ted system where the answers are already in the index: the system works, but the credit b elongs to the index builder (pretraining), not the retriev al mechanism (pheromone). The co ordinate system m ust b e structure-blind : it must not enco de task-sp eciﬁc structural patterns, while still preserving geometric relationships that allow pheromone to disc over and transfer structure. This requiremen t eliminates pretrained enco ders for rigorous ev aluation and p oints sp eciﬁcally to random pro jections, which are prov ably structure-blind (drawn b efore seeing any data, indep enden t of an y task distribution) yet geometrically informative (the Johnson-Lindenstrauss lemma [ 32 ] guaran tees that pairwise distances are preserved up to (1 ± ϵ ) m ultiplicative distortion with high probabilit y when pro jecting to O ( ϵ − 2 log n ) dimensions). The distinction is subtle but critical: • F rozen pretrained encoder: Stable co ordinates that enco de structure. T ransfer w orks, but the credit is am biguous. • Random pro jections: Stable co ordinates that are structure-blind. If transfer w orks, the credit b elongs unambiguously to the pheromone mechanism. • Learned from scratch: Unstable co ordinates. T ransfer fails regardless of pheromone quality . F or practical deploymen t, frozen pretrained enco ders are the pragmatic choice—they provide stable co ordinates with ric h geometric structure. F or scientiﬁc ev aluation of whether pheromone- based structural memory works as a me chanism , random pro jections are the rigorous choice, b ecause they con trol for the p ossibility that the co ordinate system itself is doing the structural reasoning. 14 Related W ork 14.1 Memory-Augmen ted Neural Net works The Neural T uring Mac hine (NTM) [ 8 ] in tro duced diﬀeren tiable external memory with con ten t-based addressing, where the controller generates a k ey v ector and reads from memory lo cations whose con tent is similar to the k ey . The Diﬀeren tiable Neural Computer (DNC) [ 9 ] extended this with temp oral linking (recording the order of writes to enable sequential trav ersal) and dynamic memory 22 allo cation (preven ting ov erwriting of recently written lo cations). Memory Netw orks [ 20 ] provided a simpler form ulation with multi-hop attention ov er an external memory bank, demonstrating that external memory enables multi-step reasoning that is diﬃcult for feedforward architectures. These architectures share a critical prop ert y: memory stores c ontent , not structur e . The NTM’s memory matrix records sp eciﬁc patterns that were written during pro cessing; it do es not record whic h read-write patterns pro v ed eﬀective across man y inputs. The DNC’s temporal linking is the closest to structural memory—it records the or der of writes, which is a form of structural information—but this structure is sp eciﬁc to one pro cessing episo de and do es not p ersist across tasks. When the mo del encounters a new task, the memory bank is re-initialized. DPPN’s pheromone ﬁeld occupies a diﬀerent niche: it records which slot-to-slot transitions pro ved useful (structural memory), not what information ﬂow ed along those transitions (conten t memory). This distinction is analogous to the diﬀerence b etw een a road map (which routes exist and whic h are well-tra v eled) and a delivery manifest (what cargo w as carried on each route). Conten t memory records the cargo; structural memory records the road netw ork. Our contribution is sho wing that this structural memory requires stable co ordinates—the “road names” m ust b e consisten t across maps for the accumulated traﬃc statistics to transfer. A further distinction concerns addressing. The NTM uses con tent-based addressing (similarity to a key) and lo cation-based addressing (shifting from the current p osition). Both provide stable co ordinates in a sense: con tent-based addressing maps to the same memory lo cation for the same k ey , and lo cation-based addressing uses integer indices. How ev er, these are co ordinates for c ontent r etrieval , not for structur al r outing . The co ordinate system problem we identify is sp eciﬁc to structural memory: the co ordinates must corresp ond to computational roles (slots), not to memory p ositions. 14.2 P ersisten t State in Sequence Mo dels T ransformer-XL [ 10 ] cac hes hidden states from previous segments, enabling the mo del to attend b ey ond its ﬁxed con text window. The Compressiv e T ransformer [ 11 ] extends this by compressing old hidden states into a secondary memory rather than discarding them, further increasing the eﬀectiv e memory horizon. Both main tain state across segments within a single task ; neither provides a mec hanism for state to p ersist acr oss tasks . R WKV [ 23 ] reformulates transformer attention as a linear recurrence with exp onen tial deca y , main taining a p ersistent state vector that accumulates information across time steps. The state is task-sp eciﬁc and reset b et ween tasks, so it do es not address cross-task transfer. The Structured State Space mo del S4 [ 21 ] is the most relev an t work in this category . S4 parameterizes its state transition using the HiPPO matrix, a ﬁxed basis derived from appro ximation theory (sp eciﬁcally , from the requirement that the state optimally approximates the input history under the Legendre measure). The HiPPO matrix is not le arne d —it is a ﬁxed, data-indep enden t co ordinate system for temp oral memory . Mamba [ 22 ] extends S4 with input-dep enden t selection, allo wing the mo del to dynamically ﬁlter its state, while retaining the ﬁxed state transition structure. As we discuss in Section 13 , the HiPPO matrix is a ﬁxed co ordinate system for temp oral memory in exactly the sense that our coordinate stability requiremen t demands for structural memory . The k ey diﬀerence is scop e: HiPPO co ordinates are for within-sequence temp oral dynamics; our co ordinate system problem concerns cross-task structural transfer. S4’s success with ﬁxed temp oral co ordinates, contrasted with the obstacle we identify for learned structural co ordinates, pro vides con vergen t evidence for the principle that p ersisten t memory requires a ﬁxed basis. 23 14.3 T ransfer Learning and Domain Adaptation In v arian t Risk Minimization (IRM) [ 4 ] seeks representations where the optimal classiﬁer is the same across all training environmen ts, thereby identifying inv ariant features. Ho wev er, IRM requires access to multiple training en vironments and assumes that the representation space is adequate for expressing in v ariances. Domain-adv ersarial training [ 24 ] learns represen tations that fo ol a domain discriminator, forcing the enco der to discard domain-sp eciﬁc information. Both approaches mo dify the c ontent of the representation to ac hieve inv ariance while lea ving the representation sp ac e (co ordinate system) to b e learned. Our w ork iden tiﬁes a more fundamen tal issue: even if the con tent of tw o represen tations is structurally identical, the representations may b e expressed in incompatible co ordinate systems. This is not a problem for standard transfer learning, where the mo del parameters (including the represen tation mapping) transfer together. It b ecomes a problem sp eciﬁcally for persistent structural memory , where the memory (pheromone) must transfer without the parameters (which are reset). The co ordinate system must b e stable indep endently of the parameters, which no standard transfer learning metho d guarantees. 14.4 Random F eatures and Fixed Represen tations Random Kitchen Sinks [ 25 ] show ed that random F ourier features approximate shift-in v arian t k ernels, enabling k ernel metho ds to scale to large datasets without computing the full kernel matrix. This established a surprising principle: ﬁxed random pro jections, dra wn from an appropriate distribution, are statistically informative without any learning. Extreme Learning Mac hines [ 26 ] applied this principle to neural netw orks, demonstrating that a single hidden lay er with ﬁxed random weigh ts and a trained output lay er achiev es comp etitiv e classiﬁcation p erformance. Echo State Net works [ 27 ] and the broader reservoir computing framework extended this to recurrent netw orks, showing that ﬁxed random recurrent dynamics with a trained linear readout can mo del complex temp oral patterns. The theoretical foundation for these results is the Johnson-Lindenstrauss lemma [ 32 ]: a random pro jection from R d to R k with k = O ( ϵ − 2 log n ) preserv es all pairwise distances among n p oin ts up to a (1 ± ϵ ) multiplicativ e factor with high probabilit y . This guarantees that the geometric structure of the original space—which is precisely what pheromone needs to leverage—is maintained in the pro jected space. The gap b et ween this literature and our work is the application con text. Random features w ere developed for k ernel approximation and eﬃcien t classiﬁcation, not for prov iding co ordinate systems for p ersisten t memory . The insigh t that random pro jections are the natural solution to the co ordinate stability problem—b ecause they are extrinsic, shared by construction, and geometrically informativ e—has not b een made in the random features literature, which fo cuses on approximation qualit y rather than co ordinate stability . 14.5 Biological Co ordinate Systems Grid cells in the mammalian entorhinal cortex [ 28 ] pro vide an innate hexagonal lattice that serves as a spatial co ordinate system for navigation and memory . Crucially , the grid cell pattern is presen t b efore an y spatial exp erience—it is a co ordinate system that exists prior to the conten t (spatial memories) that will b e deﬁned ov er it. The T olman-Eic henbaum Mac hine (TEM) [ 7 ] formalizes this as a computational mo del that factorizes structural representation (the graph of relationships b etw een lo cations) from sensory representation (what is present at each lo cation), using separate neural p opulations. Place cells provide stable lo cation identiﬁers; grid cells provide 24 the metric structure. The TEM demonstrates that factorizing co ordinates from conten t is suﬃcien t for structural generalization across environmen ts. The ﬂy olfactory circuit [ 29 ] implemen ts a form of similarit y-preserving hashing via sparse random expansion: 50 olfactory receptor types pro ject to approximately 2,000 Keny on cells via random, sparse connections. This random expansion preserves similarity relationships (similar o dors activ ate similar Keny on cell patterns) while dramatically increasing dimensionality , enabling rapid learning of o dor asso ciations. The random pro jection is genetically determined, not learned from o dor exp erience—it is a ﬁxed, structure-blind co ordinate system for olfactory memory . Epigenetic memory pro vides a further biological parallel: p ersisten t chemical marks (meth ylation, histone mo diﬁcation) accum ulate o ver a ﬁxed genomic co ordinate system (the DNA sequence). The co ordinate system (genome) is stable across cell divisions; the marks (epigenetic state) p ersist and transfer structural information ab out gene expression patterns. The marks are meaningful precisely b ecause the co ordinates are ﬁxed—a methyl mark at genomic p osition X means the same thing in ev ery cell. The common principle across these biological systems is that the co ordinate system is deﬁned prior to and indep endently of the conten t that will b e asso ciated with it. Grid cells exist b efore spatial memories; the ﬂy’s random pro jection exists b efore o dor learning; genomic co ordinates exist b efore epigenetic marks. DPPN’s learned soft group ers violate this principle: the co ordinates are learned sim ultaneously with the conten t, and change when the conten t chan ges. 14.6 F ourier F eatures in Neural Netw orks T ancik et al. [ 30 ] demonstrated that passing input co ordinates through a random F ourier feature mapping enables neural netw orks to learn high-frequency functions, ov ercoming the sp ectral bias of standard MLPs to ward lo w-frequency comp onen ts. F ourierF ormer [ 31 ] replaces the softmax k ernel in transformer attention with a F ourier-feature-based approximation, achieving comp etitiv e p erformance with improv ed theoretical prop erties. The connection to our proposed solution is direct: random F ourier features pro vide a ﬁxed, structure-blind mapping from input space to a feature space where similarity is preserv ed (b y Bo c hner’s theorem, a shift-inv arian t kernel can b e expressed as the inner pro duct of random F ourier features). If the soft group ers in DPPN w ere replaced b y a ﬁxed random F ourier feature mapping, the resulting slot assignments would b e stable across training runs and tasks, while still preserving the geometric relationships that pheromone needs to discov er and transfer structural patterns. This sp eciﬁc application of F ourier features—as co ordinates for p ersisten t structural memory rather than as function appro ximators or attention kernels—has not b een explored. 14.7 Sparse Atten tion Longformer [ 12 ], BigBird [ 13 ], and related architectures use ﬁxed or learned sparse atten tion patterns primarily for computational eﬃciency—reducing the O ( N 2 ) cost of dense attention to O ( N ) or O ( N log N ). DPPN’s sparsity serves a fundamentally diﬀeren t purp ose: the sparse mask is determined b y pheromone-biased agreement b etw een dual views, enco ding accumulated knowledge ab out useful computational path wa ys. Sparsit y in DPPN is a consequence of structural routing, not a design c hoice for eﬃciency . 14.8 Con tin ual Learning Con tinual learning metho ds such as Elastic W eigh t Consolidation (EWC) [ 35 ], Pac kNet [ 36 ], and Progressiv e Neural Netw orks [ 37 ] protect previously learned parameters during training on new tasks. 25 These address catastrophic forgetting—the loss of old task p erformance—rather than structural transfer—the acceleration of new task learning via p ersisten t routing memory . In our proto col, all parameters are explicitly reset b et w een tasks; the only transfer channel is the pheromone ﬁeld. This design choice isolates the question of whether structural routing memory , indep enden t of parameter sharing, can provide transfer. EWC regularizes weigh t changes to preserve old task p erformance; P ackNet freezes subsets of weigh ts for eac h task; Progressive Nets add new columns while retaining old ones. None of these mechanisms provide a p ersisten t structural memory that is decoupled from mo del parameters and could transfer routing knowledge to a fresh mo del. 14.9 An t Colon y Optimization The pheromone mechanism in DPPN draws directly from A CO [ 2 , 3 ]. In A CO applied to dynamic optimization problems, pheromone ﬁelds m ust adapt when the environmen t (problem instance) c hanges [ 14 ]. The transfer problem w e study is analogous: the “environmen t” (task) changes, and the question is whether pheromone from one environmen t helps in another. Our results align with a kno wn constraint from the ACO literature: pheromone transfer requires the new en vironment to share the same gr aph as the old en vironment. In our setting, the “graph” is deﬁned by the soft group er assignments, and tw o tasks with indep enden tly learned group ers deﬁne diﬀerent graphs— ev en if the tasks share the same structural family . The co ordinate system problem is, in ACO terms, the problem of ensuring that the graph ov er which pheromone is deﬁned is the same graph across tasks. 14.10 Structure Mapping and Analogical Reasoning Gen tner’s structure-mapping theory [ 5 ] and progressive alignment h yp othesis [ 6 ] predict that structural abstraction requires comparison across instances with shared relational structure but diﬀeren t surface features. Rosc h’s basic level theory of categorization [ 33 ] similarly identiﬁes that structural inv arian ts emerge through comparison across instances. Anderson’s rational mo del of categorization [ 34 ], whic h uses a Chinese Restaurant Pro cess prior for category formation, suggests that the num b er of structural categories should b e adaptiv e rather than ﬁxed, connecting to our discussion of Diric hlet pro cess mo dels for adaptiv e granularit y . Our Exp eriment 3 (Section 7 ) implements progressive alignmen t via multi-source distillation, and the result reveals that comparison requires c ommensur able r epr esentations —a requiremen t that learned slot systems do not satisfy . The coordinate system problem is, in Gen tner’s framew ork, the problem of ensuring that the tw o analogs are represented in the same vocabulary so that their structural corresp ondence can b e detected. 14.11 In v ariant Risk Minimization IRM [ 4 ] establishes that in v arian t features cannot b e iden tiﬁed from a single training environmen t— m ultiple environmen ts with diﬀerent spurious correlations are needed. Our Exp eriment 2 (Section 6 ) is a concrete instantiation: pheromone from a single source task entangles structural inv ariants with surface-sp eciﬁc features. Our Exp erimen t 3 extends this to multiple environmen ts (A1 and A1 ′ ), but the IRM framework assumes that the representation space is shared across en vironments, which is precisely the condition that fails when co ordinates are learned indep enden tly . 26 15 Conclusion W e set out to build p ersisten t structural memory for neural sequence mo dels and disco vered that cross-task transfer requires tw o conditions that are not met when the co ordinate system ov er whic h memory is deﬁned is learned join tly with the mo del. Five rounds of exp erimen ts—spanning con trastive pheromone up dates, task diﬃculty calibration, multi-source distillation with Hungarian alignmen t, and semantic decomp osition—each rev ealed a new obstacle, all tracing to one ro ot cause: p ersisten t memory requires stable coordinates, and learned co ordinates are inheren tly unstable. The three obstacles we characterize—pheromone saturation (resolved by con trastive up dates), surface-structure en tanglement (unresolv able with a single source), and co ordinate incompatibility (unresolv able b y p ost-ho c alignmen t of learned represen tations)—form a hierarc hy of obstacles. Eac h can only b e diagnosed after the previous one is resolved, and the ﬁnal obstacle (co ordinate instabilit y) is fundamental rather than incidental. A 10-seed replication of the p osition-only F ourier exp eriment rev eals that co ordinate stability , while necessary , is not suﬃcient. Even with p erfectly stable, surface-free co ordinates, routing-bias pheromone pro duces uniformly negative transfer (mean − 0 . 002). The transfer me chanism matters indep enden tly of the co ordinate system: • Routing bias = transfer the solution (which atten tion patterns to use). F ails b ecause the solution is task-sp eciﬁc, and wrong solutions actively interfere. • Learning-rate modulation = transfer the curriculum (which connections to prioritize). Do es not fail b ecause wrong priorities degrade gracefully—the mo del can still learn any connection at the base rate. The pap er thus identiﬁes tw o indep enden t requiremen ts for p ersisten t structural memory: (a) Stable co ordinates: The co ordinate system m ust b e ﬁxed b efore statistics are accumulated. Learned co ordinates are inherently unstable (the main pap er’s diagnostic cascade). (b) Graceful transfer mec hanism: The mechanism through whic h p ersisten t memory inﬂuences learning must degrade gracefully when the memory is wrong. Learning-rate modulation satisﬁes this; routing bias do es not (Section 12 ). The p ositiv e ﬁnding is that the DPPN architecture is eﬀective for within-task learning: pheromone- biased routing consistently outp erforms transformer and random sparse baselines b y +0 . 010 to +0 . 030 AULC. The arc hitecture learns useful structural patterns; the remaining challenge is enabling those patterns to transfer across tasks. The co ordinate system problem is not unique to DPPN or to pheromone-based memory . Any arc hitecture that attempts to accum ulate p ersistent statistics ov er learned laten t representations— whether in the form of running means, protot yp e memories, or learned routing tables—faces the same c hallenge. When the representation changes, the accumulated statistics b ecome meaningless. This suggests a general principle for the design of p ersisten t memory in neural netw orks: the co ordinate system must b e ﬁxed b efore the statistics are accumulated, and the statistics m ust inﬂuence learning gracefully . The catch-22 b et w een co ordinate stabilit y and structural informativ eness is real but not absolute: ev en when the extrinsic co-o ccurrence signal is undetectable at the mean level (cosine similarity 1.0000 across all task families), a completion function trained on correct-prediction patterns ov er stable co ordinates captures higher-order structural information suﬃcient for a small but diﬀerential 27 same-family adv an tage (+0 . 006 AULC b ey ond the regularization baseline, Section 12 ). The catc h-22 constrains statistics but is partially p ermeable to learned functions. Our exp erimen ts use synthetic tasks b y design—to isolate the co ordinate system v ariable from confounds present in real-w orld settings. Whether the t wo requirements iden tiﬁed here—co ordinate stabilit y and graceful transfer mechanisms—constrain practical systems that use pretrained enco ders remains an op en question. More broadly , the diagnostic cascade metho dology demonstrated here— iterativ ely resolving one obstacle to exp ose the next—oﬀers a template for principled inv estigation of p ersisten t memory in neural arc hitectures, and the tw o requirements w e identify provide concrete design criteria for future systems that aim to accumulate and transfer structural knowledge. References [1] A. V aswani, N. Shazeer, N. Parmar, J. Uszk oreit, L. Jones, A. N. Gomez, L. Kaiser, and I. P olosukhin. Atten tion is all y ou need. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2017. [2] M. Dorigo, V. Maniezzo, and A. Colorni. Ant system: Optimization by a colon y of co operating agen ts. IEEE T r ans. Systems, Man, and Cyb ernetics—Part B , 26(1):29–41, 1996. [3] T. St ¨ utzle and H. H. Ho os. MAX-MIN An t System. F utur e Gener ation Computer Systems , 16(8):889–914, 2000. [4] M. Arjovsky , L. Bottou, I. G ulra jani, and D. Lop ez-P az. Inv ariant risk minimization. arXiv pr eprint arXiv:1907.02893 , 2019. [5] D. Gen tner. Structure-mapping: A theoretical framework for analogy . Co gnitive Scienc e , 7(2):155–170, 1983. [6] D. Gen tner. Bo otstrapping the mind: Analogical pro cesses and symbol systems. Co gnitive Scienc e , 34(5):752–775, 2010. [7] J. C. R. Whittington, T. H. Muller, S. Mark, G. Chen, C. Barry , N. Burgess, and T. E. J. Behrens. The T olman-Eichen baum Mac hine: Unifying space and relational memory through generalization in the hipp ocampal formation. Cel l , 183(5):1249–1263, 2020. [8] A. Gra ves, G. W ayne, and I. Danihelk a. Neural T uring Machines. arXiv pr eprint arXiv:1410.5401 , 2014. [9] A. Grav es, G. W ayne, M. Reynolds, T. Harley , I. Danihelk a, A. Grabsk a-Barwi ´ nsk a, S. G. Col- menarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural net work with dynamic external memory . Natur e , 538(7626):471–476, 2016. [10] Z. Dai, Z. Y ang, Y. Y ang, J. Carb onell, Q. V. Le, and R. Salakh utdino v. T ransformer-XL: A ttentiv e language mo dels b ey ond a ﬁxed-length con text. In Pr o c. ACL , 2019. [11] J. W. Rae, A. P otap enk o, S. M. Ja yakumar, C. Hillier, and T. P . Lillicrap. Compressive T ransformers for long-range s equence mo delling. In Pr o c. ICLR , 2020. [12] I. Beltagy , M. E. Peters, and A. Cohan. Longformer: The long-do cumen t transformer. arXiv pr eprint arXiv:2004.05150 , 2020. 28 [13] M. Zaheer, G. Guruganesh, K. A. Dub ey , J. Ainslie, C. Alb erti, S. Ontanon, P . Pham, A. Ravula, Q. W ang, L. Y ang, and A. Ahmed. BigBird: T ransformers for longer sequences. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2020. [14] M. Guntsc h and M. Middendorf. Pheromone mo diﬁcation strategies for ant algorithms applied to dynamic TSP. In Pr o c. Applic ations of Evolutionary Computing (EvoWorkshops) , pages 213–222, 2001. [15] H. W. Kuhn. The Hungarian metho d for the assignment problem. Naval R ese ar ch L o gistics Quarterly , 2(1–2):83–97, 1955. [16] J. Devlin, M.-W. Chang, K. Lee, and K. T outanov a. BER T: Pre-training of deep bidirectional transformers for language understanding. In Pr o c. NAACL-HL T , 2019. [17] A. Radford, K. Narasimhan, T. Salimans, and I. Sutsk ever. Improving language understanding b y generative pre-training. Op enAI T e chnic al R ep ort , 2018. [18] Y. W. T eh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet pro cesses. Journal of the A meric an Statistic al Asso ciation , 101(476):1566–1581, 2006. [19] F. R. K. Ch ung. Sp e ctr al Gr aph The ory . American Mathematical So ciet y , 1997. [20] J. W eston, S. Chopra, and A. Bordes. Memory Netw orks. In Pr o c. ICLR , 2015. [21] A. Gu, K. Go el, and C. R´ e. Eﬃciently mo deling long sequences with structured state spaces. In Pr o c. ICLR , 2022. [22] A. Gu and T. Dao. Mamba: Linear-time sequence mo deling with selectiv e state spaces. arXiv pr eprint arXiv:2312.00752 , 2024. [23] B. P eng, E. Alcaide, Q. An thony , A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Ch ung, M. Grella, K. K. GV, et al. R WKV: Reinv en ting RNNs for the transformer era. In Findings of EMNLP , 2023. [24] Y. Ganin, E. Ustinov a, H. Ajak an, P . Germain, H. Laro c helle, F. La violette, M. Marchand, and V. Lem pitsky . Domain-adversarial training of neural net works. Journal of Machine L e arning R ese ar ch , 17(59):1–35, 2016. [25] A. Rahimi and B. Rech t. Random features for large-scale kernel machines. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2007. [26] G.-B. Huang, Q.-Y. Zh u, and C.-K. Siew. Extreme learning machine: Theory and applications. Neur o c omputing , 70(1–3):489–501, 2006. [27] H. Jaeger. The “echo state” approach to analysing and training recurren t neural netw orks. GMD T e chnic al R ep ort 148 , German National Researc h Center for Information T echnology , 2001. [28] T. Hafting, M. Fyhn, S. Molden, M.-B. Moser, and E. I. Moser. Microstructure of a spatial map in the entorhinal cortex. Natur e , 436(7052):801–806, 2005. [29] S. Dasgupta, C. F. Stevens, and S. Navlakha. A neural algorithm for a fundamen tal computing problem. Scienc e , 358(6364):793–796, 2017. 29 [30] M. T ancik, P . P . Sriniv asan, B. Mildenhall, S. F ridovic h-Keil, N. Raghav an, U. Singhal, R. Ramamo orthi, J. T. Barron, and R. Ng. F ourier features let netw orks learn high frequency functions in low dimensional domains. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2020. [31] T. Nguy en, M. Pham, T. Nguy en, K. Nguyen, S. Osher, and N. Ho. F ourierF ormer: T ransformer meets generalized F ourier in tegral theorem. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2022. [32] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilb ert space. In Confer enc e in Mo dern Analysis and Pr ob ability , volume 26 of Contemp or ary Mathematics , pages 189–206. American Mathematical So ciet y , 1984. [33] E. Rosch. Principles of categorization. In E. Rosch and B. B. Lloyd, editors, Co gnition and Cate gorization , pages 27–48. Lawrence Erlbaum, 1978. [34] J. R. Anderson. The adaptive nature of human categorization. Psycholo gic al R eview , 98(3):409– 429, 1991. [35] J. Kirkpatric k, R. Pascan u, N. Rabino witz, J. V eness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabsk a-Barwinsk a, et al. Ov ercoming catastrophic forgetting in neural net works. Pr o c e e dings of the National A c ademy of Scienc es , 114(13):3521–3526, 2017. [36] A. Mallya and S. Lazebnik. Pac kNet: Adding multiple tasks to a single netw ork by iterativ e pruning. In Pr o c. CVPR , 2018. [37] A. A. Rusu, N. C. Rabino witz, G. Desjardins, H. Sob er, J. V eness, K. Ka vukcuoglu, and R. Hadsell. Progressive neural netw orks. arXiv pr eprint arXiv:1606.04671 , 2016. [38] E. L. Bienensto ck, L. N. Co op er, and P . W. Munro. Theory for the developmen t of neuron selectivit y: Orien tation sp eciﬁcity and bino cular interaction in visual cortex. Journal of Neur oscienc e , 2(1):32–48, 1982. [39] C. Finn, P . Abb eel, and S. Levine. Mo del-agnostic meta-learning for fast adaptation of deep net works. In Pr o c. ICML , 2017. A Arc hitecture Hyp erparameters B T ask Design Details C Decomp osition Arc hitecture Details The Decomp ositionPheromoneModel replaces DPPN’s learned soft group ers with ﬁxed spatial segmen tation. Key comp onen ts: • Regional decomposition: The sequence of length N is divided into R = 8 contiguous regions of size N /R . Each region’s representation is the av erage of its token embeddings. • Sub-claim enco der: A tw o-lay er MLP ( d → d/ 2 → d ) pro jects regional embeddings into the pheromone co ordinate space. 30 T able 7: DPPN hyperparameters used in all exp erimen ts. Note: early exp erimen tal rounds (Sections 5 – 6 ) used d = 128, 4 heads, 4 lay ers, m = 64; after task recalibration, the v alues b elo w w ere adopted for all rep orted results. P arameter V alue Mo del dimension ( d ) 64 Num b er of heads 4 Num b er of lay ers 3 Num b er of slots ( m ) 32 T op- k (sparse mask) 32 Slo w-lane window ( w ) 16 Group er temp erature ( T ) 1.0 Gum b el noise scale ( γ ) 0.5 Pheromone α 1.0 Pheromone β 1.0 Ev ap oration rate ( ρ ) 0.8 τ min 0.1 τ max 2.0 Dep osit rate ( δ ) 0.3 Sparse up date top- k 128 Drop out 0.1 Learning rate 3 × 10 − 4 W eight decay 0.01 Optimizer AdamW Source ep ochs 80 T ransfer ep o c hs 50 • P attern matc hing: The a verage of enco ded regional embeddings is compared to K cluster cen troids via cosine similarity with temp erature-scaled softmax ( T = 0 . 1). • Pheromone ﬁeld: τ ∈ R K × S where K is the num b er of pattern clusters and S is the num b er of ev aluation strategies. Entry τ cs records how eﬀective strategy s has b een for inputs matching cluster c . • Priorit y-weigh ted atten tion: Each tok en’s contribution to attention is mo dulated b y its region’s priorit y score, which is derived from the pheromone-advised strategy selection. T est conﬁguration: d = 64, R = 8 regions, K = 32 clusters, S = 8 strategies. D F ull P er-Seed Results 31 T able 8: Synthetic task parameters. P arameter V alue V o cabulary size 32 Sequence length 128 T raining samples 2000 Motifs p er sample 2–3 Noise lev el 0.02 Num b er of classes 2 Structural families 3 (A, B, C) T asks p er family 2–3 (diﬀerent surface mappings) T able 9: DPPN transfer results (AULC) by seed and condition. Seed Condition A2 A3 B1 C1 42 Cold .683 .689 .693 .676 Distilled .677 .690 .695 .675 Rank-red. .678 .691 .695 .678 137 Cold .696 .697 .719 .665 Distilled .706 .696 .715 .661 Rank-red. .707 .693 .715 .666 256 Cold .718 .718 .733 .704 Distilled .718 .717 .724 .699 Rank-red. .714 .715 .732 .705 T able 10: Baseline results (AULC) by seed (cold condition only). Mo del Seed A2 A3 B1 C1 T ransformer 42 .671 .661 .669 .653 137 .688 .686 .689 .648 256 .700 .691 .711 .692 Random Sparse 42 .658 .657 .671 .642 137 .680 .675 .703 .638 256 .696 .687 .684 .669 32 ( a ) T okens x 1 , . . . , x N Embedding h N × d Dual Soft Groupers Slot Support S ( v ) Agreement A Pheromone R o u t i n g  p Pheromone F i e l d  Sparse Mask M Sparse Attention Gate Fusion  O u t p u t DPPN Architecture ( b ) Phase 1: Source T raining (80 epochs) T rain on A1 accumulates Save s a v e d Phase 2: T ransfer (50 epochs) Reset W eights K e e p  (warm) T rain on A2 W arm: structural memory transfers Reset W eights R e s e t  (cold) T rain on A2 Cold: no structural memory T ransfer Protocol Figure 1: (a) DPPN architecture schematic. T okens are em b edded and passed through dual soft group ers to pro duce slot assignments. Slot-level supp ort and agreement are combined with the p ersisten t pheromone ﬁeld (highlighted) to pro duce pheromone-biased routing, which generates a sparse attention mask. A fast/slo w gate fusion pro duces the ﬁnal output. (b) T ransfer proto col. In Phase 1, the mo del is trained on a source task and pheromone accum ulates structural memory . In Phase 2, all mo del weigh ts are reset but pheromone is either kept (warm) or reset (cold), and the mo del is trained on a target task. 33 0 5 10 15 20 25 30 T a r g e t  S l o t  b 0 5 10 15 20 25 30 S o u r c e  S l o t  a ( a ) I n i t i a l i z a t i o n  ( 1 . 0 5 ) 0 5 10 15 20 25 30 T a r g e t  S l o t  b 0 5 10 15 20 25 30 S o u r c e  S l o t  a ( b ) A f t e r  T r a i n i n g  ( = 0 . 1 4 ± 0 . 2 7 ) 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 P h e r o m o n e  Figure 2: Pheromone ﬁeld τ ∈ R 32 × 32 b efore and after source training (DPPN, se ed 42). (a) At initialization, pheromone is near-uniform ( ¯ τ ≈ 1 . 05). (b) After 80 ep ochs, the ﬁeld is sparse and structured: most transitions hav e deca yed to τ min = 0 . 1 (blue), with a small num ber of high- pheromone transitions at τ max = 2 . 0 (red). A2 A3 B1 C1 T ransfer T arget T ask 0.010 0.005 0.000 0.005 0.010 T r a n s f e r  A d v a n t a g e  (  A U L C :  w a r m   c o l d ) Same Family Different Family Only positive samefamily bar T ransfer Advantage by Model V ariant and T ask DPPN (learned) Fourier (tok+pos) Fourier (posonly) Figure 3: T ransfer adv antage (∆ AULC: w arm distilled − cold) across mo del v ariants and transfer targets. Same-family tasks (A2, A3) are shaded green; diﬀerent-family tasks (B1, C1) are shaded red. With 3 seeds, the Position-Only F ourier v ariant app eared to sho w p ositiv e transfer on A2 (+0 . 003), but with 10 seeds the adv an tage is − 0 . 001 ± 0 . 005 (not signiﬁcan t). All routing-bias v ariants show uniformly negative or zero transfer adv antages. Error bars: std ov er seeds. 34 What W as Fixed Key Result What Remained Broken Exp 1 + Contrastive updates : 0 . 9 3 u n i f o r m 0 . 3 4 ± 0 . 6 3 X Surfacestructure entanglement Exp 2 + T ask dif ficulty calibration A U L C = + 0 . 0 0 1 (not significant) X Single source cannot disentangle Exp 3 + Multisource distillation Alignment: 3.5% vs. 3.1% chance X Coordinate mismatch across runs Exp 4 + Semantic decomposition C o r r e l a t i o n : 1 0 . 4 % (worse than DPPN) X Embedding instability (any learned coords) Exp 5 + Positiononly Fourier coords A2: +0.003 (same fam.) B 1 : 0 . 0 0 7 ( d i f f . f a m . ) ~ Ef fect size not significant (3 seeds) Root Cause: Persistent memory requires stable coordinates, and learned coordinates are inherently unstable. The Diagnostic Cascade: Five Experiments, One Root Cause Figure 4: The diagnostic cascade. Five exp erimen ts, each resolving one obstacle (left, green) while rev ealing the next (righ t, red). The cen ter column sho ws the key metric from eac h exp erimen t. All obstacles trace to the same ro ot cause: p ersisten t memory requires stable co ordinates, and learned co ordinates are inherently unstable. 35 DPPN (learned) Random chance 0 1 2 3 4 5 Slot Alignment Correlation (%) 3.5% 3.1% ( a ) Barely above chance (3.5% vs. 3.1%) CrossRun Slot Alignment Unaligned learned Aligned learned Fourier tok+pos Fourier posonly 0 5 10 15 20 Surviving T ransitions (of 1024) 10 (1.0%) 21 (2.1%) 16 (1.6%) 22 (2.1%) ( b ) Distillation Survival Figure 5: Co ordinate stability diagnostics. (a) Slot alignment correlation b et ween indep enden tly trained DPPN mo dels: 3.5%, barely ab o ve the 3.1% expected by random c hance with 32 slots. (b) Distillation surviv al: the n umber of high-magnitude transitions (out of 1024) surviving element- wise minimum distillation. P osition-Only F ourier preserves the most transitions (22), consistent with b etter co ordinate stabilit y . 36

The Coordinate System Problem in Persistent Structural Memory for Neural Architectures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment