Vision Hopfield Memory Networks

V ision Hopﬁeld Memory Networks Jianfeng W ang † Amine M’Charrak 1 Luk Koska 2 Xiangtao W ang 2 Daniel Petriceanu 2 Mykyta Smyr nov 2 Ruizhi W ang 2 Michael Bumbar 2 Luca Pinchetti 1 Thomas Lukasiewicz 2 1 Abstract Recent vision and multimodal foundation back- bones, such as T ransformer families and state- space models like Mamba, hav e achieved remark- able progress, enabling uniﬁed modeling across images, text, and beyond. Despite their empiri- cal success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability . In this work, we propose the V ision Hopﬁeld Mem- ory Network (V -HMN), a brain-inspired founda- tion backbone that integrates hierarchical mem- ory mechanisms with iterati ve reﬁnement updates. Speciﬁcally , V -HMN incorporates local Hopﬁeld modules that provide associati ve memory dynam- ics at the image patch lev el, global Hopﬁeld mod- ules that function as episodic memory for contex- tual modulation, and a predictive-coding-inspired reﬁnement rule for iterati ve error correction. By organizing these memory-based modules hierar - chically , V -HMN captures both local and global dynamics in a uniﬁed framework. Memory re- triev al exposes the relationship between inputs and stored patterns, making decisions more in- terpretable, while the reuse of stored patterns improv es data efﬁcienc y . This brain-inspired design therefore enhances interpretability and data ef ﬁciency be yond existing self-attention– or state-space–based approaches. W e conducted ex- tensiv e experiments on public computer vision benchmarks, and V -HMN achieved competitiv e results against widely adopted backbone architec- tures, while offering better interpretability , higher data ef ﬁciency , and stronger biological plausibil- ity . These ﬁndings highlight the potential of V - † W ork done while at the University of Oxford. 1 Department of Computer Science, Uni versity of Oxford, United Kingdom 2 Institute of Logic and Computation, V ienna Uni- versity of T echnology , Austria.. Correspondence to: Jianfeng W ang < jianfengwang1991@gmail.com > , Thomas Lukasie wicz < thomas.lukasiewicz@tuwien.ac.at > . Pr eprint. Marc h 27, 2026. HMN to serve as a ne xt-generation vision founda- tion model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain- inspired computation with large-scale machine learning. 1. Introduction The foundation models in computer vision hav e experienced signiﬁcant changes in recent years. Starting with AlexNet ( Krizhevsk y et al. , 2012 ) and its re volutionary performance, con volutional neural networks (CNNs) attracted the atten- tion of researchers, leading to the design of advanced ar- chitectures such as VGG ( Simonyan & Zisserman , 2015 ) and ResNet ( He et al. , 2016 ). Subsequently , the ev olution of network architectures in natural language processing, particularly the T ransformer ( V aswani et al. , 2017 ), gave rise to its vision counterpart, the V ision Transformer (V iT) ( Dosovitskiy et al. , 2021 ), which achie ved promising results on computer vision benchmarks. More importantly , the T ransformer family established a strong connection between natural language processing and computer vision, unifying the modeling paradigm for both vision and language. Build- ing on this trend, a v ariety of alternative architectures ha ve been proposed, such as MLP-Mixer ( T olstikhin et al. , 2021 ), MetaFormer ( Y u et al. , 2024 ), and more recently , state-space models like V ision Mamba (V im) ( Zhu et al. , 2024 ), further enriching the landscape of foundation models for vision. Howe ver , these models do not fundamentally address some of the long-standing challenges in deep learning. Speciﬁ- cally , they are not data-ef ﬁcient and usually require large- scale datasets for training. Moreover , they lack biological plausibility , as their learning mechanisms dif fer substantially from how the human brain operates. In terms of data ef ﬁ- ciency , current models rely heavily on extensi ve supervised training and lar ge annotated datasets, which limit their appli- cability in domains where data collection is time-consuming or even infeasible. In contrast, humans are able to learn robust concepts from v ery limited e xamples, pinpointing the gap between artiﬁcial and natural learning. As for biological plausibility , deep learning architectures and optimization methods are largely engineered for computational con ve- 1 V ision Hopﬁeld Memory Networks nience rather than grounded in neuroscience. For example, the conv entional feedforward architecture overlooks key properties of the human brain, such as associative mem- ory retriev al ( Ramsauer et al. , 2020 ) and predictiv e error correction ( Rao & Ballard , 1999 ; Friston , 2005 ). T o deal with these challenges, we propose a ne w vision foun- dation model, named V ision Hopﬁeld Memory Network (V -HMN) . V -HMN departs from con ventional feedforward or self-attention-only designs by augmenting each block with content-addr essable associative memory 1 . Concretely , V -HMN employs two complementary memory paths: (i) a local window memory that collects k × k neighborhoods and performs Hopﬁeld-style retrie val to denoise and complete local patterns; and (ii) a global template path that forms a scene-level query via global pooling, retrieves a global prototype from memory , and injects it back into all tokens as a context prior . Both memory paths update features through an iterativ e reﬁnement step with a learnable strength param- eter . This mechanism can be viewed as a lightweight form of predictive-coding dynamics, where representations are gradually corrected to ward memory-predicted patterns. In this way , the network gains an error-correctiv e feedback process that is absent in con ventional feedforward models. 2. Related W ork W e now brieﬂy revie w related works, including existing vision foundation models, associativ e memory and modern Hopﬁeld networks, and brain-inspired predictiv e coding framew orks. 2.1. V ision foundation backbones Early adv ances in vision backbones were dri ven by con volu- tional neural networks (CNNs), such as Ale xNet, VGG, and ResNet. While these models achie ved remarkable progress, recent research has shifted to ward alternati ve token-mixing paradigms. The V iT showed that a pure T ransformer on im- age patches can riv al CNNs at scale ( Dosovitskiy et al. , 2021 ), and hierarchical designs like Swin T ransformer (Swin-V iT) ( Liu et al. , 2021 ) impro ved ef ﬁciency through shifted local windows. In addition, a line of hybrid architec- tures attempts to combine the complementary strengths of CNNs and T ransformers. Representative e xamples include ConV iT ( d’Ascoli et al. , 2021 ), which introduces soft con- volutional inducti ve biases into attention layers, and CoaT ( Xu et al. , 2021 ), which integrates co-scale con volution with multi-head attention for better local–global trade-offs. Such hybrids highlight the ongoing interest in balancing locality and global context within a uniﬁed backbone. Beyond at- 1 In Hopﬁeld networks, content-addressable memory refers to the ability to retriev e a stored pattern by ﬁlling in missing or noisy parts of the input. tention, MLP-based models (e.g., MLP-Mix er ( T olstikhin et al. , 2021 )) and MetaF ormer 2 frame works ( Y u et al. , 2022 ; 2024 ) demonstrated that dif ferent mixers can operate within a similar architectural scaf fold. Most recently , state-space models (SSMs) hav e emerged as competiti ve backbones. S4 introduced structured SSMs for long sequences ( Gu et al. , 2022 ), and Mamba extended this idea with selectiv e input- dependent dynamics ( Gu & Dao , 2024 ). V ision-speciﬁc variants such as V im ( Zhu et al. , 2024 ) and VMamba ( Liu et al. , 2024 ) sho w promising results with linear -time com- plexity . While these adv ances broaden the landscape of vision backbones, they typically lack explicit, interpretable memory mechanisms. A recent attempt to address this limitation is the Associati ve T ransformer (AiT) ( Sun et al. , 2025 ), which introduces a global workspace layer where memory slots are written via bottleneck attention and retriev ed through Hopﬁeld-style dynamics to reﬁne token embeddings. While AiT shows the potential of integrating associati ve memory into T rans- formers, the most related work to ours, it remains funda- mentally T ransformer-based: memory serves as an auxiliary component, and heavy self-attention is still required for to- ken interactions. By contrast, V -HMN is a memory-centric backbone in which local and global Hopﬁeld modules fully replace self-attention as the token-mixing mechanism. This design makes V -HMN lighter , more interpretable, and closer to biologically inspired reﬁnement principles, positioning memory not as an add-on but as the core of the backbone it- self. 2.2. Associative memory and modern Hopﬁeld networks Modern Hopﬁeld Networks (MHNs) re visit content- addressable memory with continuous states and an energy function that yields exponentially lar ge storage and single- step retriev al in theory ( Ramsauer et al. , 2020 ). This line has been integrated into practical deep architectures via a dif- ferentiable Hopﬁeld layer and applied be yond vision (e.g., retriev al, pooling, representation learning) ( Ramsauer et al. , 2020 ; F ¨ urst et al. , 2022 ). Recent works reﬁne robustness and capacity , and study retrie val dynamics under modern settings ( W u et al. , 2024 ; Hu et al. , 2024 ). Compared to self-attention, Hopﬁeld-style modules maintain a persistent memory bank with e xplicit slots (prototypes), enabling in- terpretable slot activ ations and prototype–token alignments. Our V -HMN le verages this by combining a local windo w memory (prototype completion/denoising) and a global tem- plate path (scene-lev el prior), both trained end-to-end. Beyond Hopﬁeld-style associativ e retriev al, a broader line of work has explored pr ototype memories and e xternal memory 2 Throughout our experiments, we follow common practice and instantiate MetaFormer using PoolFormer , which is the default implementation adopted in prior work. 2 V ision Hopﬁeld Memory Networks banks as mechanisms for impro ving data efﬁcienc y . Early metric-based fe w-shot learning methods such as matching networks ( V inyals et al. , 2016 ) and prototypical networks ( Snell et al. , 2017 ) explicitly store class-lev el prototypes in an embedding space and perform inference via metric retrie val, sho wing that maintaining persistent prototypes can substantially improv e generalization under limited supervi- sion. Subsequent extensions, including relation networks ( Sung et al. , 2018 ) and MetaOptNet ( Lee et al. , 2019 ), fur - ther reﬁne prototype-based retrie val by learning similarity functions or optimizing embedding geometry for more re- liable fe w-shot generalization. Prototype memory has also been explored in generati ve modeling. The approach of Li et al. ( 2022 ) maintains a learned bank of visual prototypes and retrieves them via attention to guide synthesis from only a handful of examples. Similarly , Li et al. ( 2022 ) in- troduce a prototype-conditioned generati ve mechanism in which retrie ved prototypes act as structural priors that sta- bilize low-data generation. Both lines of work demonstrate that reusing persistent prototypes can effecti vely expand the statistical support of limited datasets and improv e sample efﬁcienc y in generativ e scenarios. 2.3. Predicti ve-coding-inspir ed iterative reﬁnement Predictiv e coding (PC) is a long-standing theory in neuro- science that frames perception as iterati ve error minimiza- tion between predictions and sensory input ( Rao & Ballard , 1999 ; Friston , 2005 ). In computational neuroscience, PC networks (PCNs) hav e been studied as a biologically plausi- ble alternati ve to backpropagation ( Whittington & Bogacz , 2019 ), while early deep learning v ariants such as PredNet applied the idea to video prediction with hierarchical recur- rent modules ( Lotter et al. , 2017 ). More recent works hav e attempted to formalize PCNs in machine learning, drawing connections to v ariational inference, energy-based models, and equilibrium propagation ( Millidge et al. , 2022 ; v an Zwol et al. , 2024 ). Despite their theoretical appeal, PC-inspired models remain limited to small-scale or domain-speciﬁc settings, in part due to optimization difﬁculties and inefﬁ- ciency in large-scale vision tasks. In this work, we do not implement full PC inference; instead, our blocks perform a lightweight, learnable reﬁnement toward memory-predicted prototypes. This provides an interpretable, error-correcti ve step that is inspired by PC principles, connecting HMN- style memory reﬁnement with a brain-inspired narrative, while keeping the backbone simple and scalable. 3. Methodology In this section, we gi ve an overvie w of the ov erall architec- ture of V -HMN, outlining how images are processed from patch embeddings to memory-based reﬁnement blocks and ﬁnally to classiﬁcation. 3.1. Overall Ar chitecture The V -HMN is designed as a memory-centric vision back- bone. An input image is ﬁrst projected into a sequence of image patch tokens. These tokens are then processed by a stack of HMN blocks, each inte grating local and global Hop- ﬁeld memory modules 3 . Finally , the sequence is aggregated by attention pooling, and fed into a linear classiﬁer . In con- trast to con volutional backbones (ResNet ( He et al. , 2016 )), self-attention-based designs (V iT ( Dosovitskiy et al. , 2021 )), or state-space models (V im ( Zhu et al. , 2024 )), V -HMN re- places the underlying token-mixing operation with e xplicit associativ e memory retriev al. This makes memory reﬁne- ment—not con volution, self-attention, or recurrence—the core building block of the backbone. 3.2. Local and Global Hopﬁeld Memory Modules Each HMN block contains two complementary modules that together capture ﬁne-grained local structure and holistic global context. Local memory . For each token, its k × k spatial neigh- borhood is unfolded with padding, ﬂattened, and linearly projected to form a latent query for that token. Hopﬁeld retrie val is then performed against a class-balanced memory bank that stores prototype features. The retriev ed proto- type reﬁnes the local representation by stabilizing noisy features and completing partial patterns. The initial and reﬁned representations are concatenated, projected back to the embedding dimension, and subsequently fused with the global branch before being added to the input representation via a skip connection. This mechanism parallels the role of con volutions or windo wed attention, but operates through explicit prototype-based priors. Global memory . T o provide scene-le vel context, the global branch ﬁrst mean-pools all tokens to form a query vector . This query interacts with a global memory bank through Hopﬁeld retrie val, producing a prototype that captures global semantics. The result is broadcast to all tok ens and integrated with the local path. The ov erall workﬂo w of a V -HMN block and its integration into the classiﬁcation backbone are illustrated in Figure 1 . Figure 1 (a) details a single block: the input token sequence ﬁrst branches into (i) a local memory path that aggregates k × k spatial neighborhoods, performs associativ e retriev al with iterati ve reﬁnement, and concatenates the reﬁned and initial representations; and (ii) a global memory path that mean-pools all tokens to form a scene-lev el query , applies Hopﬁeld-based retrie v al and iterati ve reﬁnement, and con- 3 Appendix A.1 provides a qualitativ e discussion clarifying which in variances arise from the hierarchical local–global memory design and which are attributable to standard components such as augmentation and pooling. 3 V ision Hopﬁeld Memory Networks F igure 1. Overvie w of V -HMN. (a) Each HMN block reﬁnes features through local and global Hopﬁeld memory retrie val, rather than con volution or self-attention. (b) A deep backbone is constructed by stacking HMN blocks, with attention pooling and a linear head for image classiﬁcation. catenates the reﬁned and initial queries. The two paths are then fused by summation, and the fused representation is passed through a lightweight two-layer MLP , forming the output of the MHN block. In particular, the associative memory retriev al can be formalized as follows. Given a representation z ∈ R D and a memory bank M ∈ R K × D with K prototype slots, we ﬁrst ℓ 2 -normalize both z and the memory slots to obtain cosine similarities: ˆ z = z ∥ z ∥ 2 , ˆ M j = M j ∥ M j ∥ 2 , j = 1 , . . . , K. (1) Retriev al weights and the retriev ed prototype are then com- puted by α = softmax  √ D ˆ z ˆ M ⊤  ∈ R K , m = X K j =1 α j M j ∈ R D . (2) where M j is the j -th unnormalized memory slot, α are normalized weights ( P j α j = 1 ), and m is the retriev ed prototype. The additional scaling factor √ D is introduced, because cosine similarities ˆ z ˆ M ⊤ j hav e variance in the order of 1 /D : multiplying by √ D restores the logits to approxi- mately unit v ariance 4 , pre venting the softmax distrib ution from becoming overly ﬂat and yielding sharper, more dis- criminativ e retriev al weights. 4 A detailed proof is provided in Appendix A.2 . Figure 1 (b) sho ws the full classiﬁcation backbone formed by stacking multiple HMN blocks in depth, followed by at- tention pooling and a linear classiﬁer . Speciﬁcally , attention pooling performs a weighted combination o ver all tokens to produce a single representation, which can be deﬁned as: α = softmax( H W att ) ∈ R N , v = X N i =1 α i H i ∈ R D . (3) where H ∈ R N × D denotes the token representations after the ﬁnal block, W att ∈ R D × 1 is a learnable scoring vector that assigns importance weights to tokens, α are the nor- malized attention weights ( P i α i = 1 ), and v is the pooled representation fed into the classiﬁer . During training, both local and global modules maintain their own class-balanced memory banks. Unlike parametric weights, these banks are explicitly written with real sample embeddings at each block: the local bank stores projected patch-neighborhood features, while the global bank stores pooled scene-level features. Each bank is organized as a per-class ring buf fer with ﬁxed capacity , ensuring that all classes are allocated equal slots. As training proceeds, new embeddings replace the oldest ones within each class, yield- ing a continually refreshed and balanced set of prototypes. During inference, the banks are frozen and no longer up- dated, so retriev al always operates on stable prototypes that 4 V ision Hopﬁeld Memory Networks persist across tasks. 3.3. Iterative Reﬁnement The central operation in both local and global modules is an iterativ e reﬁnement rule. Given a current representation z and a retriev ed prototype m , the update is z ( t +1) = z ( t ) + β ( m − z ( t ) ) , (4) where β is a learnable update strength and t denotes the reﬁnement step. This mechanism can be viewed as a predictive-coding-inspir ed update 5 : the prototype m provides a memory-based prediction, while the residual ( m − z ( t ) ) acts as a prediction error that gradually corrects the current representation, in line with the associati ve mem- ory’ s role of ﬁlling in missing or noisy information. In contrast to full predictive coding networks that maintain explicit error units and multi-layer recurrent inference, V - HMN adopts a lightweight reﬁnement loop where only a few steps are sufﬁcient in practice. This reﬁnement design lev erages persistent, content-addr essable pr ototypes that are shared across samples, and it yields tw o ke y beneﬁts: (i) im- prov ed data ef ﬁciency , as stored prototypes provide reusable priors; and (ii) enhanced interpretability , since prototype activ ations directly expose the memory patterns supporting each decision. 4. Experiments W e now report on our experiments on four public image classiﬁcation benchmarks. T o save space, implementation details are giv en in the appendix. 4.1. Datasets W e ev aluated our model on four widely used image classiﬁ- cation benchmarks: CIF AR-10 ( Krizhe vsky , 2009 ), CIF AR- 100 ( Krizhevsky , 2009 ), SVHN ( Netzer et al. , 2011 ), and Fashion-MNIST ( Xiao et al. , 2017 ). CIF AR-10 and CIF AR- 100 each consist of 60,000 color images with a resolution of 32 × 32 . Each dataset is split into 50,000 training sam- ples and 10,000 test samples, cov ering 10 and 100 object categories, respecti vely . SVHN is a large-scale street vie w digit recognition dataset with more than 600,000 labeled images at 32 × 32 resolution. Compared to CIF AR, it is more challenging due to cluttered backgrounds, o verlapping digits, and signiﬁcant intra-class variation. Fashion-MNIST provides 70,000 grayscale images of size 28 × 28 from 10 categories of clothing and accessories. It was introduced as a modern alternativ e to the classic MNIST digits, with richer 5 A more detailed discussion of the connection between our reﬁnement rule and predictiv e-coding (PC) dynamics is provided in Appendix A.3 . Appendix A.4 further discusses the biological plausibility and cortical inspiration of the ove rall V -HMN design. T able 1. Data efﬁciency of V - HMN on CIF AR-10, CIF AR-100, and Fashion-MNIST with different fractions of labeled training samples. Reported v alues are top-1 test accuracy (%). Fraction of training data CIF AR-10 CIF AR-100 Fashion-MNIST 10% 80 . 22 ± 0 . 29 43 . 21 ± 1 . 07 89 . 18 ± 0 . 16 30% 88 . 67 ± 0 . 21 62 . 42 ± 0 . 29 91 . 04 ± 0 . 22 50% 91 . 19 ± 0 . 38 68 . 93 ± 0 . 35 91 . 53 ± 0 . 12 T able 2. Comparison of data efﬁcienc y across models with 10% and 30% labeled training data on CIF AR-10, CIF AR-100, and Fashion-MNIST . T op-1 test accuracy (%) are reported as mean ± standard de viation ov er 3 seeds. Baselines include V iT ( Doso- vitskiy et al. , 2021 ), Swin-V iT ( Liu et al. , 2021 ), MLP-Mixer ( T olstikhin et al. , 2021 ), MetaFormer ( Y u et al. , 2024 ), V im ( Zhu et al. , 2024 ), and AiT ( Sun et al. , 2025 ). Model CIF AR-10 CIF AR-100 Fashion-MNIST 10% training data V iT 72 . 73 ± 0 . 42 40 . 48 ± 0 . 80 87 . 17 ± 0 . 40 Swin-V iT 70 . 37 ± 0 . 94 35 . 25 ± 0 . 59 88 . 42 ± 0 . 48 MLP-Mixer 76 . 14 ± 0 . 16 41 . 94 ± 0 . 98 87 . 16 ± 0 . 63 MetaFormer 49 . 92 ± 4 . 08 19 . 39 ± 1 . 26 85 . 02 ± 1 . 22 V im 69 . 02 ± 2 . 13 36 . 61 ± 1 . 25 86 . 09 ± 1 . 03 AiT 67 . 89 ± 0 . 55 35 . 84 ± 0 . 67 84 . 33 ± 0 . 38 V -HMN (ours) 80 . 22 ± 0 . 29 43 . 21 ± 1 . 07 89 . 18 ± 0 . 16 30% training data V iT 83 . 94 ± 0 . 33 57 . 40 ± 0 . 79 89 . 71 ± 0 . 45 Swin-V iT 80 . 89 ± 0 . 24 51 . 77 ± 0 . 60 90 . 34 ± 0 . 15 MLP-Mixer 85 . 53 ± 0 . 40 56 . 22 ± 0 . 75 89 . 41 ± 0 . 13 MetaFormer 60 . 86 ± 6 . 05 37 . 47 ± 3 . 19 88 . 78 ± 1 . 03 V im 79 . 58 ± 0 . 37 49 . 67 ± 0 . 79 88 . 18 ± 2 . 43 AiT 80 . 69 ± 1 . 36 54 . 59 ± 0 . 87 87 . 62 ± 0 . 32 V -HMN (ours) 88 . 67 ± 0 . 21 62 . 42 ± 0 . 29 91 . 04 ± 0 . 22 texture and shape v ariations. For large-scale e valuation, we additionally conduct experiments on ImageNet-1k ( Deng et al. , 2009 ), which contains 1.28M training images and 50K validation images across 1,000 classes. 4.2. Ablation Studies Data Efﬁciency . W e study the data efﬁciency of V - HMN under varying fractions of labeled training data. T able 1 shows that accuracy improv es steadily as the proportion of labeled data increases: even with only 10% of the train- ing set, V - HMN achie ves competiti ve performance, while scaling to 30% and 50% further closes the gap to the full- data regime. This indicates that the associative memory mechanism provides strong inductiv e biases that reduce dependence on large-scale annotation. W e further benchmark V - HMN against standard vision back- bones under 10% and 30% labeled data (T able 2 ). Across CIF AR-10, CIF AR-100, and Fashion-MNIST , V - HMN con- sistently outperforms widely used architectures such as V iT , Swin-V iT , MLP-Mix er , and MetaF ormer , as well as more re- cent models like V im. Importantly , V - HMN also surpasses AiT , which integrates memory slots into a T ransformer back- bone. This highlights the beneﬁt of our design: rather than 5 V ision Hopﬁeld Memory Networks appending memory to an existing architecture, V - HMN makes associative memory the core computational prim- itiv e. The resulting prototype-based reﬁnement provides stronger gains, especially in low-data settings, where stored prototypes act as reusable priors and compensate for scarce supervision 6 . Iterative Reﬁnement. T able 3 summarizes the effect of varying the number of reﬁnement iterations t . Here, t = 0 disables the associati ve reﬁnement loop entirely: the local and global branches still compute their feedforward projec- tions, but no Hopﬁeld retriev al or error-correction update (Eq. 4 ) is applied. The memory banks remain allocated during training to keep the parameter count identical, but they are not read during inference. Across all datasets, introducing ev en a single reﬁnement step ( t = 1 ) yields consistent improvements (e.g., CIF AR-10: 93.56% → 93.94%; CIF AR-100: 75.84% → 76.58%). This conﬁrms that the associative update provides meaningful beneﬁts, although the underlying feedforward pathway is already strong. Performance peaks at t = 2 , while deeper unrolling offers no additional gain and can slightly degrade accuracy due to over -correction. This trend is consistent with predictiv e-coding models of cortical processing, where a small number of recurrent error-correction steps typically sufﬁces to explain the input and further unrolling mainly increases computational cost ( Rao & Ballard , 1999 ; Friston , 2005 ). In addition, we conduct robustness experiments to assess whether additional reﬁnement iterations offer beneﬁts be- yond a single update. W e ev aluate CIF AR-10 models trained with different numbers of reﬁnement steps under se veral cor- ruptions: Gaussian noise (standard deviations 0.05, 0.10, 0.20, 0.30), random square occlusion (areas 0.05, 0.10, 0.20), and contrast scaling (factors 0.5, 0.75, 1.25, 1.5). As sho wn in Figure 2 , we report both the mean top-1 accu- racy across all corruptions and the per -corruption accuracy for each reﬁnement depth. A veraged o ver all corruptions, accuracy increases from 71.65% at t = 0 to 73.79% at t = 1 and 74.35% at t = 2 . The gains are most pronounced for occlusion and contrast: occlusion accuracy improv es from 87.24% ( t = 0 ) to 89.08% ( t = 1 ) and 89.79% ( t = 2 ), with similar improv ements for contrast scaling. Overall, these results demonstrate that the predicti ve-coding-inspired re- ﬁnement yields measurable and consistent robustness gains. Unless otherwise speciﬁed, we ﬁx the number of iterations to t = 1 in all other experiments for a balanced trade-of f between accuracy and ef ﬁciency . 6 W e provide an analysis in Appendix A.5 explaining why V - HMN exhibits data ef ﬁciency . 4.3. Main Results T able 4 reports the comparison of V -HMN with a wide range of mainstream vision foundation backbones, including T ransformer-based models (V iT , Swin-V iT), MLP-based ar- chitectures (MLP-Mixer , MetaFormer), state-space models (V im), and the recently proposed AiT . Despite having a com- parable parameter scale, V -HMN consistently achiev es the best performance across all benchmarks, reaching 93 . 94% on CIF AR-10, 76 . 58% on CIF AR-100, 97 . 16% on SVHN, and 92 . 27% on Fashion-MNIST . W e attribute these impro vements to two k ey design choices. First, the incorporation of local and global associative mem- ories enables the model to retrie ve and inte grate prototypical patterns, ef fecti vely supplementing limited supervision with reusable priors. Second, the iterative reﬁnement mecha- nism provides a lightweight error-correcti ve process that gradually aligns representations with memory-predicted prototypes, thereby enhancing the robustness. Compared with standard feedforward T ransformers or purely sequen- tial state-space models, V -HMN beneﬁts from persistent, content-addressable prototypes that capture both local de- tails and global context, yielding stronger generalization under comparable model capacity . A key observ ation is that V -HMN surpasses AiT ( Sun et al. , 2025 ) across all datasets, despite having nearly identical parameter counts. While AiT integrates associativ e mem- ory within a Transformer layer , its token mixing remains attention-centric. In contrast, V -HMN is memory-centric : local and global Hopﬁeld modules replace self-attention as the mixing mechanism, memories are persistent and class- balanced (written with real sample embeddings during train- ing and frozen at inference), and representations are updated through a predicti ve-coding-inspired reﬁnement loop. These results suggest that making associati ve memory the primary computational primiti ve, rather than an auxiliary component to T ransformer, provides better data ef ﬁciency and accuracy under comparable model capacity . T o further ev aluate the scalability of V -HMN, we conduct experiments on ImageNet-1k and report results in T able 5 . W ithout any architectural specialization for large-scale train- ing, V -HMN achieves accurac y comparable to widely used CNN and Transformer backbones of similar parameter counts. This ﬁnding is encouraging giv en that V -HMN is a newly proposed architecture: unlike mature baselines that hav e beneﬁted from years of iterati ve optimization and carefully engineered design heuristics, V -HMN has not un- dergone e xtensive h yperparameter tuning or additional engi- neering ef fort. Although V -HMN demonstrates strong data efﬁcienc y on smaller benchmarks, our ImageNet experiment serves a different purpose. Rather than competing with heav- ily optimized architectures such as RegNetY ( Radosa vovic et al. , 2020 ), DeiT ( T ouvron et al. , 2021 ), Swin-V iT ( Liu 6 V ision Hopﬁeld Memory Networks T able 3. Ablation study on the number of reﬁnement iterations in V -HMN. T op-1 test accuracy (%) are reported as mean ± standard deviation o ver 3 seeds. Iterations CIF AR-10 CIF AR-100 F ashion-MNIST 0 93 . 56 ± 0 . 10 75 . 84 ± 0 . 14 92 . 05 ± 0 . 08 1 93 . 94 ± 0 . 11 76 . 58 ± 0 . 09 92 . 27 ± 0 . 06 2 94 . 28 ± 0 . 13 76 . 59 ± 0 . 16 92 . 48 ± 0 . 06 3 93 . 99 ± 0 . 07 76 . 41 ± 0 . 09 92 . 40 ± 0 . 05 F igure 2. Ef fect of reﬁnement iteration on CIF AR-10 robustness across Gaussian noise, occlusion, and contrast corruptions. 0 1 2 3 Number of r efinement iterations T 0.4 0.5 0.6 0.7 0.8 0.9 T op-1 accuracy Mean over all cor ruptions Gaussian noise Occlusion Contrast T able 4. Comparison of V -HMN with baseline models on CIF AR-10, CIF AR-100, SVHN, and Fashion-MNIST . T op-1 test accuracy (%) are reported as mean ± standard de viation ov er 3 seeds. Baselines include V iT ( Dosovitskiy et al. , 2021 ), Swin-V iT ( Liu et al. , 2021 ), MLP-Mixer ( T olstikhin et al. , 2021 ), MetaFormer ( Y u et al. , 2024 ), V im ( Zhu et al. , 2024 ), and AiT ( Sun et al. , 2025 ). Model CIF AR-10 CIF AR-100 SVHN FashionMNIST Params (M) V iT 91 . 66 ± 0 . 08 72 . 56 ± 0 . 01 96 . 11 ± 0 . 29 91 . 83 ± 0 . 15 7.16 Swin-V iT 87 . 94 ± 0 . 10 66 . 58 ± 0 . 37 95 . 89 ± 0 . 17 91 . 79 ± 0 . 17 6.92 MLP-Mixer 92 . 65 ± 0 . 26 73 . 35 ± 0 . 39 96 . 91 ± 0 . 06 91 . 46 ± 0 . 32 8.71 MetaFormer 75 . 40 ± 1 . 56 48 . 76 ± 0 . 75 86 . 54 ± 2 . 71 91 . 42 ± 0 . 21 6.99 V im 82 . 83 ± 3 . 87 63 . 29 ± 2 . 83 86 . 63 ± 1 . 51 88 . 63 ± 1 . 49 7.12 AiT 92 . 97 ± 0 . 30 72 . 91 ± 0 . 17 95 . 98 ± 0 . 06 91 . 51 ± 0 . 04 7.15 V -HMN (ours) 93 . 94 ± 0 . 05 76 . 58 ± 0 . 09 97 . 16 ± 0 . 04 92 . 27 ± 0 . 06 7.12 T able 5. Comparison on ImageNet-1k. All results are T op-1 accu- racy (%). Method Image Size #Params T op-1 Acc. ResNet-18 ( He et al. , 2016 ) 224 12M 70.6 ResNet-34 ( He et al. , 2016 ) 224 22M 75.5 ResNet-50 ( He et al. , 2016 ) 224 26M 79.8 V iT -B/16 ( Dosovitskiy et al. , 2021 ) 384 86M 77.9 V iT -L/16 ( Dosovitskiy et al. , 2021 ) 384 307M 76.5 MLP-Mixer-B/16 ( T olstikhin et al. , 2021 ) 224 59M 76.4 Swin-Mixer-T/D24 ( Liu et al. , 2021 ) 256 20M 79.4 PVT -Small ( W ang et al. , 2021 ) 224 25M 79.8 V -HMN (ours) 224 88M 80.3 et al. , 2021 ), or V im ( Gu & Dao , 2024 ), the goal is to as- sess whether the model’ s core inductiv e bias remains viable at large scale. Despite minimal tuning, V -HMN remains competitiv e on ImageNet in terms of the accuracy , indicat- ing that the framew ork is not only biologically inspired but also inherently interpretable and suf ﬁciently general to op- erate effecti vely at high-resolution, large-dataset regimes. Moreov er, V -HMN preserves a transparent computational structure, namely explicit associati ve-memory retriev al cou- pled with predictiv e-coding-inspired iterativ e reﬁnement, that is dif ﬁcult to maintain in more complex or hea vily en- gineered architectures. This transparency further enables interpretability of the decision process, since predictions can be explained by identifying which stored patterns are retriev ed and reﬁned. 4.4. V isualization T o better understand the behavior of V -HMN, we visualize the retriev ed prototypes from both local and global mem- ory banks. For each test image, we ﬁrst identify the most informativ e patch based on the attention pooling weights (highlighted by the red box), and then retriev e its nearest pro- totypes from the corresponding memory banks. T o clearly illustrate the retriev ed re gions, we adopt a patch size of 8 × 8 during visualization. Figure 3 shows representati ve results across four cate gories: automobile, ship, dog, and deer . Sev eral interesting observ a- tions can be made. First, the local memory retriev es highly similar local structures from other images of the same class. For e xample, in the automobile and ship cases, the retriev ed prototypes consistently align to similar positions, highlight- ing that local memory captures part-le vel correspondences across samples. This demonstrates that local memory is capable of associating ﬁne-grained visual parts across dif- ferent samples, thereby enhancing spatial interpretability . Second, the global memory retrie ves holistic prototypes that provide complementary , scene-lev el priors. For instance, as shown in the dog and deer cases, the global retriev al captures diverse poses and viewpoints of the same class, which supply additional information to complete or reﬁne the local representation. This behavior is consistent with the role of associativ e memory , which not only recalls simi- lar ex emplars but also provides missing conte xt to stabilize 7 V ision Hopﬁeld Memory Networks F igure 3. V isualization of retriev ed prototypes from local and global memory . predictions. T ogether , these results highlight that the two memory paths confer complementary beneﬁts: local memory aligns se- mantically corresponding parts across samples, while global memory furnishes broader priors that help disambiguate incomplete or noisy inputs. Such explicit retriev al and reﬁnement signiﬁcantly improv e the interpretability of V - HMN, as one can directly inspect which prototypes drive the model’ s decision process. 5. Summary and Outlook In this w ork, we introduced V -HMN , a brain-inspired vision backbone that augments each block with local and global Hopﬁeld-style memory . Through associativ e retriev al and predictiv e-coding-inspired reﬁnement, V -HMN moves be- yond purely feedforward or self-attention architectures and places memory at the center of feature integration. This yields two key beneﬁts: data efﬁciency , by reusing stored prototypes as inductive priors, and interpretability , as re- triev ed prototypes explicitly re veal the patterns inﬂuencing each prediction. V -HMN outperforms vision backbones of comparable scale across CIF AR, SVHN, and Fashion- MNIST , and furthermore demonstrates that its inductiv e bias scales reliably to ImageNet, achieving competitiv e performance without large-scale architectural tuning and hyper-parameter search. Our study highlights the promise of memory-centric back- bones as a competitiv e and scalable alternative to main- stream vision foundation models. By grounding representa- tion learning in e xplicit associativ e memory and lightweight error-correcti ve reﬁnement, V -HMN bridges biologically inspired computation with large-scale machine learning. This memory-driven principle is not limited to classiﬁca- tion; rather , it suggests broader applicability across a range of vision tasks, such as retrie v al and metric learning, few- shot adaptation, and dense prediction problems including segmentation and detection, where such a principle may 8 V ision Hopﬁeld Memory Networks provide beneﬁcial inducti ve biases. W e belie ve these direc- tions open the door to more interpretable, data-ef ﬁcient, and biologically inspired architectures. Acknowledgments This research was funded in part by the Austrian Science Fund (FWF) 10.55776/COE12 and the AXA Research Fund. Amine M’Charrak gratefully ackno wledges support from the Ev angelisches Studienwerk e.V . V illigst through a doc- toral fellowship. Impact Statement This paper presents work whose primary goal is to advance fundamental research in machine learning, particularly in representation learning and memory-based model architec- tures. The methods proposed are general-purpose and are not designed for deployment in any speciﬁc high-risk or safety-critical application. As such, while there may be broader societal implications associated with progress in machine learning more broadly , we do not foresee any di- rect or immediate ethical concerns arising uniquely from this work. References Cao, K., W ei, C., Gaidon, A., Arechiga, N., and Ma, T . Learning imbalanced datasets with label-distribution- aware mar gin loss. Advances in Neural Information Pr o- cessing Systems , 32, 2019. Cubuk, E. D., Zoph, B., Mane, D., V asude van, V ., and Le, Q. V . AutoAugment: Learning augmentation strategies from data. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pp. 113– 123, 2019. Deng, J., Dong, W ., Socher , R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Confer ence on Computer V ision and P at- tern Recognition , pp. 248–255. IEEE, 2009. Dosovitskiy , A., Beyer , L., K olesniko v , A., et al. An image is worth 16x16 w ords: Transformers for image recogni- tion at scale. In International Conference on Learning Repr esentations , 2021. d’Ascoli, S., T ouvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., and Sagun, L. ConV iT : Improving vision transformers with soft con volutional inducti ve biases. In Pr oceedings of the 38th International Confer ence on Ma- chine Learning , v olume 139, pp. 2286–2296, 2021. Friston, K. A theory of cortical responses. Philosophical T ransactions of the Royal Society B: Biological Sciences , 360(1456):815–836, 2005. F ¨ urst, A., Rumetshofer, E., Lehner, J., Tran, V . T ., T ang, F ., Ramsauer , H., Kreil, D., Kopp, M., Klambauer , G., Bitto, A., et al. CLOOB: Modern Hopﬁeld networks with InfoLOOB outperform CLIP. In Advances in Neur al Information Pr ocessing Systems , 2022. Gu, A. and Dao, T . Mamba: Linear-time sequence modeling with selectiv e state spaces. F irst Confer ence on Languag e Modeling , 2024. Gu, A., Goel, K., and R ´ e, C. Efﬁciently modeling long sequences with structured state spaces. In International Confer ence on Learning Representations , 2022. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 770–778, 2016. Hu, J. Y .-C., Chang, P .-H., Luo, R., Chen, H.-Y ., Li, W ., W ang, W .-P ., and Liu, H. Outlier-ef ﬁcient Hopﬁeld layers for large transformer -based models. In Pr oceedings of the 41st International Confer ence on Mac hine Learning , vol- ume 235 of Pr oceedings of Machine Learning Resear ch , 2024. Kang, B., Xie, S., Rohrbach, M., Y an, Z., Gordo, A., Feng, J., and Kalantidis, Y . Decoupling representation and classiﬁer for long-tailed recognition. In International Confer ence on Learning Representations , 2020. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Repr esentations , 2015. Krizhevsk y , A. Learning multiple layers of features from tiny images. T echnical Report TR-2009, Univ ersity of T oronto, 2009. Krizhevsk y , A., Sutske ver , I., and Hinton, G. E. ImageNet classiﬁcation with deep con volutional neural networks. In Advances in Neural Information Pr ocessing Systems , volume 25, 2012. Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta- learning with differentiable con ve x optimization. In Pro- ceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pp. 10657–10665, 2019. Li, T ., Li, Z., Luo, A., Rockwell, H., Farimani, A. B., and Lee, T . S. Prototype memory and attention mechanisms for fe w shot image generation. In International Confer - ence on Learning Repr esentations , 2022. 9 V ision Hopﬁeld Memory Networks Liu, Y ., Tian, Y ., Zhao, Y ., Y u, H., Xie, L., W ang, Y ., Y e, Q., Jiao, J., and Liu, Y . VMamba: V isual state space model. In Advances in Neural Information Pr ocessing Systems , 2024. Liu, Z., Lin, Y ., Cao, Y ., Hu, H., W ei, Y ., Zhang, Z., Lin, S., and Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pp. 10012–10022, 2021. Lotter , W ., Kreiman, G., and Cox, D. Deep predictiv e coding networks for video prediction and unsupervised learning. In International Confer ence on Learning Repr esentations , 2017. Millidge, B., Salvatori, T ., Song, Y ., Bogacz, R., and Lukasiewicz, T . Predictive coding: T ow ards a future of deep learning beyond backpropagation? In Pr oceedings of the 31st International J oint Conference on Artiﬁcial Intelligence , pp. 5538–5545, 2022. Netzer , Y ., W ang, T ., Coates, A., Bissacco, A., W u, B., and Ng, A. Y . Reading digits in natural images with unsuper - vised feature learning. Advances in Neural Information Pr ocessing Systems W orkshops , 2011. Radosav ovic, I., K osaraju, R. P ., Girshick, R., He, K., and Doll ´ ar , P . Designing network design spaces. In Pr oceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pp. 10428–10436, 2020. Ramsauer , H., Sch ¨ aﬂ, B., Lehner, J., Seidl, P ., Widrich, M., Adler , T ., Gruber , L., Holzleitner , M., P avlo vi ´ c, M., Sandve, G. K., et al. Hopﬁeld networks is all you need. arXiv pr eprint arXiv:2008.02217 , 2020. Rao, R. P . N. and Ballard, D. H. Predictive coding in the visual cortex: A functional interpretation of some e xtra- classical recepti ve-ﬁeld effects. Nature Neur oscience , 2 (1):79–87, 1999. Simonyan, K. and Zisserman, A. V ery deep conv olutional networks for large-scale image recognition. In Interna- tional Confer ence on Learning Representations , 2015. Snell, J., Swersky , K., and Zemel, R. Prototypical networks for fe w-shot learning. In Advances in Neural Information Pr ocessing Systems , 2017. Sun, Y ., Ochiai, H., W u, Z., Lin, S., and Kanai, R. As- sociativ e transformer . In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 4518–4527, 2025. Sung, F ., Y ang, Y ., Zhang, L., Xiang, T ., T orr, P . H. S., and Hospedales, T . M. Learning to compare: Relation network for fe w-shot learning. In IEEE Confer ence on Computer V ision and P attern Recognition , 2018. T olstikhin, I. O., Houlsby , N., K olesnikov , A., et al. MLP- Mixer: An all-MLP architecture for vision. In Advances in Neural Information Pr ocessing Systems , 2021. T ouvron, H., Cord, M., Douze, M., Massa, F ., Sablayrolles, A., and J ´ egou, H. Training data-efﬁcient image transform- ers & distillation through attention. In International con- fer ence on machine learning , pp. 10347–10357. PMLR, 2021. van Zwol, B., Jef ferson, R., and van den Broek, E. L. Pre- dictiv e coding networks and inference learning: Tutorial and surve y . arXiv pr eprint arXiv:2407.04117 , 2024. V aswani, A., Shazeer , N., P armar , N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł ., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Pr ocessing Systems , 2017. V inyals, O., Blundell, C., Lillicrap, T ., and W ierstra, D. Matching networks for one shot learning. In Advances in Neural Information Pr ocessing Systems , 2016. W ang, W ., Xie, E., Li, X., Fan, D., Song, K., Liang, D., Lu, T ., Luo, P ., and Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without conv o- lutions. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision (ICCV) , pp. 568–578, 2021. Whittington, J. C. R. and Bog acz, R. Theories of error back- propagation in the brain. T rends in Cognitive Sciences , 23(3):235–250, 2019. W u, D., Hu, J. Y .-C., Hsiao, T .-Y ., and Liu, H. Uni- form memory retrie val with larger capacity for modern Hopﬁeld models. In Pr oceedings of the 41st Interna- tional Confer ence on Machine Learning , v olume 235, pp. 53471–53514, 2024. Xiao, H., Rasul, K., and V ollgraf, R. F ashion-MNIST: A nov el image dataset for benchmarking machine learning algorithms. arXiv preprint , 2017. Xu, W ., Xu, Y ., Chang, T ., and T u, Z. Co-scale con v- attentional image transformers. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pp. 9981–9990, 2021. Y u, W ., Luo, M., Zhou, P ., Si, C., Zhou, Y ., W ang, X., Feng, J., and Y an, S. MetaFormer is actually what you need for vision. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pp. 10819– 10829, 2022. Y u, W ., Si, C., Zhou, P ., Luo, M., Zhou, Y ., Feng, J., Y an, S., and W ang, X. MetaFormer baselines for vision. IEEE T ransactions on P attern Analysis and Machine Intelli- gence , 46(2):896–912, 2024. 10 V ision Hopﬁeld Memory Networks Y un, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Y oo, Y . CutMix: Regularization strategy to train strong clas- siﬁers with localizable features. In Pr oceedings of the IEEE/CVF International Confer ence on Computer V ision , pp. 6023–6032, 2019. Zhang, H., Cisse, M., Dauphin, Y . N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In Interna- tional Confer ence on Learning Representations , 2018. Zhou, B., Cui, Q., W ei, X.-S., and Chen, Z.-M. BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pp. 9719–9728, 2020. Zhu, L., Liao, B., Zhang, Q., W ang, X., Liu, W ., and W ang, X. V ision Mamba: ef ﬁcient visual representation learning with bidirectional state space model. Proceedings of the 41st International Confer ence on Machine Learning , 2024. 11 V ision Hopﬁeld Memory Networks A. A ppendix A.1. Hierarchical Local–Global Memory and In variances V -HMN is organized as a stack of HMN blocks, each equipped with its o wn local and global memory banks. The hierarchical structure arises because each block processes and stores representations at its o wn depth: the prototypes learned at lo wer layers are grounded in early , ﬁne-grained features, while deeper layers operate on progressively transformed and more semantically organized representations produced by the preceding blocks. As a result, lo wer layers learn prototypes of local edge- and texture-like patterns, whereas higher layers learn more abstract object- and scene-le vel prototypes. Layerwise interaction of local and global memories. At every layer , the local HMN operates on overlapping neigh- borhoods of the token grid to retrie ve and reﬁne local patterns, while the global HMN pools information across all tokens to retrie ve a scene-le vel prototype that is broadcast back to the entire layer . Importantly , each layer’ s memory banks only interact with the representations produced at that same depth; deeper layers never directly access shallo w-layer features. This architectural separation induces a genuine hierarchy of prototypes. Lower layers store and reinstate ﬁne-grained edge-, texture-, and patch-le vel patterns, whereas higher layers operate on increasingly abstract representations passed up from previous blocks. The retrieved global prototype at each layer provides a contextual prior that guides local reﬁnement, enabling ambiguous or partially occluded patches to be interpreted in a way consistent with the overall scene. Through this layerwise progression, V -HMN builds increasingly holistic and conte xt-aware representations without relying on explicit spatial downsampling. What in variances the hierarch y provides. The hierarchical local–global design induces sev eral useful inv ariances, but not all in variances observ ed in practice come from the architecture alone. • Local tolerance to small perturbations. Because local neighborhoods are unfolded with overlap, a small translation or deformation of a feature (e.g., shifting an edge by one pixel) changes which tokens contrib ute to a neighborhood but often lea ves its nearest local prototype unchanged. The local Hopﬁeld retrie v al therefore tends to map slightly perturbed patches back to the same or a nearby prototype, providing rob ustness to small local jitters and noise. • Contextual in variance via global prototypes. The global HMN sees a pooled summary of the entire token grid. As a result, global prototypes are largely insensiti ve to where an object appears within the image, acting more like a translation-tolerant scene or object code. When the global prototype is broadcast back to all tokens, it stabilizes local representations against clutter or partial occlusion: dif ferent arrangements of the same global conﬁguration are attracted to similar global memory slots. • Increasing in variance with depth. As representations become progressiv ely more abstract across successiv e HMN blocks, prototypes in deeper local and global memory banks become less sensiti ve to ﬁne-grained pixel-le vel details and more sensitiv e to object- and class-level structure. This yields a degree of scale and translation tolerance at higher layers, analogous to the progression observed along the v entral visual stream. What is handled by augmentation, pr eprocessing, and pooling . Sev eral important in variances are primarily pro vided by standard deep-learning components rather than by the memory hierarchy itself. Random crops, horizontal ﬂips, and optional AutoAugment are the main sources of robustness to large translations, ﬂips, and complex photometric distortions. The patch embedding and ﬁnal attention pooling contrib ute additional translation in v ariance by making the classiﬁer depend mostly on aggregated tok en statistics rather than exact pixel coordinates. W e do not build in e xplicit rotation or scale-equiv ariant structure; rob ustness to such transformations arises empirically from data augmentation and from the generic effects of depth and pooling, not from a specialised architectural mechanism. In summary , the hierarchical arrangement of local and global memory banks provides a structured inducti ve bias: lower layers learn local prototypes that are rob ust to small perturbations; higher layers and global memories learn more holistic prototypes that are tolerant to object location and clutter . This interacts synergistically with standard augmentation and pooling to produce the ov erall inv ariance proﬁle observed in our experiments. 12 V ision Hopﬁeld Memory Networks A.2. V ariance of cosine similarity Let ˆ q, ˆ k ∈ R D be independent random unit vectors. W e assume the random unit vectors are drawn from a distribution that is inv ariant under coordinate permutations and sign ﬂips (e.g., the uniform distribution on the unit sphere). W e ﬁrst compute the second moment of ˆ q . Write ˆ q = ( ˆ q 1 , . . . , ˆ q D ) . By coordinate symmetry , all coordinates satisfy E [ ˆ q 2 i ] = v . Since P D i =1 ˆ q 2 i = 1 , taking expectations gi ves D v = 1 ⇒ v = 1 D . For i  = j , ﬂipping the sign of the i -th coordinate (which preserves the distrib ution of a random unit vector) implies E [ ˆ q i ˆ q j ] = − E [ ˆ q i ˆ q j ] ⇒ E [ ˆ q i ˆ q j ] = 0 . Hence the off-diagonal entries v anish and the diagonal entries are all 1 /D , i.e., E [ ˆ q ˆ q ⊤ ] = 1 D I . It follows that E [ ˆ q ⊤ ˆ k ] = E ˆ q  ˆ q ⊤ E ˆ k [ ˆ k ]  = 0 . For the second moment, E  ( ˆ q ⊤ ˆ k ) 2  = E ˆ k h ˆ k ⊤ E ˆ q [ ˆ q ˆ q ⊤ ] ˆ k i = E ˆ k h ˆ k ⊤  1 D I  ˆ k i = 1 D E ˆ k [ ∥ ˆ k ∥ 2 2 ] = 1 D . Therefore, V ar( ˆ q ⊤ ˆ k ) = E  ( ˆ q ⊤ ˆ k ) 2  −  E [ ˆ q ⊤ ˆ k ]  2 = 1 D . Concluding, the variance of the cosine similarity between tw o random unit vectors decays as 1 /D . Multiplying the similarity by √ D rescales the logits to approximately unit v ariance before the softmax. A.3. Iterative r eﬁnement as predictive-coding (PC) dynamics The central operation in both local and global modules is an iterative reﬁnement rule. Given a current representation z ( t ) ∈ R D and a memory bank M ∈ R K × D with rows M j ∈ R D , we ﬁrst perform Hopﬁeld-style retriev al as described in Section 3.2: ˆ z ( t ) = z ( t ) ∥ z ( t ) ∥ 2 , ˆ M j = M j ∥ M j ∥ 2 , α ( t ) j = softmax j  √ D ˆ z ( t ) ˆ M ⊤ j  , m ( t ) = K X j =1 α ( t ) j M j , (5) where M j denotes the j -th memory slot, and m ( t ) is the retriev ed memory . W e then update z ( t ) via z ( t +1) = z ( t ) + β  m ( t ) − z ( t )  , (6) where β is a learnable update strength. This rule can be interpreted as a simple predictiv e-coding (PC) dynamics. Deﬁne the local pr ediction err or ε ( t ) := m ( t ) − z ( t ) . 13 V ision Hopﬁeld Memory Networks Then, Eq. 6 becomes z ( t +1) = z ( t ) + β ε ( t ) , which is a discrete gradient-descent step on the squared prediction-error energy F ( z ) = 1 2 ∥ z − m ( z ) ∥ 2 2 . If we treat m ( z ) as ﬁxed with respect to z during one update step, then ∇ z F ( z ) ≈ z − m ( z ) = − ε ( t ) , and Eq. 6 coincides with z ( t +1) ≈ z ( t ) − β ∇ z F ( z ( t ) ) . That is, the Hopﬁeld module produces a memory-based prediction m ( t ) of the code z ( t ) , the residual ε ( t ) acts as a prediction-error signal, and the representation is iterativ ely corrected, so as to minimize a local prediction-error energy . This mirrors the core mechanism of hierarchical predictive coding ( Rao & Ballard , 1999 ; Friston , 2005 ; Whittington & Bogacz , 2019 ): higher-lev el causes generate a prediction of lo wer-le vel acti vity , explicit error units encode their difference, and latent states are updated by gradient descent on a prediction-error or free-energy objecti ve. In V -HMN, the role of the generati ve model is played by the associati ve memory: memory slots M j act as prototypical causes, the similarity scores α ( t ) j approximate a posterior o ver these causes gi ven z ( t ) , and the retrie ved prototype m ( t ) = E [ M j | z ( t ) ] provides the top-down prediction. The reﬁnement dynamics of Eq. 6 thus implements a lightweight PC update in which (i) the feedforwar d path computes similarities and posterior weights from the current features to the memory slots, and (ii) the feedbac k path injects the memory-based prediction back into the features through the prediction-error signal ( m ( t ) − z ( t ) ) . In practice, we ﬁnd that one or two iterations are suf ﬁcient to obtain robust improv ements while keeping computation efﬁcient. A.4. Biological plausibility and relation to cortical cir cuits V -HMN is not intended as a faithful anatomical model of the primate visual system, but it is explicitly guided by two computational motifs that are widely discussed in systems neuroscience: (i) associativ e memory implemented by recurrently connected populations, and (ii) predictiv e-coding-like error-correcti ve reﬁnement in cortical microcircuits. W e now clarify where our design aligns with known hippocampal and cortical circuitry and where the connection is only analogical. Associative memory bey ond the hippocampus. Associati ve dynamics are often introduced via models of the hippocampal formation, where pattern separation in dentate gyrus and pattern completion in CA3 are thought to support episodic memory and spatial navigation. Ho wever , a large body of theoretical and experimental work suggests that related forms of autoassociativ e computation are also implemented in neocortical microcircuits, where recurrent collateral connections between pyramidal neurons can sustain stable acti vity patterns that function as long-term and short-term memories, perceptual representations, and decision states. In this broader view , associativ e memory is a gener al cortical computational motif, not something conﬁned to the hippocampus. Our local Hopﬁeld modules are designed to echo this idea: they implement content-addressable retriev al over local patch-lev el embeddings using modern Hopﬁeld dynamics, such that frequently co-occurring visual features (e.g., edges, corners, textures) form structured prototype-like representations stored in the memory bank. These modules enable V -HMN to integrate local conte xtual information in a manner that is both interpretable and robust, supporting reﬁnement steps that adjust features toward semantically consistent local patterns, e ven under perturbations. Global memory and hippocampal/entorhinal inspiration. At the same time, the global Hopﬁeld module and its class- balanced episodic memory bank are more directly inspired by hippocampal and entorhinal circuitry . The global query aggregates information across the entire image and retrie ves a scene-lev el prototype from a separate memory bank, loosely analogous to how hippocampus and entorhinal corte x inte grate inputs from many cortical areas into a sparse episodic code that can later be reinstated to bias cortical activity . The broadcast of the retriev ed global prototype back to all tokens in a block is therefore reminiscent of hippocampo–cortical feedback that reinstates context or episodes, b ut at a highly abstract lev el and without any claim of anatomical ﬁdelity . 14 V ision Hopﬁeld Memory Networks Relation to ventral visual str eam processing . In the brain, visual object recognition emer ges along the ventral visual stream (V1, V2, V4, IT), with rich recurrent connecti vity and local microcircuits, before hippocampal structures become in volved in binding objects into episodic memories. V -HMN compresses some of these ideas into a single backbone: local HMN blocks can be viewed as abstracted cortical microcircuits that combine feedforward feature extraction with local associativ e retriev al and predictive-coding-style reﬁnement, and the global HMN adds an additional, hippocampus-inspired contextual signal. W e do not claim a one-to-one mapping between layers of V -HMN and speciﬁc cortical areas. Rather, our aim is to capture, within a practical vision model, the core computational principles of (i) distributed pattern reinstatement through associativ e memory and (ii) iterativ e interaction between top-do wn predictions and bottom-up features. Brain-inspired, not anatomically faithful. T aken together , these design choices place V -HMN somewhere between con ventional deep vision backbones and detailed biophysical models. Compared to standard feedforward or self-attention- only architectures, V -HMN moves closer to known cortical and hippocampal computation by making associati ve memory and error-correcti ve dynamics ﬁrst-class citizens in the backbone: memory banks correspond to explicit, inspectable prototypes rather than hidden weights; Hopﬁeld retrie val implements pattern completion; and iterati ve reﬁnement reduces local prediction errors. At the same time, we abstract away from man y anatomical details (layer-speciﬁc connecti vity , cell types, precise ventral-stream staging), and we do not claim that V -HMN is a literal model of the primate visual system. Instead, our goal is to mov e to wards more biologically grounded computation than in current deep learning models, while remaining competitiv e and scalable on standard machine-learning benchmarks. A.5. Why V -HMN is data-efﬁcient The empirical results in T ables 1 and 2 show that V -HMN maintains strong performance e ven when only a small fraction of the training labels is av ailable. For instance, with just 10% of the labeled data, V -HMN achie ves 80 . 22% on CIF AR-10, 43 . 21% on CIF AR-100, and 89 . 18% on Fashion-MNIST (T able 1 ), and it consistently outperforms V iT , Swin-V iT , MLP- Mixer , MetaFormer , V im, and AiT under both 10% and 30% training data across all three benchmarks (T able 2 ). W e no w brieﬂy analyze why V -HMN is more data-efﬁcient than these baselines. Prototype-based nonparametric prior . Both the local and global Hopﬁeld modules retriev e representations from explicit memory banks whose slots are populated with real latent embeddings during training. These slots act as prototypes that approximate class-conditional manifolds in feature space. Once a prototype is stored, future inputs can leverage it via retriev al ev en if supervision is limited, providing a nonparametric prior that complements the parametric backbone. In low-data regimes, this prototype reuse compensates for limited gradient-based ﬁtting and reduces ov erﬁtting to the small labeled set. The monotonic improvements from 10% to 30% to 50% labeled data in T able 1 reﬂect this behavior: as more labeled examples are observed, memory banks become better populated and the same prototypes can be reused across many inputs. Local–global inductive bias. V -HMN enforces a structured inducti ve bias by combining (i) local Hopﬁeld dynamics on unfolded k × k neighborhoods and (ii) a global Hopﬁeld module operating on scene-level aggregates. Local memories specialize to recurring edge- and texture-like patches, while global memories capture higher-le vel scene and object prototypes. This hierarchical organization constrains the ef fectiv e hypothesis class: new images are e xplained in terms of reusing and recombining a ﬁnite set of learned local and global prototypes, rather than learning fresh features from scratch for ev ery conﬁguration. This reduces the amount of labeled data needed to reach a given accuracy , which is consistent with the stronger gains of V -HMN in the 10% and 30% settings compared to the full-data regime in T able 1 . Iterative r eﬁnement focuses capacity on hard examples. As discussed in Section A.3 , the reﬁnement rule implements a predicti ve-coding (PC) update, where representations are iteratively corrected to reduce local prediction-error ener gy . In practice, this means that model capacity is concentrated on those regions of feature space where the memory-based predictions and current representations disagree. When data are scarce, this mechanism helps stabilize learning: easy examples quickly align with their nearest prototypes and require little further adjustment, while scarce or atypical examples receiv e larger reﬁnement updates (proportional to prediction error). The ablation in T able 3 sho ws that removing reﬁnement ( t = 0 ) substantially hurts performance and that one or two reﬁnement iterations yield the best trade-off between accurac y and computation. T ogether with the memory-based priors abo ve, this targeted reﬁnement e xplains why V -HMN achiev es higher accuracies than standard backbones and AiT under limited supervision (T able 2 ), despite using a comparable number of parameters. 15 V ision Hopﬁeld Memory Networks A.6. Implementation Details V -HMN is implemented in PyT orch. As for the experiments on small datasets (CIF AR, Fashion-MNIST , and SVHN), images are divided into non-overlapping 4 × 4 patches and embedded into tokens with a learnable positional encoding. The backbone consists of six V -HMN blocks, each equipped with a local Hopﬁeld module operating on 3 × 3 spatial neighborhoods and a global Hopﬁeld module operating on the mean-pooled query . The local and global modules maintain class-balanced memory banks with 2500 and 1000 slots, respectiv ely . The embedding and latent dimensions are both set to 300, and the MLP expansion ratio is set to 2. Reﬁnement updates are controlled by a learnable parameter β initialized to 0.2, and unless otherwise speciﬁed, a single reﬁnement iteration is used. W e use the Adam optimizer ( Kingma & Ba , 2015 ) with cosine learning-rate decay . The initial learning rate is 0.001, linearly warmed up during the ﬁrst ﬁ ve epochs. Training is conducted for 400 epochs with a batch size of 256 and weight decay of 5 × 10 − 5 . For data augmentation, we adopt a strong augmentation including random cropping, horizontal ﬂipping, AutoAugment ( Cubuk et al. , 2019 ), MixUp ( Zhang et al. , 2018 ), and CutMix ( Y un et al. , 2019 ). All reported results are based on this training setup unless otherwise noted. As for the ImageNet e xperiments, we use a batch size of 1000 and a patch size of 14 × 14 . The local and global memory banks contain 5000 and 3000 slots, respecti vely . W e train the model for 310 epochs using the AdamW optimizer with a cosine learning-rate schedule. The embedding and latent dimensions are both set to 576, and the MLP expansion ratio is set to 4. The initial learning rate is 0.001 with linear warmup at the beginning of training. A weight decay of 0.03 is applied, and gradient clipping with a maximum norm of 0.1 is used. For data augmentation, we use RandAugment, MixUp, CutMix, and random erasing. W e do not use stochastic depth, repeated augmentation, or Exponential Moving A verage (EMA). A.7. Explorations on Spatial Windo w Size T able 6. Ablation on spatial windo w size k . Accuracy (%) is reported on CIF AR-10, CIF AR-100, and Fashion-MNIST . Results are reported as mean ± standard deviation o ver 3 seeds. W indow size k CIF AR-10 CIF AR-100 Fashion-MNIST 3 93 . 94 ± 0 . 11 76 . 58 ± 0 . 09 92 . 27 ± 0 . 06 5 93 . 22 ± 0 . 07 74 . 80 ± 0 . 07 91 . 94 ± 0 . 14 7 92 . 60 ± 0 . 67 70 . 67 ± 0 . 08 91 . 23 ± 0 . 35 T able 6 reports the ef fect of v arying the spatial windo w size k for local memory retriev al. W e observe that a smaller window ( k = 3 ) yields the best results across datasets, while lar ger windows ( k = 5 , 7 ) lead to slightly lower accuracy . This suggests that restricting memory matching to a compact neighborhood is beneﬁcial, as it enforces stronger locality priors and av oids interference from irrelev ant patches. At the same time, the ov erall performance difference remains small, indicating that the model is robust to the choice of k . A.8. Explorations on Size of Memory T able 7. Ablation study on the effect of local and global memory sizes in V -HMN. T op-1 test accuracy (%) are reported as mean ± standard deviation o ver 3 seeds. Local memory size Global memory size Size CIF AR-10 CIF AR-100 Fashion-MNIST Size CIF AR-10 CIF AR-100 Fashion-MNIST 1500 93 . 93 ± 0 . 09 76 . 47 ± 0 . 29 92 . 32 ± 0 . 07 500 93 . 91 ± 0 . 10 75 . 93 ± 0 . 12 92 . 23 ± 0 . 10 2500 93 . 94 ± 0 . 11 76 . 58 ± 0 . 09 92 . 27 ± 0 . 06 1000 93 . 94 ± 0 . 11 76 . 58 ± 0 . 09 92 . 27 ± 0 . 06 3500 94 . 14 ± 0 . 01 76 . 42 ± 0 . 14 92 . 36 ± 0 . 14 1500 94 . 15 ± 0 . 17 76 . 41 ± 0 . 03 92 . 12 ± 0 . 04 4500 94 . 04 ± 0 . 05 76 . 59 ± 0 . 13 92 . 44 ± 0 . 03 2000 93 . 94 ± 0 . 14 76 . 19 ± 0 . 04 92 . 09 ± 0 . 17 T able 7 summarizes the effect of varying the sizes of the local and global memory banks. Performance does not grow monotonically with larger memories; instead, the best results arise from moderate capacities, approximately 2500 slots for local memory and 1000 slots for global memory . The same qualitative trend holds across all three datasets, suggesting that the ef fect is systematic rather than random. This pattern complements the iteration ablation: because the model performs only a small number of reﬁnement steps, what matters most is ha ving a well-curated, high-quality prototype set that can provide tar geted corrections, rather than an excessiv ely large bank. When the memories are too small, they under -cover the feature space and limit the correctiv e power of each reﬁnement step. 16 V ision Hopﬁeld Memory Networks F igure 4. Effect of β initialization on model accuracy with reﬁnement iteration ﬁxed to t = 1 . When they are excessi vely large, the bank becomes redundant and introduces retriev al noise, slightly reducing accuracy and weakening the inﬂuence of each update. T ogether with the iteration ablation in T able 3 , these results indicate that associati ve retriev al functions as a lightweight, prototype-based prior on top of a strong feedforward backbone. While not required for basic recognition (as seen from the small drop at t = 0 ), the learned memories and 1–2 reﬁnement iterations consistently improv e robustness and data efﬁcienc y once an appropriate prototype set is established. A.9. Explorations on β Initialization In this section, we study how the initialization of β affects performance when the number of reﬁnement steps is ﬁxed to t = 1 . Figure 4 reports results on CIF AR-10 and CIF AR-100 for β ∈ { 0 . 2 , 0 . 4 , 0 . 6 , 0 . 8 , 1 . 0 } (learned thereafter during training). Performance is relati vely stable for small-to-moderate v alues, with the best accuracy obtained when initializing β around 0 . 2 . Setting β = 1 . 0 consistently degrades accurac y on both datasets. W e hypothesize that the ef fect primarily reﬂects the strength of the reﬁnement update. V ery large initial v alues make the update overly aggressi ve—effecti vely pushing the representation too close to the retrieved prototype—while very small values under-utilize the memory-based correction. Intermediate initialization therefore provides a balanced step size that av oids both under-correction and over -writing, which explains the better performance observed for moderate β values. This effect is important in the early stages of training, when the memory banks are still noisy and prototypes are less well formed; ov erly aggressive updates at this stage can amplify noise and hurt generalization. Unless otherwise noted, we initialize β to 0 . 2 (and keep it learnable), which provides a rob ust starting point and yields the best or near-best results across datasets in this single-iteration setting. T able 8. Performance under class-imbalance on CIF AR-10 and CIF AR-100. All results are top-1 accurac y (%). Baselines include V iT ( Dosovitskiy et al. , 2021 ), Swin-V iT ( Liu et al. , 2021 ), MLP-Mixer ( T olstikhin et al. , 2021 ), MetaFormer ( Y u et al. , 2024 ), V im ( Zhu et al. , 2024 ), and AiT ( Sun et al. , 2025 ). Results are reported as mean ± standard deviation o ver 3 seeds. Method Imbalanced CIF AR-10 Imbalanced CIF AR-100 50 100 50 100 V iT 71 . 47 ± 0 . 34 64 . 61 ± 0 . 98 42 . 91 ± 0 . 29 37 . 45 ± 0 . 16 Swin-V iT 65 . 66 ± 0 . 13 34 . 13 ± 0 . 15 38 . 37 ± 0 . 08 34 . 13 ± 0 . 15 MLP-Mixer 72 . 61 ± 0 . 67 66 . 89 ± 0 . 14 41 . 90 ± 0 . 21 37 . 47 ± 0 . 07 V im 62 . 19 ± 0 . 70 57 . 41 ± 0 . 24 38 . 69 ± 0 . 12 34 . 55 ± 0 . 11 AiT 63 . 47 ± 0 . 13 56 . 89 ± 0 . 19 38 . 12 ± 0 . 18 33 . 49 ± 0 . 22 Meta-Former 55 . 63 ± 0 . 91 49 . 32 ± 0 . 23 29 . 47 ± 0 . 14 26 . 39 ± 0 . 16 V -HMN 77 . 09 ± 0 . 22 70 . 43 ± 0 . 42 47 . 82 ± 0 . 11 42 . 16 ± 0 . 25 A.10. Imbalanced Setting T o further ev aluate the data efﬁciency of V -HMN, we additionally assess the model under class-imbalanced CIF AR-10/100, following the long-tailed ev aluation protocol in prior work such as Cao et al. ( 2019 ), Zhou et al. ( 2020 ), and Kang et al. 17 V ision Hopﬁeld Memory Networks ( 2020 ). W e construct imbalance ratios of 50 and 100, where an imbalance ratio of r means that the number of samples in the most frequent class is r times larger than that in the least frequent class. This setting not only makes the training data distribution highly ske wed, but also induces imbalance in the learned prototype memories themselv es, since minority classes contribute f ar fewer instances to the memory-update process. T able 8 reports the results, in which 50 and 100 denote the imbalance ratios. Across all imbalance ratios and datasets, V -HMN achie ves the strongest robustness, substantially outperforming V iT , Swin-V iT , MLP-Mixer , V im, AiT , and Meta-Former . W e attribute this behavior to the associati ve-memory mechanism: unlike purely feedforward models, V -HMN maintains a set of learned prototypes that aggre gate information across the entire training distrib ution. During inference, the reﬁnement step retrie ves class-relev ant prototypes and corrects the latent representation to ward them. Even under se vere imbalance, minority-class prototypes remain available in the memory bank and continue to provide stabilizing signals, mitigating representation drift and reducing majority-class dominance. In summary , these results suggest that the memory-based associativ e reﬁnement acts as an inherent regularizer in long-tailed regimes, allo wing V -HMN to maintain accuracy where other architectures de grade more severely . (a) CIF AR-10 local memory weighting (b) CIF AR-10 global memory weighting (c) CIF AR-100 local memory weighting (d) CIF AR-100 global memory weighting (e) Fashion-MNIST local memory weighting (f) Fashion-MNIST global memory weighting F igure 5. V isualization of the memory slot weightings of the last layer of the V -HMN 18 V ision Hopﬁeld Memory Networks A.11. V isualization of Memory W eights W e visualize how memory slots are weighted in the last layer of the model when inputs pass through it. W e perform this visualization for CIF AR-10, CIF AR-100 and Fashion-MNIST . For each, we pass all inputs from the dataset’ s ﬁrst class through the model, and a verage o ver the resulting memory slot weightings. The results are shown in Figure 5 . For CIF AR-10 and Fashion-MNIST , the weights are highest on av erage in the ﬁrst 10% of the memory slots, corresponding to the ﬁrst dataset class. This shows that the reﬁnement step reﬁnes the latent vectors tow ards prototypes of the same class. For CIF AR-100, no such patterns can be distinguished in the visualization, as the ﬁrst class is only represented by the ﬁrst 1% of memory slots, which is too little to be clearly visible. T able 9. T op-1 and top-5 retriev al hit rate, in percent (%). Measured on CIF AR-10, CIF AR-100 and Fashion-MNIST . CIF AR-10 CIF AR-100 F ashion-MNIST Local T op-1 Hit Rate 30.87 1.92 35.73 Local T op-5 Hit Rate 61.38 6.73 73.44 Global T op-1 Hit Rate 36.32 7.82 68.80 Global T op-5 Hit Rate 81.43 24.19 96.24 A.12. Retrieval W e analyze how the model interacts with the memories by applying it to samples from CIF AR-10, CIF AR-100 and Fashion-MNIST and calculating ho w frequently the memory slots that are weighted the highest in the reﬁnement steps come from the same class as the sample input. W e consider both the top-1 hit rate, which measures how frequently the single highest-weighted memory slot is from the same class as the sample input, and the top-5 hit rate, which measures how frequently at least one of ﬁv e highest-weighted memory slots is from the same class as the sample input. W e measure these values in the last layer of the model. The results are shown in T able 9 . If the retriev al was completely random the expected top-1 and top-5 hit rates would be 1% and 5%, respecti vely , for CIF AR-100, and 10% and 50%, respecti vely , for CIF AR-10 and F ashion-MNIST . In all cases, the hit rate is signiﬁcantly higher than these v alues. This shows that the reﬁnement process works as expected: The latent vectors are reﬁned to wards the stored memories of the same class. A.13. Associative Memory Dynamics in V -HMN: Robustness, Interpretability , and Data Efﬁciency T o ev aluate the robustness of V -HMN’ s memory module as a pr ototype memory , we apply controlled perturbations to CIF AR-10 test images and examine how the retrie ved prototypes change. W e consider two perturbation types: (1) additiv e Gaussian noise with standard deviations σ ∈ [0 . 01 , 0 . 30] , and (2) block occlusions with side lengths from 4 to 20 pixels (cov ering up to 39% of the input). T o quantify ho w retrie val behav es under corruption, we ev aluate two complementary metrics: • T op-5 Consistency (blue). This metric ev aluates the stability of discr ete prototype indexing under perturbations. For each clean test image, we ﬁrst record the set of its top-5 retrieved prototypes. After applying a perturbation, we obtain the corrupted image’ s top-1 retrieved prototype, and check whether this prototype is contained in the clean image’ s top-5 set. W e repeat this for all test images and report the percentage of cases in which the corrupted top-1 prototype remains within the clean top-5 set. Higher values therefore indicate that the model maintains stable prototype choices ev en when the input is perturbed. • Prototype Cosine Similarity (orange). This metric e valuates the semantic stability of prototype retrie val. For each test image, we compute the cosine similarity between: (i) the prototype vector retrie ved as top-1 for the clean image, and (ii) the prototype v ector retrieved as top-1 for its corrupted version. W e av erage this cosine similarity across all test samples. High similarity indicates that even when the discrete index changes, the retriev ed prototypes remain semantically similar (e.g., neighbors in prototype space). In the occlusion e xperiments, the blue curve drops linearly and rapidly (falling belo w 20% at a 16 × 16 occlusion), indicating frequent index switching. In contrast, the orange curv e remains highly rob ust, maintaining ov er 90% similarity e ven with 20 × 20 occlusions. This sho ws that V -HMN often switches to semantically adjacent “neighbor” prototypes, effecti vely performing pattern completion. The noise experiments exhibit a similar trend: while the blue metric is highly sensitive (dropping to ∼ 20% at σ = 0 . 05 ), the orange metric retains around 70% similarity ev en under extreme noise ( σ = 0 . 30 ), and 19 V ision Hopﬁeld Memory Networks F igure 6. Robustness analysis of V -HMN’ s associative memory under Gaussian noise and occlusion. approximately 90% at σ = 0 . 07 . These results conﬁrm that the associative retrie val mechanism pro vides strong denoising capabilities, mapping corrupted signals back tow ard the correct prototype manifold. These ﬁndings demonstrate that V - HMN possesses strong semantic robustness, interpretability , and data efﬁcienc y . Across both noise and occlusion perturbations, the cosine similarity between pre- and post-perturbation prototypes remains remarkably stable, indicating that the model consistently maps corrupted inputs back toward nearby semantic centers in memory . This behavior reﬂects an effecti ve many-to-one compression from high-dimensional pixel variations to a low-dimensional, structured prototype space, providing a principled explanation for V - HMN’ s data efﬁcienc y: diverse corrupted variants are absorbed into a small set of meaningful prototypes, reducing the need to observe ev ery possible input conﬁguration during training. Moreover , the model’ s ability to maintain high semantic similarity e ven when the discrete index changes underscores its interpretability—prototype switching occurs primarily among semantically adjacent neighbors, rev ealing a clear structure in the associati ve-memory dynamics. Finally , the smooth and predictable changes observed under increasing perturbation le vels indicate stable and reliable retrie val beha vior, without abrupt or erratic transitions. T ogether , these properties highlight the role of associati ve-memory reﬁnement not merely as an architectural component, but as a robust inducti ve bias that enhances stability , generalization, and interpretability across challenging input conditions. 20

Vision Hopfield Memory Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment