GAAMA: Graph Augmented Associative Memory for Agents
AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relations…
Authors: Swarna Kamal Paul, Shubhendu Sharma, Nitin Sareen
GAAMA: Graph A ugmented Associativ e Memory f or Agents Swarna Kamal Paul Nagarro swarna.paul@nagarro.com Shubhendu Sharma Nagarro shubhendu.sharma@nagarro.com Nitin Sareen Nagarro nitin.sareen@nagarro.com Abstract AI agents that interact with users across multiple sessions require persistent long- term memory to maintain coherent, personalized behavior . Current approaches either rely on flat retriev al-augmented generation (RAG), which loses structural relationships between memories, or use memory compression and vector retrie v al that cannot capture the associati ve structure of multi-session con v ersations. There are few graph based techniques proposed in the literature, howe ver the y still suffer from hub dominated retriev al and poor hierarchical reasoning ov er e volving mem- ory . W e propose GAAMA, a graph-augmented associativ e memory system that constructs a concept-mediated hierarchical kno wledge graph through a three-step pipeline: (1) verbatim episode preservation from raw conv ersations, (2) LLM- based extraction of atomic f acts and topic-le vel concept nodes, and (3) synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting tra versal paths that complement semantic similarity . Retriev al combines cosine-similarity-based k -nearest neighbor search with edge- type-aware Personalized PageRank (PPR) through an additive scoring function. On the LoCoMo-10 benchmark (1,540 questions across 10 multi-session con v er- sations), GAAMA achieves 78.9% mean reward, outperforming a tuned RAG baseline (75.0%), HippoRA G (69.9%), A-Mem (47.2%), and Nemori (52.1%). Ab- lation analysis shows that augmenting graph-trav ersal-based ranking (Personalized PageRank) with semantic search consistently impro ves o ver pure semantic search on graph nodes (+1.0 percentage point ov erall). 1 Introduction AI agents increasingly interact with users over extended periods spanning multiple sessions. A customer support agent must recall past issues and resolutions; a personal assistant must remember preferences, routines, and prior con versations. W ithout effecti v e long-term memory , agents lose context between sessions and deli ver generic, repetiti v e responses. Current memory approaches face fundamental limitations. Flat RAG systems [Le wis et al., 2020] retriev e text chunks via embedding similarity alone, losing the structural relationships that connect entities, ev ents, and facts across con versations. Full-context approaches that feed entire con versation histories into the LLM context windo w are prohibitiv ely expensi ve and do not scale beyond a handful of sessions. Recent memory-augmented systems have made progress b ut face distinct limitations. Graph-based approaches like HippoRA G [Gutierrez et al., 2024] construct entity-centric kno wledge Preprint. graphs from OpenIE triples and retrie ve via Personalized P ageRank, but their entity-centric design causes context loss during indexing and inference—a limitation that HippoRA G 2 [Gutierrez et al., 2025] itself identifies as “a critical flaw . ” In our experiments, we further observe that entity nodes in multi-session con versations accumulate hundreds of edges, creating high-degree hubs that uniformly distribute PPR mass and dilute retrie v al precision. Non-graph approaches take different strategies: A-Mem [Xu et al., 2025] creates Zettelkasten-inspired interconnected memory notes with dynamic linking and e volution, b ut retrie ves primarily via embedding similarity without graph-based rele vance propagation; and Nemori [Nan et al., 2025] separates episodic and semantic memory stores with vector -based retrie val b ut lacks graph-mediated associati ve paths. W e propose GAAMA (Graph Augmented Associativ e Memory for Agents), a system that addresses these limitations through three key inno v ations: 1. A concept-mediated knowledge graph with four node types and fiv e structural edge types. Episodes preserve ra w con v ersation turns verbatim, retaining temporal references critical for temporal reasoning. F acts are atomic assertions distilled via LLM from episodes, capturing reusable knowledge. Reflections synthesize higher-order insights across multiple facts, enabling cross-session inference. Concept nodes are topic-le vel labels (e.g., pottery_hobby , camping_trip ) that provide cross-cutting trav ersal paths without creating the mega-hub problem of entity-centric designs. 2. A three-step construction pipeline that separates cheap structural operations from targeted LLM extraction: (1) verbatim episode preservation with temporal NEXT -edge chaining (no LLM), (2) LLM-based extraction of atomic facts and concept labels with pro venance edges, and (3) reflection synthesis from multiple facts. The construction pipeline incrementally constructs memory on con versation chunks and supports continuous e v olution of memory . 3. Edge-type-aware P ersonalized PageRank with hub dampening, blended with semantic similarity through an additi ve scoring function that allo ws mild graph augmentation (PPR weight 0.1) to consistently improv e retriev al without introducing noise. The code is av ailable in the repo - https://github.com/swarna- kpaul/gaama 2 Related W ork 2.1 Memory systems for con versational agents Sev eral recent systems address long-term memory for AI agents. HippoRA G [Gutierrez et al., 2024] constructs a kno wledge graph from OpenIE triples, where subjects and objects become phrase nodes connected by relation and synonym edges. Retrie val proceeds via Personalized PageRank seeded from query-linked nodes (via named entity recognition in HippoRA G; extended to query-to-triple matching in HippoRA G 2 [Gutierrez et al., 2025]). While ef fecti ve for associativ e reasoning, HippoRA G’ s entity-centric design causes context loss during both inde xing and inference—a limitation that HippoRA G 2 [Gutierrez et al., 2025] identifies as “a critical flaw . ” In our multi-session setting, we additionally find that recurring person entities accumulate hundreds of edges, creating mega-hubs that dilute PPR precision. A-Mem [Xu et al., 2025] creates a Zettelkasten-inspired memory network where each note contains LLM-generated keyw ords, tags, and contextual descriptions. New memories trigger link gener ation (LLM-judged connections to similar notes) and memory evolution (updating existing notes with ne w context). Howe v er , retriev al relies on embedding similarity with linked notes returned alongside matches, without graph-based relev ance propagation lik e PPR. SimpleMem [Liu et al., 2026] uses Semantic Structur ed Compression to filter and reformulate dialogue into compact memory units, Online Semantic Synthesis to consolidate related units, and Intent-A war e Retrie val Planning to dynamically query across semantic, lexical, and symbolic index layers. It achieves strong performance through information density rather than graph structure. Nemori [Nan et al., 2025] segments con v ersations into coherent episodes via boundary detection, transforms them into structured narrativ es, and distills semantic knowledge through a Pr edict- Calibrate c ycle inspired by the Free Energy Principle. Retriev al uses vector similarity o ver separate episodic and semantic stores, without graph-mediated associativ e paths. 2 AgeMem [Y u et al., 2026] tak es a complementary approach by unifying long-term and short-term memory management into the agent’ s policy via reinforcement learning. Memory operations (add, update, delete, retriev e, summarize, filter) are exposed as tool-based actions, and a three-stage progressi ve GRPO strategy trains the agent to coordinate L TM storage and STM conte xt management end-to-end. Unlike GAAMA ’ s focus on graph-structured retriev al, AgeMem’ s contribution is in learned memory management policies that decide when and ho w to store or retrie ve. More broadly , LLM-based agents hav e shown increasing capability in tasks requiring long-term memory [Paul, 2024], moti v ating the need for memory architectures that support both retrie val and multi-step reasoning. 2.2 Retrieval-augmented generation RA G [Le wis et al., 2020] has become the standard approach for grounding LLM responses in external kno wledge. Standard RA G pipelines encode documents into dense v ectors and retrieve the top- k most similar passages gi ven a query . While effecti ve for single-hop factual questions, flat RA G loses the relational structure between entities and e vents, making multi-hop reasoning dif ficult. Graph-based extensions such as GraphRA G [Edge et al., 2024, Nan et al., 2024] augment retriev al with graph structure, but these are typically designed for static document collections rather than the dynamic, ev olving memory of con v ersational agents. 2.3 Knowledge graphs f or question answering Knowledge graph-based QA systems [Y asunaga et al., 2021] lev erage structured relationships for multi-hop reasoning. Personalized P ageRank (PPR) [Jeh and Widom, 2003] has been widely used in information retriev al and recommendation systems to propagate rele v ance from seed nodes through a graph. Our w ork extends PPR with edge-type-aware transition weights and hub dampening, tailored for concept-mediated knowledge graphs from con v ersational memory . Unlike prior PPR- based retriev al that relies solely on graph structure, GAAMA blends PPR scores with semantic similarity through an additiv e scoring function, allowing graph traversal to augment rather than replace embedding-based retriev al. 3 Method GAAMA operates in two phases: (A) knowledge graph construction from con v ersation sessions via a three-step pipeline, and (B) retriev al via hybrid PPR-augmented scoring. 3.1 Three-step kno wledge graph construction Giv en a new conv ersation session, GAAMA constructs a typed knowledge graph through three sequential steps that separate cheap structural operations from targeted LLM calls. Step 1: Episode pr eservation (no LLM). Each con versation turn is stored verbatim as an episode node, preserving the exact w ords from the original message without summarization or modification. This design choice is critical for temporal reasoning: relati ve time references (“yesterday”, “last week”) are preserved alongside con v ersation timestamps, enabling the answer -generation LLM to resolve temporal queries from the raw context. Each conv ersation message maps to exactly one episode node. Episodes are temporally chained via NEXT edges that encode the sequential order of turns within each session. This step requires zero LLM calls and runs in linear time over the con versation length. Step 2: Fact and concept extraction (LLM). An LLM processes the episode sequence to extract two types of deri ved nodes: • F acts : Atomic factual assertions about entities or e v ents (e.g., “Melanie painted a sunset dur- ing the pottery workshop”). Facts are linked to their source episodes via DERIVED_FR OM edges, preserving prov enance. While extracting facts for a chunk of episodes, top k similar historical episodes and facts are passed into the context of the LLM to deri ve the ne w fact set. 3 T able 1: GAAMA graph schema: four node types connected by five structural edge types. Edge T ype Source → T arget Description NEXT Episode → Episode T emporal succession within a session DERIVED_FR OM Fact → Episode Prov enance: fact extracted from episode DERIVED_FR OM_F A CT Reflection → Fact Prov enance: reflection synthesized from facts HAS_CONCEPT Episode → Concept Episode discusses this topic ABOUT_CONCEPT Fact → Concept Fact relates to this topic • Concepts : T opic-lev el labels (2–5 words, snake_case) that capture the thematic content of episodes and facts. Concepts are deriv ed for new episode, facts chunks, and similar historical episodes and facts. Examples from our extraction prompt include pottery_hobby , camping_trip , adoption_pr ocess , car eer_tr ansition , beach_outing , and marathon_tr aining . Concepts are specifically not person names, dates, or generic terms—they represent activities, interests, e vents, and topics that pro vide meaningful cross-cutting structure. Episodes are linked to their concepts via HAS_CONCEPT edges; facts are linked via ABOUT_CONCEPT edges. The extraction prompt (Appendix A) instructs the LLM to perform multi-step reasoning across episodes to deriv e general facts where possible, and to generate concept labels that are specific enough to be useful for retriev al b ut general enough to connect related content across sessions. Step 3: Reflection synthesis (LLM). A second LLM pass synthesizes higher-order insights from multiple facts. Reflections capture generalized patterns, preferences, or lessons that span multiple con versations (e.g., “User prefers outdoor activities on weekends”). Each reflection is linked to its supporting facts via DERIVED_FROM_F ACT edges. Reflections are particularly v aluable for inference-type questions that require combining information across sessions. 3.2 Graph schema The resulting knowledge graph contains four node types and fi v e structural edge types, summarized in T able 1. Why concept nodes instead of entity nodes. HippoRA G [Gutierrez et al., 2024] employs entity- centric phrase nodes extracted via OpenIE as its primary associati ve inde x for PPR-based retrie val. In multi-session con versational data, this creates a me ga-hub pr oblem : person entities like “user” or recurring participants accumulate hundreds of edges, causing PPR mass to distribute uniformly across all connected memories and diluting retrieval precision. In our early experiments with an entity- centric graph design (Section 5.3), person entities had 400–500+ edges each, and hub dampening provided only mar ginal relief. Concept nodes avoid this problem because topics are inherently more distrib uted than persons: a con versation about “pottery” and “weekend plans” connects to concept nodes pottery_hobby and weekend_activities rather than funneling through a single person hub . The resulting graph is sparser (typically 30 × fewer edges than entity-centric designs) while pro viding structurally non-redundant trav ersal paths that complement embedding-based similarity search. 3.3 Hybrid retrie val with P ersonalized PageRank Giv en a query q , retrie val proceeds in fi v e steps. Step 1: KNN candidate retrieval and seed selection. W e first retrie ve a broad candidate pool of 2 B nodes by cosine similarity to the query embedding, where B is the total retriev al budget ( B = max_facts + max_reflections + max_episodes ). From this pool, the top- k nodes (def ault k = 40 ) are selected as PPR seeds. Seed weights use squared similarity to emphasize high-confidence matches: w seed ( n ) = sim ( n, q ) 2 (1) 4 T able 2: Edge-type base weights for PPR transition probability computation. Edge T ype W eight Edge T ype W eight NEXT 0.8 HAS_CONCEPT 0.8 DERIVED_FR OM 0.8 ABOUT_CONCEPT 0.8 DERIVED_FR OM_F A CT 0.5 Seed weights are normalized to form a probability distrib ution o ver the teleport v ector . The remaining 2 B − k KNN candidates are retained in the candidate pool for final scoring. Step 2: Graph expansion. Starting from the k seed nodes, we expand outward along graph edges to depth d (default d = 2 ), collecting all edges in the local subgraph. Any graph-discov ered nodes not already in the KNN candidate pool are fetched and added to it. This expansion enables PPR to discov er nodes that are structurally connected to the seeds b ut may not appear in the KNN results—for example, a fact linked to a seed episode via DERIVED_FROM, or a related episode reachable through a shared concept node via HAS_CONCEPT → Concept ← HAS_CONCEPT . Step 3: Edge-type-aware transition weights. Each edge from node i to node j with type t has an effecti v e weight ˜ w ij = w base ( t ) , where w base ( t ) is a per-type base weight (T able 2). All edges in the current graph use uniform stored weight (1.0), so the effecti ve weight is determined entirely by edge type. Transition probabilities are computed by per -source normalization: P ij = ˜ w ij P k ˜ w ik (2) Hub dampening. Although concept nodes ha ve f ar fewer connections than entity nodes in prior designs, some high-degree nodes can still accumulate excessiv e PPR mass. For any node i with degree de g ( i ) > θ (threshold θ = 50 ), outgoing edge weights are scaled down: ˜ w damped ij = ˜ w ij · min 1 , θ deg ( i ) (3) This preserves the relati ve ordering of a hub’ s neighbors while limiting the total PPR mass that flo ws through it. Step 4: Personalized PageRank. W e run iterativ e PPR on the local subgraph with teleport vector v derived from seed weights and damping f actor α = 0 . 6 : r ( t +1) j = (1 − α ) · v j + α X i r ( t ) i · P ij + α · S ( t ) · v j (4) where S ( t ) = P i : deg ( i )=0 r ( t ) i is the sink mass redistrib uted according to the teleport v ector . Iteration continues until con ver gence ( ∥ r ( t +1) − r ( t ) ∥ 1 < 10 − 6 ) or a maximum of 200 iterations. PPR scores are max-normalized to [0 , 1] . Step 5: Additive scoring . The final relev ance score is computed over all candidate nodes—the full 2 B KNN pool plus any graph-disco vered nodes from Step 2. Both PPR scores and similarity scores are max-normalized to [0 , 1] , then combined additi v ely: score ( n ) = w ppr · ppr ( n ) + w sim · sim ( n, q ) (5) where w ppr = 0 . 1 and w sim = 1 . 0 in our default configuration. KNN candidates without PPR scores receiv e ppr ( n ) = 0 ; graph-discovered nodes without KNN similarity ha ve their similarity computed on demand. The low PPR weight reflects a design principle validated by our ablation analysis (Section 5.2): graph trav ersal should augment similarity-based retriev al rather than dominate it, providing a small but consistent improv ement by surfacing structurally connected nodes that embedding similarity alone would miss. 5 3.4 Memory packing and context assembly Retriev ed nodes are bucketed by type and subject to per-type b udget caps: max_facts=60 , max_reflections=20 , max_episodes=80 . This per-type budget is critical: without it, episodes (which tend to hav e high embedding similarity to con v ersational queries) dominate the retriev al set, displacing facts that the answer -generation LLM needs for precise responses. W ithin each bucket, nodes are ranked by their additiv e score (Equation 5), and the lo west-scored items are remov ed first until a total word budget ( max_memory_words=1000 ) is satisfied. Episodes are additionally sorted by their temporal sequence number to preserve chronological order in the assembled context. The final memory text is passed to the answer -generation LLM alongside the user query . 4 Experimental Setup 4.1 Benchmark W e e valuate on LoCoMo-10 [Maharana et al., 2024], a subset of the LoCoMo benchmark comprising 10 multi-session con versations (conv-26, 30, 41, 42, 43, 44, 47, 48, 49, 50) with a total of 1,540 questions across four categories (T able 3). T able 3: Question categories in the LoCoMo-10 benchmark. Category Abbrev . Count Description Multi-hop Cat1 282 Questions requiring information synthesis across sessions T emporal Cat2 321 Questions about when ev ents occurred or their ordering Open Domain Cat3 96 Questions requiring external or conte xtual reasoning Single Hop Cat4 841 Questions answerable from a single con versation se gment The con versations v ary in length (81–199 questions each), topic div ersity , and structural complexity , providing a representati ve e valuation of memory system performance across dif ferent con versational patterns. Single-hop questions (Cat4) dominate the benchmark at 54.6% of questions, reflecting the prev alence of direct factual recall in con versational memory ev aluation. 4.2 Evaluation pr otocol Our ev aluation follows a generate-then-judge protocol with three stages: Stage 1: Memory retrieval. Gi ven a question, the system retrie ves relev ant memories from its knowledge graph using the hybrid retrie val pipeline (Section 3.3). The retrie ved context is assembled into a memory text of at most 1,000 words. Stage 2: Answer generation. The retriev ed memory text and the question are passed to GPT -4o- mini (temperature=0), which generates a hypothesis answer . The generation prompt (Appendix C) instructs the model to answer concisely using only the provided memory , and to indicate when the memory is insufficient. Stage 3: LLM-as-judge scoring. An LLM judge (GPT -4o-mini, temperature=0) evaluates the generated hypothesis against the ground-truth reference answer, producing a fractional reward score r ∈ [0 . 0 , 1 . 0] . The judge scores based on ke y fact cover age : the fraction of ke y facts from the reference answer that appear in the generated response. Formally: r = number of reference key f acts found in hypothesis total key f acts in reference answer (6) A score of 1.0 means all ke y facts are present; 0.0 means none are found. Partial scores reflect partial cov erage (e.g., if 2 of 3 key facts are present, r = 0 . 67 ). Critically , the judge does not penalize for extra information be yond the reference answer—only cov erage of reference facts matters. This 6 T able 4: Mean re ward (%) on LoCoMo-10 by question category . Best results in bold . System Multi-hop (Cat1) T emporal (Cat2) Open Domain (Cat3) Single Hop (Cat4) Overall A-Mem [Xu et al., 2025] 44.7 37.4 50.0 51.5 47.2 Nemori [Nan et al., 2025] 49.4 45.0 36.8 57.4 52.1 HippoRA G [Gutierrez et al., 2024] 61.7 67.0 67.7 74.1 69.9 RA G Baseline 67.5 59.0 44.6 87.1 75.0 GAAMA (ours) 72.2 71.9 49.3 87.2 78.9 design choice av oids penalizing systems that retriev e and present additional relev ant context, which would unfairly disadv antage memory systems with higher recall. The full ev aluator prompt is provided in Appendix D. 4.3 Models and configuration • LLM : GPT -4o-mini for knowledge graph extraction, answer generation, and ev aluation judging (temperature=0 for all calls). • Embeddings : text-embedding-3-small (1536-dimensional) for node embeddings and query encoding. • PPR configuration : damping factor α = 0 . 6 , max iterations = 200, con ver gence tolerance = 10 − 6 , expansion depth d = 2 , hub dampening threshold θ = 50 , k = 40 KNN seeds. • Scoring weights : w ppr = 0 . 1 , w sim = 1 . 0 . • Retrieval budget : max_facts=60 , max_reflections=20 , max_episodes=80 , max_memory_words=1000 . 4.4 Baselines W e compare against four baselines: • A-Mem [Xu et al., 2025]: Zettelkasten-inspired memory with structured notes, dynamic link generation, and memory ev olution via LLM-driven updates. • Nemori [Nan et al., 2025]: Narrativ e-driv en structured memory architecture. • HippoRA G [Gutierrez et al., 2024]: Hippocampal-inspired entity-centric knowledge graph with PPR retriev al. • RA G Baseline : A tuned retriev al-augmented generation system that embeds each con ver - sation’ s raw turns separately using te xt-embedding-3-small (each con versation is index ed independently) and retrie ves the top- k most similar chunks per query via cosine similarity , with the same 1,000-word context b udget and answer generation prompt as GAAMA. All baselines use GPT -4o-mini for answer generation and the same LLM-as-judge e valuation protocol, ensuring that differences in re ward scores reflect retriev al quality rather than generation or ev aluation artifacts. The RA G baseline is particularly important as a strong baseline that uses the same embedding model and context b udget as GAAMA, isolating the contribution of graph structure. 5 Results 5.1 Main results T able 4 presents the mean reward (%) across all systems and question categories. GAAMA achie ves the highest overall reward at 78.9%, outperforming the tuned RA G baseline (75.0%) by 3.9 percentage points and the next best prior system HippoRA G (69.9%) by 9.0 percentage points. Analysis by category . Multi-hop (Cat1): GAAMA achieves 72.2%, a +4.7 pp impro vement o ver RA G (67.5%) and +10.5 pp over HippoRAG (61.7%). The improv ement is driv en by GAAMA ’ s L TM construction: deri ved facts distill multi-step information into retriev able atomic units, and reflections synthesize cross-session patterns, both of which enable the answer-generation LLM to 7 resolve multi-hop queries. Concept-mediated graph connections pro vide additional cross-session linking that complements embedding similarity . T emporal (Cat2): GAAMA leads at 71.9%, a dramatic +12.9 pp improvement o ver RA G (59.0%). Note that con versation timestamps are supplied to all methods including RA G, so the improvement is not due to timestamp a vailability . Rather , GAAMA ’ s L TM construction deri ves atomic f acts that resolve v ague temporal references (e.g., “last week” → a specific date) and synthesizes reflections that consolidate temporal patterns. Additionally , the NEXT edge chain preserv es episode ordering, enabling the answer-generation LLM to reason about temporal sequences from the structured context. Open Domain (Cat3): GAAMA achieves 49.3%, the highest across all systems (+4.7 pp ov er RA G at 44.6%). Open-domain questions require inte grating con versation conte xt with broader reasoning; concept-mediated PPR trav ersal helps surface structurally connected facts that embedding similarity alone would miss. The relati vely lo wer absolute scores across all systems reflect the inherent dif ficulty of these questions. Single Hop (Cat4): GAAMA (87.2%) and RAG (87.1%) perform comparably , both outperforming HippoRA G (74.1%) and A-Mem (51.5%). Single-hop questions about specific con versation content are well-served by high-quality embedding retrie val; the typed graph pro vides marginal additional benefit here. GAAMA vs. RA G. The comparison with the RAG baseline is particularly informati ve because both systems use identical embedding models, context budgets, and generation prompts. The 3.9 pp ov erall improvement reflects the pure contribution of the kno wledge graph structure: typed nodes with concept-mediated edges enable better retriev al than flat embedding similarity alone, especially for temporal (+12.9 pp) and factual (+4.7 pp) queries. 5.2 Ablation: Semantic vs. graph-augmented retrieval T o isolate the contribution of graph structure, we compare two GAAMA configurations on the full 1,540-question benchmark: • Semantic : Pure cosine similarity retrie val ( w ppr = 0 , w sim = 1 ). Uses only KNN ov er node embeddings—no graph trav ersal. • PPR=0.1 : Graph-augmented retriev al ( w ppr = 0 . 1 , w sim = 1 ). Blends PPR with similarity using a 10:1 similarity-to-PPR ratio. Both configurations use identical knowledge graphs, retrie val budgets, and answer generation—the y differ only in whether graph structure influences the retrie val ranking. Important distinction from RA G: The “Semantic” baseline here is not equiv alent to the RAG baseline in Section 5.1. Semantic retriev al operates over GAAMA ’ s constructed L TM—deriv ed facts, reflections, and verbatim episodes—using KNN ov er their embeddings. The RA G baseline, by contrast, retriev es from ra w con versation chunks without any L TM construction. Thus, the semantic baseline already benefits from GAAMA ’ s three-step knowledge graph construction (which distills facts and reflections from ra w con versations); this ablation isolates specifically whether PPR-based graph trav ersal provides additional v alue over semantic search on the same L TM. Per -category comparison. T able 5 sho ws the per -category breakdo wn. PPR=0.1 impro ves ov erall rew ard by +1.0 pp (78.0% → 78.9%), with the largest gain on single-hop questions (+1.6 pp) and temporal questions (+0.5 pp). Multi-hop and open-domain categories show ne gligible change. Per -sample comparison. T able 6 sho ws per-con versation ov erall rew ard for both configurations. PPR=0.1 improves on 8 of 10 con versations, with the largest gains on con v-30 (+3.8 pp) and con v-26 (+2.1 pp). Only two con versations sho w slight degradation (con v-41: − 0.3 pp, conv-47: − 1.7 pp). Per -sample per -category heatmap. T able 7 presents a heatmap of the rew ard delta (PPR=0.1 − Semantic, in percentage points) across all conv ersations and categories. Green cells indicate PPR improv ement; red cells indicate degradation. Sev eral patterns emerge from the heatmap: 8 T able 5: Ablation: Semantic vs. PPR=0.1 retriev al, per category . ∆ = PPR=0.1 − Semantic (pp). Category n Semantic (%) PPR=0.1 (%) ∆ (pp) Cat1 (Multi-hop) 282 72.2 72.2 0.0 Cat2 (T emporal) 321 71.4 71.9 + 0.5 Cat3 (Open Domain) 96 49.1 49.3 + 0.2 Cat4 (Single Hop) 841 85.7 87.2 + 1.6 Overall 1540 78.0 78.9 + 1.0 T able 6: Per-con versation overall re ward (%) for Semantic vs. PPR=0.1. ∆ = PPR=0.1 − Semantic (pp). Con versation n Semantic (%) PPR=0.1 (%) ∆ (pp) con v-26 152 75.1 77.3 + 2.1 con v-30 81 75.4 79.2 + 3.8 con v-41 152 83.0 82.7 − 0.3 con v-42 199 73.2 74.1 + 0.9 con v-43 178 76.1 77.8 + 1.6 con v-44 123 79.1 80.3 + 1.2 con v-47 150 80.8 79.1 − 1.7 con v-48 191 78.9 80.1 + 1.1 con v-49 156 75.5 76.0 + 0.5 con v-50 158 82.8 84.3 + 1.5 Cat4 (Single Hop) is the most consistent beneficiary . PPR improves Cat4 reward on 7 of 10 con versations, with the three negati ve cases (con v-41: − 2.0, con v-44: − 0.3, con v-47: − 0.6) showing only mar ginal de gradation. Single-hop questions often ask about specific aspects of past con versations that benefit from graph expansion disco vering related episodes via shared concept nodes. Cat2 (T emporal) shows high variance. PPR produces both strong gains (con v-43: +5.8, con v-42: +5.0) and moderate losses (con v-47: − 4.9, con v-50: − 4.7). T emporal questions benefit when PPR trav erses NEXT edges to disco ver temporally adjacent episodes, b ut can be hurt when graph expansion introduces temporally distant content that confuses the temporal reasoning of the answer-generation LLM. Cat3 (Open Domain) shows the largest individual deltas. The most extreme v alues appear in Cat3: conv-50 at +14.3 pp and con v-44 at +11.9 pp, alongside con v-42 at − 9.1 pp. This reflects the nature of open-domain questions: when PPR trav ersal surfaces the right cross-session connections, performance improv es dramatically; when it introduces noise, the effect is equally pronounced. The small category size (7–14 questions per con versation) amplifies indi vidual question effects. Per -sample overall deltas are pr edominantly positive. Eight of ten con versations sho w positiv e ov erall deltas, with a mean of +1.0 pp. The two ne gati ve cases (con v-41: − 0.3, con v-47: − 1.7) show only marginal de gradation. 5.3 Discussion Why mild PPR (0.1) outperforms strong PPR. In early e xperiments with PPR weight 1.0 on the same knowledge graph, o verall reward w as 74.6%—below e ven the semantic-only configuration (78.0%). Strong PPR allo ws graph tra versal to ov erride embedding similarity , introducing structurally connected b ut semantically irrele vant nodes into the retriev al set. The additi ve scoring with w ppr = 0 . 1 ensures that graph structure can only promote nodes that already hav e reasonable semantic rele vance, acting as a tiebreaker and neighborhood expander rather than an o verride. Graph structure quality is the main bottleneck. Error analysis re veals that most retrie val failures stem from the kno wledge graph construction phase rather than the retrie val algorithm. Analysis of the constructed L TM (10,224 nodes; 15,037 edges) reveals three concrete f ailure modes: 9 T able 7: Heatmap of re ward delta (pp): PPR=0.1 − Semantic, per con versation and category . > +5 +2 to +5 0 to +2 − 2 to 0 − 5 to − 2 < − 5 Con v . Cat1 Cat2 Cat3 Cat4 Overall con v-26 + 1.8 + 4.0 − 3.8 + 2.4 + 2.1 con v-30 − 2.3 + 3.8 — + 5.3 + 3.8 con v-41 + 0.8 + 3.7 0.0 − 2.0 − 0.3 con v-42 − 4.6 + 5.0 − 9.1 + 2.3 + 0.9 con v-43 − 4.0 + 5.8 − 2.4 + 2.8 + 1.6 con v-44 + 6.0 − 4.2 + 11.9 − 0.3 + 1.2 con v-47 + 0.8 − 4.9 − 3.8 − 0.6 − 1.7 con v-48 − 1.2 − 4.4 0.0 + 3.6 + 1.1 con v-49 + 1.6 + 1.5 + 5.2 − 1.4 + 0.5 con v-50 0.0 − 4.7 + 14.3 + 3.3 + 1.5 A vg. 0.0 + 0.5 + 0.2 + 1.6 + 1.0 (1) Generic concept nodes. The most-connected concepts— personal_gr owth (44 edges, 6 con- versations), travel_e xperience (42 edges, 5 con versations), nature_appr eciation (38 edges, 7 con versations)—are too broad to provide discriminati ve PPR paths. For comparison, specific concepts like car_r estoration (20 edges, 1 con versation) or game_development (28 edges, 1 conv ersation) provide tight, useful tra versal clusters. (2) Near -duplicate concepts. The LLM extraction produces singular/plural v ariants that frag- ment the graph: supportive_r elationships vs. supportive_r elationship , outdoor_adventure vs. out- door_adventur es , book_r ecommendation vs. book_recommendations . W e find 15+ such pairs at > 97% string similarity . A canonicalization step (e.g., lemmatization before concept insertion) would consolidate these and strengthen PPR trav ersal. (3) Overlapping thematic concepts. Closely related concepts like natur e_appr eciation and na- tur e_exploration connect to similar episodes b ut are treated as separate nodes, splitting PPR mass rather than concentrating it. Merging semantically equiv alent concepts would create stronger associa- tiv e paths. Concept nodes vs. entity nodes. Our migration from entity-centric to concept-mediated graphs was moti vated by empirical observ ation: in our entity-centric v1 design, person entities accumulated 400–500+ edges each. Hub dampening (threshold=50) reduced b ut did not eliminate the dif fusion problem. The concept-node design produces graphs approximately 30 × sparser , where PPR tra versal follows thematic paths (e.g., artistic_cr eation → multiple painting episodes) rather than funneling through person hubs. Per -type budget matters. W ithout per-type b udgets, episodes (which have naturally high embed- ding similarity to con versational queries) dominate the retrie val set, displacing facts by 3–7% on multi-hop questions (Cat1). The budget mechanism forces content diversity across node types, which the answer-generation LLM requires for comprehensi ve responses. 6 Conclusion W e presented GAAMA, a graph-augmented associativ e memory system for AI agents that combines a three-step kno wledge graph construction pipeline with concept-mediated Personalized P ageRank retriev al. A ke y contribution is the hierarchical L TM construction : raw con versations are first preserved as verbatim episodes, then distilled into atomic facts (resolving temporal references, extracting cross-session knowledge) and higher-order reflections (synthesizing patterns across facts). This hierarchical representation is itself a major driver of performance—even without PPR, semantic retriev al over the constructed L TM (facts, reflections, episodes) outperforms flat RAG by 3.0 pp (78.0% vs. 75.0%), demonstrating that the knowledge distillation pipeline adds significant value independent of graph structure. The concept-node design then pro vides structurally non-redundant 10 trav ersal paths for hybrid PPR , which blends graph-based relev ance propagation with semantic similarity . This combination yields an additional +1.0 pp improvement, achie ving 78.9% overall on LoCoMo-10—surpassing HippoRA G (69.9%), A-Mem (47.2%), and Nemori (52.1%). Sev eral directions for future work remain. First, concept canonicalization —merging near-duplicate concepts (e.g., singular/plural variants) and consolidating semantically ov erlapping concepts—would strengthen PPR paths and reduce graph fragmentation. Second, adaptive PPR gating —a lightweight neural model that predicts per -query whether graph tra versal helps—could further impro ve selecti vity , particularly for open-domain questions where PPR shows high v ariance. Third, edge weight learning via backpropagation through the PPR computation could replace hand-tuned edge-type weights with values optimized end-to-end for retrie val quality . References P . Lewis, E. Perez, A. Piktus, F . Petroni, V . Karpukhin, N. Goyal, H. Kuttler , M. Lewis, W . Y ih, T . Rocktaschel, S. Riedel, and D. Kiela. Retriev al-augmented generation for kno wledge-intensive NLP tasks. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2020. B. Gutierrez, Y . Y ang, H. Gu, M. Srini vasa, Y . Li, S. T ian, X. Sun, J. Y ang, M. Huang, and Y . Su. HippoRA G: Neurobiologically inspired long-term memory for large language models. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2024. B. Gutierrez, Y . Shu, Y . Gu, W . Qi, S. Zhou, and Y . Su. From RAG to Memory: Non-Parametric Continual Learning for Large Language Models. In Pr oc. of the 42nd International Confer ence on Machine Learning (ICML) , 2025. W . Xu, Z. Liang, K. Mei, H. Gao, J. T an, and Y . Zhang. A-Mem: Agentic memory for LLM agents. In Advances in Neural Information Pr ocessing Systems (NeurIPS) , 2025. Y . Y u, L. Y ao, Y . Xie, Q. T an, J. Feng, Y . Li, and L. W u. Agentic Memory: Learning Unified Long-T erm and Short-T erm Memory Management for Large Language Model Agents. arXiv pr eprint arXiv:2601.01885 , 2026. J. Liu, Y . Su, P . Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Y ao. SimpleMem: Efficient Lifelong Memory for LLM Agents. arXiv preprint , 2026. J. Nan, W . Ma, W . W u, and Y . Chen. Nemori: Self-Organizing Agent Memory Inspired by Cognitiv e Science. arXiv preprint , 2025. D. Edge, H. T rinh, N. Cheng, J. Bradley , A. Chao, A. Mody , S. Truitt, and J. Larson. From local to global: A graph RA G approach to query-focused summarization. arXiv pr eprint arXiv:2404.16130 , 2024. G. Nan, Z. Guo, S. Seo, T . Rossi, and B. Norick. GraphRA G: Unlocking LLM discov ery on narrati ve priv ate data. Micr osoft Researc h Blog , 2024. S. Y asunaga, H. Ren, A. Bosselut, P . Liang, and J. Lesko vec. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Pr oc. of NAA CL , 2021. G. Jeh and J. W idom. Scaling personalized web search. In Pr oc. of the 12th International W orld W ide W eb Confer ence (WWW) , pages 271–279, 2003. R. Maharana, J. Lee, D. T uia, and D. Graus. LoCoMo: Long con versational memory benchmark for multi-session dialogues. In Pr oc. of ACL , 2024. S. K. Paul. Continually learning planning agent for large en vironments guided by LLMs. In 2024 IEEE Conference on Artificial Intellig ence (CAI) , pages 377–382, 2024. doi: 10.1109/CAI59869.2024.00076. 11 A F act and concept extraction prompt The follo wing prompt is from gaama/prompts/fact_generation.md . Placehold- ers {{existing_facts}} , {{existing_concepts}} , {{related_episodes}} , and {{new_episodes}} are filled at runtime. # Extract facts and concepts from conversation episodes You are a knowledge extraction system. Given a set of new conversation episodes and context, extract NEW factual claims AND topic concepts. ## Part 1: Facts ### Rules 1. Each fact must be a **single, specific, atomic claim** (e.g., "User’s birthday is March 15, 1990"). 2. **Do NOT duplicate existing facts.** If an existing fact already captures the information, skip it. 3. **Resolve relative dates to absolute dates** using the conversation timestamp. For example, if the conversation date is "2023-06-15" and the user says "last week", resolve to approximately "2023-06-08". 4. Derive general knowledge from episodes by doing multi-step reasoning where possible. 5. Do not extract events or interactions as facts. Only extract general knowledge, preferences, attributes, or relationships that can be applied broadly. 6. Each fact should stand alone without requiring the original conversation for context. 7. For each fact, list which concept(s) it relates to (from the concepts you extract below). ## Part 2: Concepts ### Rules 1. Concepts are short topic labels (2-5 words, snake_case) representing activities, events, topics, or themes. 2. **Good concepts**: camping_trip, adoption_process, beach_outing, charity_run, art_expression, career_transition, family_vacation, marathon_training 3. **Do NOT use**: Person names, generic words (e.g., NOT family, life, experience, conversation, sharing), adjectives (e.g., NOT beautiful, amazing), dates. 4. **Reuse existing concepts** when applicable. Only create new concepts when no existing one fits. 5. Each new episode should have 1-3 concepts. 6. Each concept must be linked to the episode IDs it appears in. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"facts": [ {"fact_text": "Melanie painted a lake sunrise in 2022", "belief": 0.95, "source_episode_ids": ["ep-abc123", "ep-def456"], "concepts": ["artistic_creation", "painting_hobby"]} 12 ], "concepts": [ {"concept_label": "artistic_creation", "episode_ids": ["ep-abc123", "ep-def456", "ep-ghi789"]}, {"concept_label": "painting_hobby", "episode_ids": ["ep-abc123"]} ]} ### Facts fields - **fact_text** (required): The complete fact statement. - **belief** (0.0-1.0): Confidence. 1.0 = explicit. 0.8 = inferred. - **source_episode_ids** (required): Episode node_ids that support this fact. - **concepts** (required): Concept labels this fact relates to. ### Concepts fields - **concept_label** (required): Short snake_case topic label. - **episode_ids** (required): Episode node_ids for this concept. If no new facts, return {"facts": [], "concepts": []}. Do **not** add markdown code fences around the JSON. --- ## Existing facts (do NOT duplicate these) {{existing_facts}} ## Existing concepts (reuse when applicable, do NOT duplicate) {{existing_concepts}} ## Related older episodes (for context) {{related_episodes}} ## New conversation episodes (extract from these) {{new_episodes}} B Reflection generation prompt The following prompt is from gaama/prompts/refl ection_generation.md . # Generate new reflections from facts You are an insight generation system. Given a set of new facts and context (related existing facts and existing reflections), generate NEW reflections or insights. ## What is a reflection? A reflection is a generalized insight, preference pattern, lesson learned, or higher-order observation that emerges from combining multiple facts. Examples: - "User tends to prefer lightweight tools over full-featured IDEs" - "User’s debugging approach always starts with log analysis before code inspection" - "User values documentation and testing in their development workflow" ## Rules 13 1. Each reflection should synthesize information from multiple facts when possible. 2. **Do NOT duplicate existing reflections.** If an existing reflection already captures the insight, skip it. 3. Reflections should be actionable or informative -- they should help in future interactions. 4. Each reflection should stand alone without requiring the original facts for context. 5. Only generate reflections when there is genuine insight to be drawn. It is perfectly fine to return zero reflections. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"reflections": [ {"reflection_text": "User consistently prefers minimalist tools and configurations across all development environments", "belief": 0.8, "source_fact_ids": ["fact-abc123", "fact-def456"]} ]} - **reflection_text** (required): The insight in natural language. - **belief** (0.0-1.0): Confidence. Higher when supported by multiple consistent facts. - **source_fact_ids** (required): Fact node_ids this reflection is derived from. If no new reflections, return {"reflections": []}. Do **not** add markdown code fences around the JSON. --- ## Existing reflections (do NOT duplicate these) {{existing_reflections}} ## Related existing facts (for context) {{related_facts}} ## New facts (generate reflections from these) {{new_facts}} C Answer generation prompt The following prompt is from gaama/prompts/answ er_from_memory.md . # Answer from memory You are a precise answer assistant. Given a **query** and the **retrieved memory** below, answer the query using the provided memory. ## Query {{query}} ## Retrieved memory 14 {{memory_text}} ## Instructions - Answer the query in one or two short paragraphs. Be direct and specific. - Extract concrete answers from the memory even if the information is scattered across multiple items. Synthesize and combine partial evidence. - When counting occurrences (e.g., "how many times"), carefully scan ALL memory items and count each distinct instance. - When listing items (e.g., "which cities"), exhaustively list EVERY item mentioned across all memory entries. - Prefer giving a direct answer over saying "the memory does not specify." If the memory contains relevant clues, use them to form a best-effort answer. - Do not repeat the query. Do not cite section headers; use the memory content naturally. D Evaluator pr ompt The follo wing prompt is used by the LLM judge (GPT -4o-mini, temperature=0) to ev aluate generated answers against reference answer . Identical prompt is used for method ev aluation. You are an evaluator. Given a question, a correct reference answer, and a generated response (hypothesis), determine what fraction of the reference answer is present in the generated response. IMPORTANT: Only check whether the key facts from the reference answer appear in the generated response. Do NOT penalize the response for containing extra information, additional details, or tangential content beyond the reference answer. The ONLY thing that matters is whether the reference answer’s key facts are covered. Scoring guidelines: - 1.0: All key facts and details from the reference answer are present in the generated response (even if the response also contains extra information). - 0.0: None of the key facts from the reference answer appear in the generated response. - Between 0.0 and 1.0: Some key facts from the reference answer are present. Score = (number of reference answer key facts found) / (total key facts in reference answer). For example, if the answer has 3 key facts and 2 are found in the response, score = 0.67. You MUST respond in the following JSON format (no markdown, no extra text): {"reward": , "justification": ""} Question: {question} Correct Reference Answer: {answer} 15 Generated Response: {hypothesis} Evaluate ONLY the coverage of the reference answer’s facts. Do NOT reduce the score for extra information. Respond with the JSON only. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment