Airavat: An Agentic Framework for Internet Measurement
Internet measurement faces twin challenges: complex analyses require expert-level orchestration of tools, yet even syntactically correct implementations can have methodological flaws and can be difficult to verify. Democratizing measurement capabilit…
Authors: Alagappan Ramanathan, Eunju Kang, Dongsu Han
Airavat: An Agentic Frame work for Internet Measurement Alagappan Ramanathan University of California, Irvine Eunju Kang University of California, Irvine Dongsu Han KAIST Sangeetha Abdu Jyothi University of California, Irvine ABSTRA CT Internet measurement faces twin challenges: complex anal- yses require expert-level or chestration of tools, yet even syntactically correct implementations can have methodolog- ical aws and can be dicult to verify . Democratizing mea- surement capabilities thus demands automating both work- ow generation and v erication against metho dological stan- dards established through decades of research. W e present Airavat, the rst agentic framework for Inter- net measurement workow generation with systematic veri- cation and validation. Airavat coordinates a set of agents mirroring expert reasoning: three agents handle problem decomposition, solution design, and code implementation, with assistance from a registry of existing tools. T wo special- ized engines ensure methodological correctness: a V erica- tion Engine evaluates workows against a knowledge graph encoding ve decades of measurement resear ch, while a V al- idation Engine identies appropriate validation techniques grounded in established methodologies. Through four Inter- net measurement case studies, we demonstrate that Airavat (i) generates workows matching expert-level solutions, (ii) makes sound architectural decisions, (iii) addresses novel problems without ground truth, and (iv ) identies method- ological aws missed by standard execution-based testing. 1 IN TRODUCTION Agentic AI systems leveraging Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, from code generation to scientic reasoning [ 49 , 53 , 55 ]. Their ability to decompose complex problems, explore so- lution spaces, and synthesize executable implementations po- sitions them as powerful tools for automating sophisticated analytical workows. Howe ver , applying agentic systems to Internet measurement research faces two fundamental challenges that limit their practical deployment. First , Internet measurement analyses require expert-lev el orchestration of multiple specialized to ols—BGP analyzers [ 4 , 7 , 8 , 14 , 15 , 26 , 41 , 46 ], traceroute processors [ 19 , 39 , 50 , 51 ], Sangeetha Abdu Jyothi holds concurrent appointments at UC Irvine and Amazon. This publication describes work performe d at UC Irvine and is not associated with Amazon. topology mappers [ 36 , 43 , 60 ], and performance monitors [ 5 , 6 , 10 , 20 , 45 ]—each with unique interfaces, data formats, and domain knowledge requirements. When researchers need to understand routing behavior , infrastructure dependencies, or performance anomalies, they must manually integrate dierent measurement systems through custom solutions. Recent events highlight this challenge ’s practical impact. The AAE-1 cable cuts [ 2 ] and F ALCON cable failure [ 3 ] caused widespread outages, requiring rapid development of workows integrating cable mapping, BGP analysis, and trac ow assessment. Similar challenges arise regularly: understanding CDN performance degradation requires corre- lating traceroute data with BGP changes [ 22 , 37 , 62 ]; investi- gating security incidents demands integrating multiple mea- surement perspe ctives. While recent measurement frame- works [ 43 , 45 ] oer pow erful capabilities, these tools op erate in isolation and require specialized knowledge. Experts must spend days developing measurement workows before anal- ysis can begin, creating a substantial barrier: the ability to compose advanced measurement workows requires special- ized domain experience, limiting such capabilities to a small community of experts. Second , unlike agentic systems for system optimization whose evolution is guided by well-dened performance goals, verifying and validating measurement workows is inher- ently dicult. Agentic systems may generate executable code for Internet measurement with methodological aws that corrupt analytical results while appearing to function prop- erly . T raditional software testing approaches—checking code syntax, validating output formats, and ensuring e xecution completes—prove insucient for measurement w orkows, where correctness dep ends on tacit domain knowledge of data artifacts, appropriate preprocessing steps, and method- ological precedents established o ver decades of research. For instance, an agent might correctly implement longest-prex- match operations yet fail to lter routing table artifacts, pro- ducing meaningless results despite successful execution. W e take an alternative view . What if network operators could ask high-level questions in natural language and re- ceive executable measurement workows in minutes? What if researchers could compose Internet measurement tools , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi without specialize d training in each framework while ensur- ing generated solutions meet establishe d quality standards? W e present Airavat , the rst domain-specic agentic frame- work for Internet measur ement workow generation with systematic verication and validation capabilities. Airavat addresses the dual challenges by automating workow gen- eration and verication, while also supp orting human-in- the-loop oversight to ensure methodological rigor . Our key insight is that measurement workow development follows predictable patterns decomposed into distinct phases: prob- lem analysis, solution design, and implementation. Airavat executes these phases through three specialized agents— QueryMind, W orkowScout, and SolutionW eaver—operating on a curated Registry of measurement tools. T o improve trust, Airavat adds two specialized engines that systemati- cally verify and validate generated workows against estab- lished measurement methodologies, ensuring methodologi- cal correctness before deployment. Users expr ess measure- ment goals in natural language, and the system generates executable measurement solutions that provide complete workows or serve as foundations for expert renement. T o demonstrate Airavat’s capabilities, we present four distinct Internet measurement case studies. Our evaluation demonstrates Airavat’s ability to (i) independently generate workows that produce analytical outputs similar to expert- designed solutions with the relevant solutions remov ed from the Registr y and the Knowledge Graph (§ 6.1, 6.2), (ii) or- chestrate complex analysis across multiple measurement frameworks with signicant integration complexity (§ 6.3), and (iii) identify critical methodological aws missed by stan- dard execution-based testing through systematic verication (§ 7), and (iv ) synthesize validation methods for generated workows guided by prior literature (§ 8). In summary , we make the following contributions. • W e present Airavat, the rst domain-sp ecic agentic frame- work that translates high-level Internet measurement queries into executable workows. • W e introduce a verication engine in Airavat that evalu- ates generated workows against ve decades of Internet measurement research encoded in a structured knowledge graph. Unlike execution-based testing, our approach de- tects methodological aws that silently corrupt results. • W e design a validation engine that discovers, adapts, and implements appropriate validation techniques from prior literature, enabling measur ement workows to be che cked using alternative methods. • Through extensive Internet measur ement case studies, in- cluding cross-layer infrastructure resilience and IP alloca- tion analysis, we demonstrate the capabilities of Airavat. 2 AIRA V A T DESIGN O VERVIEW Building on our vision for an agentic framework for Internet measurement [ 44 ], we present Airavat, which tackles the fundamental challenge of generating trustworthy measur e- ment workows fr om natural language queries. While our prior work demonstrated workow generation capabilities, Airavat addresses additional core problems essential for trust- worthiness. First, worko w design requires systematically decomposing complex measurement tasks into executable steps. Second, workow verication and validation must en- sure that generated workows are scientically rigorous and methodologically sound. T o addr ess these challenges, Airavat integrates specialized subsystems shown in Figure 1. The Multi-A gent W orkow Generation pipeline emulates expert reasoning, with sp e- cialized agents handling problem decomp osition, solution design, and implementation. The Registry , a catalog of mea- surement tools, assists this process. The Registry Curator en- ables evolution of the registry by extracting proven patterns from generated worko ws and codifying them as reusable capabilities. T o enable workow verication, we construct the Knowledge Graph , which encodes measurement litera- ture as semantic r elationships and validation patterns that enable literature-grounded quality assessment. The V erica- tion Engine evaluates workow quality through a literature- grounded assessment and addresses identied deciencies. The V alidation Engine generates executable validation code by identifying applicable approaches from the literature and adapting them to specic problems. Scope. Airavat fo cuses on workow composition and code generation for measurement analysis—weaving together ex- isting measurement tools and data sources to generate e xe- cutable measurement solutions that provide complete work- ows or serve as foundations for expert r enement. Out of scope are distributed measurement collection itself, tool implementation improvements, no vel data acquisition, and improving individual measurement tools. 3 W ORKFLO W GENERA TION In this section, we detail of the key comp onents enabling workow generation. 3.1 Registry: T ool Base T o reason about workow composition, agents need struc- tured knowledge about available measurement capabilities. The Registry is a manually curate d catalog describing what measurement tools can do, not how they do it. This abstrac- tion emerged from early experiments where exposing entire codebases overwhelmed agents, causing them to miss key capabilities. Each registry entr y species an open-source tool’s capabilities, its required inputs, expecte d outputs, and Airavat , , Figure 1: Airavat’s architecture comprising a multi-agent workow generation pipeline, verication engine, validation engine, knowledge graph, and registry . operational constraints in a standardized format. This ab- straction scales linearly with available tools rather than with codebase complexity—a framework with 10,000 lines of code contributes a single registr y entr y . The standardized spe c- ication enables agents to evaluate tool applicability , un- derstand integration requirements, and compose multi-tool workows without frame work-specic knowledge. W e boot- strap the Registry with a set of manually curated to ols. The Registry evolves organically as the RegistryCurator (§ 3.5) identies reusable patterns from successful w orkows and proposes new entries, though all additions undergo manual validation to maintain quality standards. 3.2 QueryMind: Problem De composition The Quer yMind Agent transforms user queries into struc- tured problem representations by decomposing them into sub-problems, dependencies, and constraints. This agent solves a specic problem: natural language queries contain hidden complexity and implicit assumptions that must b e made explicit b efore solution design can proceed. Decompos- ing "measure CDN performance " into latency analysis across regions, cache behavior evaluation, and temporal consis- tency checking enables subse quent agents to design targeted solutions rather than over-generalized workows. W e use a task-agnostic prompt in QueryMind to system- atically decompose measurement queries into manageable subproblems by examining ve dimensions of complexity: temporal (evolution o ver time), spatial (geographic/network regions), causal (primary/secondary/tertiar y eects), stake- holders (multiple perspectives), and data (complementary sources). It assigns a complexity score (0-5) that guides W ork- owScout’s exploration strategy . The agent prioritizes early constraint evaluation: even a theoretically excellent work- ow is infeasible if the required data or technical infras- tructure is unavailable. It then denes success criteria to prevent under-analysis and ov er-engineering, and maps sub- problems to rele vant Registry functions, pro viding the W ork- owScout Agent with fo cused guidance (detailed output schema in Appendix Figure 6). 3.3 W orkowScout: W orkow Design The W orkowScout Agent conv erts structured sub-problems into concrete solution architectures. This separation from im- plementation is essential: solution design requires exploring multiple competing approaches and evaluating trade-os, a fundamentally dierent task from writing executable code. A monolithic approach either skips exploration to focus on implementation, or produces verbose code exploration that obscures the actual solution architecture . W orkowScout employs an adaptive exploration strategy that scales to problem complexity . Simple queries (complex- ity score 0-1 generated by QueryMind) receive direct solu- tions since alternatives pr ovide minimal benet. Moderate complexity (2-3) triggers a primary approach plus 1-2 com- plementary alternatives targeting missed insights. Complex problems (4-5) require three or more appr oaches from dier- ent analytical perspectives. This strategy reects measurement research practice: stronger conclusions come from multiple independent metho ds rather than optimizing a single approach. For high-risk cor e com- ponents (steps that address primary success criteria or criti- cal data transformations), the agent designs alternative ap- proaches under dierent assumptions and failure modes, ensuring backup options if any approach fails. For method , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi selection, the agent employs ensemble approaches o ver pa- rameter tuning, combining multiple complementary meth- ods using four composition patterns (detailed in Appendix E). W orkowScout produces a comprehensive design specica- tion for the SolutionW eaver Agent. 3.4 SolutionW eaver: Code Implementation The SolutionW eaver converts workow designs into exe- cutable code that integrates heterogeneous measurement tools. Since independently designed frameworks may use diverse data repr esentations, the agent implements format translation using registry specications to ensure seamless data ow . A critical challenge is that LLMs produce plausi- ble but non-executable code. Airavat addresses this through strict execution requirements, ensuring generated code pro- cesses real measurement data with complete implementa- tions. Early experiments rev ealed systematic failure modes, such as synthetic data generation, placeholder functions, and oversimplied operations, prompting quality assurance through validation checkpoints embedde d throughout gen- eration. This produces code requiring minimal manual cor- rection. The agent also documents functions with reusability potential for the Registry Curator Agent. 3.5 Registry Curator: Registr y Evolution The Registry Curator Agent ensures Airavat’s capabilities grow organically by identifying r eusable patterns from suc- cessful worko ws and pr oposing them for r egistry inclusion. Manual curation does not scale as the system generates more workows, so the agent analyzes successful implementations to identify data processing utilities, analysis algorithms, and integration functions demonstrating utility b eyond the orig- inal query context. Critically , not all patterns merit generalization. The agent employs a validation-rst strategy requiring proposed func- tions to pass four tests: (1) cross-scenario utility (value in 2-3 dierent use cases), (2) accuracy assessment (correct outputs on test cases), (3) integration compatibility (proper interaction with existing registry functions), and (4) edge case handling (appropriate behavior with invalid/boundary inputs). The agent must generate complete, executable valida- tion code for every proposed function. This strict validation prevents registry bloat, which would increase token costs and degrade agent performance by overwhelming agents with excessive choices. V alidation code runs manually to ensure quality control. 4 W ORKFLO W VERIFICA TION In this section, w e detail the key components of Airavat that enable workow verication. 4.1 Knowledge Graph The Knowledge Graph condenses measurement research lit- erature into a queryable r esource for the V erication and V al- idation Engines. The construction pipeline addresses three challenges: preventing LLM hallucination during extraction, supporting both semantic search and relationship traversal, and processing thousands of papers cost-eectively . Corpus Collection and Classication. Airavat targets four ACM conferences (SIGCOMM, SIGMETRICS, IMC, CoNEXT) and associated journals (POMACS, P A CMNET) that capture foundational measurement methodologies and validation techniques fr om the past ve decades. An automated scraper extracts DOIs and metadata (title, authors, page count, ab- stract) via Crossref API, with a 6-page threshold to e xclude papers lacking sucient methodological detail. T o focus on measurement research, a two-stage classication pipeline categorizes papers into 24 measur ement research areas using a sentence transformer , with low-condence papers under- going secondar y classication via a zero-shot classier . This prioritizes eciency—sentence transformers process the en- tire corpus while the expensive zer o-shot classier handles uncertain cases—achieving comparable accuracy to cloud- based LLMs at no cost with local LLMs. Content Extraction. Extracting structured information re- quires parsing PDFs and identifying content across ve pre- dened categories: problem statements, methodologies, data sources, baseline comparisons, and validations ( example in Appendix Figure 4, 5). Airavat uses GROBID to parse PDFs into structured TEI format. Initial single-mo del extraction encountered hallucination and category confusion. A two- stage architecture addresses this: LLama-3.1-8B rst classies which categories are present in each section, then LLama- 3.3-70B extracts content from tagged sections, reducing con- fusion by restricting extraction to appropriate sections. Graph Construction and Representation. With the struc- tured information, Airavat constructs a Neo4j knowledge graph encoding measurement domain kno wledge through semantic embeddings and typed relationships. The entity model includes ten entity typ es: Papers, Problems, Research- Gaps, Approaches, PipelineSteps, Algorithms, Metrics, Pa- rameters, Datasets, and V alidations. Entities connect through typed relationships (detailed in Appendix B). Benets and T ransferability . The graph representation en- ables semantic search, ecient multi-hop traversal, and pat- tern discovery , serving as the foundation for V erication and V alidation Engines to quer y precedents and assess method- ological soundness. The system is domain-agnostic—expertise emerges from the graph rather than from hardcoded logic, enabling extensibility by adding rele vant papers. Addition- ally , our graph construction pipeline allows "incremental Airavat , , updates, " enabling researchers to incorporate the latest con- ference proceedings with minimal manual eort. 4.2 V erication Engine Airavat’s V erication Engine assesses the generated work- ows, identies critical gaps, and produces mo dications grounded in established techniques. Assessing workow quality is challenging: unlike algorithm optimization with automated veriers or performance-based system optimiza- tions, measurement worko ws span diverse domains with varying requirements and typically cannot be evaluated by execution due to long-term data colle ction requirements, large-scale infrastructure needs, and often a lack of well- dened success metrics. The V erication Engine addresses these challenges with a three-stage pipeline. The Evaluator performs a systematic as- sessment of the workow using the Knowledge Graph. The Selector determines the optimal verication strategy base d on the Evaluator’s scores, selecting whether a given w orkow merits direct use, requires enhancement, or benets from combining with complementary workows. Finally , the Syn- thesizer generates rened workows: either by improving individual workows in enhancement mode or by combining multiple workows in hybrid mode. Note that the v erica- tion her e assesses metho dological alignment with prior work rather than guaranteeing the correctness of conclusions. Evaluator: Multi-Dimensional Assessment. The Evalu- ator assesses workows across ve dimensions: literature alignment, novelty , feasibility , simplicity , and robustness. These complementary dimensions serve dual purposes: com- puting overall quality scor es for comparison and revealing specic improvement opportunities for synthesis. The eval- uator employs a three-stage pipeline. The rst stage involves structural validation. This stage validates JSON Schema compliance for intermediate outputs (e .g., the Quer yMind output schema, Fig. 6 in the Appendix) conrms that all sub-problems are addressed, veries that the pr opose d r egistr y functions exist, and ensures that w ork- ow complexity matches requirements. W orko ws that fail structural validation are rejected. The second stage focuses on literature-grounded scoring using the Knowledge Graph. Three complementary assess- ment approaches pro vide dierent perspectives: (a) Pr oblem- centric assessment searches for similar problems via vec- tor similarity , validating whether metho ds, steps, datasets, and targets align with approaches use d for similar prob- lems (contributes to literature alignment and robustness); (b) Approach-centric assessment searches for similar meth- ods across literature without problem constraints, detecting transferable cross-domain patterns (contributes to literature alignment, novelty , feasibility , simplicity , and robustness); ( c) Collective assessment analyzes patterns frequently appear- ing in literature but missing from all workows, identify- ing systematic gaps and generating system warnings rather than dimension scores. This stage also validates feasibility through sub-dimension checks (scale handling, comple xity , edge-case coverage, error handling), rejecting worko ws that fail minimum thresholds [ Appendix T able 4]. Finally , the Evaluator assigns the nal score thr ough adap- tive weighting, which dynamically adjusts dimension priori- ties based on problem characteristics. Nov el problems receive increased novelty weight, w ell-studied problems receive in- creased literature weight, and scale-intensive problems r e- ceive increased feasibility weight. Final scoring produces rankings with comprehensive justications documenting di- mension breakdowns, strengths/w eaknesses, and workow- specic limitations. System warnings capturing collective gaps appear separately from workow-specic issues. Selector: V erication Strategy Selection . The Selector de- termines the optimal verication strategy based on Evaluator scores. W orkows exceeding excellence thresholds ( Appen- dix T able 4) receive immediate approval. W orkows within the good range trigger an enhancement evaluation, which examines dimension weaknesses and worko w-spe cic lim- itations. When it identies actionable issues, synthesis pro- ceeds in “enhancement” mode. W orkows below the good range always trigger synthesis. When multiple proposals ex- ist with structural diversity exceeding thresholds, the Selec- tor performs complementarity analysis. If workows propose fundamentally dierent approaches with complementary strengths, synthesis proceeds in “hybrid” mode to combine elements. If workows are very similar , enhancement mode applies to the top-scoring worko w . The Selector prepar es structured input for the Synthesizer , including problem con- text, workows to impro ve or combine, complete evaluation insights, and improvement guidance . Synthesizer: Generating Improved W orkows. Synthe- sizer converts structured input into improved workows using the top-p erforming model identie d by the Selector . In enhancement mode, prompts include problem context, di- mension weaknesses with computation transparency , knowl- edge graph hints with advisor y markers, and domain con- siderations encouraging practical reasoning. In hybrid mode, prompts emphasize coherent combination guided by comple- mentarity analysis, with explicit attention to common weak- nesses. After generation, Synthesizer performs response pars- ing with schema validation, then generates change do cumen- tation via a separate call that compares the original and syn- thesized workows. This separation ensures both synthesis quality and documentation comprehensiveness. Synthesized workows maintain schema compatibility with W orko wS- cout outputs, enabling seamless hando to SolutionW eaver . , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi 4.3 V alidation Engine V alidation using alternative techniques is critical in mea- surement resear ch to ensure that collected data accurately reect reality and measurement tools produce reliable results. Airavat’s V alidation Engine generates executable validation code for network measurement solutions by discovering ap- plicable validation approaches from resear ch literature and adapting them to sp ecic problems. The engine operates through a pipeline grouped into three functional compo- nents: InsightEngine, Strategizer , and CodeGenerator . This decomposition enables three key capabilities. First, dierent validation tasks demand dierent computational approaches. Second, the pipeline gracefully adapts to vary- ing literatur e coverage , emphasizing creative synthesis when few existing studies dir ectly address a problem. Third, pro- ducing inspectable outputs at each stage allows researchers to understand how validation plans wer e deriv ed rather than receiving unexplained recommendations. InsightEngine: Problem Analysis and Knowledge Dis- covery . InsightEngine p erforms three integrated functions to enable the synthesis of the validation strategies. First, in problem characterization, a local LLM (LLama-3.1-8B) seman- tically classies problem characteristics (prediction, detec- tion, temporal/spatial patterns, or causal relationships) with- out relying on keyword matching. This enables recognition of problem types across dier ent terminologies and guides subsequent knowledge graph queries and validation-typ e selection. InsightEngine also identies high-risk components agged in W orkowScout’s analysis that require additional validation. Rather than executing potentially unsafe code, InsightEngine uses static analysis to extract implementation details from the generated worko w . Specically , it parses the Python code into an Abstract Syntax Tree ( AST) to iden- tify key elements such as data sour ces, analytical methods, pipeline steps, and output structures. This extracted informa- tion then informs the queries sent to the Knowledge Graph to retrieve r elevant validation strategies. Second, to balance quer y specicity and coverage, In- sightEngine uses multidimensional quer ying across seven dimensions: similar problems, methods, data sources, anal- ysis pip elines, validation key words, suitable ground-truth datasets, and domain-spe cic metrics. The local LLM rst extracts technical terms and validation ke ywords from the problem description and implementation. Each quer y re- turns papers with their validation metadata (methodologies, ground truth sources, metrics, limitations), enabling the dis- covery of applicable approaches even when no single paper directly addresses the problem. The third function assesses problem novelty . InsightEngine computes semantic similarity between the current problem and retrieved validation appr oaches using embedding mo d- els, capturing relationships that keyword matching would miss. High similarity indicates well-studied problems amenable to adapting proven methods. Low similarity signals novel problems requiring creativ e synthesis. For borderline cases, the LLM assesses whether semantically similar papers ad- dress comparable problems or dier in critical assumptions. Strategizer: Filtering and Adaptation. Strategizer em- ploys an LLM to critically e valuate retrieved validation ap- proaches, determine applicability , adapt them to the current problem, and select complementar y strategies. This requires sophisticated reasoning because literature approaches rarely apply directly—pap ers may assume dierent data availability , constraints, or measurement capabilities. Strategizer receives e xtensive context from InsightEngine: problem characteristics, SolutionW eaver’s implementation details, knowledge graph results with rele vance scores, high- risk components, available tools and datasets from the Reg- istry , and the novelty assessment. The synthesis process emphasizes critical ltering, identifying which approaches are actually applicable and explicitly documenting why oth- ers are unsuitable, prev enting acceptance of semantically similar but practically inapplicable validations. Three ltering rules guide this pr ocess. (i) Ground truth comparison is only r ecommende d when ground truth demon- strably exists in the registr y and matches the problem require- ments. Recommending unavailable or mismatched ground truth wastes eort and provides no validation value. (ii) Al- ternative method validation is only suggested when the al- ternative method has pro ven reliability documented in the literature. Comparing against an unveried alternative pr o- vides no condence gain. (iii) V alidation strategies must pro- vide complementary rather than redundant persp ectives, examining dierent aspects such as end-to-end correctness, component-level accuracy , and internal consistency rather than repeatedly checking the same properties. For each recommended strategy , Strategizer species its applicability with supporting literature, adaptation from the original approach, feasibility given available data and tools, metrics with interpretation guidance, and how it comple- ments other strategies. V alidation strategies must trace to specic pap ers from the database and align with veried data availability rather than fabricating approaches or refer- ences. The grounding in knowledge graph results, combined with explicit feasibility checking, r educes hallucination by constraining generation through factual anchors. CodeGenerator: Producing Executable V alidation Code. CodeGenerator translates the validation plan into executable code using an LLM. V alidation strategies typically span di- verse types requiring dierent implementation approaches. Ground truth comparison, alternative method validation, Airavat , , consistency checking, component testing, and sample ver- ication each have distinct implementation requirements. T emplates would impose a rigid structure and fail to support creative strategies tailor ed to novel problems. The LLM ob- serves SolutionW eaver’s code to match its style , including documentation standards and coding patterns, while adapt- ing to diverse validation approaches. 5 IMPLEMEN T A TION The prompts instruct all agents to generate Python code. W orkow Generation : Airavat employs Claude Opus 4.5 [ 17 ] and Claude Sonnet 4.5 [ 18 ] for the core agents in work- ow generation. State-of-the-art cloud models provide the best available performance for the core functionality . The prompts are model-agnostic and do not rely on vendor- specic capabilities. However , empirical evaluation shows that Claude variants perform most consistently (§ 7.2). W orkow V erication : The Knowledge Graph construc- tion pipeline uses the mpnet-base-v2 [ 47 ] sentence trans- former for categorizing papers and BART -MNLI [ 13 ] as the zero-shot classier . GROBID [ 1 ] is used to parse the papers, followed by LLama-3.1-8B for tagging categories in each section, and LLama-3.3-70B for content extraction. The con- struction itself relies Ne o4j [ 40 ]. The resulting graph contains 35,719 nodes across 2,021 papers, spanning over 65,000 re- lationships across eight entity types (detailed statistics in Appendix D). The V erication and V alidation Engines rely on four cloud-based LLMs (Claude Opus 4.5 [ 17 ], Claude Sonnet 4.5 [ 18 ], Gemini-3-Pro [ 31 ], Gemini-3-Flash [ 30 ]) for gen- erating multiple variants in LLM-assisted stages and BAAI BGE-M3 [23] for embeddings. Cost: Generation agents use Claude mo dels (Sonnet 4.5 and Opus 4.5). Standard w orkow generation costs $0.80-$1.70 per run (A gents 1-3), while comprehensive evaluation with 12 workow variants, verication, and validation costs $4.50- $7.00. T otal expenditure for all case studies is $21.50, demon- strating cost-eectiveness for research-scale evaluation (de- tailed cost breakdown in Appendix A). Generation time for all workows was under 10 minutes, signicantly shorter than days/weeks required by human experts. 6 EV ALU A TION: GENERA TION W e demonstrate Airavat’s workow generation capabilities using four case studies spanning infrastructure resilience and IP allocation analysis. All evaluations use Claude Opus 4.5 unless mentioned otherwise. W e organize the subsec- tions based on demonstrated capabilities: expert solution replication, judicious tool selection, novel pr oblem solving, and domain transferability . T able 1 summarizes our results. W e evaluate workows along three axes: architectural coher- ence, correctness relativ e to e xpert baselines, and robustness to known data artifacts. 6.1 Expert Solution Replication Capability Under T est: Can Airavat deriv e worko ws func- tionally equivalent to expert-designed solutions without domain-specic architectural guidance and without the tar- get solution in the Registry/Knowledge Graph? Airavat Query: "Identify the impact of SeaMe W e-5 cable failure at a countr y lev el" (Cable Impact Analysis Case Study) Background: Xaminer [ 45 ] is an open-source cross-layer Internet resilience analysis framew ork that addresses such queries by aggregating metrics across the cable, IP , and AS layers. Setup: W e provide Airavat with the standard registry , in- cluding core measurement functions from Nautilus [ 43 ], an open-source cross-layer submarine cable mapping frame- work, but deliberately excluding Xaminer . This setup ensur es the system relies on analytical reasoning rather than fol- lowing the pre-existing solution. The query requires under- standing cable dependencies, extracting ae cted IP addresses, performing geographic mapping, and aggregating country- level impacts—analysis that traditionally requires domain expertise and manual framework integration. W orkow Comparison. Both systems perform physical topology analysis ( landing point identication) and trac topology analysis (IP link processing). Xaminer employs embedding modules that pre-aggregate cross-layer metrics at country and AS-level abstractions. Airavat instead uses multi-source evidence fusion, assigning condence scores based on source agreement (landing points, IP geolocation, AS registration) and separately tracking direct versus indirect landing station impacts. Results. Airavat successfully replicates Xaminer’s analysis, producing perfectly matching results across all impact met- rics (Figure 2). Airavat generates 850 lines of Python code implementing the complete analysis pipeline using 4 reg- istry functions (listed in T able 3). At the IP level, the most impacted countries are India (8,754 IPs), Malaysia (7,170), and Singapore (7,082). At the link lev el, the countries most aected by raw counts ar e Singapore (58.1K links), Germany (34.6K), and France (22.5K). At the AS level, the most im- pacted countries are Indonesia (869 ASes), Bangladesh (731), and Singap ore (634). When examining normalize d IP link impact (called risk factor by Xaminer ), both systems identify Afghanistan, Eritrea, and Djibouti showing the highest nor- malized impact, followed by Nepal, Bhutan, and Bangladesh (Fig. 2). The perfect agreement with Xaminer is expe cted: , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi T able 1: Case Study Summar y Case Study Problem LoC Key Capability Results 1: Cable Impact SeaMe W e-5 failure ∼ 850 Expert Solution Replication Matches the expert solution 2: Disaster Earthquake/hurricane ∼ 700 W orkow Simplicity Appropriate expert-level minimal solution 3: Cascading Europe- Asia cables ∼ 1600 9-function orchestration Multi-layer integration framework 4: Prex2Org Prex-to-org mapping 1500 → 1700 Domain Transferability + 0% → 70-75% (vague query); (w/ verication) V erication capabilities 0% → 90.9% tags (improved query) 150 100 50 0 50 100 150 L ongitude 75 50 25 0 25 50 75 Latitude 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 2: Countr y-level impact from SeaMe W e-5 cable fail- ure. The heatmap shows the risk factor , computed as the fraction of IP addresses aected in a countr y relative to the total numb er of IP addresses mapp ed to that countr y . The results match the ndings of Xaminer [45]. both systems ultimately invoke the same component func- tions and operate on identical input datasets. Additionally , the generated code includes comprehensive error handling, intermediate result validations, and clear documentation. This case study did not require any manual modications. Ke y Insight: Airavat replicates expert-level analytical work- ows through systematic reasoning about evidence fusion and condence scoring, demonstrating that measurement expertise follows capturable compositional patterns. 6.2 Judicious T ool Sele ction Capability Under T est: When multiple tools are available, will Airavat select minimal sucient functionality or o ver- engineer solutions with unnecessar y complexity? Airavat Query: "Identify the impact of severe earthquakes and hurricanes globally assuming a 10% infrastructure failure probability" (Natural Disaster Analysis Case Study) Setup: W e provide registr y functions from multiple mea- surement framew orks, including Xaminer , to evaluate W ork- owScout’s architectural decision-making. The challenge tests whether the system r ecognizes that Xaminer’s ev ent- processing capability can support multi-disaster analysis by systematically applying it to each disaster type , thereby avoiding unnecessary cross-framework orchestration. 150 100 50 0 50 100 150 L ongitude 75 50 25 0 25 50 75 Latitude 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 3: Global cable infrastructure impact from earth- quakes and hurricanes. The heatmap shows the risk factor , computed as the fraction of IP links ae cted in a countr y relative to the total number of IP links in that countr y . The results match the ndings of Xaminer [45]. W orkow Comparison. Airavat demonstrates appropriate architectural restraint. Rather than orchestrating multiple specialized frameworks, the system identies that X aminer’s event-processing function can handle b oth disaster types when applied systematically . The workow processes earth- quakes (MMI ≥ 3) and hurricanes (wind spee d ≥ 75 knots) separately with 10% failure probabilities, then merges results through union-based aggregation. While X aminer makes a single call to generate combined results for b oth disasters, Airavat’s separation enables disaster-specic impact statis- tics while maintaining architectural simplicity . Importantly , while the Xaminer implementation uses 7 registry functions (T able 3) to establish the analytical foundation, Airavat r ec- ognizes that one function can handle both types of disaster without requiring separate frameworks for each scenario . Results. Airavat generates approximately 700 lines of Python code to implement multi-disaster analysis. Execution against real disaster data produces results perfectly matching X am- iner’s multi-disaster impact analysis (Figure 3). Ke y Insight: Airavat demonstrates appropriate architectural judgment by avoiding unnecessary complexity and selecting solutions that meet requirements without over-engineering. Airavat , , 6.3 Novel Problem Solving Capability Under T est: Can Airavat tackle research prob- lems without established solutions, enabling analyses previ- ously impractical due to integration complexity? Airavat Query: " Analyze the cascading eects of submarine cable failures between Europe and Asia" (Cascading Failure Analysis Case Study) Challenge: Unlike previous case studies with expert-designe d solutions for validation, cascading failure analysis across con- tinents has no established implementation. Manual develop- ment would require: (i) expertise across infrastructure map- ping, impact analysis, and AS dependency tracking frame- works; (ii) days of integration engineering; and (iii) special- ized knowledge of cross-layer synthesis techniques. This barrier makes such analysis impractical for most researchers. Setup: W e evaluate whether Airavat can orchestrate com- plex multi-framework workows for e xploratory research. The analysis requires integration across infrastructure map- ping, AS-level dependency tracking, and cross-layer synthe- sis—traditionally requiring substantial manual engineering. W orkow Structure. Airavat produces a 1,600-line imple- mentation orchestrating 9 registry functions across three analytical layers (T able 3): Physical Infrastructure A nalysis . Identies submarine cables connecting Europe and Asia through geographic ltering of landing points. Using countr y-to-cable graph data and land- ing p oint mappings, it employs a MERGE strategy combining geographic ltering with cable-name-based identication. AS-Level Dependency A nalysis. Captures cascading eects through autonomous system r elationships. Extracts AS num- bers from aected cables, loads AS dependency graphs, and implements graph traversal to trace secondary impacts, dis- tinguishing primary from secondar y failures. Country-Level Impact Assessment. Integrating physical and logical layers, it consolidates cable segments with IP-to- ASN mappings and geolocation data, creates an indexed embed- dings for country/AS queries, and simulates multiple failures scenarios producing quantitative impacts across cable seg- ments, IP links, IPs, AS links, and ASes. Results. Without ground truth, we assess workow quality through architectural coherence , tool integration appropri- ateness, and analytical reasoning. The generate d solution demonstrates: (1) correct understanding of measurement to ol capabilities across frame works, (2) appropriate cr oss-layer integration, and (3) specialized domain reasoning for cascade analysis. Execution against real infrastructur e and routing data produces comprehensive vulnerability assessments that provide actionable starting points for further investigation. Figures 7, 8, 9, and 10 visually show the workow layers generated by Airavat. Ke y Insight: Airavat handles complex multi-framework integration for never-solved-before problems, lo wering bar- riers to sophisticated exploratory analysis while maintaining research-quality reasoning acr oss network layers. 6.4 Domain Transferability Capability Under T est: Can Airavat adapt to dierent mea- surement problems? Airavat Quer y: "Map BGP-routed prexes to owner orga- nizations, identifying Direct-Owner allocations, Delegated Customer chains, and organizational consolidation for ad- dress blocks scattered across WHOIS records" (Prex-to- Organization Mapping Case Study) Background: Prex2Org [ 33 ] maps BGP prexes to organi- zations, distinguishing Direct O wners (provider-independent allocations) from Delegate d Customers (sub-delegations) through WHOIS parsing, longest-pr ex-match using radix tries, and organizational clustering across heterogeneous WHOIS formats from all RIRs. Setup and Challenge: This case study focuses on IP alloca- tion analysis (a signicantly dierent measurement problem from prior case studies on infrastructure resilience) under deliberately more stringent constraints. Registry functions provide raw BGP/WHOIS data with minimal preprocess- ing (T able 3), unlike previous case studies where substantial processing occurs upstream. W e do not dene key terms in the quer y , such as "Direct O wner" and "Delegated Cus- tomer" . This tests whether LLM reasoning can infer semantic distinctions from context alone. W e generate 12 workow variants using four models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini-3-Pro, Gemini-3-Flash) at three temperature set- tings (0.0, 0.5, 1.0) and evaluate against Prex2Org monthly ground truth data cov ering over 1 million BGP-r outed pre- xes. Domain Transfer Results. W orkowScout successfully captures high-level w orkow methodology for this new do- main, generating sp ecications that incorporate domain- appropriate techniques including radix trie construction for longest-prex-match op erations, WHOIS format pars- ing across heterogeneous RIR schemas, and organizational name matching for entity consolidation. The generated work- ows process completely dierent data sources ( bulk WHOIS records, BGP dumps, RPKI certicates) using methodology distinct from cable-based analysis, demonstrating that Aira- vat adapts to new measurement domains rather than simply templating from previous solutions. All 12 generated workows execute successfully and pro- duce output in expecte d formats, validating domain transfer at the architectural and implementation lev els. However , , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi comparison against Prex2Org ground truth rev eals 0% cor- rect mappings. Every workow contains an identical crit- ical bug: failing to lter 0.0.0.0/0 default routes from BGP dumps before hierarchical tree traversal op erations, caus- ing all prexes to incorrectly map to catch-all allocations. This demonstrates that subtle domain-specic data quality requirements that exist in tacit expert knowledge rather than formal specications require verication mechanisms to detect and address them. W e discuss how the V erication Engine helps improve the r esult next. 7 EV ALU A TION: VERIFICA TION Using the Prex2Org case study [ 33 ], which demonstrated that all generated workows failed despite syntactically cor- rect code (§ 6.4), we demonstrate the verication capabilities of Airavat. Specically , we illustrate the follo wing capabili- ties: bug detection, comparative model evaluation, including the ltering of faulty workows, and the impact of quer y clar- ity on performance. V erication success is measured by the detection of known methodological aws and improvement in downstream accuracy . 7.1 Literature-Grounded Bug Detection Goal: Identify subtle metho dological aws that execution testing cannot detect—bugs that produce syntactically cor- rect code executing successfully yet yielding incorrect mea- surement results. Experimental Setup: W e generate workows using four models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini-3-Pro, Gemini-3-Flash), at three temperatures (0.0, 0.5, 1.0), pro- ducing 12 congurations. Each conguration is run once, yielding a total of 12 generated workows. The Bug: Without verication, all 12 baseline workows contained an identical critical bug: failing to lter 0.0.0.0/0 default routes before hierar chical tree traversal operations, causing every prex to match the catch-all r oute. Addition- ally , workows failed to lter overly broad WHOIS alloca- tions (IPv4 < /8, IPv6 < /16) representing RIR administrative records rather than operational allocations. This preprocess- ing requirement exists in tacit expert knowledge, but rarely appears in published algorithmic descriptions—making it invisible to LLMs despite correct high-level methodology . Bug Detection: The V erication Engine detects and repairs bugs in three stages. The evaluator scored worko ws against the quality dimensions, assigning scores b elow the excel- lence threshold, triggering enhancement mode. It then com- pared the workows against best practices emb edded in the Knowledge Graph, identied systematic data quality issues, and generated two warnings. The rst warning concerns BGP data quality and the need to r emove unwanted pr exes, which correctly identies the bug under consideration. The second is about inferred topology validation, which proved less r elevant. These warnings are derived from literature pat- terns: papers in the knowledge graph discuss default r oute ltering, establishing it as a domain best practice . Synthesizer uses these warnings to generate enhanced workows. Results: W orkows impr oved from 0% accuracy to opera- tional correctness with one targeted x guided by automate d detection. Synthesizer correctly xed the 0.0.0.0/0 bug for BGP prexes, but issue d a warning for the same issue for WHOIS records. The synthesize d workow also automat- ically implemented bogon prex ltering and invalid AS record elimination, with additional warnings agging suspi- cious patterns (e .g., large numbers of ASNs assigned to the U.S. DoD ). Due to a warning for WHOIS, unlike an up dated workow in the case of BGP , LLM-guided manual correction of the WHOIS 0.0.0.0/0 record was r equired. Note that this correction is also fully automated without manual inter ven- tion when the query has better clarity (§ 7.3). Ke y Insight. Literature-grounded verication detects bugs missed by standard execution-based testing by comparing generated workows against best practices extracted fr om published research. When full automatic repair isn’t possible, the system provides graceful degradation: explicit warnings with context enable quick manual correction rather than leaving bugs undetected. 7.2 Comparative Model Evaluation Goal: Systematically evaluate model-sp ecic failur e patterns in the V erication Engine. Experimental Design: W e consider the same 12 congura- tions in § 7.1 with three runs per conguration generating a total of 36 workows. The Bug: W e consider the same bug as § 7.1. Model-Specic Failure Detection: The V erication En- gine’s three-stage pipeline detecte d systematic model-sp ecic patterns. Structural validation automatically ltered either Gemini-3-Pro or Gemini-3-Flash at a temperature of 0.0 in all three runs due to malformed outputs (incomplete schema compliance, missing elds, invalid JSON). This aligns with Google’s warning against very lo w temperatures for Gemini- 3 models [ 28 ]. Dimension scoring re vealed robustness fail- ures acr oss the remaining Gemini congurations. Most wer e tagged infeasible due to missing edge case handling and error scenarios documented in prior IP allocation research. This matches Google ’s documentation indicating Gemini’s pr efer- ence for direct answers over comprehensive robustness [ 29 ]. In 1 of 3 runs, Opus 4.5 (temperature 1.0) was tagged infea- sible for excessive complexity excee ding literature norms. Airavat , , Quality ranking consistently place d Sonnet 4.5 variants high- est across all runs (scores 68-72), validating our design choice to employ Claude variants for the agent pipeline. Ke y Insight: The V erication Engine identies mo del-specic failure patterns through empirical assessment rather than requiring manual expertise about LLM-spe cic behaviors and limitations. Systematic evaluation acr oss architectures and temperatures rev eals that model selection signicantly impacts workow quality , with structural validation and robustness scoring providing objective selection criteria. 7.3 Query Clarity Impact Assessment Goal: Evaluate how query precision aects LLM reasoning quality during verication by comparing worko ws gener- ated from vague versus rened queries. Experimental Design: W e compare two quer y variants for the same Prex2Org task. (i) V ague quer y (baseline): Airavat Query in § 6.4. This query deliberately omits denitions for "Direct Owner" and "Delegated Customer" . (ii) Rened query: W e add targeted hints without complete denitions for Direct Owners and Delegated Customers. Performance Impact of Quer y Clarity: With the vague query (no Direct/Delegate d denitions), the corrected work- ow generated by the V erication Engine achieved 97.6% overlap for origin AS identication and 60-65% for Direct Owner/Delegated Customer (DO/DC) WHOIS tag classica- tion, improving to 70-75% after excluding incomplete LA C- NIC data. This represents substantial improvement from 0% correctness without verication. With the rened quer y before verication, all 12 work- ows still contained the identical 0.0.0.0/0 bug. After veri- cation, the overlap for origin AS identication remained the same, and the DO/DC WHOIS tag identication improved to 90.9% (correctly identied 20 of 22 tags, with only two LA C- NIC tags misclassied). The workow also include d RPKI validation as a fallback (similar to the paper [ 33 ]) for low- condence tags. For this improved query , both the BGP and WHOIS 0.0.0.0/0 issue was automatically detecte d and xed during synthesis, requiring no manual corrections. Ke y Insight: Query precision dramatically impacts LLM semantic reasoning quality for problems requiring domain- specic interpretation b eyond technical implementation. Even partial clarication (hints rather than complete denitions) improves o wnership classication accuracy from 60-65% to 90.9%, demonstrating that query engineering complements verication in achieving high-quality worko ws. 8 EV ALU A TION: V ALID A TION W e demonstrate the V alidation Engine capabilities with the Cable Impact Analysis Case Study (§ 6.1). The worko w analyzes country-level impact from SeaMe W e-5 cable failure, producing metrics spanning cable segments, IP links, ASes, and geographic distribution. For this e valuation, we generate a new knowledge graph by removing Xaminer pap er [ 45 ] to prevent the validation engine from directly identifying techniques from the paper . The V alidation Engine generated nine validation strate- gies spanning multiple validation types: three system-level validations, three component validations, one consistency validation, and two synthesized validations. W e show ho w the engine identies validation strategies in the existing lit- erature and develops new ones when pr ece dents are absent. [V1] System-Level V alidation: Landing Point Coun- try Coverage. InsightEngine’s problem-centric knowledge graph queries identied POPsicle’s [ 25 ] validation approach comparing network topology predictions against authorita- tive sour ces. The Strategizer adapted this to submarine cable validation, proposing comparison of the workow’s country list against publicly documented SeaMe W e-5 landing points from T eleGeography and SubmarineCableMap [9]. [V2] Component V alidation: Cable Segment Identi- cation. InsightEngine agged the Cable Segment Identi- cation comp onent as high-risk based on W orkowScout’s analysis, triggering targeted comp onent-level validation. The Strategizer adapted SyslogDigest’s template validation ap- proach [ 42 ] (comparing identied patterns against ground truth) to cable identication, proposing dual verication through string matching across naming variants (SeaMe W e- 5, SMW5, SMW -5) and landing point pair matching. The strategy validates that identied segments connect known landing point pairs and form geographically coherent routes, isolating cable identication accuracy from downstream ag- gregation errors. [V3] Consistency V alidation: Multi-Source Impact Met- ric Consistency . InsightEngine’s approach-centric queries identied T essellation’s multi-source trac attribution vali- dation [ 58 ]. The Strategizer adapted this to country attribu- tion consistency checking across three independent sources: IP geolocation, ASN mappings, and landing p oint data. The strategy computes consistency scores (pr oportion of sources agreeing) and identies countries with high versus low source agreement, providing condence indicators for impact as- sessments when single ground truth is unavailable. [V4] Synthesized V alidation: Historical Cable Outage Comparison. InsightEngine’s novelty assessment indicated no direct precedent for submarine cable failure impact vali- dation, triggering creative synthesis. The Strategizer synthe- sized this appr oach by adapting temporal validation patterns from network disruption detection literature , proposing ap- plication of the workow to documented historical outages , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi (2008 Mediterranean cuts, 2020 AAE-1 issues) and compar- ison of predicted impacts against reported impacts. Criti- cally , the strategy leverages IOD A trac measurements and RIPE Atlas data during historical cable failures—matching the validation methodology employed in the original X am- iner paper [ 45 ] for infrastructure resilience analysis. This demonstrates the V alidation Engine ’s ability to discover and propose validation approaches that align with established measurement research practices. Methodological Quality Assessment. The generate d strate- gies exhibit multiple quality indicators. First, literature ground- ing ensures strategies adapt proven validation patterns rather than fabricating approaches. Second, complementarity en- sures strategies examine dierent correctness dimensions— V1 validates geographic scope, V2 validates data integration quality , V3 validates foundational component accuracy , V4 provides external validation through historical comparison. Third, feasibility assessment ensures all strategies use data available in the registry or publicly accessible sources. Fourth, explicit adaptation justications document how literature approaches transfer to the specic problem context. 9 DISCUSSION AND RELA TED W ORK Generalization. W e developed agent prompts through iter- ative renement, encoding domain-agnostic reasoning pat- terns (e.g., problem de composition, constraint evaluation) while deliberately excluding measurement-specic heuris- tics. Domain expertise resides in the Knowledge Graph and Registry rather than agent logic, enabling transferability through knowledge graph population. Within the Internet measurement domain, Airavat’s ar chi- tecture generalizes across sub-domains, as demonstrated by our case study (§ 6.4). W e expect Airavat’s capabilities to read- ily generalize across a broader range of sub-problems, includ- ing network security analysis and performance debugging. The knowledge graph and agent reasoning patterns trans- fer directly to these domains by ingesting relevant r esearch papers and domain-spe cic tools. Beyond Internet measure- ment, generalization to fundamentally dierent domains requires domain-specic knowledge graph construction and registry curation. Additionally , measurement-specic com- ponents, such as the ve complexity dimensions used for query decomposition, may require adaptation for other elds. Emerging Standards and Scalability . The emergence of agent communication protocols such as Model Context Pro- tocol (MCP) [ 16 ] and Agent-to- Agent protocol (A2A ) [ 27 ] presents opportunities for standardizing AI agent interac- tions with measurement tools. MCP ’s server-client design could provide unied interfaces for tool interaction, dramat- ically simplifying registry maintenance through automatic capability discovery and standardized interaction patterns. A2A protocols could formalize communication between Aira- vat’s specialized agents, enabling more robust task delega- tion and state management. Howe ver , realizing these bene- ts requires widespread protocol adoption across both the measurement tool and AI agent ecosystems. A related chal- lenge is maintaining registry accuracy as tools ev olve. Future work could employ specialized LLM agents to automatically analyze codebases and monitor tool repositories, reducing manual maintenance overhead while impro ving scalability . Limitations. Despite demonstrated capabilities, Airavat ex- hibits fundamental limitations. While it can be fully auto- mated for well-known problems, it can only serve as a co- pilot for never-seen-before problems. The Registry requires manual curation; while the RegistryCurator Agent identies reusable patterns, all additions r equire expert validation to maintain quality standards. While the knowledge graph cap- tures a broad slice of measurement literature , it may miss in- formal best practices. The verication engine can only dete ct methodological aws documente d in existing literature. The system cannot generate validation strategies for workows requiring long-term longitudinal studies, large-scale infras- tructure deployment, or proprietary datasets unavailable in public repositories. Finally , quer y precision can dramatically impact accuracy , as demonstrate d in § 7.3, requiring both experts and non-experts to be precise in their interactions. Multi-agent LLM systems de compose complex problems into specialized subtasks. Prominent frameworks include A u- toGen [ 57 ], MetaGPT [ 35 ], LangGraph [ 12 ], and Cr ew AI [ 11 ]. Howev er , multi-agent systems exhibit notable pitfalls in- cluding role violations, input conicts, and incomplete v er- ication [ 21 ]. Airavat addresses these challenges through systematic quality assurance mechanisms. Agentic W orkows in Networked Systems Research. Recent work applies LLMs and agentic AI to networked sys- tems: NetLLM [ 56 ] for networking tasks, NetConfEval [ 52 ] for conguration, Confucius [ 54 ] for network management, and system optimization [ 24 , 34 , 48 ]. How ever , none address measurement research’s unique challenges of worko w gen- eration with systematic verication, which Airavat pr ovides. AI for Science. AI’s role in scientic discovery spans mul- tiple elds and autonomy levels [ 32 , 38 , 59 , 61 ], from pas- sive assistance to fully autonomous research. Achieving au- tonomous scientic discovery requires advances across prob- lem identication, hypothesis formulation, experiment de- sign, execution, analysis, and iterativ e renement. Our work contributes by developing agentic workows for network measurement research that demonstrate how multi-LLM col- laboration can address domain-specic scientic challenges while maintaining interpretability and human oversight. Airavat , , 10 CONCLUSION Internet measurement r esearch requires sophisticated tool integration and rigorous validation. Airavat demonstrates how agentic AI systems can automate both w orkow genera- tion and literature-grounded verication through specialized agents and engines operating on a knowledge graph. Our evaluation shows that agentic systems generate expert-level workows and identify metho dological aws missed by stan- dard execution-based testing. W e view Airavat as a force mul- tiplier for experts and a scaolding tool for non-specialists. Looking for ward, we envision Airavat enabling a new mode of measur ement research—wher e hypotheses, workows, verication strategies, and validation plans co-evolv e inter- actively—lowering the barrier to rigorous Internet measur e- ment while preserving the methodological discipline built over decades of community eort. REFERENCES [1] 2008–2026. GROBID. h tt ps : // gi t hu b. c om /k e rm it t 2/ gr o bid. (2008– 2026). [2] 2022. AAE-1 cable cut causes widespread outages in Europe, East Africa, Middle East, and South Asia - DCD. https://w ww .datacenterdy namics.com/en/news/aae- 1- cable- cut- causes- widespread- outages- i n- europe- east- africa- middle- east- and- south- asia/. (2022). [3] 2022. Falcon Cable Fault Believed T o Be From Air Strike. htt ps://su btelf orum. com/ falco n- cable- f a ult- bel ieved- to- be- f rom- air - s trike/. (2022). [4] 2025. BGP.Tools — bgp.tools. https://bgp.tools. (2025). [5] 2025. Home - NetBlocks — netblocks.org. https://netblocks.org. (2025). [6] 2025. IODA — io da.inetintel.cc.gatech.edu. https://ioda.inetintel.cc.ga tech.edu. (2025). [7] 2025. Route Views; University of Oregon Route Views Project — route- views.org. https://www .routeviews.org/routeviews/. (2025). [8] 2025. Routing Information Service (RIS) — ripe.net. https://www .ripe.n et/analyse/internet- measurements/routing- inf ormation- service- ris/. (2025). [9] 2025. Submarine Cable Map. (2025). https://www .submarinecablemap. com/ [10] 2025. Worldwide Overview | Cloudare Radar — radar.cloudar e.com. https://radar .cloudf lare.com. (2025). [11] 2026. Crew AI. https://www .crewai.com/. (2026). [12] 2026. LangGraph: Agent Orchestration Framework. https://w ww .lang chain.com/langgraph. (2026). [13] Facebook AI. 2020. bart-large-mnli. https://huggingf ace.co/f acebook/ bart- large- mnli. (2020). [14] Bahaa Al-Musawi, Philip Branch, and Grenville Armitage. 2016. BGP Anomaly Detection T echniques: A Sur vey . IEEE Communications Surveys & T utorials 19, 1 (2016), 377–396. [15] Thomas Alfroy , Thomas Holterbach, Thomas Krenc, KC Clay , and Cristel Pelsser . 2024. The Next Generation of BGP Data Collection Platforms. In Procee dings of the ACM SIGCOMM 2024 Conference . 794– 812. [16] Anthropic. 2024. Mo del Context Protocol (MCP). https://modelconte xtprotocol.io/docs/getting- started/intro. (2024). [17] Anthropic. 2025. Claude Opus 4.5. https://w ww .anthropic.com/news /claude- opus- 4- 5. (2025). [18] Anthropic. 2025. Claude Sonnet 4.5. https://www .anthropic.com/ne ws/claude- sonnet- 4- 5. (2025). [19] Brice A ugustin, Xavier Cuvellier , Benjamin Orgogozo, Fabien Viger , Timur Friedman, Matthieu Latapy , Clémence Magnien, and Renata T eixeira. 2006. A voiding traceroute anomalies with Paris traceroute . In Proceedings of the 6th A CM SIGCOMM conference on Internet mea- surement . 153–158. [20] V aibhav Bajpai and Jürgen Schönwälder. 2015. A survey on inter- net performance measurement platforms and related standardization eorts. IEEE Communications Surveys & T utorials 17, 3 (2015), 1313– 1341. [21] Mert Cemri, Melissa Z. Pan, Shuyi Y ang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer , Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? (2025). arXiv:cs.AI/2503.13657 https://arxiv .org/abs/2503.13657 [22] Balakrishnan Chandrasekaran, Georgios Smaragdakis, Arthur Berger , Matthew Luckie, and Keung-Chi Ng. 2015. A server-to-server view of the Internet. In Proceedings of the 11th A CM Conference on Emerging Networking Experiments and T echnologies . 1–13. [23] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2025. M3-Embedding: Multi-Linguality , Multi- Functionality , Multi-Granularity T ext Embeddings Through Self- Knowledge Distillation. (2025). arXiv:cs.CL/2402.03216 h t tp s : / / a r xiv .org/abs/2402.03216 [24] A udrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen W ang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Y ang, Je Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. 2025. Barbarians at the Gate: How AI is Upending Systems Research. (2025). arXiv:cs.AI/2510.06189 h tt p s: //arxiv .org/abs/2510.06189 [25] Ramakrishnan Durairajan, Joel Sommers, and Paul Barford. 2014. Layer 1-informed Internet T opology Measurement (IMC ’14) . Association for Computing Machinery , New Y ork, NY, USA, 381–394. https://doi.org/ 10.1145/2663716.2663737 [26] Nick Feamster and Hari Balakrishnan. 2005. Dete cting BGP congura- tion faults with static analysis. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-V olume 2 . 43–56. [27] Google. 2024. Agent-to-A gent (A2A) Pr otocol. https://github.com/goo gle/A2A. (2024). [28] Google. 2025. Gemini 3 Developer Guide. https://ai.google.dev/gemi ni- api/do cs/gemini- 3#temperature. (2025). [29] Google. 2025. Gemini 3 Developer Guide. https://ai.google.dev/gemi ni- api/do cs/gemini- 3#prompting_best_practices. (2025). [30] Google DeepMind. 2025. Gemini 3 Flash. https://deepmind.go ogle/mo dels/gemini/ash/. (2025). [31] Google DeepMind. 2025. Gemini 3 Pro. https://deepmind.google/mo dels/gemini/pro/. (2025). [32] Juraj Gottweis, W ei-Hung W eng, Alexander Daryin, T ao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky , Felix W eissenberger , Keran Rong, Ryutaro T anno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, K atherine Chou, A vinatan Hassidim, Burak Gokturk, Amin V ahdat, Pushmeet Kohli, Y ossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad T omasev , Yuan Guan, Vikram Dhillon, Eeshit Dhaval V aish- nav , Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Y unhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, and Vivek Natara- jan. 2025. T owards an AI co-scientist. (2025). arXiv:cs.AI/2502.18864 https://arxiv .org/abs/2502.18864 [33] Deepak Gouda, Alb erto Dainotti, and Cecilia T estart. 2025. Prex2Org: Mapping BGP Prexes to Organizations (IMC ’25) . Association for Computing Machinery , New Y ork, NY, USA, 397–414. https://doi.org/ 10.1145/3730567.3764485 , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi [34] Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany , Kimia Noorbakhsh, Joseph Chandler , Ali ParandehGheibi, Mohammad Al- izadeh, and Hari Balakrishnan. 2025. Glia: A Human-Inspired AI for Automated Systems Design and Optimization. (2025). arXiv:cs.AI/2510.27176 https://arxiv .org/abs/2510.27176 [35] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Y uheng Cheng, Ceyao Zhang, Jinlin W ang, Zili Wang, Steven Ka Shing Y au, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber . 2024. MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework. (2024). arXiv:cs.AI/2308.00352 https: //arxiv .org/abs/2308.00352 [36] Simon Knight, Hung X Nguyen, Nickolas Falkner , Rhys Bow den, and Matthew Roughan. 2011. The internet topology zoo. IEEE Journal on Selected A reas in Communications 29, 9 (2011), 1765–1775. [37] Rupa Krishnan, Harsha V Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy , Thomas Anderson, and Jie Gao. 2009. Moving beyond end-to-end path information to optimize CDN p erfor- mance. In Proceedings of the 9th A CM SIGCOMM conference on Internet measurement . 190–201. [38] Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Je Clune, and David Ha. 2024. The AI Scientist: T owards Fully Automated Open- Ended Scientic Discovery . (2024). arXiv:cs.AI/2408.06292 h t t p s : //arxiv .org/abs/2408.06292 [39] Zhuoqing Morley Mao, Jennifer Rexford, Jia W ang, and Randy H Katz. 2003. T owards an accurate AS-level traceroute tool. In Proceedings of the 2003 conference on A pplications, technologies, architectures, and protocols for computer communications . 365–378. [40] Neo4j, Inc. 2026. Neo4j Graph Database & Analytics P latform. https: //neo4j.com/. (2026). [41] Chiara Orsini, Alistair King, Danilo Giordano, V asileios Giotsas, and Al- berto Dainotti. 2016. BGPStream: A Software Framework for Live and Historical BGP Data Analysis. In Proceedings of the 2016 Internet Mea- surement Conference (IMC ’16) . Association for Computing Machiner y , New Y ork, N Y , USA, 429–444. https://doi.org/10.1145/2987443.2987482 [42] T ongqing Qiu, Zihui Ge, Dan Pei, Jia W ang, and Jun Xu. 2010. What happened in my network: mining network events from router syslogs (IMC ’10) . Association for Computing Machinery , New Y ork, NY, USA, 472–484. https://doi.org/10.1145/1879141.1879202 [43] Alagappan Ramanathan and Sangeetha Abdu Jyothi. 2023. Nautilus: A Framework for Cross-Layer Cartography of Submarine Cables and IP Links. Proc. A CM Meas. A nal. Comput. Syst. 7, 3, Article 46 (Dec. 2023), 34 pages. https://doi.org/10.1145/3626777 [44] Alagappan Ramanathan, Eunju Kang, Dongsu Han, and Sangeetha Abdu Jyothi. 2025. T owards an A gentic W orkow for Internet Measure- ment Research (HotNets ’25) . Association for Computing Machinery , New Y ork, N Y , USA, 61–68. https://doi.org/10.1145/3772356.3772409 [45] Alagappan Ramanathan, Rishika Sankaran, and Sangeetha Abdu Jyothi. 2024. Xaminer: An Internet Cross-Layer Resilience Analysis T ool. Proc. A CM Meas. A nal. Comput. Syst. 8, 1, Article 16 (Feb. 2024), 37 pages. https://doi.org/10.1145/3639042 [46] Justin Raynor , T arik Crnovrsanin, Sara Di Bartolome o, Laura South, David Sao, and Cody Dunne. 2022. The state of the art in bgp visu- alization tools: A mapping of visualization techniques to cyberattack types. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1059–1069. [47] Nils Reimers and Iryna Gurevych. 2021. all-mpnet-base-v2. ht tp s : //huggingf ace.co/sentence- transformers/all- mpnet- base- v2. (2021). [48] Y e onju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bian- chini, Aditya Akella, Zhangyang W ang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reliable and Ecient Agentic W orkow Execution. (2025). arXiv:cs.MA/2511.00330 https://arxiv .org/abs/2511 .00330 [49] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten So otla, Itai Gat, Xiaoqing Ellen T an, Y ossi Adi, Jingyu Liu, Romain Sauvestre, T al Remez, Jérémy Rapin, Artyom Kozhevnikov , Ivan Evtimov , Joanna Bit- ton, Manish Bhatt, Cristian Canton Ferrer , Aaron Grattaori, W enhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar , Hugo T ouvron, Louis Martin, Nicolas Usunier , Thomas Scialom, and Gabriel Syn- naeve. 2024. Code Llama: Op en Foundation Models for Code. (2024). arXiv:cs.CL/2308.12950 https://arxiv .org/abs/2308.12950 [50] Kevin V ermeulen, Ege Gurmericliler, Italo Cunha, David Chones, and Ethan Katz-Bassett. 2022. Internet scale reverse traceroute. In Proceedings of the 22nd A CM Internet Measurement Conference . 694– 715. [51] Kevin V ermeulen, Stephen D Stro wes, Olivier Fourmaux, and Timur Friedman. 2018. Multilevel MDA -lite Paris traceroute . In Proceedings of the Internet Measurement Conference 2018 . 29–42. [52] Changjie W ang, Mariano Scazzariello, Alireza Farshin, Simone Ferlin, Dejan Kostić, and Marco Chiesa. 2024. NetConfEval: Can LLMs Facili- tate Network Conguration? Proc. A CM Netw . 2, CoNEXT2, Article 7 (June 2024), 25 pages. https://doi.org/10.1145/3656296 [53] Lei W ang, Chen Ma, Xueyang Feng, Ze yu Zhang, Hao Y ang, Jingsen Zhang, Zhiyuan Chen, Jiakai T ang, Xu Chen, Y ankai Lin, W ayne Xin Zhao, Zhewei W ei, and Jirong W en. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (March 2024). https://doi.org/10.1007/s11704- 024- 40231- 1 [54] Zhaodong W ang, Samuel Lin, Guanqing Y an, Soudeh Ghorbani, Min- lan Y u, Jiawei Zhou, Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Y ang, and Ying Zhang. 2025. Intent-Driven Network Management with Multi- Agent LLMs: The Confucius Framework. In Proceedings of the ACM SIGCOMM 2025 Conference (SIGCOMM ’25) . Association for Computing Machinery , New Y ork, N Y, USA, 347–362. https://doi.org/10.1145/3718958.3750537 [55] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain- of- Thought Prompting Elicits Reasoning in Large Language Models. (2023). arXiv:cs.CL/2201.11903 https://arxiv .org/abs/2201.11903 [56] Duo Wu, Xianda W ang, Y aqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin W ang. 2024. NetLLM: Adapting Large Language Models for Networking. In Procee dings of the ACM SIGCOMM 2024 Con- ference (ACM SIGCOMM ’24) . A ssociation for Computing Machinery, New Y ork, N Y , USA, 661–678. https://doi.org/10.1145/3651890.3672268 [57] Qingyun W u, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahme d Has- san A wadallah, Ryen W White, Doug Burger , and Chi W ang. 2023. Au- toGen: Enabling Next-Gen LLM Applications via Multi-Agent Conver- sation. (2023). arXiv:cs.AI/2308.08155 [58] Ning Xia, Han Hee Song, Y ong Liao, Marios Iliofotou, Antonio Nucci, Zhi-Li Zhang, and Aleksandar Kuzmanovic. 2013. Mosaic: quantifying privacy leakage in mobile netw orks. SIGCOMM Comput. Commun. Rev . 43, 4 (A ug. 2013), 279–290. https://doi.org/10.1145/2534169.2486008 [59] Y utaro Y amada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster , Je Clune, and David Ha. 2025. The AI Scientist- v2: W orkshop-Level Automated Scientic Discovery via Agentic T ree Search. (2025). arXiv:cs.AI/2504.08066 https://arxiv .org/abs/2504.08066 [60] Beichuan Zhang, Raymond Liu, Daniel Massey , and Lixia Zhang. 2005. Collecting the Internet AS-level topology . ACM SIGCOMM Computer Communication Review 35, 1 (2005), 53–61. [61] Tianshi Zheng, Zheye Deng, Hong Ting T sang, W eiqi W ang, Jiaxin Bai, Zihao W ang, and Y angqiu Song. 2025. From A utomation to Autonomy: A Survey on Large Language Models in Scientic Discovery . (2025). arXiv:cs.CL/2505.13259 https://arxiv .org/abs/2505.13259 [62] Y aping Zhu, Benjamin Helsley , Jennifer Rexford, A spi Siganporia, and Sridhar Srinivasan. 2012. LatLong: Diagnosing wide-area latency Airavat , , changes for CDNs. IEEE Transactions on Network and Service Manage- ment 9, 3 (2012), 333–345. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi A COST ANAL YSIS All agents in Airavat’s experiments use Claude mo dels (Son- net 4.5 and Opus 4.5) with the following costs estimate d per execution in T able 2. The Multi- Agent W orkow Generation pipeline comprises four agents with var ying computational requirements. Agent 1 (QueryMind) costs $0.06-$0.15 per run for problem characterization and knowledge graph query for- mulation. Agent 2 (W orkowScout) costs $0.25-$0.60 per run for workow design generation, reecting the mor e complex reasoning required to synthesize measurement literature into concrete specications. Agent 3 (SolutionW eaver) costs $0.35-$0.60 per run for code generation from workow spec- ications. Agent 4 (Registr yCurator) costs $0.20-$0.40 per run when evaluating generated w orkows against literature- derived quality criteria. The V erication Engine adds substantial computational overhead for quality assurance. The Synthesizer agent costs $0.50-$0.70 per run in standard mode, or $1.00-$1.50 when performing best-approach comparison with the synthesized workow . Generating 12 workow variants (4 models × 3 temperatures) for comprehensive model comparison costs approximately $2.50-$3.00 p er run. The V alidation Engine comprises two agents: the Strategizer costs $0.16-$0.30 per run for generating validation strategies from measur ement literature, while the CodeGenerator costs $0.40-$0.60 per run for translating strategies into executable validation code. Estimating total costs for all case studies presented in the paper yields a modest overall expenditure. Case Studies 1-3 each required one standard worko w generation run, total- ing approximately $3.75 (3 × $1.25 midpoint). Case Study 4 employed comprehensive verication with 3 independent runs of 12 workow variants each, costing approximately $17.25 (3 × $5.75 midpoint). The V alidation Engine demon- stration required strategy generation and code generation, adding approximately $0.45. The Kno wledge Graph extrac- tion evaluation used only local LLMs (LLama-3.1-8B and LLama3.3-70B), incurring negligible cloud API costs. T otal estimated expenditure for all experiments presented is ap- proximately $21.50, demonstrating the cost-eectiveness of the approach for research-scale e valuation. Cost optimization represents a signicant opportunity for future work. V ery little optimization was performed in the current implementation—intermediate outputs from earlier pipeline stages were often passed in entirety to subsequent agents rather than extracting only relevant content. For e x- ample, Agent 2 (W orkowScout) r eceives complete Agent 1 (QueryMind) results rather than ltered excerpts. Selective content extraction could reduce token consumption by an estimated 30-40%, lowering per-run costs to $0.50-$1.00 for standard workows and $3.00-$4.50 for full evaluation runs. Additional optimizations include caching repeated knowl- edge graph queries, compressing intermediate representa- tions, and employing smaller models for routine validation checks while reserving powerful models for complex rea- soning tasks. These improvements would enhance system aordability while maintaining generation quality , making Airavat more accessible for broader research community adoption. B KNO WLEDGE GRAPH EXTRA CTION DET AILS AND QU ALI TY Airavat constructs a Neo4j knowledge graph encoding mea- surement domain knowledge through semantic embe ddings and typed relationships. The entity model includes ten en- tity types: Papers, Problems, ResearchGaps, Approaches, PipelineSteps, Algorithms, Metrics, Parameters, Datasets, and V alidations. Entities connect through typed relation- ships. Papers PROPOSE Approaches that SOLVE Problems; Approaches USE_D A T ASET and ar e V ALIDA TED_BY valida- tion methodologies; Approaches contain order ed PIPELINE_ STEP sequences that USE_ALGORI THM; IMPRO VES_UPON relationships capture methodological evolution. This schema enables similarity-based retrieval, relationship traversal for methodology evolution, and constraint-based ltering. Aira- vat uses MD5 hashing for deduplication and BGE-M3 em- beddings for semantic search. T o evaluate the extraction quality of our local LLM ap- proach (LLama-3.1-8B and LLama3.3-70B), w e conducted a validation study using Claude Sonnet 4.5 as an evaluator . W e selected seven representative papers covering various Internet measurement topics ( submarine cable mapping, net- work resilience analysis, routing communities, broadband availability , third-party dependencies, and IPv6 allocation) and compared the extraction outputs from our local LLM pipeline against what Sonnet itself would have generated. W e evaluated overlap across the same v e key extraction cat- egories: problem statement, methodology , datasets, baselines, and validations. Across the seven papers, the local LLM ex- traction achiev ed an average aggregate overlap of 87.3% with Sonnet’s extraction, with individual paper scores ranging from 72% to 94%. This demonstrates that cost-ee ctive local LLM extraction achieves str ong performance comparable to expensive cloud-based models for populating the knowledge graph, validating our design choice to use local models for large-scale paper processing while reserving cloud LLMs for the agent pipeline’s reasoning tasks. C KNO WLEDGE GRAPH EXTRA CTION EXAMPLE W e demonstrate the extraction pipeline ’s output using the Nautilus paper [ 43 ] as a representativ e example (Figure 4, 5). Airavat , , Component Cost per Run Notes Agent 1 (QueryMind) $0.06 - $0.15 Query decomp osition Agent 2 (W orkowScout) $0.25 - $0.60 W orkow specication Agent 3 (SolutionW eaver ) $0.35 - $0.60 Code generation Agent 4 (RegistryCurator) $0.20 - $0.40 Quality assessment Standard W orkow $0.80 - $1.70 Agents 1-3 only V erication Synthesizer $0.50 - $0.70 Standard mode V erication Synthesizer $1.00 - $1.50 With comparison 12- V ariant Generation $2.50 - $3.00 Model comparison V alidation Strategizer $0.16 - $0.30 Strategy generation V alidation Co deGenerator $0.40 - $0.60 V alidation co de Full Evaluation $4.50 - $7.00 12 variants + verication + validation T able 2: Cost Analysis Summar y The extraction captures ve key categories from measure- ment papers. Note that the following is a condensed version for illustration purp oses—the actual extraction contains more comprehensive details for each section. Extraction Structure The knowledge graph extraction sys- tem processes papers to extract structured information across ve primary categories: Problem Statement, Methodology , Datasets, Baseline Comparisons, and V alidations. Extraction Characteristics This condensed example in Fig- ure 4, 5 illustrates the structured extraction approach applie d across all papers in the knowledge graph. The full extraction contains comprehensive details, including all 10 method- ology steps, 19 data sources with detailed characteristics, complete baseline comparisons, and extensive validation ex- periments. The extraction enables the knowledge graph to answer queries like "What clustering algorithms are used for geolocation?" (DBSCAN with 𝜖 =20km), "What validation strategies w ere emplo yed?" ( cable failures, targeted measure- ments, operator maps), and "What were the key parameters?" (SoL threshold=0.05, radius=500km, weights 5:4:1). D KNO WLEDGE GRAPH CHARA CTERISTICS The knowledge graph constructed for Airavat aggregates structured information from Internet measurement research papers. The graph repr esents a comprehensive corpus of measurement methodologies, techniques, and validation ap- proaches extracted from published literature. Below , we pro- vide the scale and composition of the knowledge graph to demonstrate its breadth and depth for supporting workow generation. The knowledge graph contains 2,021 research papers pro- cessed through the extraction pipeline . From these papers, the system extracted 1,944 distinct problem statements char- acterizing measurement challenges, and 3,767 unique ap- proaches describing solution metho dologies. The metho d- ology extraction identied 13,813 pipeline steps represent- ing the granular procedural de composition of measurement workows. T echnical components include 2,956 algorithms (clustering methods, graph traversal te chniques, optimiza- tion procedures), 7,018 datasets and data sources (measure- ment platforms, geolocation ser vices, r outing databases), and 4,483 parameters (thresholds, weights, conguration values). V alidation information comprises 722 validation strategies and 1,016 evaluation metrics extracted from experimental sections. The knowledge graph maintains 65,250 relationships con- necting these entities, enabling traversal b etween related concepts. These r elationships link problems to applicable ap- proaches, approaches to constituent pipeline steps, steps to required algorithms and datasets, and metho dologies to vali- dation strategies. The relationship structure enables queries like "What validation strategies were used for geolocation- based approaches?" or "What datasets are required for cross- layer mapping problems?" to return contextually relevant re- sults grounded in measurement literature . This scale demon- strates that the knowledge graph pro vides substantial cov- erage of Internet measurement r esearch, capturing diverse problem domains (infrastructure resilience , prex mapping, network top ology inference), methodological techniques (geolocation clustering, BGP analysis, traceroute process- ing), and validation practices ( cable failure analysis, targeted measurements, ground truth comparison). The compr ehen- sive repr esentation enables Airavat agents to reason about measurement problems through literature grounded context rather than relying solely on LLM parametric knowledge. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi "paper_title" : " N a u t i l u s : F r a m e w o r k f o r C r o s s - L a y e r C a r t o g r a p h y o f S u b m a r i n e C a b l e s a n d I P L i n k s " , "extractions" : { "Problem Statement" : { " p r o b l e m _ s t a t e m e n t " : " M a p p i n g I P l i n k s t o s u b m a r i n e c a b l e s a c c u r a t e l y " , " p r o b l e m _ d e t a i l s " : { " w h a t _ i s _ l a c k i n g " : " E x i s t i n g a p p r o a c h e s l a c k a c c u r a c y a n d a r e c o a r s e - g r a i n e d " , " w h y _ i t s _ c h a l l e n g i n g " : " G e o l o c a t i o n i n a c c u r a c i e s , i n c o m p l e t e o w n e r s h i p i n f o r m a t i o n , c o m p l e x t o p o l o g y " , " s c o p e " : " A t I n t e r n e t s c a l e , f o r c r i t i c a l i n f r a s t r u c t u r e " } , " r e s e a r c h _ g a p s " : [ " A c c u r a t e c r o s s - l a y e r m a p p i n g " , " H a n d l i n g g e o l o c a t i o n u n c e r t a i n t i e s a n d i n c o m p l e t e d a t a " ] } , "Methodology" : { " a p p r o a c h _ o v e r v i e w " : " N a u t i l u s u s e s p u b l i c l y a v a i l a b l e d a t a s e t s a n d t e c h n i q u e s t o g e n e r a t e I P l i n k t o s u b m a r i n e c a b l e m a p p i n g w i t h c o n f i d e n c e s c o r e s . " , " p i p e l i n e " : [ . . . { " s t e p _ n u m b e r " : 2 , " s t e p _ n a m e " : " G e o l o c a t i o n M o d u l e " , " d e s c r i p t i o n " : " C o l l e c t a n d a g g r e g a t e g e o l o c a t i o n f r o m e l e v e n s e r v i c e s , c l a s s i f y I P l i n k s " , " a l g o r i t h m s _ u s e d " : [ " D B S C A N " ] , " p a r a m e t e r s " : { " m i n P o i n t s " : " 1 " , " e p s i l o n " : " 2 0 k m " } } , . . . { " s t e p _ n u m b e r " : 1 0 , " s t e p _ n a m e " : " A g g r e g a t i o n & F i n a l M a p p i n g " , " d e s c r i p t i o n " : " C o m b i n e g e o l o c a t i o n a n d c a b l e o w n e r o u t p u t s " , " p a r a m e t e r s " : { " w e i g h t a g e " : " 0 . 5 g e o l o c a t i o n , 0 . 4 d i s t a n c e , 0 . 1 o w n e r s h i p " } } ] , " a l g o r i t h m s " : [ { " n a m e " : " D B S C A N " , " p u r p o s e " : " C l u s t e r i n g g e o l o c a t i o n s b a s e d o n d e n s i t y " , " p a r a m e t e r s " : { " m i n P o i n t s " : " 1 " , " e p s i l o n " : " 2 0 k m " } } ] } , Figure 4: A representative subset of the extraction for the Nautilus paper on submarine cable mapping. Continued in Figure 5 Airavat , , "Datasets or Data Sources" : { " d a t a _ s o u r c e s " : [ { " n a m e " : " R I P E A t l a s " , " w h a t _ i t _ c o n t a i n s " : " T r a c e r o u t e d a t a " , " h o w _ u s e d " : " C o l l e c t i n g t r a c e r o u t e s f o r I P l i n k e x t r a c t i o n " , " c h a r a c t e r i s t i c s " : { " s i z e " : " ~ 1 2 0 M t r a c e r o u t e s " } } , { " n a m e " : " T e l e g e o g r a p h y m a p " , " w h a t _ i t _ c o n t a i n s " : " S u b m a r i n e c a b l e i n f o r m a t i o n " , " h o w _ u s e d " : " M a p p i n g I P l i n k s t o s u b m a r i n e c a b l e s " , " c h a r a c t e r i s t i c s " : { " s i z e " : " ~ 4 8 0 s u b m a r i n e c a b l e s " } } ] , . . . } , "Baseline Comparisons" : { " b a s e l i n e s " : [ { " b a s e l i n e _ n a m e " : " S C N - C r i t " , " b a s e l i n e _ a p p r o a c h " : " U s e s d r i v a b i l i t y m e t r i c , m a p s c a b l e s a t c o u n t r y l e v e l " , " c o m p a r i s o n _ m e t r i c s " : [ { " m e t r i c " : " N u m b e r o f c a b l e s p r e d i c t e d p e r l i n k " , " i m p r o v e m e n t " : " N a u t i l u s p r e d i c t s 3 5 % f e w e r c a b l e s p e r l i n k " } ] , " b a s e l i n e _ l i m i t a t i o n s " : [ " M a p s c a b l e s a t c o u n t r y l e v e l " , " C o n s e r v a t i v e a p p r o a c h m i s s e s p o t e n t i a l s u b m a r i n e p a t h s " ] } ] } , "Validations" : { " v a l i d a t i o n s " : [ . . . { " v a l i d a t i o n _ n a m e " : " S u b m a r i n e C a b l e F a i l u r e s " , " m e t h o d o l o g y " : " A n a l y z e I P l i n k d i s a p p e a r a n c e d u r i n g d o c u m e n t e d c a b l e f a i l u r e s " , " s p e c i f i c _ e x a m p l e s " : [ { " e x a m p l e _ n a m e " : " Y e m e n O u t a g e A n a l y s i s " , " r e s u l t s " : { " d i s a p p e a r a n c e _ o f _ l i n k s " : " 1 0 6 l i n k s d i s a p p e a r e d " } } ] } , { " v a l i d a t i o n _ n a m e " : " T a r g e t e d T r a c e r o u t e s " , " m e t h o d o l o g y " : " T a r g e t e d m e a s u r e m e n t s b e t w e e n R I P E p r o b e s n e a r c a b l e l a n d i n g p o i n t s " , " r e s u l t s " : { " m a t c h _ r a t e " : " 7 7 % t o p p r e d i c t i o n m a t c h " } } , . . . ] } Figure 5: Continued from Figure 4 – The representativ e subset of the extraction for the Nautilus paper on submarine cable mapping. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi "query_summary" : " C o n c i s e 1 - 2 s e n t e n c e s u m m a r y o f w h a t u s e r i s a s k i n g " , "complexity_assessment" : { { " t e m p o r a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s p a t i a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " c a u s a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s t a k e h o l d e r " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " d a t a " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s c o r e " : 0 , " t i e r " : " s i m p l e / m o d e r a t e / c o m p l e x " , " r a t i o n a l e " : " O v e r a l l c o m p l e x i t y j u s t i f i c a t i o n " , " i m p l i c a t i o n s _ f o r _ a g e n t 2 " : " W h a t t h i s c o m p l e x i t y m e a n s f o r s o l u t i o n d e s i g n . " } } , "sub_problems" : [ { { " i d " : " S P 1 " , " d e s c r i p t i o n " : " D e t a i l e d s u b - p r o b l e m d e s c r i p t i o n " , " d e p e n d e n c i e s " : [ " S P 2 " , " S P 3 " ] , " p r i o r i t y " : " h i g h / m e d i u m / l o w " , " e s t i m a t e d _ d i f f i c u l t y " : " D e s c r i p t i o n o f d i f f i c u l t y " } } ] , "constraints" : { { " t e c h n i c a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " d a t a " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " m e t h o d o l o g i c a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " t e m p o r a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] } } , "success_criteria" : { { " p r i m a r y " : " M a i n s u c c e s s c r i t e r i o n " , " s e c o n d a r y " : [ " c r i t e r i o n 1 " , " c r i t e r i o n 2 " ] , " v a l i d a t i o n _ a p p r o a c h " : " H o w t o v a l i d a t e s u c c e s s " } } , "risks" : [ { { " r i s k " : " D e s c r i p t i o n o f r i s k " , " l i k e l i h o o d " : " h i g h / m e d i u m / l o w " , " s e v e r i t y " : " h i g h / m e d i u m / l o w " , " m i t i g a t i o n " : " H o w t o m i t i g a t e " } } ] , "registry_mapping" : { { " r e l e v a n t _ f u n c t i o n s " : [ { { " f u n c t i o n _ n a m e " : " n a m e _ f r o m _ r e g i s t r y " , " p u r p o s e " : " W h a t i t p r o v i d e s f o r t h i s q u e r y " , " s u b _ p r o b l e m s _ a d d r e s s e d " : [ " S P 1 " , " S P 2 " ] } } ] , " i n t e g r a t i o n _ p o i n t s " : [ " H o w f u n c t i o n s c o n n e c t " ] , " g a p s " : [ " W h a t r e g i s t r y d o e s n ' t p r o v i d e " ] } } , "recommendations_for_designer" : [ " R e c o m m e n d a t i o n 1 f o r A g e n t 2 " , " R e c o m m e n d a t i o n 2 f o r A g e n t 2 " ] Figure 6: Output of Quer yMind follows this schema to examine complexities of subproblems, evaluate constraints, dene success criteria, and map to the relevant registry to provide guidance to W orkf lowScout . Airavat , , Case Study Registry Functions by System Cable Impact nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings Disaster Impact nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings xaminer_system: (5) generate_cable_segments_to_all_info_map (6) generate_cable_segment_to_countr y_as_maps (7) process_single_event Cascading Eects nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings xaminer_system: (5) generate_cable_segments_to_all_info_map (6) generate_cable_segment_to_countr y_as_maps (7) process_single_event submarine_system: (8) get_countr y_to_cable_graph as_dependency_system: (9) get_as_dependency_graph Prex2Org bgp_system: (1) download_bgp_dumps whois_system: (2) parse_whois_dump rpki_system: (3) get_rpki_snapshot as2org_system: (4) get_as2org_mappings T able 3: Registr y functions utilized across case studies. E W ORKFLO W COMPOSI TION P A T TERNS W orkowScout employs four method composition patterns to eliminate the need for manual parameter calibration while maintaining workow r obustness. The MERGE pattern com- bines multiple measurement methods algorithmically—for example, running two geolocation APIs and merging results through intersection (for high condence) or union (for cov- erage)—thereby achieving robustness thr ough methodologi- cal diversity . The AUTO_CALIBRA TE pattern derives param- eters automatically from dataset distribution statistics rather than requiring manual tuning, such as calculating thresholds based on the data’s statistical properties. The DERIVED pat- tern extracts parameter values from registry specications or problem requirements ( e.g., timeout values from registry specs, scale from problem constraints), ensuring compatibil- ity and feasibility . Finally , the CONSERV A TIVE_DEF AULT pattern applies loose bounds with explicitly documented trade-os, such as using wide thresholds while documenting their precision-recall implications, which prevents prema- ture data ltering while making the trade-os transparent. T ogether , these patterns enable Airavat to generate measure- ment workows that avoid the brittle manual parameter , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi Parameter V alue Comp onent Purpose Novelty threshold (high) 0.7 Evaluator - Stage 2 Problems above classied as novel Novelty threshold (low ) 0.4 Evaluator - Stage 2 Problems below classied as well-studied Infeasibility threshold 60 Evaluator - Stage 2 Sub-dimension scores below this trigger rejec- tion Excellence threshold 85 Selector W orkows above appro ved directly Good range lower bound 80 Selector W orkows below always require synthesis Structural diversity threshold 0.4 Selector Triggers complementarity analysis W eight adjustment bound ± 0.2 Evaluator - Stage 3 Maximum adjustment per dimension T able 4: V erication Engine parameters. Figure 7: Europe-A sia Cable Failure Cascade Analysis W orkow (Data Input Layer) tuning typical of traditional approaches while maintaining methodological soundness. F SAMPLE INP U TS AND OU TP U TS • Figures 4 and 5 sho w a sample knowledge graph output for a single paper . • Figure 6 shows the QueryMind Schema. • T able 3 sho ws the registry functions used by Airavat’s automated workows across the various case studies. • Figures 7, 8, 9, 10 show the output workow gener- ated for the cascading failure analysis. G ARCHI TECT URE DET AILS T able 4 shows the V erication Engine parameters. Airavat , , Figure 8: Europe-A sia Cable Failure Cascade Analysis W orkow (Core Processing Layer) , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi Figure 9: Europe-A sia Cable Failure Cascade Analysis W orkow (Integration & Cascade Layer) Airavat , , Figure 10: Europe-A sia Cable Failure Cascade Analysis W orkow (Output Reporting Layer)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment