Airavat: An Agentic Framework for Internet Measurement

Internet measurement faces twin challenges: complex analyses require expert-level orchestration of tools, yet even syntactically correct implementations can have methodological flaws and can be difficult to verify. Democratizing measurement capabilit…

Authors: Alagappan Ramanathan, Eunju Kang, Dongsu Han

Airavat: An Agentic Framework for Internet Measurement
Airavat: An Agentic Frame work for Internet Measurement Alagappan Ramanathan University of California, Irvine Eunju Kang University of California, Irvine Dongsu Han KAIST Sangeetha Abdu Jyothi University of California, Irvine ABSTRA CT Internet measurement faces twin challenges: complex anal- yses require expert-level or chestration of tools, yet even syntactically correct implementations can have methodolog- ical aws and can be dicult to verify . Democratizing mea- surement capabilities thus demands automating both work- ow generation and v erication against metho dological stan- dards established through decades of research. W e present Airavat, the rst agentic framework for Inter- net measurement workow generation with systematic veri- cation and validation. Airavat coordinates a set of agents mirroring expert reasoning: three agents handle problem decomposition, solution design, and code implementation, with assistance from a registry of existing tools. T wo special- ized engines ensure methodological correctness: a V erica- tion Engine evaluates workows against a knowledge graph encoding ve decades of measurement resear ch, while a V al- idation Engine identies appropriate validation techniques grounded in established methodologies. Through four Inter- net measurement case studies, we demonstrate that Airavat (i) generates workows matching expert-level solutions, (ii) makes sound architectural decisions, (iii) addresses novel problems without ground truth, and (iv ) identies method- ological aws missed by standard execution-based testing. 1 IN TRODUCTION Agentic AI systems leveraging Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, from code generation to scientic reasoning [ 49 , 53 , 55 ]. Their ability to decompose complex problems, explore so- lution spaces, and synthesize executable implementations po- sitions them as powerful tools for automating sophisticated analytical workows. Howe ver , applying agentic systems to Internet measurement research faces two fundamental challenges that limit their practical deployment. First , Internet measurement analyses require expert-lev el orchestration of multiple specialized to ols—BGP analyzers [ 4 , 7 , 8 , 14 , 15 , 26 , 41 , 46 ], traceroute processors [ 19 , 39 , 50 , 51 ], Sangeetha Abdu Jyothi holds concurrent appointments at UC Irvine and Amazon. This publication describes work performe d at UC Irvine and is not associated with Amazon. topology mappers [ 36 , 43 , 60 ], and performance monitors [ 5 , 6 , 10 , 20 , 45 ]—each with unique interfaces, data formats, and domain knowledge requirements. When researchers need to understand routing behavior , infrastructure dependencies, or performance anomalies, they must manually integrate dierent measurement systems through custom solutions. Recent events highlight this challenge ’s practical impact. The AAE-1 cable cuts [ 2 ] and F ALCON cable failure [ 3 ] caused widespread outages, requiring rapid development of workows integrating cable mapping, BGP analysis, and trac ow assessment. Similar challenges arise regularly: understanding CDN performance degradation requires corre- lating traceroute data with BGP changes [ 22 , 37 , 62 ]; investi- gating security incidents demands integrating multiple mea- surement perspe ctives. While recent measurement frame- works [ 43 , 45 ] oer pow erful capabilities, these tools op erate in isolation and require specialized knowledge. Experts must spend days developing measurement workows before anal- ysis can begin, creating a substantial barrier: the ability to compose advanced measurement workows requires special- ized domain experience, limiting such capabilities to a small community of experts. Second , unlike agentic systems for system optimization whose evolution is guided by well-dened performance goals, verifying and validating measurement workows is inher- ently dicult. Agentic systems may generate executable code for Internet measurement with methodological aws that corrupt analytical results while appearing to function prop- erly . T raditional software testing approaches—checking code syntax, validating output formats, and ensuring e xecution completes—prove insucient for measurement w orkows, where correctness dep ends on tacit domain knowledge of data artifacts, appropriate preprocessing steps, and method- ological precedents established o ver decades of research. For instance, an agent might correctly implement longest-prex- match operations yet fail to lter routing table artifacts, pro- ducing meaningless results despite successful execution. W e take an alternative view . What if network operators could ask high-level questions in natural language and re- ceive executable measurement workows in minutes? What if researchers could compose Internet measurement tools , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi without specialize d training in each framework while ensur- ing generated solutions meet establishe d quality standards? W e present Airavat , the rst domain-specic agentic frame- work for Internet measur ement workow generation with systematic verication and validation capabilities. Airavat addresses the dual challenges by automating workow gen- eration and verication, while also supp orting human-in- the-loop oversight to ensure methodological rigor . Our key insight is that measurement workow development follows predictable patterns decomposed into distinct phases: prob- lem analysis, solution design, and implementation. Airavat executes these phases through three specialized agents— QueryMind, W orkowScout, and SolutionW eaver—operating on a curated Registry of measurement tools. T o improve trust, Airavat adds two specialized engines that systemati- cally verify and validate generated workows against estab- lished measurement methodologies, ensuring methodologi- cal correctness before deployment. Users expr ess measure- ment goals in natural language, and the system generates executable measurement solutions that provide complete workows or serve as foundations for expert renement. T o demonstrate Airavat’s capabilities, we present four distinct Internet measurement case studies. Our evaluation demonstrates Airavat’s ability to (i) independently generate workows that produce analytical outputs similar to expert- designed solutions with the relevant solutions remov ed from the Registr y and the Knowledge Graph (§ 6.1, 6.2), (ii) or- chestrate complex analysis across multiple measurement frameworks with signicant integration complexity (§ 6.3), and (iii) identify critical methodological aws missed by stan- dard execution-based testing through systematic verication (§ 7), and (iv ) synthesize validation methods for generated workows guided by prior literature (§ 8). In summary , we make the following contributions. • W e present Airavat, the rst domain-sp ecic agentic frame- work that translates high-level Internet measurement queries into executable workows. • W e introduce a verication engine in Airavat that evalu- ates generated workows against ve decades of Internet measurement research encoded in a structured knowledge graph. Unlike execution-based testing, our approach de- tects methodological aws that silently corrupt results. • W e design a validation engine that discovers, adapts, and implements appropriate validation techniques from prior literature, enabling measur ement workows to be che cked using alternative methods. • Through extensive Internet measur ement case studies, in- cluding cross-layer infrastructure resilience and IP alloca- tion analysis, we demonstrate the capabilities of Airavat. 2 AIRA V A T DESIGN O VERVIEW Building on our vision for an agentic framework for Internet measurement [ 44 ], we present Airavat, which tackles the fundamental challenge of generating trustworthy measur e- ment workows fr om natural language queries. While our prior work demonstrated workow generation capabilities, Airavat addresses additional core problems essential for trust- worthiness. First, worko w design requires systematically decomposing complex measurement tasks into executable steps. Second, workow verication and validation must en- sure that generated workows are scientically rigorous and methodologically sound. T o addr ess these challenges, Airavat integrates specialized subsystems shown in Figure 1. The Multi-A gent W orkow Generation pipeline emulates expert reasoning, with sp e- cialized agents handling problem decomp osition, solution design, and implementation. The Registry , a catalog of mea- surement tools, assists this process. The Registry Curator en- ables evolution of the registry by extracting proven patterns from generated worko ws and codifying them as reusable capabilities. T o enable workow verication, we construct the Knowledge Graph , which encodes measurement litera- ture as semantic r elationships and validation patterns that enable literature-grounded quality assessment. The V erica- tion Engine evaluates workow quality through a literature- grounded assessment and addresses identied deciencies. The V alidation Engine generates executable validation code by identifying applicable approaches from the literature and adapting them to specic problems. Scope. Airavat fo cuses on workow composition and code generation for measurement analysis—weaving together ex- isting measurement tools and data sources to generate e xe- cutable measurement solutions that provide complete work- ows or serve as foundations for expert r enement. Out of scope are distributed measurement collection itself, tool implementation improvements, no vel data acquisition, and improving individual measurement tools. 3 W ORKFLO W GENERA TION In this section, we detail of the key comp onents enabling workow generation. 3.1 Registry: T ool Base T o reason about workow composition, agents need struc- tured knowledge about available measurement capabilities. The Registry is a manually curate d catalog describing what measurement tools can do, not how they do it. This abstrac- tion emerged from early experiments where exposing entire codebases overwhelmed agents, causing them to miss key capabilities. Each registry entr y species an open-source tool’s capabilities, its required inputs, expecte d outputs, and Airavat , , Figure 1: Airavat’s architecture comprising a multi-agent workow generation pipeline, verication engine, validation engine, knowledge graph, and registry . operational constraints in a standardized format. This ab- straction scales linearly with available tools rather than with codebase complexity—a framework with 10,000 lines of code contributes a single registr y entr y . The standardized spe c- ication enables agents to evaluate tool applicability , un- derstand integration requirements, and compose multi-tool workows without frame work-specic knowledge. W e boot- strap the Registry with a set of manually curated to ols. The Registry evolves organically as the RegistryCurator (§ 3.5) identies reusable patterns from successful w orkows and proposes new entries, though all additions undergo manual validation to maintain quality standards. 3.2 QueryMind: Problem De composition The Quer yMind Agent transforms user queries into struc- tured problem representations by decomposing them into sub-problems, dependencies, and constraints. This agent solves a specic problem: natural language queries contain hidden complexity and implicit assumptions that must b e made explicit b efore solution design can proceed. Decompos- ing "measure CDN performance " into latency analysis across regions, cache behavior evaluation, and temporal consis- tency checking enables subse quent agents to design targeted solutions rather than over-generalized workows. W e use a task-agnostic prompt in QueryMind to system- atically decompose measurement queries into manageable subproblems by examining ve dimensions of complexity: temporal (evolution o ver time), spatial (geographic/network regions), causal (primary/secondary/tertiar y eects), stake- holders (multiple perspectives), and data (complementary sources). It assigns a complexity score (0-5) that guides W ork- owScout’s exploration strategy . The agent prioritizes early constraint evaluation: even a theoretically excellent work- ow is infeasible if the required data or technical infras- tructure is unavailable. It then denes success criteria to prevent under-analysis and ov er-engineering, and maps sub- problems to rele vant Registry functions, pro viding the W ork- owScout Agent with fo cused guidance (detailed output schema in Appendix Figure 6). 3.3 W orkowScout: W orkow Design The W orkowScout Agent conv erts structured sub-problems into concrete solution architectures. This separation from im- plementation is essential: solution design requires exploring multiple competing approaches and evaluating trade-os, a fundamentally dierent task from writing executable code. A monolithic approach either skips exploration to focus on implementation, or produces verbose code exploration that obscures the actual solution architecture . W orkowScout employs an adaptive exploration strategy that scales to problem complexity . Simple queries (complex- ity score 0-1 generated by QueryMind) receive direct solu- tions since alternatives pr ovide minimal benet. Moderate complexity (2-3) triggers a primary approach plus 1-2 com- plementary alternatives targeting missed insights. Complex problems (4-5) require three or more appr oaches from dier- ent analytical perspectives. This strategy reects measurement research practice: stronger conclusions come from multiple independent metho ds rather than optimizing a single approach. For high-risk cor e com- ponents (steps that address primary success criteria or criti- cal data transformations), the agent designs alternative ap- proaches under dierent assumptions and failure modes, ensuring backup options if any approach fails. For method , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi selection, the agent employs ensemble approaches o ver pa- rameter tuning, combining multiple complementary meth- ods using four composition patterns (detailed in Appendix E). W orkowScout produces a comprehensive design specica- tion for the SolutionW eaver Agent. 3.4 SolutionW eaver: Code Implementation The SolutionW eaver converts workow designs into exe- cutable code that integrates heterogeneous measurement tools. Since independently designed frameworks may use diverse data repr esentations, the agent implements format translation using registry specications to ensure seamless data ow . A critical challenge is that LLMs produce plausi- ble but non-executable code. Airavat addresses this through strict execution requirements, ensuring generated code pro- cesses real measurement data with complete implementa- tions. Early experiments rev ealed systematic failure modes, such as synthetic data generation, placeholder functions, and oversimplied operations, prompting quality assurance through validation checkpoints embedde d throughout gen- eration. This produces code requiring minimal manual cor- rection. The agent also documents functions with reusability potential for the Registry Curator Agent. 3.5 Registry Curator: Registr y Evolution The Registry Curator Agent ensures Airavat’s capabilities grow organically by identifying r eusable patterns from suc- cessful worko ws and pr oposing them for r egistry inclusion. Manual curation does not scale as the system generates more workows, so the agent analyzes successful implementations to identify data processing utilities, analysis algorithms, and integration functions demonstrating utility b eyond the orig- inal query context. Critically , not all patterns merit generalization. The agent employs a validation-rst strategy requiring proposed func- tions to pass four tests: (1) cross-scenario utility (value in 2-3 dierent use cases), (2) accuracy assessment (correct outputs on test cases), (3) integration compatibility (proper interaction with existing registry functions), and (4) edge case handling (appropriate behavior with invalid/boundary inputs). The agent must generate complete, executable valida- tion code for every proposed function. This strict validation prevents registry bloat, which would increase token costs and degrade agent performance by overwhelming agents with excessive choices. V alidation code runs manually to ensure quality control. 4 W ORKFLO W VERIFICA TION In this section, w e detail the key components of Airavat that enable workow verication. 4.1 Knowledge Graph The Knowledge Graph condenses measurement research lit- erature into a queryable r esource for the V erication and V al- idation Engines. The construction pipeline addresses three challenges: preventing LLM hallucination during extraction, supporting both semantic search and relationship traversal, and processing thousands of papers cost-eectively . Corpus Collection and Classication. Airavat targets four ACM conferences (SIGCOMM, SIGMETRICS, IMC, CoNEXT) and associated journals (POMACS, P A CMNET) that capture foundational measurement methodologies and validation techniques fr om the past ve decades. An automated scraper extracts DOIs and metadata (title, authors, page count, ab- stract) via Crossref API, with a 6-page threshold to e xclude papers lacking sucient methodological detail. T o focus on measurement research, a two-stage classication pipeline categorizes papers into 24 measur ement research areas using a sentence transformer , with low-condence papers under- going secondar y classication via a zero-shot classier . This prioritizes eciency—sentence transformers process the en- tire corpus while the expensive zer o-shot classier handles uncertain cases—achieving comparable accuracy to cloud- based LLMs at no cost with local LLMs. Content Extraction. Extracting structured information re- quires parsing PDFs and identifying content across ve pre- dened categories: problem statements, methodologies, data sources, baseline comparisons, and validations ( example in Appendix Figure 4, 5). Airavat uses GROBID to parse PDFs into structured TEI format. Initial single-mo del extraction encountered hallucination and category confusion. A two- stage architecture addresses this: LLama-3.1-8B rst classies which categories are present in each section, then LLama- 3.3-70B extracts content from tagged sections, reducing con- fusion by restricting extraction to appropriate sections. Graph Construction and Representation. With the struc- tured information, Airavat constructs a Neo4j knowledge graph encoding measurement domain kno wledge through semantic embeddings and typed relationships. The entity model includes ten entity typ es: Papers, Problems, Research- Gaps, Approaches, PipelineSteps, Algorithms, Metrics, Pa- rameters, Datasets, and V alidations. Entities connect through typed relationships (detailed in Appendix B). Benets and T ransferability . The graph representation en- ables semantic search, ecient multi-hop traversal, and pat- tern discovery , serving as the foundation for V erication and V alidation Engines to quer y precedents and assess method- ological soundness. The system is domain-agnostic—expertise emerges from the graph rather than from hardcoded logic, enabling extensibility by adding rele vant papers. Addition- ally , our graph construction pipeline allows "incremental Airavat , , updates, " enabling researchers to incorporate the latest con- ference proceedings with minimal manual eort. 4.2 V erication Engine Airavat’s V erication Engine assesses the generated work- ows, identies critical gaps, and produces mo dications grounded in established techniques. Assessing workow quality is challenging: unlike algorithm optimization with automated veriers or performance-based system optimiza- tions, measurement worko ws span diverse domains with varying requirements and typically cannot be evaluated by execution due to long-term data colle ction requirements, large-scale infrastructure needs, and often a lack of well- dened success metrics. The V erication Engine addresses these challenges with a three-stage pipeline. The Evaluator performs a systematic as- sessment of the workow using the Knowledge Graph. The Selector determines the optimal verication strategy base d on the Evaluator’s scores, selecting whether a given w orkow merits direct use, requires enhancement, or benets from combining with complementary workows. Finally , the Syn- thesizer generates rened workows: either by improving individual workows in enhancement mode or by combining multiple workows in hybrid mode. Note that the v erica- tion her e assesses metho dological alignment with prior work rather than guaranteeing the correctness of conclusions. Evaluator: Multi-Dimensional Assessment. The Evalu- ator assesses workows across ve dimensions: literature alignment, novelty , feasibility , simplicity , and robustness. These complementary dimensions serve dual purposes: com- puting overall quality scor es for comparison and revealing specic improvement opportunities for synthesis. The eval- uator employs a three-stage pipeline. The rst stage involves structural validation. This stage validates JSON Schema compliance for intermediate outputs (e .g., the Quer yMind output schema, Fig. 6 in the Appendix) conrms that all sub-problems are addressed, veries that the pr opose d r egistr y functions exist, and ensures that w ork- ow complexity matches requirements. W orko ws that fail structural validation are rejected. The second stage focuses on literature-grounded scoring using the Knowledge Graph. Three complementary assess- ment approaches pro vide dierent perspectives: (a) Pr oblem- centric assessment searches for similar problems via vec- tor similarity , validating whether metho ds, steps, datasets, and targets align with approaches use d for similar prob- lems (contributes to literature alignment and robustness); (b) Approach-centric assessment searches for similar meth- ods across literature without problem constraints, detecting transferable cross-domain patterns (contributes to literature alignment, novelty , feasibility , simplicity , and robustness); ( c) Collective assessment analyzes patterns frequently appear- ing in literature but missing from all workows, identify- ing systematic gaps and generating system warnings rather than dimension scores. This stage also validates feasibility through sub-dimension checks (scale handling, comple xity , edge-case coverage, error handling), rejecting worko ws that fail minimum thresholds [ Appendix T able 4]. Finally , the Evaluator assigns the nal score thr ough adap- tive weighting, which dynamically adjusts dimension priori- ties based on problem characteristics. Nov el problems receive increased novelty weight, w ell-studied problems receive in- creased literature weight, and scale-intensive problems r e- ceive increased feasibility weight. Final scoring produces rankings with comprehensive justications documenting di- mension breakdowns, strengths/w eaknesses, and workow- specic limitations. System warnings capturing collective gaps appear separately from workow-specic issues. Selector: V erication Strategy Selection . The Selector de- termines the optimal verication strategy based on Evaluator scores. W orkows exceeding excellence thresholds ( Appen- dix T able 4) receive immediate approval. W orkows within the good range trigger an enhancement evaluation, which examines dimension weaknesses and worko w-spe cic lim- itations. When it identies actionable issues, synthesis pro- ceeds in “enhancement” mode. W orkows below the good range always trigger synthesis. When multiple proposals ex- ist with structural diversity exceeding thresholds, the Selec- tor performs complementarity analysis. If workows propose fundamentally dierent approaches with complementary strengths, synthesis proceeds in “hybrid” mode to combine elements. If workows are very similar , enhancement mode applies to the top-scoring worko w . The Selector prepar es structured input for the Synthesizer , including problem con- text, workows to impro ve or combine, complete evaluation insights, and improvement guidance . Synthesizer: Generating Improved W orkows. Synthe- sizer converts structured input into improved workows using the top-p erforming model identie d by the Selector . In enhancement mode, prompts include problem context, di- mension weaknesses with computation transparency , knowl- edge graph hints with advisor y markers, and domain con- siderations encouraging practical reasoning. In hybrid mode, prompts emphasize coherent combination guided by comple- mentarity analysis, with explicit attention to common weak- nesses. After generation, Synthesizer performs response pars- ing with schema validation, then generates change do cumen- tation via a separate call that compares the original and syn- thesized workows. This separation ensures both synthesis quality and documentation comprehensiveness. Synthesized workows maintain schema compatibility with W orko wS- cout outputs, enabling seamless hando to SolutionW eaver . , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi 4.3 V alidation Engine V alidation using alternative techniques is critical in mea- surement resear ch to ensure that collected data accurately reect reality and measurement tools produce reliable results. Airavat’s V alidation Engine generates executable validation code for network measurement solutions by discovering ap- plicable validation approaches from resear ch literature and adapting them to sp ecic problems. The engine operates through a pipeline grouped into three functional compo- nents: InsightEngine, Strategizer , and CodeGenerator . This decomposition enables three key capabilities. First, dierent validation tasks demand dierent computational approaches. Second, the pipeline gracefully adapts to vary- ing literatur e coverage , emphasizing creative synthesis when few existing studies dir ectly address a problem. Third, pro- ducing inspectable outputs at each stage allows researchers to understand how validation plans wer e deriv ed rather than receiving unexplained recommendations. InsightEngine: Problem Analysis and Knowledge Dis- covery . InsightEngine p erforms three integrated functions to enable the synthesis of the validation strategies. First, in problem characterization, a local LLM (LLama-3.1-8B) seman- tically classies problem characteristics (prediction, detec- tion, temporal/spatial patterns, or causal relationships) with- out relying on keyword matching. This enables recognition of problem types across dier ent terminologies and guides subsequent knowledge graph queries and validation-typ e selection. InsightEngine also identies high-risk components agged in W orkowScout’s analysis that require additional validation. Rather than executing potentially unsafe code, InsightEngine uses static analysis to extract implementation details from the generated worko w . Specically , it parses the Python code into an Abstract Syntax Tree ( AST) to iden- tify key elements such as data sour ces, analytical methods, pipeline steps, and output structures. This extracted informa- tion then informs the queries sent to the Knowledge Graph to retrieve r elevant validation strategies. Second, to balance quer y specicity and coverage, In- sightEngine uses multidimensional quer ying across seven dimensions: similar problems, methods, data sources, anal- ysis pip elines, validation key words, suitable ground-truth datasets, and domain-spe cic metrics. The local LLM rst extracts technical terms and validation ke ywords from the problem description and implementation. Each quer y re- turns papers with their validation metadata (methodologies, ground truth sources, metrics, limitations), enabling the dis- covery of applicable approaches even when no single paper directly addresses the problem. The third function assesses problem novelty . InsightEngine computes semantic similarity between the current problem and retrieved validation appr oaches using embedding mo d- els, capturing relationships that keyword matching would miss. High similarity indicates well-studied problems amenable to adapting proven methods. Low similarity signals novel problems requiring creativ e synthesis. For borderline cases, the LLM assesses whether semantically similar papers ad- dress comparable problems or dier in critical assumptions. Strategizer: Filtering and Adaptation. Strategizer em- ploys an LLM to critically e valuate retrieved validation ap- proaches, determine applicability , adapt them to the current problem, and select complementar y strategies. This requires sophisticated reasoning because literature approaches rarely apply directly—pap ers may assume dierent data availability , constraints, or measurement capabilities. Strategizer receives e xtensive context from InsightEngine: problem characteristics, SolutionW eaver’s implementation details, knowledge graph results with rele vance scores, high- risk components, available tools and datasets from the Reg- istry , and the novelty assessment. The synthesis process emphasizes critical ltering, identifying which approaches are actually applicable and explicitly documenting why oth- ers are unsuitable, prev enting acceptance of semantically similar but practically inapplicable validations. Three ltering rules guide this pr ocess. (i) Ground truth comparison is only r ecommende d when ground truth demon- strably exists in the registr y and matches the problem require- ments. Recommending unavailable or mismatched ground truth wastes eort and provides no validation value. (ii) Al- ternative method validation is only suggested when the al- ternative method has pro ven reliability documented in the literature. Comparing against an unveried alternative pr o- vides no condence gain. (iii) V alidation strategies must pro- vide complementary rather than redundant persp ectives, examining dierent aspects such as end-to-end correctness, component-level accuracy , and internal consistency rather than repeatedly checking the same properties. For each recommended strategy , Strategizer species its applicability with supporting literature, adaptation from the original approach, feasibility given available data and tools, metrics with interpretation guidance, and how it comple- ments other strategies. V alidation strategies must trace to specic pap ers from the database and align with veried data availability rather than fabricating approaches or refer- ences. The grounding in knowledge graph results, combined with explicit feasibility checking, r educes hallucination by constraining generation through factual anchors. CodeGenerator: Producing Executable V alidation Code. CodeGenerator translates the validation plan into executable code using an LLM. V alidation strategies typically span di- verse types requiring dierent implementation approaches. Ground truth comparison, alternative method validation, Airavat , , consistency checking, component testing, and sample ver- ication each have distinct implementation requirements. T emplates would impose a rigid structure and fail to support creative strategies tailor ed to novel problems. The LLM ob- serves SolutionW eaver’s code to match its style , including documentation standards and coding patterns, while adapt- ing to diverse validation approaches. 5 IMPLEMEN T A TION The prompts instruct all agents to generate Python code. W orkow Generation : Airavat employs Claude Opus 4.5 [ 17 ] and Claude Sonnet 4.5 [ 18 ] for the core agents in work- ow generation. State-of-the-art cloud models provide the best available performance for the core functionality . The prompts are model-agnostic and do not rely on vendor- specic capabilities. However , empirical evaluation shows that Claude variants perform most consistently (§ 7.2). W orkow V erication : The Knowledge Graph construc- tion pipeline uses the mpnet-base-v2 [ 47 ] sentence trans- former for categorizing papers and BART -MNLI [ 13 ] as the zero-shot classier . GROBID [ 1 ] is used to parse the papers, followed by LLama-3.1-8B for tagging categories in each section, and LLama-3.3-70B for content extraction. The con- struction itself relies Ne o4j [ 40 ]. The resulting graph contains 35,719 nodes across 2,021 papers, spanning over 65,000 re- lationships across eight entity types (detailed statistics in Appendix D). The V erication and V alidation Engines rely on four cloud-based LLMs (Claude Opus 4.5 [ 17 ], Claude Sonnet 4.5 [ 18 ], Gemini-3-Pro [ 31 ], Gemini-3-Flash [ 30 ]) for gen- erating multiple variants in LLM-assisted stages and BAAI BGE-M3 [23] for embeddings. Cost: Generation agents use Claude mo dels (Sonnet 4.5 and Opus 4.5). Standard w orkow generation costs $0.80-$1.70 per run (A gents 1-3), while comprehensive evaluation with 12 workow variants, verication, and validation costs $4.50- $7.00. T otal expenditure for all case studies is $21.50, demon- strating cost-eectiveness for research-scale evaluation (de- tailed cost breakdown in Appendix A). Generation time for all workows was under 10 minutes, signicantly shorter than days/weeks required by human experts. 6 EV ALU A TION: GENERA TION W e demonstrate Airavat’s workow generation capabilities using four case studies spanning infrastructure resilience and IP allocation analysis. All evaluations use Claude Opus 4.5 unless mentioned otherwise. W e organize the subsec- tions based on demonstrated capabilities: expert solution replication, judicious tool selection, novel pr oblem solving, and domain transferability . T able 1 summarizes our results. W e evaluate workows along three axes: architectural coher- ence, correctness relativ e to e xpert baselines, and robustness to known data artifacts. 6.1 Expert Solution Replication Capability Under T est: Can Airavat deriv e worko ws func- tionally equivalent to expert-designed solutions without domain-specic architectural guidance and without the tar- get solution in the Registry/Knowledge Graph? Airavat Query: "Identify the impact of SeaMe W e-5 cable failure at a countr y lev el" (Cable Impact Analysis Case Study) Background: Xaminer [ 45 ] is an open-source cross-layer Internet resilience analysis framew ork that addresses such queries by aggregating metrics across the cable, IP , and AS layers. Setup: W e provide Airavat with the standard registry , in- cluding core measurement functions from Nautilus [ 43 ], an open-source cross-layer submarine cable mapping frame- work, but deliberately excluding Xaminer . This setup ensur es the system relies on analytical reasoning rather than fol- lowing the pre-existing solution. The query requires under- standing cable dependencies, extracting ae cted IP addresses, performing geographic mapping, and aggregating country- level impacts—analysis that traditionally requires domain expertise and manual framework integration. W orkow Comparison. Both systems perform physical topology analysis ( landing point identication) and trac topology analysis (IP link processing). Xaminer employs embedding modules that pre-aggregate cross-layer metrics at country and AS-level abstractions. Airavat instead uses multi-source evidence fusion, assigning condence scores based on source agreement (landing points, IP geolocation, AS registration) and separately tracking direct versus indirect landing station impacts. Results. Airavat successfully replicates Xaminer’s analysis, producing perfectly matching results across all impact met- rics (Figure 2). Airavat generates 850 lines of Python code implementing the complete analysis pipeline using 4 reg- istry functions (listed in T able 3). At the IP level, the most impacted countries are India (8,754 IPs), Malaysia (7,170), and Singapore (7,082). At the link lev el, the countries most aected by raw counts ar e Singapore (58.1K links), Germany (34.6K), and France (22.5K). At the AS level, the most im- pacted countries are Indonesia (869 ASes), Bangladesh (731), and Singap ore (634). When examining normalize d IP link impact (called risk factor by Xaminer ), both systems identify Afghanistan, Eritrea, and Djibouti showing the highest nor- malized impact, followed by Nepal, Bhutan, and Bangladesh (Fig. 2). The perfect agreement with Xaminer is expe cted: , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi T able 1: Case Study Summar y Case Study Problem LoC Key Capability Results 1: Cable Impact SeaMe W e-5 failure ∼ 850 Expert Solution Replication Matches the expert solution 2: Disaster Earthquake/hurricane ∼ 700 W orkow Simplicity Appropriate expert-level minimal solution 3: Cascading Europe- Asia cables ∼ 1600 9-function orchestration Multi-layer integration framework 4: Prex2Org Prex-to-org mapping 1500 → 1700 Domain Transferability + 0% → 70-75% (vague query); (w/ verication) V erication capabilities 0% → 90.9% tags (improved query) 150 100 50 0 50 100 150 L ongitude 75 50 25 0 25 50 75 Latitude 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 2: Countr y-level impact from SeaMe W e-5 cable fail- ure. The heatmap shows the risk factor , computed as the fraction of IP addresses aected in a countr y relative to the total numb er of IP addresses mapp ed to that countr y . The results match the ndings of Xaminer [45]. both systems ultimately invoke the same component func- tions and operate on identical input datasets. Additionally , the generated code includes comprehensive error handling, intermediate result validations, and clear documentation. This case study did not require any manual modications. Ke y Insight: Airavat replicates expert-level analytical work- ows through systematic reasoning about evidence fusion and condence scoring, demonstrating that measurement expertise follows capturable compositional patterns. 6.2 Judicious T ool Sele ction Capability Under T est: When multiple tools are available, will Airavat select minimal sucient functionality or o ver- engineer solutions with unnecessar y complexity? Airavat Query: "Identify the impact of severe earthquakes and hurricanes globally assuming a 10% infrastructure failure probability" (Natural Disaster Analysis Case Study) Setup: W e provide registr y functions from multiple mea- surement framew orks, including Xaminer , to evaluate W ork- owScout’s architectural decision-making. The challenge tests whether the system r ecognizes that Xaminer’s ev ent- processing capability can support multi-disaster analysis by systematically applying it to each disaster type , thereby avoiding unnecessary cross-framework orchestration. 150 100 50 0 50 100 150 L ongitude 75 50 25 0 25 50 75 Latitude 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 3: Global cable infrastructure impact from earth- quakes and hurricanes. The heatmap shows the risk factor , computed as the fraction of IP links ae cted in a countr y relative to the total number of IP links in that countr y . The results match the ndings of Xaminer [45]. W orkow Comparison. Airavat demonstrates appropriate architectural restraint. Rather than orchestrating multiple specialized frameworks, the system identies that X aminer’s event-processing function can handle b oth disaster types when applied systematically . The workow processes earth- quakes (MMI ≥ 3) and hurricanes (wind spee d ≥ 75 knots) separately with 10% failure probabilities, then merges results through union-based aggregation. While X aminer makes a single call to generate combined results for b oth disasters, Airavat’s separation enables disaster-specic impact statis- tics while maintaining architectural simplicity . Importantly , while the Xaminer implementation uses 7 registry functions (T able 3) to establish the analytical foundation, Airavat r ec- ognizes that one function can handle both types of disaster without requiring separate frameworks for each scenario . Results. Airavat generates approximately 700 lines of Python code to implement multi-disaster analysis. Execution against real disaster data produces results perfectly matching X am- iner’s multi-disaster impact analysis (Figure 3). Ke y Insight: Airavat demonstrates appropriate architectural judgment by avoiding unnecessary complexity and selecting solutions that meet requirements without over-engineering. Airavat , , 6.3 Novel Problem Solving Capability Under T est: Can Airavat tackle research prob- lems without established solutions, enabling analyses previ- ously impractical due to integration complexity? Airavat Query: " Analyze the cascading eects of submarine cable failures between Europe and Asia" (Cascading Failure Analysis Case Study) Challenge: Unlike previous case studies with expert-designe d solutions for validation, cascading failure analysis across con- tinents has no established implementation. Manual develop- ment would require: (i) expertise across infrastructure map- ping, impact analysis, and AS dependency tracking frame- works; (ii) days of integration engineering; and (iii) special- ized knowledge of cross-layer synthesis techniques. This barrier makes such analysis impractical for most researchers. Setup: W e evaluate whether Airavat can orchestrate com- plex multi-framework workows for e xploratory research. The analysis requires integration across infrastructure map- ping, AS-level dependency tracking, and cross-layer synthe- sis—traditionally requiring substantial manual engineering. W orkow Structure. Airavat produces a 1,600-line imple- mentation orchestrating 9 registry functions across three analytical layers (T able 3): Physical Infrastructure A nalysis . Identies submarine cables connecting Europe and Asia through geographic ltering of landing points. Using countr y-to-cable graph data and land- ing p oint mappings, it employs a MERGE strategy combining geographic ltering with cable-name-based identication. AS-Level Dependency A nalysis. Captures cascading eects through autonomous system r elationships. Extracts AS num- bers from aected cables, loads AS dependency graphs, and implements graph traversal to trace secondary impacts, dis- tinguishing primary from secondar y failures. Country-Level Impact Assessment. Integrating physical and logical layers, it consolidates cable segments with IP-to- ASN mappings and geolocation data, creates an indexed embed- dings for country/AS queries, and simulates multiple failures scenarios producing quantitative impacts across cable seg- ments, IP links, IPs, AS links, and ASes. Results. Without ground truth, we assess workow quality through architectural coherence , tool integration appropri- ateness, and analytical reasoning. The generate d solution demonstrates: (1) correct understanding of measurement to ol capabilities across frame works, (2) appropriate cr oss-layer integration, and (3) specialized domain reasoning for cascade analysis. Execution against real infrastructur e and routing data produces comprehensive vulnerability assessments that provide actionable starting points for further investigation. Figures 7, 8, 9, and 10 visually show the workow layers generated by Airavat. Ke y Insight: Airavat handles complex multi-framework integration for never-solved-before problems, lo wering bar- riers to sophisticated exploratory analysis while maintaining research-quality reasoning acr oss network layers. 6.4 Domain Transferability Capability Under T est: Can Airavat adapt to dierent mea- surement problems? Airavat Quer y: "Map BGP-routed prexes to owner orga- nizations, identifying Direct-Owner allocations, Delegated Customer chains, and organizational consolidation for ad- dress blocks scattered across WHOIS records" (Prex-to- Organization Mapping Case Study) Background: Prex2Org [ 33 ] maps BGP prexes to organi- zations, distinguishing Direct O wners (provider-independent allocations) from Delegate d Customers (sub-delegations) through WHOIS parsing, longest-pr ex-match using radix tries, and organizational clustering across heterogeneous WHOIS formats from all RIRs. Setup and Challenge: This case study focuses on IP alloca- tion analysis (a signicantly dierent measurement problem from prior case studies on infrastructure resilience) under deliberately more stringent constraints. Registry functions provide raw BGP/WHOIS data with minimal preprocess- ing (T able 3), unlike previous case studies where substantial processing occurs upstream. W e do not dene key terms in the quer y , such as "Direct O wner" and "Delegated Cus- tomer" . This tests whether LLM reasoning can infer semantic distinctions from context alone. W e generate 12 workow variants using four models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini-3-Pro, Gemini-3-Flash) at three temperature set- tings (0.0, 0.5, 1.0) and evaluate against Prex2Org monthly ground truth data cov ering over 1 million BGP-r outed pre- xes. Domain Transfer Results. W orkowScout successfully captures high-level w orkow methodology for this new do- main, generating sp ecications that incorporate domain- appropriate techniques including radix trie construction for longest-prex-match op erations, WHOIS format pars- ing across heterogeneous RIR schemas, and organizational name matching for entity consolidation. The generated work- ows process completely dierent data sources ( bulk WHOIS records, BGP dumps, RPKI certicates) using methodology distinct from cable-based analysis, demonstrating that Aira- vat adapts to new measurement domains rather than simply templating from previous solutions. All 12 generated workows execute successfully and pro- duce output in expecte d formats, validating domain transfer at the architectural and implementation lev els. However , , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi comparison against Prex2Org ground truth rev eals 0% cor- rect mappings. Every workow contains an identical crit- ical bug: failing to lter 0.0.0.0/0 default routes from BGP dumps before hierarchical tree traversal op erations, caus- ing all prexes to incorrectly map to catch-all allocations. This demonstrates that subtle domain-specic data quality requirements that exist in tacit expert knowledge rather than formal specications require verication mechanisms to detect and address them. W e discuss how the V erication Engine helps improve the r esult next. 7 EV ALU A TION: VERIFICA TION Using the Prex2Org case study [ 33 ], which demonstrated that all generated workows failed despite syntactically cor- rect code (§ 6.4), we demonstrate the verication capabilities of Airavat. Specically , we illustrate the follo wing capabili- ties: bug detection, comparative model evaluation, including the ltering of faulty workows, and the impact of quer y clar- ity on performance. V erication success is measured by the detection of known methodological aws and improvement in downstream accuracy . 7.1 Literature-Grounded Bug Detection Goal: Identify subtle metho dological aws that execution testing cannot detect—bugs that produce syntactically cor- rect code executing successfully yet yielding incorrect mea- surement results. Experimental Setup: W e generate workows using four models (Claude Sonnet 4.5, Claude Opus 4.5, Gemini-3-Pro, Gemini-3-Flash), at three temperatures (0.0, 0.5, 1.0), pro- ducing 12 congurations. Each conguration is run once, yielding a total of 12 generated workows. The Bug: Without verication, all 12 baseline workows contained an identical critical bug: failing to lter 0.0.0.0/0 default routes before hierar chical tree traversal operations, causing every prex to match the catch-all r oute. Addition- ally , workows failed to lter overly broad WHOIS alloca- tions (IPv4 < /8, IPv6 < /16) representing RIR administrative records rather than operational allocations. This preprocess- ing requirement exists in tacit expert knowledge, but rarely appears in published algorithmic descriptions—making it invisible to LLMs despite correct high-level methodology . Bug Detection: The V erication Engine detects and repairs bugs in three stages. The evaluator scored worko ws against the quality dimensions, assigning scores b elow the excel- lence threshold, triggering enhancement mode. It then com- pared the workows against best practices emb edded in the Knowledge Graph, identied systematic data quality issues, and generated two warnings. The rst warning concerns BGP data quality and the need to r emove unwanted pr exes, which correctly identies the bug under consideration. The second is about inferred topology validation, which proved less r elevant. These warnings are derived from literature pat- terns: papers in the knowledge graph discuss default r oute ltering, establishing it as a domain best practice . Synthesizer uses these warnings to generate enhanced workows. Results: W orkows impr oved from 0% accuracy to opera- tional correctness with one targeted x guided by automate d detection. Synthesizer correctly xed the 0.0.0.0/0 bug for BGP prexes, but issue d a warning for the same issue for WHOIS records. The synthesize d workow also automat- ically implemented bogon prex ltering and invalid AS record elimination, with additional warnings agging suspi- cious patterns (e .g., large numbers of ASNs assigned to the U.S. DoD ). Due to a warning for WHOIS, unlike an up dated workow in the case of BGP , LLM-guided manual correction of the WHOIS 0.0.0.0/0 record was r equired. Note that this correction is also fully automated without manual inter ven- tion when the query has better clarity (§ 7.3). Ke y Insight. Literature-grounded verication detects bugs missed by standard execution-based testing by comparing generated workows against best practices extracted fr om published research. When full automatic repair isn’t possible, the system provides graceful degradation: explicit warnings with context enable quick manual correction rather than leaving bugs undetected. 7.2 Comparative Model Evaluation Goal: Systematically evaluate model-sp ecic failur e patterns in the V erication Engine. Experimental Design: W e consider the same 12 congura- tions in § 7.1 with three runs per conguration generating a total of 36 workows. The Bug: W e consider the same bug as § 7.1. Model-Specic Failure Detection: The V erication En- gine’s three-stage pipeline detecte d systematic model-sp ecic patterns. Structural validation automatically ltered either Gemini-3-Pro or Gemini-3-Flash at a temperature of 0.0 in all three runs due to malformed outputs (incomplete schema compliance, missing elds, invalid JSON). This aligns with Google’s warning against very lo w temperatures for Gemini- 3 models [ 28 ]. Dimension scoring re vealed robustness fail- ures acr oss the remaining Gemini congurations. Most wer e tagged infeasible due to missing edge case handling and error scenarios documented in prior IP allocation research. This matches Google ’s documentation indicating Gemini’s pr efer- ence for direct answers over comprehensive robustness [ 29 ]. In 1 of 3 runs, Opus 4.5 (temperature 1.0) was tagged infea- sible for excessive complexity excee ding literature norms. Airavat , , Quality ranking consistently place d Sonnet 4.5 variants high- est across all runs (scores 68-72), validating our design choice to employ Claude variants for the agent pipeline. Ke y Insight: The V erication Engine identies mo del-specic failure patterns through empirical assessment rather than requiring manual expertise about LLM-spe cic behaviors and limitations. Systematic evaluation acr oss architectures and temperatures rev eals that model selection signicantly impacts workow quality , with structural validation and robustness scoring providing objective selection criteria. 7.3 Query Clarity Impact Assessment Goal: Evaluate how query precision aects LLM reasoning quality during verication by comparing worko ws gener- ated from vague versus rened queries. Experimental Design: W e compare two quer y variants for the same Prex2Org task. (i) V ague quer y (baseline): Airavat Query in § 6.4. This query deliberately omits denitions for "Direct Owner" and "Delegated Customer" . (ii) Rened query: W e add targeted hints without complete denitions for Direct Owners and Delegated Customers. Performance Impact of Quer y Clarity: With the vague query (no Direct/Delegate d denitions), the corrected work- ow generated by the V erication Engine achieved 97.6% overlap for origin AS identication and 60-65% for Direct Owner/Delegated Customer (DO/DC) WHOIS tag classica- tion, improving to 70-75% after excluding incomplete LA C- NIC data. This represents substantial improvement from 0% correctness without verication. With the rened quer y before verication, all 12 work- ows still contained the identical 0.0.0.0/0 bug. After veri- cation, the overlap for origin AS identication remained the same, and the DO/DC WHOIS tag identication improved to 90.9% (correctly identied 20 of 22 tags, with only two LA C- NIC tags misclassied). The workow also include d RPKI validation as a fallback (similar to the paper [ 33 ]) for low- condence tags. For this improved query , both the BGP and WHOIS 0.0.0.0/0 issue was automatically detecte d and xed during synthesis, requiring no manual corrections. Ke y Insight: Query precision dramatically impacts LLM semantic reasoning quality for problems requiring domain- specic interpretation b eyond technical implementation. Even partial clarication (hints rather than complete denitions) improves o wnership classication accuracy from 60-65% to 90.9%, demonstrating that query engineering complements verication in achieving high-quality worko ws. 8 EV ALU A TION: V ALID A TION W e demonstrate the V alidation Engine capabilities with the Cable Impact Analysis Case Study (§ 6.1). The worko w analyzes country-level impact from SeaMe W e-5 cable failure, producing metrics spanning cable segments, IP links, ASes, and geographic distribution. For this e valuation, we generate a new knowledge graph by removing Xaminer pap er [ 45 ] to prevent the validation engine from directly identifying techniques from the paper . The V alidation Engine generated nine validation strate- gies spanning multiple validation types: three system-level validations, three component validations, one consistency validation, and two synthesized validations. W e show ho w the engine identies validation strategies in the existing lit- erature and develops new ones when pr ece dents are absent. [V1] System-Level V alidation: Landing Point Coun- try Coverage. InsightEngine’s problem-centric knowledge graph queries identied POPsicle’s [ 25 ] validation approach comparing network topology predictions against authorita- tive sour ces. The Strategizer adapted this to submarine cable validation, proposing comparison of the workow’s country list against publicly documented SeaMe W e-5 landing points from T eleGeography and SubmarineCableMap [9]. [V2] Component V alidation: Cable Segment Identi- cation. InsightEngine agged the Cable Segment Identi- cation comp onent as high-risk based on W orkowScout’s analysis, triggering targeted comp onent-level validation. The Strategizer adapted SyslogDigest’s template validation ap- proach [ 42 ] (comparing identied patterns against ground truth) to cable identication, proposing dual verication through string matching across naming variants (SeaMe W e- 5, SMW5, SMW -5) and landing point pair matching. The strategy validates that identied segments connect known landing point pairs and form geographically coherent routes, isolating cable identication accuracy from downstream ag- gregation errors. [V3] Consistency V alidation: Multi-Source Impact Met- ric Consistency . InsightEngine’s approach-centric queries identied T essellation’s multi-source trac attribution vali- dation [ 58 ]. The Strategizer adapted this to country attribu- tion consistency checking across three independent sources: IP geolocation, ASN mappings, and landing p oint data. The strategy computes consistency scores (pr oportion of sources agreeing) and identies countries with high versus low source agreement, providing condence indicators for impact as- sessments when single ground truth is unavailable. [V4] Synthesized V alidation: Historical Cable Outage Comparison. InsightEngine’s novelty assessment indicated no direct precedent for submarine cable failure impact vali- dation, triggering creative synthesis. The Strategizer synthe- sized this appr oach by adapting temporal validation patterns from network disruption detection literature , proposing ap- plication of the workow to documented historical outages , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi (2008 Mediterranean cuts, 2020 AAE-1 issues) and compar- ison of predicted impacts against reported impacts. Criti- cally , the strategy leverages IOD A trac measurements and RIPE Atlas data during historical cable failures—matching the validation methodology employed in the original X am- iner paper [ 45 ] for infrastructure resilience analysis. This demonstrates the V alidation Engine ’s ability to discover and propose validation approaches that align with established measurement research practices. Methodological Quality Assessment. The generate d strate- gies exhibit multiple quality indicators. First, literature ground- ing ensures strategies adapt proven validation patterns rather than fabricating approaches. Second, complementarity en- sures strategies examine dierent correctness dimensions— V1 validates geographic scope, V2 validates data integration quality , V3 validates foundational component accuracy , V4 provides external validation through historical comparison. Third, feasibility assessment ensures all strategies use data available in the registry or publicly accessible sources. Fourth, explicit adaptation justications document how literature approaches transfer to the specic problem context. 9 DISCUSSION AND RELA TED W ORK Generalization. W e developed agent prompts through iter- ative renement, encoding domain-agnostic reasoning pat- terns (e.g., problem de composition, constraint evaluation) while deliberately excluding measurement-specic heuris- tics. Domain expertise resides in the Knowledge Graph and Registry rather than agent logic, enabling transferability through knowledge graph population. Within the Internet measurement domain, Airavat’s ar chi- tecture generalizes across sub-domains, as demonstrated by our case study (§ 6.4). W e expect Airavat’s capabilities to read- ily generalize across a broader range of sub-problems, includ- ing network security analysis and performance debugging. The knowledge graph and agent reasoning patterns trans- fer directly to these domains by ingesting relevant r esearch papers and domain-spe cic tools. Beyond Internet measure- ment, generalization to fundamentally dierent domains requires domain-specic knowledge graph construction and registry curation. Additionally , measurement-specic com- ponents, such as the ve complexity dimensions used for query decomposition, may require adaptation for other elds. Emerging Standards and Scalability . The emergence of agent communication protocols such as Model Context Pro- tocol (MCP) [ 16 ] and Agent-to- Agent protocol (A2A ) [ 27 ] presents opportunities for standardizing AI agent interac- tions with measurement tools. MCP ’s server-client design could provide unied interfaces for tool interaction, dramat- ically simplifying registry maintenance through automatic capability discovery and standardized interaction patterns. A2A protocols could formalize communication between Aira- vat’s specialized agents, enabling more robust task delega- tion and state management. Howe ver , realizing these bene- ts requires widespread protocol adoption across both the measurement tool and AI agent ecosystems. A related chal- lenge is maintaining registry accuracy as tools ev olve. Future work could employ specialized LLM agents to automatically analyze codebases and monitor tool repositories, reducing manual maintenance overhead while impro ving scalability . Limitations. Despite demonstrated capabilities, Airavat ex- hibits fundamental limitations. While it can be fully auto- mated for well-known problems, it can only serve as a co- pilot for never-seen-before problems. The Registry requires manual curation; while the RegistryCurator Agent identies reusable patterns, all additions r equire expert validation to maintain quality standards. While the knowledge graph cap- tures a broad slice of measurement literature , it may miss in- formal best practices. The verication engine can only dete ct methodological aws documente d in existing literature. The system cannot generate validation strategies for workows requiring long-term longitudinal studies, large-scale infras- tructure deployment, or proprietary datasets unavailable in public repositories. Finally , quer y precision can dramatically impact accuracy , as demonstrate d in § 7.3, requiring both experts and non-experts to be precise in their interactions. Multi-agent LLM systems de compose complex problems into specialized subtasks. Prominent frameworks include A u- toGen [ 57 ], MetaGPT [ 35 ], LangGraph [ 12 ], and Cr ew AI [ 11 ]. Howev er , multi-agent systems exhibit notable pitfalls in- cluding role violations, input conicts, and incomplete v er- ication [ 21 ]. Airavat addresses these challenges through systematic quality assurance mechanisms. Agentic W orkows in Networked Systems Research. Recent work applies LLMs and agentic AI to networked sys- tems: NetLLM [ 56 ] for networking tasks, NetConfEval [ 52 ] for conguration, Confucius [ 54 ] for network management, and system optimization [ 24 , 34 , 48 ]. How ever , none address measurement research’s unique challenges of worko w gen- eration with systematic verication, which Airavat pr ovides. AI for Science. AI’s role in scientic discovery spans mul- tiple elds and autonomy levels [ 32 , 38 , 59 , 61 ], from pas- sive assistance to fully autonomous research. Achieving au- tonomous scientic discovery requires advances across prob- lem identication, hypothesis formulation, experiment de- sign, execution, analysis, and iterativ e renement. Our work contributes by developing agentic workows for network measurement research that demonstrate how multi-LLM col- laboration can address domain-specic scientic challenges while maintaining interpretability and human oversight. Airavat , , 10 CONCLUSION Internet measurement r esearch requires sophisticated tool integration and rigorous validation. Airavat demonstrates how agentic AI systems can automate both w orkow genera- tion and literature-grounded verication through specialized agents and engines operating on a knowledge graph. Our evaluation shows that agentic systems generate expert-level workows and identify metho dological aws missed by stan- dard execution-based testing. W e view Airavat as a force mul- tiplier for experts and a scaolding tool for non-specialists. Looking for ward, we envision Airavat enabling a new mode of measur ement research—wher e hypotheses, workows, verication strategies, and validation plans co-evolv e inter- actively—lowering the barrier to rigorous Internet measur e- ment while preserving the methodological discipline built over decades of community eort. REFERENCES [1] 2008–2026. GROBID. h tt ps : // gi t hu b. c om /k e rm it t 2/ gr o bid. (2008– 2026). [2] 2022. AAE-1 cable cut causes widespread outages in Europe, East Africa, Middle East, and South Asia - DCD. https://w ww .datacenterdy namics.com/en/news/aae- 1- cable- cut- causes- widespread- outages- i n- europe- east- africa- middle- east- and- south- asia/. (2022). [3] 2022. Falcon Cable Fault Believed T o Be From Air Strike. htt ps://su btelf orum. com/ falco n- cable- f a ult- bel ieved- to- be- f rom- air - s trike/. (2022). [4] 2025. BGP.Tools — bgp.tools. https://bgp.tools. (2025). [5] 2025. Home - NetBlocks — netblocks.org. https://netblocks.org. (2025). [6] 2025. IODA — io da.inetintel.cc.gatech.edu. https://ioda.inetintel.cc.ga tech.edu. (2025). [7] 2025. Route Views; University of Oregon Route Views Project — route- views.org. https://www .routeviews.org/routeviews/. (2025). [8] 2025. Routing Information Service (RIS) — ripe.net. https://www .ripe.n et/analyse/internet- measurements/routing- inf ormation- service- ris/. (2025). [9] 2025. Submarine Cable Map. (2025). https://www .submarinecablemap. com/ [10] 2025. Worldwide Overview | Cloudare Radar — radar.cloudar e.com. https://radar .cloudf lare.com. (2025). [11] 2026. Crew AI. https://www .crewai.com/. (2026). [12] 2026. LangGraph: Agent Orchestration Framework. https://w ww .lang chain.com/langgraph. (2026). [13] Facebook AI. 2020. bart-large-mnli. https://huggingf ace.co/f acebook/ bart- large- mnli. (2020). [14] Bahaa Al-Musawi, Philip Branch, and Grenville Armitage. 2016. BGP Anomaly Detection T echniques: A Sur vey . IEEE Communications Surveys & T utorials 19, 1 (2016), 377–396. [15] Thomas Alfroy , Thomas Holterbach, Thomas Krenc, KC Clay , and Cristel Pelsser . 2024. The Next Generation of BGP Data Collection Platforms. In Procee dings of the ACM SIGCOMM 2024 Conference . 794– 812. [16] Anthropic. 2024. Mo del Context Protocol (MCP). https://modelconte xtprotocol.io/docs/getting- started/intro. (2024). [17] Anthropic. 2025. Claude Opus 4.5. https://w ww .anthropic.com/news /claude- opus- 4- 5. (2025). [18] Anthropic. 2025. Claude Sonnet 4.5. https://www .anthropic.com/ne ws/claude- sonnet- 4- 5. (2025). [19] Brice A ugustin, Xavier Cuvellier , Benjamin Orgogozo, Fabien Viger , Timur Friedman, Matthieu Latapy , Clémence Magnien, and Renata T eixeira. 2006. A voiding traceroute anomalies with Paris traceroute . In Proceedings of the 6th A CM SIGCOMM conference on Internet mea- surement . 153–158. [20] V aibhav Bajpai and Jürgen Schönwälder. 2015. A survey on inter- net performance measurement platforms and related standardization eorts. IEEE Communications Surveys & T utorials 17, 3 (2015), 1313– 1341. [21] Mert Cemri, Melissa Z. Pan, Shuyi Y ang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer , Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. 2025. Why Do Multi-Agent LLM Systems Fail? (2025). arXiv:cs.AI/2503.13657 https://arxiv .org/abs/2503.13657 [22] Balakrishnan Chandrasekaran, Georgios Smaragdakis, Arthur Berger , Matthew Luckie, and Keung-Chi Ng. 2015. A server-to-server view of the Internet. In Proceedings of the 11th A CM Conference on Emerging Networking Experiments and T echnologies . 1–13. [23] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2025. M3-Embedding: Multi-Linguality , Multi- Functionality , Multi-Granularity T ext Embeddings Through Self- Knowledge Distillation. (2025). arXiv:cs.CL/2402.03216 h t tp s : / / a r xiv .org/abs/2402.03216 [24] A udrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen W ang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Y ang, Je Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica. 2025. Barbarians at the Gate: How AI is Upending Systems Research. (2025). arXiv:cs.AI/2510.06189 h tt p s: //arxiv .org/abs/2510.06189 [25] Ramakrishnan Durairajan, Joel Sommers, and Paul Barford. 2014. Layer 1-informed Internet T opology Measurement (IMC ’14) . Association for Computing Machinery , New Y ork, NY, USA, 381–394. https://doi.org/ 10.1145/2663716.2663737 [26] Nick Feamster and Hari Balakrishnan. 2005. Dete cting BGP congura- tion faults with static analysis. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-V olume 2 . 43–56. [27] Google. 2024. Agent-to-A gent (A2A) Pr otocol. https://github.com/goo gle/A2A. (2024). [28] Google. 2025. Gemini 3 Developer Guide. https://ai.google.dev/gemi ni- api/do cs/gemini- 3#temperature. (2025). [29] Google. 2025. Gemini 3 Developer Guide. https://ai.google.dev/gemi ni- api/do cs/gemini- 3#prompting_best_practices. (2025). [30] Google DeepMind. 2025. Gemini 3 Flash. https://deepmind.go ogle/mo dels/gemini/ash/. (2025). [31] Google DeepMind. 2025. Gemini 3 Pro. https://deepmind.google/mo dels/gemini/pro/. (2025). [32] Juraj Gottweis, W ei-Hung W eng, Alexander Daryin, T ao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky , Felix W eissenberger , Keran Rong, Ryutaro T anno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, K atherine Chou, A vinatan Hassidim, Burak Gokturk, Amin V ahdat, Pushmeet Kohli, Y ossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad T omasev , Yuan Guan, Vikram Dhillon, Eeshit Dhaval V aish- nav , Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Y unhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, and Vivek Natara- jan. 2025. T owards an AI co-scientist. (2025). arXiv:cs.AI/2502.18864 https://arxiv .org/abs/2502.18864 [33] Deepak Gouda, Alb erto Dainotti, and Cecilia T estart. 2025. Prex2Org: Mapping BGP Prexes to Organizations (IMC ’25) . Association for Computing Machinery , New Y ork, NY, USA, 397–414. https://doi.org/ 10.1145/3730567.3764485 , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi [34] Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany , Kimia Noorbakhsh, Joseph Chandler , Ali ParandehGheibi, Mohammad Al- izadeh, and Hari Balakrishnan. 2025. Glia: A Human-Inspired AI for Automated Systems Design and Optimization. (2025). arXiv:cs.AI/2510.27176 https://arxiv .org/abs/2510.27176 [35] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Y uheng Cheng, Ceyao Zhang, Jinlin W ang, Zili Wang, Steven Ka Shing Y au, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber . 2024. MetaGPT: Meta Programming for A Multi- Agent Collaborative Framework. (2024). arXiv:cs.AI/2308.00352 https: //arxiv .org/abs/2308.00352 [36] Simon Knight, Hung X Nguyen, Nickolas Falkner , Rhys Bow den, and Matthew Roughan. 2011. The internet topology zoo. IEEE Journal on Selected A reas in Communications 29, 9 (2011), 1765–1775. [37] Rupa Krishnan, Harsha V Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind Krishnamurthy , Thomas Anderson, and Jie Gao. 2009. Moving beyond end-to-end path information to optimize CDN p erfor- mance. In Proceedings of the 9th A CM SIGCOMM conference on Internet measurement . 190–201. [38] Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Je Clune, and David Ha. 2024. The AI Scientist: T owards Fully Automated Open- Ended Scientic Discovery . (2024). arXiv:cs.AI/2408.06292 h t t p s : //arxiv .org/abs/2408.06292 [39] Zhuoqing Morley Mao, Jennifer Rexford, Jia W ang, and Randy H Katz. 2003. T owards an accurate AS-level traceroute tool. In Proceedings of the 2003 conference on A pplications, technologies, architectures, and protocols for computer communications . 365–378. [40] Neo4j, Inc. 2026. Neo4j Graph Database & Analytics P latform. https: //neo4j.com/. (2026). [41] Chiara Orsini, Alistair King, Danilo Giordano, V asileios Giotsas, and Al- berto Dainotti. 2016. BGPStream: A Software Framework for Live and Historical BGP Data Analysis. In Proceedings of the 2016 Internet Mea- surement Conference (IMC ’16) . Association for Computing Machiner y , New Y ork, N Y , USA, 429–444. https://doi.org/10.1145/2987443.2987482 [42] T ongqing Qiu, Zihui Ge, Dan Pei, Jia W ang, and Jun Xu. 2010. What happened in my network: mining network events from router syslogs (IMC ’10) . Association for Computing Machinery , New Y ork, NY, USA, 472–484. https://doi.org/10.1145/1879141.1879202 [43] Alagappan Ramanathan and Sangeetha Abdu Jyothi. 2023. Nautilus: A Framework for Cross-Layer Cartography of Submarine Cables and IP Links. Proc. A CM Meas. A nal. Comput. Syst. 7, 3, Article 46 (Dec. 2023), 34 pages. https://doi.org/10.1145/3626777 [44] Alagappan Ramanathan, Eunju Kang, Dongsu Han, and Sangeetha Abdu Jyothi. 2025. T owards an A gentic W orkow for Internet Measure- ment Research (HotNets ’25) . Association for Computing Machinery , New Y ork, N Y , USA, 61–68. https://doi.org/10.1145/3772356.3772409 [45] Alagappan Ramanathan, Rishika Sankaran, and Sangeetha Abdu Jyothi. 2024. Xaminer: An Internet Cross-Layer Resilience Analysis T ool. Proc. A CM Meas. A nal. Comput. Syst. 8, 1, Article 16 (Feb. 2024), 37 pages. https://doi.org/10.1145/3639042 [46] Justin Raynor , T arik Crnovrsanin, Sara Di Bartolome o, Laura South, David Sao, and Cody Dunne. 2022. The state of the art in bgp visu- alization tools: A mapping of visualization techniques to cyberattack types. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1059–1069. [47] Nils Reimers and Iryna Gurevych. 2021. all-mpnet-base-v2. ht tp s : //huggingf ace.co/sentence- transformers/all- mpnet- base- v2. (2021). [48] Y e onju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bian- chini, Aditya Akella, Zhangyang W ang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reliable and Ecient Agentic W orkow Execution. (2025). arXiv:cs.MA/2511.00330 https://arxiv .org/abs/2511 .00330 [49] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten So otla, Itai Gat, Xiaoqing Ellen T an, Y ossi Adi, Jingyu Liu, Romain Sauvestre, T al Remez, Jérémy Rapin, Artyom Kozhevnikov , Ivan Evtimov , Joanna Bit- ton, Manish Bhatt, Cristian Canton Ferrer , Aaron Grattaori, W enhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar , Hugo T ouvron, Louis Martin, Nicolas Usunier , Thomas Scialom, and Gabriel Syn- naeve. 2024. Code Llama: Op en Foundation Models for Code. (2024). arXiv:cs.CL/2308.12950 https://arxiv .org/abs/2308.12950 [50] Kevin V ermeulen, Ege Gurmericliler, Italo Cunha, David Chones, and Ethan Katz-Bassett. 2022. Internet scale reverse traceroute. In Proceedings of the 22nd A CM Internet Measurement Conference . 694– 715. [51] Kevin V ermeulen, Stephen D Stro wes, Olivier Fourmaux, and Timur Friedman. 2018. Multilevel MDA -lite Paris traceroute . In Proceedings of the Internet Measurement Conference 2018 . 29–42. [52] Changjie W ang, Mariano Scazzariello, Alireza Farshin, Simone Ferlin, Dejan Kostić, and Marco Chiesa. 2024. NetConfEval: Can LLMs Facili- tate Network Conguration? Proc. A CM Netw . 2, CoNEXT2, Article 7 (June 2024), 25 pages. https://doi.org/10.1145/3656296 [53] Lei W ang, Chen Ma, Xueyang Feng, Ze yu Zhang, Hao Y ang, Jingsen Zhang, Zhiyuan Chen, Jiakai T ang, Xu Chen, Y ankai Lin, W ayne Xin Zhao, Zhewei W ei, and Jirong W en. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (March 2024). https://doi.org/10.1007/s11704- 024- 40231- 1 [54] Zhaodong W ang, Samuel Lin, Guanqing Y an, Soudeh Ghorbani, Min- lan Y u, Jiawei Zhou, Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Y ang, and Ying Zhang. 2025. Intent-Driven Network Management with Multi- Agent LLMs: The Confucius Framework. In Proceedings of the ACM SIGCOMM 2025 Conference (SIGCOMM ’25) . Association for Computing Machinery , New Y ork, N Y, USA, 347–362. https://doi.org/10.1145/3718958.3750537 [55] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain- of- Thought Prompting Elicits Reasoning in Large Language Models. (2023). arXiv:cs.CL/2201.11903 https://arxiv .org/abs/2201.11903 [56] Duo Wu, Xianda W ang, Y aqi Qiao, Zhi Wang, Junchen Jiang, Shuguang Cui, and Fangxin W ang. 2024. NetLLM: Adapting Large Language Models for Networking. In Procee dings of the ACM SIGCOMM 2024 Con- ference (ACM SIGCOMM ’24) . A ssociation for Computing Machinery, New Y ork, N Y , USA, 661–678. https://doi.org/10.1145/3651890.3672268 [57] Qingyun W u, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahme d Has- san A wadallah, Ryen W White, Doug Burger , and Chi W ang. 2023. Au- toGen: Enabling Next-Gen LLM Applications via Multi-Agent Conver- sation. (2023). arXiv:cs.AI/2308.08155 [58] Ning Xia, Han Hee Song, Y ong Liao, Marios Iliofotou, Antonio Nucci, Zhi-Li Zhang, and Aleksandar Kuzmanovic. 2013. Mosaic: quantifying privacy leakage in mobile netw orks. SIGCOMM Comput. Commun. Rev . 43, 4 (A ug. 2013), 279–290. https://doi.org/10.1145/2534169.2486008 [59] Y utaro Y amada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster , Je Clune, and David Ha. 2025. The AI Scientist- v2: W orkshop-Level Automated Scientic Discovery via Agentic T ree Search. (2025). arXiv:cs.AI/2504.08066 https://arxiv .org/abs/2504.08066 [60] Beichuan Zhang, Raymond Liu, Daniel Massey , and Lixia Zhang. 2005. Collecting the Internet AS-level topology . ACM SIGCOMM Computer Communication Review 35, 1 (2005), 53–61. [61] Tianshi Zheng, Zheye Deng, Hong Ting T sang, W eiqi W ang, Jiaxin Bai, Zihao W ang, and Y angqiu Song. 2025. From A utomation to Autonomy: A Survey on Large Language Models in Scientic Discovery . (2025). arXiv:cs.CL/2505.13259 https://arxiv .org/abs/2505.13259 [62] Y aping Zhu, Benjamin Helsley , Jennifer Rexford, A spi Siganporia, and Sridhar Srinivasan. 2012. LatLong: Diagnosing wide-area latency Airavat , , changes for CDNs. IEEE Transactions on Network and Service Manage- ment 9, 3 (2012), 333–345. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi A COST ANAL YSIS All agents in Airavat’s experiments use Claude mo dels (Son- net 4.5 and Opus 4.5) with the following costs estimate d per execution in T able 2. The Multi- Agent W orkow Generation pipeline comprises four agents with var ying computational requirements. Agent 1 (QueryMind) costs $0.06-$0.15 per run for problem characterization and knowledge graph query for- mulation. Agent 2 (W orkowScout) costs $0.25-$0.60 per run for workow design generation, reecting the mor e complex reasoning required to synthesize measurement literature into concrete specications. Agent 3 (SolutionW eaver) costs $0.35-$0.60 per run for code generation from workow spec- ications. Agent 4 (Registr yCurator) costs $0.20-$0.40 per run when evaluating generated w orkows against literature- derived quality criteria. The V erication Engine adds substantial computational overhead for quality assurance. The Synthesizer agent costs $0.50-$0.70 per run in standard mode, or $1.00-$1.50 when performing best-approach comparison with the synthesized workow . Generating 12 workow variants (4 models × 3 temperatures) for comprehensive model comparison costs approximately $2.50-$3.00 p er run. The V alidation Engine comprises two agents: the Strategizer costs $0.16-$0.30 per run for generating validation strategies from measur ement literature, while the CodeGenerator costs $0.40-$0.60 per run for translating strategies into executable validation code. Estimating total costs for all case studies presented in the paper yields a modest overall expenditure. Case Studies 1-3 each required one standard worko w generation run, total- ing approximately $3.75 (3 × $1.25 midpoint). Case Study 4 employed comprehensive verication with 3 independent runs of 12 workow variants each, costing approximately $17.25 (3 × $5.75 midpoint). The V alidation Engine demon- stration required strategy generation and code generation, adding approximately $0.45. The Kno wledge Graph extrac- tion evaluation used only local LLMs (LLama-3.1-8B and LLama3.3-70B), incurring negligible cloud API costs. T otal estimated expenditure for all experiments presented is ap- proximately $21.50, demonstrating the cost-eectiveness of the approach for research-scale e valuation. Cost optimization represents a signicant opportunity for future work. V ery little optimization was performed in the current implementation—intermediate outputs from earlier pipeline stages were often passed in entirety to subsequent agents rather than extracting only relevant content. For e x- ample, Agent 2 (W orkowScout) r eceives complete Agent 1 (QueryMind) results rather than ltered excerpts. Selective content extraction could reduce token consumption by an estimated 30-40%, lowering per-run costs to $0.50-$1.00 for standard workows and $3.00-$4.50 for full evaluation runs. Additional optimizations include caching repeated knowl- edge graph queries, compressing intermediate representa- tions, and employing smaller models for routine validation checks while reserving powerful models for complex rea- soning tasks. These improvements would enhance system aordability while maintaining generation quality , making Airavat more accessible for broader research community adoption. B KNO WLEDGE GRAPH EXTRA CTION DET AILS AND QU ALI TY Airavat constructs a Neo4j knowledge graph encoding mea- surement domain knowledge through semantic embe ddings and typed relationships. The entity model includes ten en- tity types: Papers, Problems, ResearchGaps, Approaches, PipelineSteps, Algorithms, Metrics, Parameters, Datasets, and V alidations. Entities connect through typed relation- ships. Papers PROPOSE Approaches that SOLVE Problems; Approaches USE_D A T ASET and ar e V ALIDA TED_BY valida- tion methodologies; Approaches contain order ed PIPELINE_ STEP sequences that USE_ALGORI THM; IMPRO VES_UPON relationships capture methodological evolution. This schema enables similarity-based retrieval, relationship traversal for methodology evolution, and constraint-based ltering. Aira- vat uses MD5 hashing for deduplication and BGE-M3 em- beddings for semantic search. T o evaluate the extraction quality of our local LLM ap- proach (LLama-3.1-8B and LLama3.3-70B), w e conducted a validation study using Claude Sonnet 4.5 as an evaluator . W e selected seven representative papers covering various Internet measurement topics ( submarine cable mapping, net- work resilience analysis, routing communities, broadband availability , third-party dependencies, and IPv6 allocation) and compared the extraction outputs from our local LLM pipeline against what Sonnet itself would have generated. W e evaluated overlap across the same v e key extraction cat- egories: problem statement, methodology , datasets, baselines, and validations. Across the seven papers, the local LLM ex- traction achiev ed an average aggregate overlap of 87.3% with Sonnet’s extraction, with individual paper scores ranging from 72% to 94%. This demonstrates that cost-ee ctive local LLM extraction achieves str ong performance comparable to expensive cloud-based models for populating the knowledge graph, validating our design choice to use local models for large-scale paper processing while reserving cloud LLMs for the agent pipeline’s reasoning tasks. C KNO WLEDGE GRAPH EXTRA CTION EXAMPLE W e demonstrate the extraction pipeline ’s output using the Nautilus paper [ 43 ] as a representativ e example (Figure 4, 5). Airavat , , Component Cost per Run Notes Agent 1 (QueryMind) $0.06 - $0.15 Query decomp osition Agent 2 (W orkowScout) $0.25 - $0.60 W orkow specication Agent 3 (SolutionW eaver ) $0.35 - $0.60 Code generation Agent 4 (RegistryCurator) $0.20 - $0.40 Quality assessment Standard W orkow $0.80 - $1.70 Agents 1-3 only V erication Synthesizer $0.50 - $0.70 Standard mode V erication Synthesizer $1.00 - $1.50 With comparison 12- V ariant Generation $2.50 - $3.00 Model comparison V alidation Strategizer $0.16 - $0.30 Strategy generation V alidation Co deGenerator $0.40 - $0.60 V alidation co de Full Evaluation $4.50 - $7.00 12 variants + verication + validation T able 2: Cost Analysis Summar y The extraction captures ve key categories from measure- ment papers. Note that the following is a condensed version for illustration purp oses—the actual extraction contains more comprehensive details for each section. Extraction Structure The knowledge graph extraction sys- tem processes papers to extract structured information across ve primary categories: Problem Statement, Methodology , Datasets, Baseline Comparisons, and V alidations. Extraction Characteristics This condensed example in Fig- ure 4, 5 illustrates the structured extraction approach applie d across all papers in the knowledge graph. The full extraction contains comprehensive details, including all 10 method- ology steps, 19 data sources with detailed characteristics, complete baseline comparisons, and extensive validation ex- periments. The extraction enables the knowledge graph to answer queries like "What clustering algorithms are used for geolocation?" (DBSCAN with 𝜖 =20km), "What validation strategies w ere emplo yed?" ( cable failures, targeted measure- ments, operator maps), and "What were the key parameters?" (SoL threshold=0.05, radius=500km, weights 5:4:1). D KNO WLEDGE GRAPH CHARA CTERISTICS The knowledge graph constructed for Airavat aggregates structured information from Internet measurement research papers. The graph repr esents a comprehensive corpus of measurement methodologies, techniques, and validation ap- proaches extracted from published literature. Below , we pro- vide the scale and composition of the knowledge graph to demonstrate its breadth and depth for supporting workow generation. The knowledge graph contains 2,021 research papers pro- cessed through the extraction pipeline . From these papers, the system extracted 1,944 distinct problem statements char- acterizing measurement challenges, and 3,767 unique ap- proaches describing solution metho dologies. The metho d- ology extraction identied 13,813 pipeline steps represent- ing the granular procedural de composition of measurement workows. T echnical components include 2,956 algorithms (clustering methods, graph traversal te chniques, optimiza- tion procedures), 7,018 datasets and data sources (measure- ment platforms, geolocation ser vices, r outing databases), and 4,483 parameters (thresholds, weights, conguration values). V alidation information comprises 722 validation strategies and 1,016 evaluation metrics extracted from experimental sections. The knowledge graph maintains 65,250 relationships con- necting these entities, enabling traversal b etween related concepts. These r elationships link problems to applicable ap- proaches, approaches to constituent pipeline steps, steps to required algorithms and datasets, and metho dologies to vali- dation strategies. The relationship structure enables queries like "What validation strategies were used for geolocation- based approaches?" or "What datasets are required for cross- layer mapping problems?" to return contextually relevant re- sults grounded in measurement literature . This scale demon- strates that the knowledge graph pro vides substantial cov- erage of Internet measurement r esearch, capturing diverse problem domains (infrastructure resilience , prex mapping, network top ology inference), methodological techniques (geolocation clustering, BGP analysis, traceroute process- ing), and validation practices ( cable failure analysis, targeted measurements, ground truth comparison). The compr ehen- sive repr esentation enables Airavat agents to reason about measurement problems through literature grounded context rather than relying solely on LLM parametric knowledge. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi "paper_title" : " N a u t i l u s : F r a m e w o r k f o r C r o s s - L a y e r C a r t o g r a p h y o f S u b m a r i n e C a b l e s a n d I P L i n k s " , "extractions" : { "Problem Statement" : { " p r o b l e m _ s t a t e m e n t " : " M a p p i n g I P l i n k s t o s u b m a r i n e c a b l e s a c c u r a t e l y " , " p r o b l e m _ d e t a i l s " : { " w h a t _ i s _ l a c k i n g " : " E x i s t i n g a p p r o a c h e s l a c k a c c u r a c y a n d a r e c o a r s e - g r a i n e d " , " w h y _ i t s _ c h a l l e n g i n g " : " G e o l o c a t i o n i n a c c u r a c i e s , i n c o m p l e t e o w n e r s h i p i n f o r m a t i o n , c o m p l e x t o p o l o g y " , " s c o p e " : " A t I n t e r n e t s c a l e , f o r c r i t i c a l i n f r a s t r u c t u r e " } , " r e s e a r c h _ g a p s " : [ " A c c u r a t e c r o s s - l a y e r m a p p i n g " , " H a n d l i n g g e o l o c a t i o n u n c e r t a i n t i e s a n d i n c o m p l e t e d a t a " ] } , "Methodology" : { " a p p r o a c h _ o v e r v i e w " : " N a u t i l u s u s e s p u b l i c l y a v a i l a b l e d a t a s e t s a n d t e c h n i q u e s t o g e n e r a t e I P l i n k t o s u b m a r i n e c a b l e m a p p i n g w i t h c o n f i d e n c e s c o r e s . " , " p i p e l i n e " : [ . . . { " s t e p _ n u m b e r " : 2 , " s t e p _ n a m e " : " G e o l o c a t i o n M o d u l e " , " d e s c r i p t i o n " : " C o l l e c t a n d a g g r e g a t e g e o l o c a t i o n f r o m e l e v e n s e r v i c e s , c l a s s i f y I P l i n k s " , " a l g o r i t h m s _ u s e d " : [ " D B S C A N " ] , " p a r a m e t e r s " : { " m i n P o i n t s " : " 1 " , " e p s i l o n " : " 2 0 k m " } } , . . . { " s t e p _ n u m b e r " : 1 0 , " s t e p _ n a m e " : " A g g r e g a t i o n & F i n a l M a p p i n g " , " d e s c r i p t i o n " : " C o m b i n e g e o l o c a t i o n a n d c a b l e o w n e r o u t p u t s " , " p a r a m e t e r s " : { " w e i g h t a g e " : " 0 . 5 g e o l o c a t i o n , 0 . 4 d i s t a n c e , 0 . 1 o w n e r s h i p " } } ] , " a l g o r i t h m s " : [ { " n a m e " : " D B S C A N " , " p u r p o s e " : " C l u s t e r i n g g e o l o c a t i o n s b a s e d o n d e n s i t y " , " p a r a m e t e r s " : { " m i n P o i n t s " : " 1 " , " e p s i l o n " : " 2 0 k m " } } ] } , Figure 4: A representative subset of the extraction for the Nautilus paper on submarine cable mapping. Continued in Figure 5 Airavat , , "Datasets or Data Sources" : { " d a t a _ s o u r c e s " : [ { " n a m e " : " R I P E A t l a s " , " w h a t _ i t _ c o n t a i n s " : " T r a c e r o u t e d a t a " , " h o w _ u s e d " : " C o l l e c t i n g t r a c e r o u t e s f o r I P l i n k e x t r a c t i o n " , " c h a r a c t e r i s t i c s " : { " s i z e " : " ~ 1 2 0 M t r a c e r o u t e s " } } , { " n a m e " : " T e l e g e o g r a p h y m a p " , " w h a t _ i t _ c o n t a i n s " : " S u b m a r i n e c a b l e i n f o r m a t i o n " , " h o w _ u s e d " : " M a p p i n g I P l i n k s t o s u b m a r i n e c a b l e s " , " c h a r a c t e r i s t i c s " : { " s i z e " : " ~ 4 8 0 s u b m a r i n e c a b l e s " } } ] , . . . } , "Baseline Comparisons" : { " b a s e l i n e s " : [ { " b a s e l i n e _ n a m e " : " S C N - C r i t " , " b a s e l i n e _ a p p r o a c h " : " U s e s d r i v a b i l i t y m e t r i c , m a p s c a b l e s a t c o u n t r y l e v e l " , " c o m p a r i s o n _ m e t r i c s " : [ { " m e t r i c " : " N u m b e r o f c a b l e s p r e d i c t e d p e r l i n k " , " i m p r o v e m e n t " : " N a u t i l u s p r e d i c t s 3 5 % f e w e r c a b l e s p e r l i n k " } ] , " b a s e l i n e _ l i m i t a t i o n s " : [ " M a p s c a b l e s a t c o u n t r y l e v e l " , " C o n s e r v a t i v e a p p r o a c h m i s s e s p o t e n t i a l s u b m a r i n e p a t h s " ] } ] } , "Validations" : { " v a l i d a t i o n s " : [ . . . { " v a l i d a t i o n _ n a m e " : " S u b m a r i n e C a b l e F a i l u r e s " , " m e t h o d o l o g y " : " A n a l y z e I P l i n k d i s a p p e a r a n c e d u r i n g d o c u m e n t e d c a b l e f a i l u r e s " , " s p e c i f i c _ e x a m p l e s " : [ { " e x a m p l e _ n a m e " : " Y e m e n O u t a g e A n a l y s i s " , " r e s u l t s " : { " d i s a p p e a r a n c e _ o f _ l i n k s " : " 1 0 6 l i n k s d i s a p p e a r e d " } } ] } , { " v a l i d a t i o n _ n a m e " : " T a r g e t e d T r a c e r o u t e s " , " m e t h o d o l o g y " : " T a r g e t e d m e a s u r e m e n t s b e t w e e n R I P E p r o b e s n e a r c a b l e l a n d i n g p o i n t s " , " r e s u l t s " : { " m a t c h _ r a t e " : " 7 7 % t o p p r e d i c t i o n m a t c h " } } , . . . ] } Figure 5: Continued from Figure 4 – The representativ e subset of the extraction for the Nautilus paper on submarine cable mapping. , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi "query_summary" : " C o n c i s e 1 - 2 s e n t e n c e s u m m a r y o f w h a t u s e r i s a s k i n g " , "complexity_assessment" : { { " t e m p o r a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s p a t i a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " c a u s a l " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s t a k e h o l d e r " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " d a t a " : { { " a n s w e r " : " y e s / n o " , " r e a s o n i n g " : " E x p l a n a t i o n " } } , " s c o r e " : 0 , " t i e r " : " s i m p l e / m o d e r a t e / c o m p l e x " , " r a t i o n a l e " : " O v e r a l l c o m p l e x i t y j u s t i f i c a t i o n " , " i m p l i c a t i o n s _ f o r _ a g e n t 2 " : " W h a t t h i s c o m p l e x i t y m e a n s f o r s o l u t i o n d e s i g n . " } } , "sub_problems" : [ { { " i d " : " S P 1 " , " d e s c r i p t i o n " : " D e t a i l e d s u b - p r o b l e m d e s c r i p t i o n " , " d e p e n d e n c i e s " : [ " S P 2 " , " S P 3 " ] , " p r i o r i t y " : " h i g h / m e d i u m / l o w " , " e s t i m a t e d _ d i f f i c u l t y " : " D e s c r i p t i o n o f d i f f i c u l t y " } } ] , "constraints" : { { " t e c h n i c a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " d a t a " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " m e t h o d o l o g i c a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] , " t e m p o r a l " : [ " c o n s t r a i n t 1 " , " c o n s t r a i n t 2 " ] } } , "success_criteria" : { { " p r i m a r y " : " M a i n s u c c e s s c r i t e r i o n " , " s e c o n d a r y " : [ " c r i t e r i o n 1 " , " c r i t e r i o n 2 " ] , " v a l i d a t i o n _ a p p r o a c h " : " H o w t o v a l i d a t e s u c c e s s " } } , "risks" : [ { { " r i s k " : " D e s c r i p t i o n o f r i s k " , " l i k e l i h o o d " : " h i g h / m e d i u m / l o w " , " s e v e r i t y " : " h i g h / m e d i u m / l o w " , " m i t i g a t i o n " : " H o w t o m i t i g a t e " } } ] , "registry_mapping" : { { " r e l e v a n t _ f u n c t i o n s " : [ { { " f u n c t i o n _ n a m e " : " n a m e _ f r o m _ r e g i s t r y " , " p u r p o s e " : " W h a t i t p r o v i d e s f o r t h i s q u e r y " , " s u b _ p r o b l e m s _ a d d r e s s e d " : [ " S P 1 " , " S P 2 " ] } } ] , " i n t e g r a t i o n _ p o i n t s " : [ " H o w f u n c t i o n s c o n n e c t " ] , " g a p s " : [ " W h a t r e g i s t r y d o e s n ' t p r o v i d e " ] } } , "recommendations_for_designer" : [ " R e c o m m e n d a t i o n 1 f o r A g e n t 2 " , " R e c o m m e n d a t i o n 2 f o r A g e n t 2 " ] Figure 6: Output of Quer yMind follows this schema to examine complexities of subproblems, evaluate constraints, dene success criteria, and map to the relevant registry to provide guidance to W orkf lowScout . Airavat , , Case Study Registry Functions by System Cable Impact nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings Disaster Impact nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings xaminer_system: (5) generate_cable_segments_to_all_info_map (6) generate_cable_segment_to_countr y_as_maps (7) process_single_event Cascading Eects nautilus_system: (1) get_nautilus_link_to_cable_mapping (2) get_lp_id_to_countr y_dict (3) get_ip_to_geolo cation_mappings (4) get_ip_to_asn_mappings xaminer_system: (5) generate_cable_segments_to_all_info_map (6) generate_cable_segment_to_countr y_as_maps (7) process_single_event submarine_system: (8) get_countr y_to_cable_graph as_dependency_system: (9) get_as_dependency_graph Prex2Org bgp_system: (1) download_bgp_dumps whois_system: (2) parse_whois_dump rpki_system: (3) get_rpki_snapshot as2org_system: (4) get_as2org_mappings T able 3: Registr y functions utilized across case studies. E W ORKFLO W COMPOSI TION P A T TERNS W orkowScout employs four method composition patterns to eliminate the need for manual parameter calibration while maintaining workow r obustness. The MERGE pattern com- bines multiple measurement methods algorithmically—for example, running two geolocation APIs and merging results through intersection (for high condence) or union (for cov- erage)—thereby achieving robustness thr ough methodologi- cal diversity . The AUTO_CALIBRA TE pattern derives param- eters automatically from dataset distribution statistics rather than requiring manual tuning, such as calculating thresholds based on the data’s statistical properties. The DERIVED pat- tern extracts parameter values from registry specications or problem requirements ( e.g., timeout values from registry specs, scale from problem constraints), ensuring compatibil- ity and feasibility . Finally , the CONSERV A TIVE_DEF AULT pattern applies loose bounds with explicitly documented trade-os, such as using wide thresholds while documenting their precision-recall implications, which prevents prema- ture data ltering while making the trade-os transparent. T ogether , these patterns enable Airavat to generate measure- ment workows that avoid the brittle manual parameter , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi Parameter V alue Comp onent Purpose Novelty threshold (high) 0.7 Evaluator - Stage 2 Problems above classied as novel Novelty threshold (low ) 0.4 Evaluator - Stage 2 Problems below classied as well-studied Infeasibility threshold 60 Evaluator - Stage 2 Sub-dimension scores below this trigger rejec- tion Excellence threshold 85 Selector W orkows above appro ved directly Good range lower bound 80 Selector W orkows below always require synthesis Structural diversity threshold 0.4 Selector Triggers complementarity analysis W eight adjustment bound ± 0.2 Evaluator - Stage 3 Maximum adjustment per dimension T able 4: V erication Engine parameters. Figure 7: Europe-A sia Cable Failure Cascade Analysis W orkow (Data Input Layer) tuning typical of traditional approaches while maintaining methodological soundness. F SAMPLE INP U TS AND OU TP U TS • Figures 4 and 5 sho w a sample knowledge graph output for a single paper . • Figure 6 shows the QueryMind Schema. • T able 3 sho ws the registry functions used by Airavat’s automated workows across the various case studies. • Figures 7, 8, 9, 10 show the output workow gener- ated for the cascading failure analysis. G ARCHI TECT URE DET AILS T able 4 shows the V erication Engine parameters. Airavat , , Figure 8: Europe-A sia Cable Failure Cascade Analysis W orkow (Core Processing Layer) , , Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Ab du Jyothi Figure 9: Europe-A sia Cable Failure Cascade Analysis W orkow (Integration & Cascade Layer) Airavat , , Figure 10: Europe-A sia Cable Failure Cascade Analysis W orkow (Output Reporting Layer)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment