SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

SA GAI-MID: A Generativ e AI-Driv en Middlew are for Dynamic R un time In terop erabilit y Oliv er Aleksander Larsen SDU Softwar e Engine ering Univ ersit y of Southern Denmark Odense, Denmark olar@mmmi.sdu.dk Mah y ar T. Moghaddam SDU Softwar e Engine ering Univ ersit y of Southern Denmark Odense, Denmark m tmo@mmmi.sdu.dk A bstr act —Mo dern distributed systems in tegrate het- erogeneous services, REST APIs with diﬀeren t sc hema v ersions, GraphQL endp oin ts, and IoT devices with proprietary payloads that suﬀer from persistent sc hema mismatc hes. T raditional static adapters require man- ual co ding for ev ery sc hema pair and cannot handle no v el com binations at run time. W e presen t SA GAI- MID, a F astAPI-based middlew are that uses large language mo dels (LLMs) to dynamically detect and resolv e sc hema mismatc hes at runtime. The system emplo ys a ﬁv e-la y er pip eline: hybrid detection (struc- tural diﬀ plus LLM seman tic analysis), dual resolu- tion strategies (p er-request LLM transformation and LLM-generated reusable adapter code), and a three- tier safeguard stac k (v alidation, ensemble voting, rule- based fallbac k). W e frame the arc hitecture through Bass et al. ’s interoperability tactics, transforming them from design-time artifacts into runtime capabilities. W e ev aluate SAGAI-MID on 10 in terop erabilit y scenar- ios spanning REST v ersion migration, IoT-to-analytics bridging, and GraphQL proto col conv ersion across six LLMs from t wo providers. The b est-p erforming conﬁg- uration achiev es 0.90 pass@1 accuracy . The CODEGEN strategy consistently outp erforms DIRECT (0.83 vs 0.77 mean pass@1), while cost v aries b y ov er 30 × across mo dels with no prop ortional accuracy gain; the most accurate model is also the c heap est. W e discuss implications for softw are arc hitects adopting LLMs as run time architectural comp onents. Index T erms —softw are architecture, in terop erabil- it y , large language mo dels, generative AI, middlew are, sc hema matching, run time adaptation I. Intr oduction Mo dern distributed systems increasingly integrate het- erogeneous services: REST APIs with div erging sc hema v ersions, GraphQL endp oin ts, and IoT/MQTT de- vices with proprietary payloads. Sc hema mismatches— diﬀerences in ﬁeld naming conv en tions (e.g., camelCase vs. snake_case ), data types (ISO 8601 timestamps vs. Unix ep o c hs), measuremen t units (Celsius vs. F ahrenheit), and 0 © 2026 IEEE. Personal use of this material is p ermitted. Permis- sion from IEEE must b e obtained for all other uses, in an y curren t or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to serv ers or lists, or reuse of any copy- righted comp onent of this work in other works. Accepted at SA GAI 2026, co-located with IEEE ICSA 2026. nesting structures (ﬂat vs. nested JSON)—are a p ersistent in terop erability challenge. Lercher et al. [9] ﬁnd that mi- croservice API evolution is a ma jor maintenance burden, with lo ose coupling providing no immediate feedback on breaking c hanges. T raditional approac hes rely on static adapters, XSL T transforms, or conﬁguration-based mapping engines. These solutions require man ual co ding for every new sc hema pair and cannot handle no vel com binations at run- time. Bass et al. ’s in terop erability tactics [1] pro vide a well- established v o cabulary for such adaptation, but remain traditionally b ound to design-time implemen tation. Recen t work demonstrates that LLMs can generate sc hema mappings and transformation code with promising accuracy . F alcao et al. [2] sho w that LLM-based DIRECT and CODEGEN strategies ac hiev e up to 93% pass@1 on in terop erability tasks, while P arciak et al. [4] ﬁnd that ma jorit y voting reduces LLM hallucination in sc hema matc hing. How ever, no prior work 1) frames LLM middle- w are through established arc hitectural tactics, 2) treats safeguards against LLM non-determinism as ﬁrst-class arc hitectural concerns, 3) ev aluates across m ultiple pro- to cols (REST, GraphQL, IoT), or 4) provides a pluggable middlew are in tegration pattern. This pap er presen ts SAGAI-MID, a generativ e AI- driv en middleware for dynamic runtime in terop erabilit y . Our con tributions are: 1) A ﬁve-la yer middleware arc hitecture that connects LLM-based sc hema resolution to Bass et al. ’s inter- op erabilit y tactics, transforming them from design- time artifacts in to run time capabilities. 2) A hybrid detection approach combining determinis- tic structural analysis with LLM-p ow ered semantic analysis for naming and unit mismatc hes. 3) T wo resolution strategies, DIRECT (p er-request LLM transformation) and CODEGEN (LLM- generated reusable Python adapters), with hash- k ey ed caching for deterministic replay . 4) A three-tier safeguard stack (Pydantic v alidation, ensem ble voting, rule-based fallbac k) addressing LLM non-determinism. 5) Empirical ev aluation on 10 div erse in terop erabilit y scenarios across six LLMs from tw o providers, mea- suring correctness, p erformance, cost, and safeguard impact. The implementation and all ev aluation artifacts are av ail- able as op en-source soft ware. 1 The remainder of this paper is structured as follo ws. Section II discusses related work. Section I I I presents the SAGAI-MID architecture. Section IV describ es the implemen tation. Section V rep orts our ev aluation results. Section VI discusses ﬁndings and limitations. Section VI I concludes with future directions. I I. Ba ckgr ound & Rela ted Work A. Inter op er ability T actics Bass, Clements, and Kazman deﬁne in terop erabilit y as the degree to which tw o or more systems can usefully exc hange meaningful information [1]. Their in terop erabil- it y tactics provide a vocabulary for architectural deci- sions: Disc over , T ailor Interfac e , Convert Data , Manage R esour c es , and Or chestr ate . Complemen tary frameworks include T olk and Muguira’s Lev els of Conceptual In ter- op erabilit y Mo del (LCIM) [10], which distinguishes seven lev els from technical to conceptual interoperability , and Hohp e and W o olf ’s enterprise in tegration patterns [11], whic h catalog messaging-based mediation strategies. The ISO/IEC 25010 qualit y mo del iden tiﬁes interoperability as a k ey pro duct quality characteristic [7]. These tactics are traditionally implemented as design- time artifacts: co ded once, tested, deplo y ed, and brittle to sc hema ev olution. Our k ey insigh t is that LLMs can trans- form these tactics into runtime c ap abilities , generating the adaptation logic on demand for nov el schema combinations without man ual co ding or redeploymen t. B. LLM-Base d Schema Matching Recen t w ork applies LLMs to schema matching and data transformation. Parciak et al. [4] ev aluate four prompting strategies for schema matching with GPT-4, ac hieving F1 = 0.58, and ﬁnd that ma jority voting reduces halluci- nation from 24% to 8%, directly inspiring our ensemble safeguard (Tier 2). Seedat and v an der Schaar [5] propose Matc hmak er, a self-impro ving LLM program for schema matc hing via zero-shot self-impro v emen t. Nara y an et al. [12] demonstrate that foundation mo dels can p erform data integration tasks (sc hema matc hing, entit y resolu- tion, data imputation) comp etitively with task-sp eciﬁc approac hes. P eeters et al. [13] sho w that LLMs ac hiev e state-of-the-art en tity matching accuracy with zero-shot prompting. These works fo cus on oﬄine matching for dataset in te- gration. W e apply similar tec hniques at runtime within a middlew are con text, requiring b oth caching (to amortize LLM costs) and a deterministic fallback path (to guarantee forw ard progress). 1 https://gith ub.com/Oliver1703dk/sagai- mid C. LLM Midd lewar e for Inter op er ability Most directly related to our work, F alcao et al. [2] ev aluate LLM-based interoperability using DIRECT (p er- request LLM transformation) and CODEGEN (LLM- generated adapter co de) strategies across 13 op en-source LLMs, achievi ng up to 93% pass@1. They show that CODEGEN outp erforms DIRECT on av erage and that mo del selection matters more than prompting strategy . Their ev aluation framework and strategy deﬁnitions form the foundation that our resolution engine builds up on. Guran et al. [3] prop ose a tw o-tier LLM middleware ar- c hitecture with t w o deploymen t patterns: LLM-as-Servic e (the LLM is the primary service endp oint) and LLM-as- Gateway (the LLM mediates b et w een client and back- end). SAGAI-MID falls into the LLM-as-Gatew a y pat- tern, where the middleware in tercepts traﬃc and applies LLM-driv en transformations transparently . How ever, Gu- ran et al. do not address schema matc hing speciﬁcally , nor do they pro vide empirical ev aluation. Esp osito et al. [6] survey generative AI for softw are arc hitecture, iden tifying run time in teroperability as an op en direction with “high p otential but limited empirical evidence. ” Self-adaptiv e arc hitectures using the MAPE- K lo op [14] hav e explored runtime adaptation through monitor-analyze-plan-execute cycles, but not LLM-driven sc hema transformation sp eciﬁcally . SA GAI-MID builds on F alcao et al. ’s strategies but mak es three nov el con tributions: 1) embedding the strate- gies in a ﬁv e-lay er middleware architecture, 2) adding safeguards against LLM non-determinism as a ﬁrst-class concern, and 3) connecting to established in terop erabilit y tactics. W e additionally ev aluate across m ultiple proto cols (REST, GraphQL, IoT) and commercial LLM providers, whereas prior w ork ev aluates one proto col with op en- source mo dels. I I I. SA GAI-MID Architecture SA GAI-MID is an ASGI middleware that in tercepts HTTP traﬃc b etw een clien ts and bac k end services, dy- namically detecting and resolving schema mismatches using LLMs. It is implemented as a ﬁv e-lay er pipeline (Fig. 1) and in tegrates with existing service arc hitectures via F astAPI’s BaseHTTPMiddleware pattern. The middle- w are is transparen t to b oth clients and back end services: clien ts send requests in the source sc hema format, and the middlew are transforms them to the target sc hema b efore forw arding. W e describ e eac h la y er and then map the arc hitecture to Bass et al. ’s interoperability tactics. A. Input L ayer The input lay er intercepts HTTP requests with JSON b o dies (POST, PUT, P A TCH). A SchemaRegistry maps routes to (source_schema, target_schema) pairs via exact path matching or longest-preﬁx matc hing. Each RouteConfig sp eciﬁes source and target JSON Schemas, 1. Input: Route matc hing & schema lo okup 2. Detection: Structural + Seman tic 3. Resolution: DIRECT / CODEGEN 4. Safeguards: V alidate → Ensemble → F allback 5. Monitoring: Latency , cost, accuracy Client (sour ce schema) Backend (tar get schema) LLM LLM Cache Fig. 1: SA GAI-MID ﬁve-la yer pip eline architecture. Solid arro ws show the data ﬂo w; dashed lines indicate LLM calls (la y ers 2–3) and cache lo okups (la yer 3). La y ers 4–5 are fully deterministic. service iden tiﬁers, and the resolution strategy to use. Non- registered routes pass through unmo diﬁed. B. Dete ction Mo dule The detection mo dule uses a hybrid approach com bin- ing deterministic structural analysis with LLM-p o wered seman tic analysis. Structural Detector (deterministic, no LLM call): p erforms a recursive prop ert y walk comparing source and target JSON Sc hemas, detecting ﬁeld missing/extra, t yp e mismatc hes, nesting mismatches, and cardinality mis- matc hes. It classiﬁes sev erit y (lo w/medium/high) based on t yp e distance. This runs in under 1 ms and alwa ys executes ﬁrst. Seman tic Detector (LLM-pow ered): sends both sc hemas to the LLM with a structured output prompt, detecting naming mismatches (e.g., camelCase ↔ snake_case , abbreviations) and unit mismatc hes (e.g., Celsius ↔ F ahrenheit). It returns a MismatchReport via Op enAI structured outputs ( .parse() ) with Pydan tic resp onse mo dels. If the LLM call fails, structural results remain v alid (graceful degradation). Seman tic ﬁndings sup ersede structural ﬁndings for ov er- lapping (source_path, target_path) pairs, as they carry ric her information. The ﬁnal merged MismatchReport is deduplicated. C. R esolution Engine The resolution engine implements tw o strategies, follow- ing F alcao et al. [2], and em beds them in the middleware pip eline. T ABLE I: Comparison of Resolution Strategies Prop ert y DIRECT CODEGEN LLM calls (cold) 2/request 3 (then 0) LLM calls (cac hed) 1/request 0 Deterministic No Y es (after compile) Cac heable Mapping only Mapping + code DIRECT Strategy : a t w o-step LLM approac h where 1) the LLM generates a SchemaMapping (ﬁeld-lev el map- pings with conﬁdence scores), and 2) the LLM transforms the source data per-request using the mapping. This is simple and handles arbitrary transformations, but is non- deterministic across calls, slow (tw o LLM round-trips p er request), and exp ensiv e. CODEGEN Strategy : a three-step approach where 1) the LLM generates a SchemaMapping , 2) the LLM generates a Python adapter function ( AdapterCode ), and 3) the system compiles and v alidates the generated co de via exec() , caching the compiled function. After the ﬁrst in v o cation, subsequen t requests execute native Python with zero LLM calls. T able I summarizes the key diﬀerences. Cac hing : the mapping cac he uses (SHA-256(source_schema), SHA-256(target_schema)) as k ey , storing both SchemaMapping and compiled AdapterCode . A cac he hit for CODEGEN skips all LLM calls; for DIRECT, it skips the mapping step but still requires p er-request transformation. D. Safe guar d L ayer The safeguard lay er is a three-tier pipeline addressing LLM non-determinism: Tier 1 – V alidation : JSON Sc hema v alidation (Draft 2020-12) of the transformed output against the target schema, plus Pydantic mo del v alidation of the LLM resp onse structure and conﬁdence threshold chec king. If v alid, the result passes through without further safeguards. Tier 2 – Ensem ble V oting : triggered when v alidation fails. R uns N parallel LLM calls (default N = 3 ) for map- ping generation, taking ma jority vote on (source_field, target_field) pairs. This follo ws self-consistency deco d- ing [15], where sampling m ultiple reasoning paths and tak- ing ma jority vote impro v es accuracy , and Parciak et al. ’s ﬁnding that voting reduces hallucination in sc hema matc h- ing [4]. Tier 3 - R ule-Based F allback : triggered when the ensem ble also fails. A fully deterministic lay er with no LLM dep endency: difflib similarity match- ing for ﬁeld renaming, a registry of kno wn unit conv er- sions (C ↔ F, km/h ↔ mph, meters ↔ feet), type coercion (string → in t, ISO 8601 → ep o ch), and cardinality handling (arra y ↔ single v alue). This tier never fails, guaranteeing forw ard progress with b est-eﬀort output. A key architectural decision is that safeguards are not optional add-ons but ﬁrst-class pip eline stages with their T ABLE II: Mapping of SA GAI-MID to Interoperability T actics [1] T actic T raditional SA GAI-MID Disco ver Service registry (DNS, Consul) Sc hema registry + LLM seman tic matc hing T ailor In- terface Static adapter / decorator LLM-generated translation la y er Con vert Data Pre-built serializ- ers, XSL T LLM-generated con version co de Manage Resources Connection p o ol- ing, rate limiting T oken budgets, mo del routing, seman tic cac hing o wn metrics (ensemble trigger rate, fallbac k trigger rate, safeguard lift). This distinguishes SA GAI-MID from prior w ork that ev aluates LLM accuracy without pro viding reliabilit y mec hanisms. E. Mapping to Inter op er ability T actics T able I I maps SA GAI-MID’s comp onents to Bass et al. ’s in terop erability tactics, sho wing how LLMs transform each tactic from a static, design-time artifact into a dynamic, run time capabilit y . F. W alkthr ough: W e ather API Migr ation T o illustrate the pip eline, consider a weather service migrating from REST v1 (ﬂat structure, Celsius, ISO 8601 timestamps) to v2 (nested structure, F ahrenheit, Unix ep o c hs). A client sends: { " c i t y " : " A m s t e r d a m " , " t e m p e r a t u r e _ c e l s i u s " : 1 8 . 5 , " h u m i d i t y _ p e r c e n t " : 7 2 , " w i n d _ s p e e d _ k m h " : 1 5 . 3 , " t i m e s t a m p " : " 2 0 2 6 - 0 6 - 2 3 T 1 4 : 3 0 : 0 0 Z " } La y er 1 (Input): The SchemaRegistry matches the route to a RouteConfig sp ecifying source (v1) and target (v2) JSON Sc hemas. La y er 2 (Detection): The structural detector iden tiﬁes t yp e mismatches ( timestamp : string → integer ) and nesting mismatc hes ( city → location.name ). The seman tic detector identiﬁes unit mismatc hes ( temperature_celsius → temp_f : Celsius to F ahrenheit) and naming mismatches ( humidity_percent → humidity ). La y er 3 (Resolution): Using the CODEGEN strategy , the LLM generates a Python adapter function: d e f t r a n s f o r m ( d a t a ) : r e t u r n { " l o c a t i o n " : { " n a m e " : d a t a [ " c i t y " ] } , " m e a s u r e m e n t s " : { " t e m p _ f " : d a t a [ " t e m p e r a t u r e _ c e l s i u s " ] * 9 / 5 + 3 2 , " h u m i d i t y " : f l o a t ( d a t a [ " h u m i d i t y _ p e r c e n t " ] ) , " w i n d _ m p h " : d a t a [ " w i n d _ s p e e d _ k m h " ] * 0 . 6 2 1 3 7 1 } , " r e c o r d e d _ a t " : i n t ( d a t e t i m e . f r o m i s o f o r m a t ( d a t a [ " t i m e s t a m p " ] . r e p l a c e ( " Z " , " + 0 0 : 0 0 " ) ) . t i m e s t a m p ( ) ) } This function is compiled via exec() , v alidated against sample input, and cac hed under the (SHA-256(v1_schema), SHA-256(v2_schema)) k ey . La y er 4 (Safeguards): The output is v alidated against the v2 target schema using jsonschema . All ﬁelds are presen t and correctly typed; Tier 1 passes, and no ensem- ble or fallbac k is needed. La y er 5 (Monitoring): Records latency , token count, and pass@1 result. Subsequen t requests execute the cac hed function with zero LLM calls ( < 1 ms). The output matches the v2 target sc hema: { " l o c a t i o n " : { " n a m e " : " A m s t e r d a m " } , " m e a s u r e m e n t s " : { " t e m p _ f " : 6 5 . 3 , " h u m i d i t y " : 7 2 . 0 , " w i n d _ m p h " : 9 . 5 1 } , " r e c o r d e d _ a t " : 1 7 8 2 2 2 5 0 0 0 } IV. Implement a tion A. T e chnolo gy Stack SA GAI-MID is implemented in Python 3.13 using F astAPI with Uvicorn for async-native ASGI middleware supp ort. The middleware extends BaseHTTPMiddleware , whic h intercepts every HTTP request and response. F or registered routes, the middleware reads the request b o dy , runs the detect → resolve → safeguard pipeline, and replaces the b o dy b efore forwarding; for unregistered routes, the request passes through unmo diﬁed with zero o v erhead. LLM calls use the OpenAI API (compatible with b oth Op enAI and xAI endp oints) with structured out- puts [8] - the .parse() method with Pydantic re- sp onse mo dels guarantees v alid JSON conforming to the exp ected sc hema, functioning as run time contract en- forcemen t [19]. Five Pydan tic mo dels deﬁne the LLM resp onse con tracts: MismatchReport , SchemaMapping , FieldMapping , AdapterCode , and TransformedData . V al- idation uses b oth Pydan tic v2 (LLM resp onse struc- ture) and jsonschema Draft 2020-12 (target schema con- formance). All LLM calls are async hronous ( await on AsyncOpenAI ). B. Pr ompt Engine ering F our prompt ﬁles are stored as plain text ﬁles in a ded- icated prompts/ directory , editable without code changes: 1) detect_mismatch.txt instructs the LLM to compare source and target sc hemas and report mismatches with t yp e classiﬁcation; 2) generate_mapping.txt requests ﬁeld-lev el mapping with conﬁdence scores and transfor- mation expressions; 3) generate_adapter.txt asks for a complete Python function that transforms source data to matc h the target sc hema, including imp ort statements and t yp e con v ersions; and 4) transform_data.txt pro vides the mapping and source data for p er-request transforma- tion. T ABLE I I I: Ev aluation Scenarios and Mismatch Types # Scenario Mismatc h T yp es 1 W eather v1 → v2 Nesting, unit, name, t ype 2 IoT sensor → analytics Unit (C → F), name 3 Sto c k REST → GraphQL Name (camelCase) 4 Multi-sensor aggregation Cardinalit y , aggregate 5 Date format bridging T yp e (ISO → ep och) 6 Nested → ﬂat device log Nesting (ﬂatten) 7 Metric normalization Name (Op enT elemetry) 8 Missing optional ﬁelds Field missing/ex- tra 9 Arra y ↔ single v alue Cardinalit y 10 Com bined complex All t yp es com bined Structured outputs guarantee v alid JSON conforming to the exp ected Pydantic mo del, so prompts con tain no output format instructions, only the task description and sc hema con text [8]. This separation simpliﬁes prompt engineering [16] and reduces token consumption. F or non- reasoning mo dels, we set temp erature = 0.0 for maximum determinism. Reasoning mo dels (GPT-5, GPT-5-nano, Grok 4.1 F ast R) do not accept the temperature param- eter and require longer timeouts. C. Mo del A daptation L ayer The LLM client adapts to b ehavioral diﬀerences across mo del families. GPT-5 reasoning mo dels (GPT-5, GPT- 5-nano) use reasoning_effort instead of temperature ; Grok reasoning mo dels (Grok 4.1 F ast R) accept nei- ther parameter. GPT-5.2 supports temperature only with reasoning_effort="none" . All reasoning models require 120+ s timeouts (vs. 60 s for standard mo dels). This adap- tation is transparen t to the rest of the pip eline. D. Evaluation Sc enarios W e design 10 interoperability scenarios co v ering diverse mismatc h t yp es across three proto cols (T able I I I). Eac h scenario is a self-contained JSON ﬁxture with source/tar- get JSON Schemas, sample input data, golden reference output, and exp ected mismatc hes for detection ev aluation. Three mo ck services, W eather REST (v1/v2), IoT Sensor REST bridge, and Sto c k GraphQL (Strawberry), co v er REST, GraphQL, and IoT/MQTT proto cols. V. Ev alua tion A. Exp erimental Setup W e ev aluate all 10 scenarios with b oth DIRECT and CODEGEN strategies, with and without safeguards (4 com binations), for 3 runs p er combination (40 scenario- strategy-mo de runs p er mo del). W e test six LLMs from t w o providers: GPT-4o, GPT-5, GPT-5.2, and GPT-5- nano from OpenAI; and Grok 4.1 F ast in both non- reasoning and reasoning mo des from xAI. The mapping cac he is disabled during b enc hmarking to measure raw LLM p erformance. Eac h run records p er-stage latency , tok en consumption, and the complete transformed output. B. Metrics Correctness. p ass@1 [17]: binary exact match against the golden reference (using DeepDiﬀ [18] with ﬂoat toler- ance ϵ = 0 . 01 and numeric type equiv alence). Field F1 : leaf-k ey ov erlap b et w een actual and exp ected outputs. V alue ac cur acy : fraction of shared k eys where v alues matc h. Dete ction pr e cision/r e c al l : mismatc h pair ov erlap vs. exp ected. P erformance. End-to-end latency (ms) across detect, resolv e, and safeguard stages. W e rep ort mean and P95 across runs. Cost. T otal tok en consumption and estimated USD cost p er mo del (tok en coun t × p er-mo del pricing) across all 40 com binations. C. R esults T able IV presents the main cross-mo del results. A cross all six mo dels, CODEGEN ac hieves equal or higher mean pass@1 than DIRECT. The b est-p erforming conﬁgura- tion, Grok 4.1 F ast (reasoning) with CODEGEN, achiev es 0.90 pass@1 at the lo w est total cost ($0.18). Field F1 is univ ersally ≥ 0 . 98 , indicating that all mo dels correctly iden tify which ﬁelds to map; the diﬀerentiator is value ac cur acy , whether transformed v alues are correct. T able V sho ws per-scenario pass@1 a v eraged across all mo dels, revealing the diﬃculty sp ectrum. F our sce- narios (sto ck casing, nested-to-ﬂat, missing ﬁelds, arra y- single) achiev e p erfect 1.00 pass@1 across all mo dels. Sen- sor analytics (unit conv ersion) and metric normalization (Op enT elemetry renaming) are the hardest, ac hieving only 0.50–0.56 mean pass@1, with signiﬁcant mo del-dep endent v ariation. T able VI presents latency and cost metrics. Latency v aries by o v er 10 × across mo dels: Grok 4.1 F ast (non- reasoning) is the fastest at 9–11 s mean, while GPT-5 is the slo west at 75–104 s. Cost v aries b y o ver 30 × : from $0.18 (Grok 4.1 F ast reasoning) to $6.25 (GPT-5). Notably , the most accurate mo del (Grok 4.1 R, 0.90 pass@1) is also the c heap est, while the most exp ensive mo del (GPT-5, $6.25) ac hiev es only 0.80 pass@1. D. A nalysis CODEGEN outp erforms DIRECT. A cross all six mo dels, CODEGEN ac hieves equal or higher mean pass@1 than DIRECT (0.83 vs. 0.77 o verall). The adv antage is most pronounced on complex scenarios: for weather v er- sion migration (scenario 1), CODEGEN achiev es 0.78 vs. 0.56 DIRECT; for date bridging (scenario 5), 0.95 vs. 0.56. The CODEGEN strategy b eneﬁts from co de v alidation— the generated adapter function is tested against sample T ABLE IV: Cross-Mo del Results (Mean ov er 10 Scenarios, 3 Runs p er Combination, 40 Combinations p er Mo del) pass@1 V alue Acc. Mean Latency (s) Mo del Pro vider D C D C D C T otal Cost GPT-4o Op enAI 0.70 0.83 0.92 0.97 11.8 16.1 $2.43 GPT-5 Op enAI 0.80 0.80 0.95 0.95 75.4 103.8 $6.25 GPT-5.2 Op enAI 0.83 0.87 0.97 0.97 31.7 36.7 $4.85 GPT-5-nano Op enAI 0.73 0.73 0.95 0.98 50.6 55.1 $0.27 Grok 4.1 F ast (NR) xAI 0.70 0.87 0.92 0.98 9.2 10.5 $0.19 Grok 4.1 F ast (R) xAI 0.87 0.90 0.97 0.98 69.4 70.8 $0.18 Mean 0.77 0.83 0.95 0.97 41.3 48.8 $2.36 D = DIRECT, C = CODEGEN, NR = non-reasoning, R = reasoning. T otal cost cov ers all 40 scenario-strategy-mode combinations × 3 runs. Bold indicates best v alue per column. T ABLE V: Per-Scenario pass@1 (Mean across All 6 Mo d- els) # Scenario D C 1 W eather version 0.56 0.78 2 Sensor analytics 0.56 0.50 3 Stock casing 1.00 1.00 4 Multi-sensor 1.00 0.95 5 Date bridging 0.56 0.95 6 Nested to ﬂat 1.00 1.00 7 Metric normalization 0.50 0.56 8 Missing ﬁelds 1.00 1.00 9 Arra y single 1.00 1.00 10 Combined complex 0.56 0.61 Mean 0.77 0.84 T ABLE VI: Performance and Cost Comparison Mo del P95 Lat. T okens Cost (s) (total) ($) GPT-4o 14–19 615K 2.43 GPT-5 88–117 1,180K 6.25 GPT-5.2 36–40 943K 4.85 GPT-5-nano 57–64 1,317K 0.27 Grok 4.1 (NR) 10–11 721K 0.19 Grok 4.1 (R) 80–81 1,535K 0.18 P95 Latency range shows DIRECT–CODEGEN. T okens and cost are totals across all 40 combinations × 3 runs. input b efore caching, catc hing errors that w ould propagate silen tly in DIRECT mo de. Mo del-dep endent diﬃcult y patterns. W e ob- serv e a striking pattern: models exhibit complementary strengths across scenario types. Reasoning mo dels (GPT- 5, Grok 4.1 R) excel at complex combined scenarios (scenario 10: b oth achiev e 1.00 pass@1) but struggle with sensor analytics unit conv ersion (scenario 2: b oth achiev e 0.00 pass@1). Con v ersely , non-reasoning mo dels (GPT-4o, Grok 4.1 NR) handle unit conv ersions w ell (1.00 DIRECT pass@1 on scenario 2) but struggle with the combined scenario (GPT-4o achiev es 0.00 pass@1 on scenario 10 with CODEGEN). T ABLE VII: Cost-Accuracy-Latency Comparison (CODE- GEN Strategy) Mo del pass@1 Latency Cost Grok R 0.90 70.8 s $0.18 Grok NR 0.87 10.5 s $0.19 GPT-5.2 0.87 36.7 s $4.85 GPT-4o 0.83 16.1 s $2.43 GPT-5 0.80 103.8 s $6.25 GPT-5-nano 0.73 55.1 s $0.27 Bold indicates best v alue per column. Similarly , metric normalization (scenario 7) is consis- ten tly solv ed b y GPT-4o and both Grok mo dels (all 1.00 pass@1) but fails across all three Op enAI reasoning mo dels: GPT-5 (0.00), GPT-5.2 (0.00–0.33), and GPT- 5-nano (0.00). This suggests that reasoning mo dels may o v erthink straigh tforw ard mapping tasks that require sim- ple pattern matc hing rather than deep reasoning. Cost-accuracy frontier. The relationship b etw een cost and accuracy is non-monotonic (T able VI I). The c heap est mo del (Grok 4.1 R at $0.18) achiev es the highest accuracy (0.90 pass@1 CODEGEN), while the most ex- p ensiv e model (GPT-5 at $6.25) achiev es only 0.80. This ﬁnding challenges the assumption that larger, more exp en- siv e mo dels necessarily pro duce b etter results for struc- tured transformation tasks. The b est cost-p erformance ra- tio for latency-sensitive applications is Grok 4.1 F ast (NR) at 10 s mean latency , 0.87 CODEGEN pass@1, and $0.19 total. Safeguard impact. The safeguard la yer shows mo dest aggregate impact on pass@1 across mo dels that already ac hiev e high baseline accuracy . Ensemble voting activ ates on 0–100% of runs dep ending on mo del and scenario. F or GPT-5.2, safeguards provide measurable improv emen t on sp eciﬁc scenarios (sensor analytics DIRECT: +0.33, met- ric normalization CODEGEN: +0.33). How ever, we also observ e cases where safeguards marginally reduce accuracy (e.g., GPT-5 sensor analytics CODEGEN: − 0.33), likely due to ensemble voting selecting a less accurate consensus mapping. The primary v alue of the safeguard architecture is as a safety net rather than an accuracy b o oster: it pre- v en ts catastrophic failures b y pro viding deterministic fall- bac k when LLM outputs fail v alidation, ensuring forward progress on every request. The rule-based fallback (Tier 3) handles known unit con versions and t yp e co ercions de- terministically , providing a guaran teed minimum quality ﬂo or. Detection qualit y . Detection recall is consistently high across all mo dels ( ≥ 0.83), meaning the system identiﬁes the v ast ma jority of schema mismatc hes. Detection pre- cision v aries more widely (0.46–1.00). GPT-4o achiev es the highest detection precision (0.95) with p erfect recall (1.00), while reasoning mo dels tend to o ver-report mis- matc hes (low er precision). Importantly , low er detection precision does not impair resolution qualit y; extra reported mismatc hes are b enign, as the resolution engine only acts on relev ant ones. E. Thr e ats to V alidity In ternal : LLM outputs exhibit non-determinism ev en at temp erature = 0 (do cumented Op enAI behavior). Golden outputs are hand-crafted, introducing potential h uman error. With N = 3 runs per com bination, v ariance estimates ha v e wide conﬁdence interv als. External : 10 scenarios may not co ver all in teroper- abilit y patterns. W e ev aluate JSON-only payloads (no XML, Protobuf, A vro) ov er HTTP-only transp ort. A sin- gle middleware deploymen t (not distributed or multi-hop) is tested. Construct : pass@1 is a strict binary metric that may undercoun t near-correct outputs. T oken cost dep ends on pricing at time of ev aluation (F ebruary 2026), which may c hange. VI. Discussion LLMs as run time architectural comp onen ts. Our results demonstrate that LLMs can eﬀectiv ely serve as run time in terop erability comp onents, ac hieving 0.73– 0.90 pass@1 on diverse schema transformation tasks. This enables a paradigm shift: the interoperability tactics from Bass et al., traditionally implemented as static, design- time adapters, b ecome dynamic runtime capabilities. The practical implication is that new service in tegrations can b e handled without manual adapter developmen t or de- plo ymen t cycles, at the cost of LLM latency and API exp enses. This is particularly relev an t in microservice ecosystems where Lercher et al. [9] rep ort that API evo- lution is a p ersisten t maintenan ce burden. Mo del selection guidance. Our cross-model ev alu- ation reveals that model selection for structured trans- formation tasks diﬀers from general-purp ose LLM b enc h- marks. The most expensive model (GPT-5 at $6.25) is not the most accurate (0.80 pass@1), while the cheap- est (Grok 4.1 R at $0.18) achiev es the highest accuracy (0.90 pass@1). W e recommend: (1) Grok 4.1 F ast (NR) for latency-critical applications (10 s mean, 0.87 pass@1); (2) Grok 4.1 F ast (R) for accuracy-critical applications (0.90 pass@1 at minimal cost); and (3) GPT-4o as a balanced option with the b est detection precision. When to use DIRECT vs. CODEGEN. CODE- GEN should b e preferred for pro duction workloads: it ac hiev es higher pass@1, is deterministic after ﬁrst inv oca- tion, and amortizes LLM cost ov er rep eated requests. DI- RECT is appropriate for protot yping, one-oﬀ transforma- tions, or scenarios where sc hema pairs change frequently . A h ybrid approach (starting with DIRECT for new sc hema pairs, graduating to CODEGEN after stabilization) opti- mizes for b oth agility and reliability . The CODEGEN ad- v antage is most pronounced on complex scenarios inv olv- ing multiple mismatch t ypes sim ultaneously (e.g., w eather v ersion migration: 0.78 vs. 0.56 pass@1). Safeguard arc hitecture as defense-in-depth. While safeguards show mo dest aggregate impact on pass@1, their v alue is architectural: they pro vide defense-in-depth against LLM non-determinism. The three-tier design en- sures that every request receives a resp onse, degrading gracefully from LLM-generated to ensemble-v oted to rule- based output. This is critical for pro duction middleware where a failed transformation is worse than a best-eﬀort one. The safeguard architecture directly addresses the SA GAI workshop’s topic of “tactics to deal with the non- deterministic c haracter of generativ e AI. ” Qualit y attribute tradeoﬀs. In tro ducing LLMs as run time architectural comp onen ts creates fundamental tradeoﬀs across qualit y attributes. W e analyze these through the lens of ISO/IEC 25010 [7]: Performanc e eﬃciency vs. mo diﬁability : Static adapters execute in < 1 ms but require manual co ding for every sc hema c hange. SA GAI-MID incurs 10–104 s latency on cold starts but handles no v el sc hema combinations without co de c hanges; CODEGEN cac hing conv erges tow ard static adapter p erformance on subsequen t requests. R eliability vs. ﬂexibility : LLM non-determinism intro- duces a new failure mo de. The three-tier safeguard ar- c hitecture provides progressiv ely more deterministic fall- bac ks: ensemble voting reduces v ariance [15], and rule- based fallbac k eliminates non-determinism en tirely . Cost vs. autonomy : LLM API costs ($0.18–$6.25) are a new op erational exp ense, but replace developer time for man ual adapter developmen t. CODEGEN cac hing amor- tizes costs, and model selection can reduce them by ov er 30 × . Reasoning mo del paradox. Reasoning models (GPT- 5, GPT-5-nano, Grok 4.1 R) struggle with seemingly sim- ple tasks: unit con v ersions (scenario 2) and metric normal- ization (scenario 7) yield 0.00 pass@1 where non-reasoning mo dels achiev e 1.00. Con v ersely , reasoning models excel at complex com bined transformations (scenario 10). This parallels Liu et al. ’s [20] ﬁnding that chain-of-though t prompting can r e duc e p erformance on tasks where delib- eration hurts human accuracy—extended inference c hains in tro duce error accum ulation on mechanistic tasks (e.g., F = C × 9 / 5 + 32 ). This suggests mo del routing based on mismatc h complexity could further improv e system-lev el accuracy . Limitations. Key limitations include: (1) LLM latency (10–104 s on cold starts) compared to static adapters ( < 1 ms), making this unsuitable for ultra-low-latency re- quiremen ts without cac hing; (2) cloud API dependency for LLM calls (cost, net w ork av ailability , v endor lo ck- in); (3) security considerations around exec() of LLM- generated code in CODEGEN, whic h requires sandbox- ing in pro duction; (4) JSON-only pa yload support (no XML, Protobuf, or binary formats); (5) prompt sensitivit y , p erformance dep ends on prompt engineering quality and mo del-sp eciﬁc tuning; and (6) ev aluation scale, 10 scenar- ios with 3 runs each provide limited statistical p ow er for p er-scenario claims. VI I. Conclusion W e presen ted SA GAI-MID, a generative AI-driv en middlew are that uses LLMs to dynamically detect and resolv e sc hema mismatc hes at run time. The ﬁve-la yer arc hitecture connects LLM-based middleware to estab- lished interoperability tactics from Bass et al., transform- ing them from static design-time artifacts in to dynamic run time capabilities. A hybrid detection mo dule com- bines deterministic structural analysis with LLM seman tic analysis for comprehensive mismatc h identiﬁcation. Dual resolution strategies (DIRECT and CODEGEN) oﬀer ﬂexibilit y-determinism tradeoﬀs with hash-key ed caching, and a three-tier safeguard stack provides defense-in-depth against LLM non-determinism. Our ev aluation across 10 in terop erability scenar- ios (spanning REST v ersion migration, IoT-to-analytics bridging, and GraphQL proto col con version) and six LLMs from tw o providers rev eals four key ﬁndings: (1) CODE- GEN consistently outperforms DIRECT (0.83 vs. 0.77 mean pass@1), with the adv antage most pronounced on complex multi-mismatc h scenarios; (2) the b est mo del achiev es 0.90 pass@1 at the low est cost ($0.18), demonstrating a non-monotonic cost-accuracy relation- ship; (3) reasoning and non-reasoning mo dels exhibit complemen tary strengths, suggesting that mo del routing could further improv e accuracy; and (4) safeguards provide arc hitectural v alue as a defense-in-depth mechanism rather than a consistent accuracy bo oster, with the rule-based fallbac k guaran teeing forward progress on every request. These ﬁndings hav e practical implications for soft ware arc hitects: LLMs can serv e as runtime in teroperability comp onen ts for systems where sc hema evolution outpaces man ual adapter developmen t, provided that appropriate safeguards, caching, and model selection strategies are in place. F uture w ork includes: (1) lo cal/on-premise mo dels to eliminate cloud dependency and address data priv acy , F alcao et al. show op en-source mo dels achieving comp et- itiv e results [2]; (2) mo del routing based on mismatch complexit y to exploit the complemen tary strengths of reasoning and non-reasoning mo dels; (3) service mesh in tegration (e.g., as an En vo y or Istio ﬁlter) for transpar- en t deplo yment in microservice architectures; (4) schema ev olution trac king for proactive adapter regeneration when bac k end sc hemas c hange; and (5) broader protocol supp ort including XML, Protobuf, gRPC, and binary formats. References [1] L. Bass, P . Clements, and R. Kazman, Softwar e Ar chitectur e in Pr actic e , 4th ed. A ddison-W esley , 2021. [2] R. F alcao, S. Sch w eitzer, J. Sieb ert, E. Calvet, and F. El- berzhager, “Ev aluating the eﬀectiveness of LLM-based interop- erability ,” in Pr o c. ICSE , 2026, to appear. [3] N. Guran, F. Knauf, M. Ngo, S. Petrescu, and J. S. Reller- meyer, “T ow ards a middlew are for large language models,” arXiv pr eprint arXiv:2411.14513 , 2024. [4] M. P arciak, B. V andevoort, F. Neven, L. M. Peeters, and S. V ansummeren, “Schema matching with large language mo dels: an experimental study ,” in Pr o c. VLDB T aDA W orkshop , 2024. [5] N. Seedat and M. v an der Schaar, “Matchmaker: Self-improving large language mo del programs for sc hema matc hing,” in Pr o c. NeurIPS TRL W orkshop , 2024. [6] M. Esp osito, X. Li, S. Moresc hini, N. Ahmad, T. Cern y , K. V aidhy anathan, V. Lenarduzzi, and D. T aibi, “Generative AI for softw are architecture. Applications, c hallenges, and future directions,” J. Syst. Softw. , vol. 231, art. 112607, 2026. [7] ISO/IEC 25010:2023, Systems and softwar e engine ering — Systems and softwar e Quality R e quirements and Evaluation (SQuaRE) — Pro duct quality mo del , International Organization for Standardization, 2023. [8] Op enAI, “Structured outputs in the API,” 2024. [Online]. A v ailable: https://platform.openai.com/docs/guides/ structured- outputs. Accessed: F eb. 2026. [9] A. Lercher, J. Glock, C. Macho, and M. Pinzger, “Microservice API ev olution in practice: A study on strategies and challenges,” J. Syst. Softw. , vol. 215, p. 112110, 2024. [10] A. T olk and J. A. Muguira, “The Levels of Conceptual Inter- operability Mo del,” in Pr o c. F al l Simulation Interop erability W orkshop , 2003. [11] G. Hohpe and B. W o olf, Enterprise Inte gr ation Patterns . Addison-W esley , 2003. [12] A. Naray an, I. Chami, L. Orr, S. Arora, and C. Ré, “Can foundation models wrangle your data?,” Pr o c. VLDB Endow. , vol. 16, no. 4, pp. 738–746, 2022. [13] R. Peeters, A. Steiner, and C. Bizer, “Entit y matc hing using large language models,” in Pr oc. EDBT , 2025. [14] J. O. Kephart and D. M. Chess, “The vision of autonomic computing,” Computer , v ol. 36, no. 1, pp. 41–50, 2003. [15] X. W ang, J. W ei, D. Sch uurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery , and D. Zhou, “Self-consistency improv es chain of thought reasoning in language models,” in Pr o c. ICLR , 2023. [16] J. W ei, X. W ang, D. Sch uurmans, M. Bosma, B. Ich ter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language mo dels,” in Pro c. NeurIPS , 2022. [17] M. Chen et al., “Ev aluating large language mo dels trained on code,” arXiv pr eprint arXiv:2107.03374 , 2021. [18] S. Dehp our, “DeepDiﬀ: Deep diﬀerence and searc h of any Python ob ject/data,” 2024. [Online]. A v ailable: https://gith ub. com/seperman/deep diﬀ . Accessed: F eb. 2026. [19] B. Mey er, “Applying ‘Design by Con tract’,” Computer , v ol. 25, no. 10, pp. 40–51, 1992. [20] R. Liu, J. Geng, A. J. W u, I. Sucholutsky , T. Lombrozo, and T. L. Griﬃths, “Mind y our step (b y step): Chain-of-thought can reduce performance on tasks where thinking mak es h umans worse,” in Pr oc. ICML , 2025.

SAGAI-MID: A Generative AI-Driven Middleware for Dynamic Runtime Interoperability

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment