An Agentic System for Schema Aware NL2SQL Generation

An Agentic System for Schema Aware NL2SQL Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]


💡 Research Summary

The paper addresses the growing demand for accessible data querying by proposing a cost‑effective, schema‑aware, agentic NL2SQL system that primarily relies on Small Language Models (SLMs) and invokes a Large Language Model (LLM) only when necessary. Recognizing that recent NL2SQL advances have been dominated by expensive LLMs, which raise concerns about computational overhead, data privacy, and deployment feasibility in resource‑constrained settings, the authors design a four‑stage pipeline—Schema Extraction, Query Decomposition, SQL Generation, and Validation/Execution—each handled by a dedicated agent.

The Extractor Agent combines three sources of schema information: (1) database metadata, (2) retrieval‑augmented generation (RAG) contexts extracted from documentation, and (3) evidence mappings from historical query‑schema alignments. Using a compact embedding model (all‑MiniLM‑L16‑v2) and a ChromaDB vector store, it maps user questions and schema documents into a shared semantic space, performing approximate nearest‑neighbor search to retrieve the top‑10 relevant evidence snippets in under a second. This enriched, schema‑aware context reduces hallucination and improves constraint handling.

The Decomposer Agent employs a 7‑billion‑parameter SLM (Mistral‑7B) to break the natural‑language question into a structured plan: entity identification, condition extraction, execution ordering, and output formatting. By leveraging the schema‑aware context from the Extractor, it aligns linguistic tokens with the correct tables, columns, and relationships, enabling reliable handling of multi‑step queries, joins, and nested operations.

The Generator Agent uses Llama‑3.1‑8B as the primary model to produce an initial SQL statement from the decomposition plan and the enriched schema context. The generated query is immediately passed to the Validator/Executor Agent. If execution fails, a fallback mechanism triggers GPT‑4o (an LLM) to regenerate the SQL using the original natural language query, the failed SQL, and the error message. Up to three regeneration attempts are allowed before reporting a failure.

The Validator and Executor Agent implements a four‑stage validation process: (1) evidence‑based value validation, (2) syntax and schema conformity checking, (3) execution validation within a secure transaction (detecting missing columns, type mismatches, constraint violations), and (4) semantic validation of the result set against the original intent (checking aggregations, grouping, empty results). This layered validation not only catches syntactic errors but also enforces semantic alignment with the database, substantially mitigating LLM “hallucination.”

Implementation leverages LangGraph for workflow orchestration and LangChain for stateful memory and context routing. The system’s modularity allows each agent to be swapped or upgraded independently.

Experimental evaluation uses the BIRD benchmark (12,751 NL‑SQL pairs across 95 databases and 37 professional domains). The proposed system resolves approximately 67 % of queries using only the SLM pipeline, achieving an average cost per query of $0.0085, compared to $0.094 for LLM‑only baselines—a reduction exceeding 90 %. Overall execution accuracy reaches 47.78 %, and the validation efficiency score is 51.05 %, outperforming prior LLM‑centric multi‑agent approaches in cost while delivering competitive accuracy.

The authors acknowledge limitations: (1) execution accuracy remains lower than state‑of‑the‑art LLM‑only models, especially on complex multi‑join and aggregation queries; (2) the retrieval of schema evidence may become a bottleneck for extremely large enterprise schemas; (3) reliance on a fixed fallback LLM (GPT‑4o) could introduce latency spikes. Future work is outlined to improve evidence retrieval efficiency, expand SLM fine‑tuning on domain‑specific corpora, enhance inter‑agent collaboration (e.g., shared memory of execution feedback), and explore continual learning from user corrections.

In summary, this work demonstrates that a carefully orchestrated ensemble of small, specialized models can dramatically cut operational costs while maintaining acceptable performance for NL2SQL tasks, paving the way for on‑premise, privacy‑preserving deployments in real‑world enterprise environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment