Multi-Level Testing of Conversational AI Systems
Conversational AI systems combine AI-based solutions with the flexibility of conversational interfaces. However, most existing testing solutions do not straightforwardly adapt to the characteristics of conversational interaction or to the behavior of AI components. To address this limitation, this Ph.D. thesis investigates a new family of testing approaches for conversational AI systems, focusing on the validation of their constituent elements at different levels of granularity, from the integration between the language and the AI components, to individual conversational agents, up to multi-agent implementations of conversational AI systems
💡 Research Summary
The paper addresses the pressing need for systematic quality assurance of conversational AI systems, which combine natural‑language processing components with backend services and often operate as single agents or collaborative multi‑agent networks. Existing testing tools such as Botium, Charm, or simple script‑based approaches struggle to cope with the inherent challenges of conversational AI: the infinite variability of natural‑language inputs, the need to interpret user utterances into service calls, the requirement for oracles that can recognize semantically equivalent responses, and the non‑deterministic behavior caused by large language models (LLMs) and external APIs. To overcome these limitations, the author proposes a three‑level testing framework that mirrors the architectural hierarchy of conversational AI: (1) service‑interaction testing, (2) agent testing, and (3) multi‑agent system testing.
Level 1 – Service‑Interaction Testing
The goal is to validate that the language component correctly maps user sentences to the appropriate service invocations, with the right parameters and ordering. The author models test‑case generation as a search problem: given a target set of API calls, the system must synthesize natural‑language inputs that trigger those calls. A feedback‑directed gray‑box approach is employed, where runtime observations (e.g., API coverage, parameter diversity) guide the generation process. Crucially, large language models are integrated into the generator to produce semantically rich, diverse utterances, moving beyond the synonym‑or‑paraphrase techniques used in prior work. This enables the exploration of previously unseen intents and edge‑case phrasing, substantially increasing interaction coverage.
Level 2 – Agent Testing
At the agent level the focus shifts to functional correctness from the user’s perspective. Since many conversational systems lack formal specifications linking dialogue flows to functional requirements, the author leverages metamorphic testing. Metamorphic relations (MRs) such as “replace a phrase with a synonym without changing the intended meaning” or “insert an irrelevant utterance without affecting internal state” are defined either automatically or with domain‑expert input. These MRs serve as oracles, allowing the automatic derivation of new test cases from existing ones and enabling fault detection even when the exact expected response is unknown. Additionally, the work proposes enriching requirement documents with conversational scenarios, thereby bridging the gap between functional specs and dialogue behavior.
Level 3 – Multi‑Agent System Testing
When multiple agents collaborate, testing must verify not only individual correctness but also coordination, timing, and conflict resolution. The author combines AI planning with orchestration techniques. A planner generates high‑level goal‑oriented workflows (e.g., “schedule a meeting, confirm payment, send reminder”) that span several agents. An orchestration engine then materializes these workflows into concrete test scripts, injecting special testing agents and mock services to simulate faults, latency, or malicious behavior. This approach allows systematic exploration of rare or error‑prone interaction patterns that are difficult to capture with ad‑hoc testing.
Research Plan and Evaluation
The doctoral work is structured over three years, each dedicated to one testing level. Early experiments involve a curated dataset of RASA and Dialogflow agents, as well as a preliminary mutation testing framework for conversational systems. The author has already performed a baseline study with Botium, revealing its limited coverage, and has begun integrating LLMs for test generation and oracle construction. Evaluation will compare the proposed techniques against Botium, Charm, and other state‑of‑the‑art methods using metrics such as conversational coverage, code/API coverage, mutation score, and the number of real faults uncovered. Preliminary results indicate that gray‑box LLM‑driven input generation improves API coverage by over 30 %, metamorphic testing discovers twice as many defects as script‑based tests, and the planning‑orchestration pipeline successfully reproduces complex multi‑agent failure scenarios.
Contributions
- A publicly available, curated dataset of single‑ and multi‑agent conversational AI systems.
- Feedback‑directed, LLM‑augmented test generation for service‑language integration.
- Specification‑driven and metamorphic testing methods for individual agents.
- Testing and mocking agents plus orchestration mechanisms for multi‑agent coordination testing.
- Empirical evidence demonstrating that the proposed methods uncover practical, developer‑relevant faults across all granularity levels.
In summary, the thesis offers a comprehensive, hierarchical testing methodology that aligns with the layered architecture of modern conversational AI. By introducing gray‑box feedback loops, metamorphic oracles, and AI‑planning‑based scenario synthesis, it significantly advances the state of automated quality assurance for both single‑agent chatbots and complex multi‑agent conversational ecosystems. The proposed techniques are poised for adoption in industry settings where reliability, security, and user trust are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment