PromptPex: Automatic Test Generation for Language Model Prompts

PromptPex: Automatic Test Generation for Language Model Prompts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs, generate outputs, and perform some specific function. However, prompts differ from traditional code in many ways and require new approaches to ensure that they are robust. For example, unlike traditional software the output of a prompt depends on the AI model that interprets it. Also, while natural language prompts are easy to modify, the impact of updates is harder to predict. New approaches to testing, debugging, and modifying prompts with respect to the model running them are required. To address some of these issues, we developed PromptPex, an LLM-based tool to automatically generate and evaluate unit tests for a given prompt. PromptPex extracts input and output specifications from a prompt and uses them to generate diverse, targeted, and valid unit tests. These tests are instrumental in identifying regressions when a prompt is changed and also serve as a tool to understand how prompts are interpreted by different models. We use PromptPex to generate tests for eight benchmark prompts and evaluate the quality of the generated tests by seeing if they can cause each of four diverse models to produce invalid output. PromptPex consistently creates tests that result in more invalid model outputs than a carefully constructed baseline LLM-based test generator. Furthermore, by extracting concrete specifications from the input prompt, PromptPex allows prompt writers to clearly understand and test specific aspects of their prompts. The source code of PromptPex is available at https://github.com/microsoft/promptpex.


💡 Research Summary

The paper introduces PromptPex, a novel tool that automatically generates and evaluates unit tests for large language model (LLM) prompts. The authors argue that prompts, while functionally similar to traditional software functions—taking well‑defined inputs and producing outputs—exhibit unique challenges: their behavior is non‑deterministic, heavily dependent on the underlying model, and often expressed in natural language that can be ambiguous. Consequently, conventional software testing techniques are insufficient for ensuring prompt robustness, especially when prompts are edited or when the underlying model is swapped.

PromptPex addresses these challenges through a three‑stage pipeline. First, it extracts an input specification (IS) and a set of output rules (OR) from the Prompt Under Test (PUT). The extraction is performed by a strong LLM (GPT‑OSS) that parses the natural‑language description of the prompt and produces a formalized, pre‑condition/post‑condition style specification. For example, in a part‑of‑speech tagging prompt, the IS might state that the input must be a sentence plus a word that appears in that sentence, while the OR would capture constraints such as “return only the POS tag, otherwise return Unknown or CantAnswer.”

Second, PromptPex uses another LLM to generate diverse test inputs that deliberately probe the boundaries of the extracted specifications. Tests are of two kinds: (a) boundary‑testing inputs that explore edge cases of the IS (e.g., compound words, missing words) and (b) rule‑violating inputs that aim to trigger non‑compliant outputs according to the OR (e.g., prompting the model to include explanations when only a tag is allowed).

Third, the generated tests are executed against multiple Model Under Test (MUT) instances. In the evaluation, four models were used: gpt‑4o‑mini, gemma2‑9b, qwen2.5‑3b, and llama3.2‑1b. For each model, PromptPex captures the raw output, feeds it to a compliance‑checking LLM, and labels the result as either compliant or non‑compliant. The authors treat non‑compliant outcomes as “successful” test cases because they reveal a violation of the prompt’s intended behavior.

The empirical study involved 22 benchmark prompts drawn from real‑world applications (POS tagging, summarization, code generation, etc.). PromptPex’s generated tests were compared against a baseline LLM‑based test generator that does not perform specification extraction. Across all prompts and models, PromptPex consistently produced a higher proportion of non‑compliant results—approximately 15–20 % more than the baseline. The advantage was especially pronounced when comparing models with differing instruction‑following capabilities; for instance, gpt‑oss typically obeyed “return only the tag” strictly, whereas gpt‑3.5 often prefixed the tag with “Output:”, exposing a portability issue that PromptPex highlighted clearly.

Key contributions of the work are: (1) the first systematic approach to automated test generation specifically for LLM prompts, (2) a novel method for extracting formal input/output specifications from natural‑language prompts, and (3) a comprehensive evaluation showing that specification‑driven tests are more effective at uncovering model‑specific failures than naïve test generation.

The paper also acknowledges several limitations. The quality of the extracted specifications depends heavily on the LLM used for extraction; ambiguous or highly complex prompts may lead to incomplete or incorrect IS/OR. PromptPex focuses on negative testing (i.e., finding violations) and does not automatically generate positive tests that confirm correct behavior, which would be necessary for full coverage. The evaluation was limited to four models, all of which are relatively small or open‑source; extending the study to larger commercial models (e.g., GPT‑4, Claude) would be valuable. Finally, integration with continuous integration/continuous deployment (CI/CD) pipelines is left as future work.

In conclusion, PromptPex demonstrates that treating prompts as software artifacts and applying specification‑based testing can substantially improve the reliability of LLM‑driven applications. By automatically extracting constraints, generating targeted tests, and evaluating compliance across multiple models, PromptPex equips developers with actionable insights about prompt robustness, model portability, and potential regression when prompts evolve. Future research directions include improving specification extraction accuracy, expanding test generation to cover positive cases, and embedding PromptPex into real‑world development workflows.


Comments & Academic Discussion

Loading comments...

Leave a Comment