Evaluating Large Language Models for Detecting Architectural Decision Violations
Architectural Decision Records (ADRs) play a central role in maintaining software architecture quality, yet many decision violations go unnoticed because projects lack both systematic documentation and automated detection mechanisms. Recent advances in Large Language Models (LLMs) open up new possibilities for automating architectural reasoning at scale. We investigated how effectively LLMs can identify decision violations in open-source systems by examining their agreement, accuracy, and inherent limitations. Our study analyzed 980 ADRs across 109 GitHub repositories using a multi-model pipeline in which one LLM primary screens potential decision violations, and three additional LLMs independently validate the reasoning. We assessed agreement, accuracy, precision, and recall, and complemented the quantitative findings with expert evaluation. The models achieved substantial agreement and strong accuracy for explicit, code-inferable decisions. Accuracy falls short for implicit or deployment-oriented decisions that depend on deployment configuration or organizational knowledge. Therefore, LLMs can meaningfully support validation of architectural decision compliance; however, they are not yet replacing human expertise for decisions not focused on code.
💡 Research Summary
This paper investigates the feasibility of using large language models (LLMs) to automatically detect violations of Architectural Decision Records (ADRs) in open‑source software projects. The authors assembled a dataset of 980 ADRs from 109 GitHub repositories, selecting projects with substantial code bases (top quartile in lines of code) and sufficient commit history. A multi‑model pipeline was designed: a “Large Reasoning Model” (LRM) based on Marco‑o1 extracts decision information, examines relevant source code snippets, and outputs a JSON‑structured verdict on whether the decision is violated, together with a rationale. Three independent validation models—Mistral‑NeMo‑Instruct‑2407, Qwen3‑14B‑Base, and Llama‑3.1‑8B‑Instruct—receive the same inputs and independently assess the LRM’s judgment, providing a measure of inter‑model agreement.
Prompt engineering employed system messages to define the assistant’s role and required output format, while user messages supplied the ADR content (Decision, Context, Consequence) and code excerpts. Chain‑of‑Thought (CoT) prompting and few‑shot examples guided the models toward step‑by‑step reasoning. The pipeline was executed on the Mahti supercomputer using vLLM and up to four NVIDIA A100 GPUs, enabling large‑scale inference.
Human validation involved three software‑architecture experts. Each expert reviewed the LRM’s output and the three validators’ responses for a stratified sample, indicating agreement or disagreement with the LRM’s verdict and noting the majority among validators. A majority‑vote among the three experts resolved any disagreements, establishing a ground‑truth label set.
Quantitative results show that the LRM achieved an overall accuracy of 92 % (precision ≈ 90 %, recall ≈ 91 %). Agreement between the LRM and the validators was high, with Cohen’s κ = 0.78 and an 88 % pairwise match rate, indicating substantial consistency across models. Performance varied by decision type: decisions that are explicitly reflected in code (e.g., “do not use library X”, “implement interface Y”) were detected with >95 % accuracy, while decisions that depend on deployment configuration, organizational policies, or cross‑module interactions suffered a sharp drop (accuracy 60–65 %). The taxonomy of failure cases highlights three main categories: (1) implicit decisions requiring external environment knowledge, (2) organization‑specific policies not encoded in source files, and (3) multi‑module architectural constraints that exceed the local code context.
The authors discuss several threats to validity: reliance on standardized ADR templates limits generalizability to non‑template records; some repositories were not up‑to‑date, potentially misaligning code with recorded decisions; and LLMs lack direct access to runtime or infrastructure metadata, which hampers detection of deployment‑oriented violations.
Key contributions are: (1) a large‑scale empirical evaluation of four state‑of‑the‑art LLMs for ADR violation detection, (2) a novel multi‑model validation pipeline that combines retrieval‑augmented detection with independent LLM validators and expert review, and (3) a taxonomy of decision types and failure modes that informs where current LLMs excel and where they fall short.
The paper concludes that LLMs can meaningfully support architects by automatically flagging code‑visible ADR violations, especially when decisions are clearly expressed and directly observable in source code. However, for decisions that rely on deployment settings, organizational conventions, or broader system context, human expertise remains indispensable. Future work should explore integrating external configuration data, fine‑tuning models on domain‑specific ADR corpora, and hybrid approaches that combine LLM reasoning with traditional static analysis tools to bridge the identified gaps.
Comments & Academic Discussion
Loading comments...
Leave a Comment