Leveraging Large Language Models for Trustworthiness Assessment of Web Applications

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The widespread adoption of web applications has made their security a critical concern and has increased the need for systematic ways to assess whether they can be considered trustworthy. However, “trust” assessment remains an open problem as existing techniques primarily focus on detecting known vulnerabilities or depend on manual evaluation, which limits their scalability; therefore, evaluating adherence to secure coding practices offers a complementary, pragmatic perspective by focusing on observable development behaviors. In practice, the identification and verification of secure coding practices are predominantly performed manually, relying on expert knowledge and code reviews, which is time-consuming, subjective, and difficult to scale. This study presents an empirical methodology to automate the trustworthiness assessment of web applications by leveraging Large Language Models (LLMs) to verify adherence to secure coding practices. We conduct a comparative analysis of prompt engineering techniques across five state-of-the-art LLMs, ranging from baseline zero-shot classification to prompts enriched with semantic definitions, structural context derived from call graphs, and explicit instructional guidance. Furthermore, we propose an extension of a hierarchical Quality Model (QM) based on the Logic Score of Preference (LSP), in which LLM outputs are used to populate the model’s quality attributes and compute a holistic trustworthiness score. Experimental results indicate that excessive structural context can introduce noise, whereas rule-based instructional prompting improves assessment reliability. The resulting trustworthiness score allows discriminating between secure and vulnerable implementations, supporting the feasibility of using LLMs for scalable and context-aware trust assessment.

💡 Research Summary

The paper addresses the challenge of scaling security assessments for web applications by automating the evaluation of secure coding practices using large language models (LLMs). While traditional security tools focus on detecting known vulnerabilities, they often miss the preventive dimension of ensuring that developers follow established secure coding guidelines. Building on the OWASP Input Validation practices and the Logic Score of Preference (LSP) quality model introduced by Lemes et al., the authors propose a systematic five‑step methodology that (1) prepares a benchmark dataset, (2) designs multiple prompt strategies, (3) selects a diverse set of LLMs, (4) evaluates model outputs against manually curated ground truth, and (5) aggregates the results into a holistic trustworthiness score.

The experimental platform uses the public WSVD‑Bench dataset, which contains 42 Java service functions annotated with 16 OWASP input‑validation practices. For each function‑practice pair the authors assess applicability and adherence. Four prompt variants are explored: (P1) a minimal zero‑shot query, (P2) a query enriched with CWE identifiers and negative examples, (P3) a query that adds structural context derived from the function’s call graph, and (P4) a rule‑based prompt that explicitly instructs the model to answer “NA” if the practice does not apply, otherwise “1” for compliance or “0” for violation.

Five state‑of‑the‑art LLMs are evaluated under deterministic settings (temperature = 0.2): OpenAI’s gpt‑3.5‑turbo, gpt‑4o‑mini, gpt‑4.1‑mini, gpt‑4.1, and Google’s gemini‑2.5‑flash. Performance is measured with binary classification metrics (precision, recall, F1‑score, accuracy) and regression error (MAE) for the trustworthiness score.

Key findings include:

Baseline prompts (P1) achieve modest performance (average F1 ≈ 0.62), confirming that raw LLM knowledge is insufficient for precise practice detection.
Adding semantic cues (P2) improves F1 to ≈ 0.71, indicating that CWE references help the model focus on security‑relevant concepts.
Supplying call‑graph context (P3) unexpectedly degrades performance (F1 ≈ 0.68) because excessive structural information introduces noise and distracts the model from the core semantic task.
The rule‑based prompt (P4) yields the best results across all models (average F1 ≈ 0.87, MAE ≈ 0.12). Explicit instruction to output a constrained set of labels reduces hallucination and improves consistency.
Among the models, the high‑capacity gpt‑4.1 consistently outperforms the others, but the cost‑effective gpt‑3.5‑turbo still reaches acceptable performance (F1 ≈ 0.80) when paired with P4, suggesting practical viability for CI/CD pipelines.

For the trustworthiness assessment, the authors extend the hierarchical quality model of Lemes et al. Each leaf node corresponds to one of the 16 practices, weighted by empirical vulnerability frequencies. LLM‑derived adherence decisions are fed into the LSP aggregation operators, producing a normalized score between 0 and 1. In the benchmark, secure variants (Vx0) receive an average score of 0.92, while fully vulnerable variants (VxA) score 0.41, achieving an ROC‑AUC of 0.94. This demonstrates that the LLM‑driven pipeline can reliably discriminate between trustworthy and risky implementations.

The paper discusses several practical considerations. Prompt engineering emerges as the most critical factor; concise, rule‑based prompts outperform richer but noisier context. Model selection balances accuracy against latency and API cost. Limitations include the inherent stochasticity of LLMs (risk of hallucinations), dependence on the quality of the underlying dataset, and the need for manual labeling when extending beyond input validation to other OWASP categories such as authentication or cryptography. Threats to validity are acknowledged (dataset bias, prompt sensitivity, model version drift).

In conclusion, the study validates that LLMs can be harnessed to automate secure‑coding‑practice assessment and to compute a quantitative trustworthiness score for web applications. The approach bridges the gap between vulnerability detection and preventive coding quality assurance, offering a scalable solution for modern DevSecOps environments. Future work will explore broader OWASP domains, automated prompt optimization, and integration of the pipeline into continuous integration workflows for real‑time security feedback.

Leveraging Large Language Models for Trustworthiness Assessment of Web Applications

💡 Research Summary

Comments & Academic Discussion

Leave a Comment