Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection
The Uniform Resource Locator (URL), introduced in a connectivity-first era to define access and locate resources, remains historically limited, lacking future-proof mechanisms for security, trust, or resilience against fraud and abuse, despite the introduction of reactive protections like HTTPS during the cybersecurity era. In the current AI-first threatscape, deceptive URLs have reached unprecedented sophistication due to the widespread use of generative AI by cybercriminals and the AI-vs-AI arms race to produce context-aware phishing websites and URLs that are virtually indistinguishable to both users and traditional detection tools. Although AI-generated phishing accounted for a small fraction of filter-bypassing attacks in 2024, phishing volume has escalated over 4,000% since 2022, with nearly 50% more attacks evading detection. At the rate the threatscape is escalating, and phishing tactics are emerging faster than labeled data can be produced, zero-shot and few-shot learning with large language models (LLMs) offers a timely and adaptable solution, enabling generalization with minimal supervision. Given the critical importance of phishing URL detection in large-scale cybersecurity defense systems, we present a comprehensive benchmark of LLMs under a unified zero-shot and few-shot prompting framework and reveal operational trade-offs. Our evaluation uses a balanced dataset with consistent prompts, offering detailed analysis of performance, generalization, and model efficacy, quantified by accuracy, precision, recall, F1 score, AUROC, and AUPRC, to reflect both classification quality and practical utility in threat detection settings. We conclude few-shot prompting improves performance across multiple LLMs.
💡 Research Summary
The paper addresses the growing challenge of AI‑generated phishing URLs by investigating whether large language models (LLMs) can reliably detect malicious links with minimal supervision. The authors benchmark three proprietary, instruction‑tuned LLMs—OpenAI’s GPT‑4o, Anthropic’s Claude‑3.7‑sonnet‑20250219, and xAI’s Grok‑3‑Beta—under both zero‑shot and few‑shot prompting regimes.
Data preparation uses the publicly available PhiUSIIL phishing URL dataset. For the balanced evaluation, 5,000 phishing and 5,000 legitimate URLs are randomly sampled (seed 42) to form a 10,000‑sample test set. For the imbalanced evaluation, two 1,000‑sample test sets are constructed with phishing prevalence of 1 % and 10 %, each generated with two different random seeds (S123, S456) to assess stability.
Prompt design follows a simple, model‑agnostic template: a system instruction (“You are a cybersecurity expert. Respond only with 0 for phishing or 1 for legitimate.”) followed by a query (“URL: {u} Is this URL phishing or legitimate? Respond with 0 or 1.”). In the few‑shot setting, six exemplars (three phishing, three legitimate) are appended, each formatted as “URL: {u’} Answer: {y’}”. For GPT‑4o and Grok‑3‑Beta the instruction is sent as a system message, examples as separate user messages, and the final query as the last user message; Claude‑3.7‑sonnet concatenates everything with double newlines. All API calls use temperature 0 and a maximum of 10 output tokens; unparsable responses are discarded.
Six evaluation metrics are reported: macro‑averaged accuracy, precision, recall, F1‑score, AUROC, and AUPRC, computed with scikit‑learn. The balanced test results (Table 1) show that few‑shot prompting consistently improves performance across all models. Grok‑3‑Beta achieves the highest few‑shot accuracy (0.9405), precision (0.9492), F1 (0.9399), AUROC (0.9405) and AUPRC (0.9573), though its recall drops slightly from 0.9735 (zero‑shot) to 0.9307 (few‑shot). Claude‑3.7‑sonnet records the best recall (0.9526) in the few‑shot setting but lags in precision (0.9027). GPT‑4o shows steady gains but remains behind the other two models. Confusion matrices (Figure 2) confirm that few‑shot prompting reduces false negatives dramatically, especially for Grok‑3‑Beta (from 950 to 248).
ROC and precision‑recall curves (Figures 3‑5) further illustrate that few‑shot prompting lifts the entire curve, indicating better discrimination across thresholds. The imbalanced experiments (not fully detailed in the excerpt) reveal that few‑shot prompting mitigates the drop in recall that typically occurs with rare‑class scenarios, and that increasing the number of exemplars (1, 3, 9) yields incremental improvements.
The authors contribute (1) a unified benchmark for zero‑ and few‑shot LLM‑based phishing URL detection, (2) a balanced public dataset with standardized prompts, and (3) an open‑source codebase for reproducibility. Limitations include reliance on English prompts, lack of latency or cost analysis for real‑time deployment, and the use of randomly selected exemplars without studying exemplar quality or selection strategies. Future work should explore multilingual prompting, optimal exemplar selection, on‑premise LLM deployment to control inference costs, and integration with streaming detection pipelines. Overall, the study demonstrates that few‑shot prompting can substantially enhance LLM performance for phishing URL detection, offering a viable, low‑label‑requirement alternative to traditional supervised models.
Comments & Academic Discussion
Loading comments...
Leave a Comment