Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.


💡 Research Summary

The paper introduces Minerva, a framework for training large language models (LLMs) to perform cyber‑threat‑intelligence (CTI) tasks with higher accuracy and robustness by using reinforcement learning with verifiable rewards (RLVR). CTI analysts must map noisy, unstructured artifacts (vulnerability descriptions, incident narratives, detection rules) onto standardized identifiers and schemas such as MITRE ATT&CK, CWE, CVSS, STIX/TAXII, etc. These identifiers are deterministic: a model’s output can be programmatically checked for exact match, partial match, or set overlap. The authors exploit this property to replace human preference models (as in RLHF) with lightweight, deterministic verifiers that assign a scalar reward in (


Comments & Academic Discussion

Loading comments...

Leave a Comment