DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
💡 Research Summary
The paper introduces DomURLs_BERT, a BERT‑based encoder pre‑trained specifically for malicious domain and URL detection and classification. Recognizing the limitations of traditional blacklist and heuristic approaches—namely their inability to keep up with rapidly evolving, obfuscated threats—the authors propose a self‑supervised, transformer‑based solution that learns directly from raw URL and domain strings.
To build the pre‑training corpus, the authors aggregate data from six large public sources: the multilingual mC4 corpus, Falcon‑refinedWeb, the CBA web‑tracking dataset, the Tranco top‑1M list, and two DGA‑focused collections (UTL_DGA22 and UMUDGA). After deduplication and cleaning, the final corpus comprises 375,057,861 training samples and 19,739,888 development samples, covering a wide spectrum of languages, URL structures, and malicious patterns.
The preprocessing pipeline strips the scheme (e.g., “http://”), splits each URL into a domain part and a path part, and inserts special tokens:
Comments & Academic Discussion
Loading comments...
Leave a Comment