Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
Mobile devices are frequent targets of eCrime threat actors through SMS spearphishing (smishing) links that leverage Domain Generation Algorithms (DGA) to rotate hostile infrastructure. Despite this, DGA research and evaluation largely emphasize malware C2 and email phishing datasets, leaving limited evidence on how well detectors generalize to smishing-driven domain tactics outside enterprise perimeters. This work addresses that gap by evaluating traditional and machine-learning DGA detectors against Gravity Falls, a new semi-synthetic dataset derived from smishing links delivered between 2022 and 2025. Gravity Falls captures a single threat actor’s evolution across four technique clusters, shifting from short randomized strings to dictionary concatenation and themed combo-squatting variants used for credential theft and fee/fine fraud. Two string-analysis approaches (Shannon entropy and Exp0se) and two ML-based detectors (an LSTM classifier and COSSAS DGAD) are assessed using Top-1M domains as benign baselines. Results are strongly tactic-dependent: performance is highest on randomized-string domains but drops on dictionary concatenation and themed combo-squatting, with low recall across multiple tool/cluster pairings. Overall, both traditional heuristics and recent ML detectors are ill-suited for consistently evolving DGA tactics observed in Gravity Falls, motivating more context-aware approaches and providing a reproducible benchmark for future evaluation.
💡 Research Summary
The paper addresses a notable blind spot in the domain‑generation‑algorithm (DGA) research landscape: the detection of malicious domains used in SMS‑based phishing (smishing) attacks targeting mobile users outside corporate perimeters. While most prior work evaluates DGA detectors on malware command‑and‑control (C2) or email‑phishing datasets, this study introduces “Gravity Falls,” a semi‑synthetic dataset compiled from smishing links observed between 2022 and 2025. Gravity Falls captures a single threat actor’s evolution across four technique clusters, each reflecting a distinct DGA tactic:
- Cats Cradle (2022) – short, fully random strings (5‑8 characters) with common TLDs.
- Double Helix (2023) – concatenation of two dictionary words, often using newer gTLDs.
- Pandoras Box (2024) – “combo‑squatting” where brand or logistics keywords are split between sub‑domain and domain, appended with short random suffixes.
- Easy Rider (2025) – government‑or toll‑related themed combo‑squatting (e.g., DMV, EZ‑Pass) with similar random suffixes.
Each cluster contains 10 000 malicious samples, balanced with 10 000 benign domains drawn from four major Top‑1M lists (Alexa, Cisco, Cloudflare, Majestic). The benign lists are treated as a static baseline, a common practice in DGA evaluation, despite the acknowledged risk of occasional false negatives.
The authors evaluate four detection approaches: two traditional string‑analysis heuristics (Shannon entropy and Exp0se, which combines entropy, consonant count, and length thresholds) and two machine‑learning models (a pre‑trained LSTM classifier that one‑hot encodes TLDs, and COSSAS DGAD, a Temporal Convolutional Network trained on Shadowserver data). All tools run with default parameters on an Ubuntu 24.04 VM; results are aggregated into precision, accuracy, and recall metrics.
Key Findings
- Tactic‑dependent performance – All detectors excel on the fully random Cats Cradle cluster (precision and recall often > 90%). Performance collapses on Double Helix, where dictionary concatenation reduces entropy, and further degrades on the combo‑squatting clusters (Pandoras Box, Easy Rider), where recognizable brand or government tokens mask the random component.
- Traditional heuristics – Shannon entropy reliably flags high‑entropy strings but fails when entropy drops due to word‑like structures. Exp0se improves on Cats Cradle and shows modest gains on combo‑squatting, yet remains weak on Double Helix.
- Machine‑learning models – The LSTM and DGAD models, despite being trained on large public DGA corpora, generalize only to the random‑string pattern. Their recall on Double Helix falls below 30 %, and on the later clusters it stays under 40 %, indicating limited adaptability to mixed‑token domains.
- False‑positive pressure – Combo‑squatting domains that embed popular brand or government keywords cause a rise in false positives when benign Top‑1M domains contain similar tokens, highlighting the difficulty of separating malicious from benign lexical overlap.
Implications
The study demonstrates that current DGA detection pipelines, whether heuristic or neural, are ill‑suited for the evolving tactics observed in smishing campaigns. Attackers can deliberately blend dictionary words, brand names, and short random suffixes to evade high‑entropy filters while still maintaining enough randomness to avoid simple blacklist matching. Consequently, defenders cannot rely on strong detection scores obtained on traditional malware‑focused datasets as evidence of robustness against mobile‑centric phishing.
Proposed Directions
- Multimodal feature integration – Incorporate WHOIS metadata, passive DNS histories, and URL‑page context (e.g., redirection chains, visual snapshots) alongside lexical features to provide richer signals.
- Online/continual learning – Deploy adaptive models that ingest newly observed smishing domains in near‑real‑time, allowing rapid adjustment to emerging token combinations.
- Smishing‑specific training data – Curate labeled datasets that explicitly contain combo‑squatting and dictionary‑concatenation patterns, ensuring that ML models see representative examples during training.
Contribution to the Community
Gravity Falls is released publicly (doi:10.5281/zenodo.17624554) together with collection scripts and replication instructions on GitHub. By providing a benchmark that reflects realistic mobile‑phishing DGA tactics, the authors enable future work to evaluate and improve detection methods beyond the traditional malware‑centric scope.
In summary, the paper highlights a critical gap in DGA detection for mobile smishing, empirically shows the inadequacy of existing heuristics and ML models across evolving DGA tactics, and calls for context‑aware, continuously‑trained solutions while supplying a reproducible dataset for the research community.
Comments & Academic Discussion
Loading comments...
Leave a Comment