SynthChain: A Synthetic Benchmark and Forensic Analysis of Advanced and Stealthy Software Supply Chain Attacks
Advanced software supply chain (SSC) attacks are increasingly runtime-only and leave fragmented evidence across hosts, services, and build/dependency layers, so any single telemetry stream is inherently insufficient to reconstruct full compromise cha…
Authors: Zhuoran Tan, Wenbo Guo, Taylor Brierley
SynthChain: A Synthetic Benchmark and F or ensic Analysis of Advanced and Stealthy Softwar e Supply Chain Attacks Zhuoran T an University of Glasgow W enbo Guo Nanyang T echnological University T aylor Brierley JUMPSEC Ltd Jiewen Luo Royal Holloway , University of London J er emy Singer University of Glasgow Christos Anagnostopoulos University of Glasgow Abstract Advance d software supply chain (SSC) attacks are increas- ingly runtime-only and lea ve fragmented e vidence across hosts, services, and build/dependency layers, so any single telemetry stream is inherently insuf ficient to reconstruct full compromise chains under realistic access and b udget limits. W e present SynthChain, a near-production testbed and a multi- source runtime dataset with chain-lev el ground truth, deriv ed from real-world malicious packages and e xploit campaigns. SynthChain cov ers sev en representativ e supply-chain exploit scenarios across PyPI, npm, and a nativ e C/C++ supply-chain case, spanning W indows and Linux, and in volving four hosts and one containerized en vironment. Scenarios span realistic time windows from minutes to hours and are annotated with 14 MITRE A TT&CK tactics and 161 techniques (29–104 techniques per scenario). Beyond releasing the data, we quan- tify observ ability constraints by mapping each chain step to the minimum evidence needed for detection and cross-source correlation. W ith realistic trace availability , no single source is chain-complete: the best single source reaches only 0.391 weighted tag/step coverage and 0.403 mean chain reconstruc- tion. Even minimal tw o-source fusion boosts cov erage to 0.636 and reconstruction to 0.639 ( ≈ 1.6× gain), with consis- tent chain coverage/recall impro vements (0.545). The corpus contains approximately 0.58M ra w multi-source ev ents and 1.50M ev aluation rows, enabling controlled studies of de- tection under constrained telemetry . W e release the dataset, ground truth, and artifacts to support reproducible, forensic- aware runtime defenses and to guide efficient detection for software supply chains. 1 Introduction SSC compromise has become a high-leverage vector for mod- ern adversaries and is ranked among O W ASP’ s top three threats in 2025 [ 42 ]. While Mandiant’ s 2025 analysis at- tributes only 0.2% of initial intrusions directly to supply-chain compromise [ 31 ], the downstream blast radius is often dis- proportionate: recent incidents sho w ho w small footholds can escalate into outsized consequences [ 1 , 28 ]. Contempo- rary threat intelligence further indicates that SSC campaigns are no longer isolated package-tampering events, b ut increas- ingly manifest as adv anced, multi-stage, stealthy operations— including targeted manipulation of AI/ML ecosystems [ 47 ]. Echoing this shift, CrowdStrike’ s 2025 threat-hunting re- port highlights cross-domain hands-on-keyboard acti vity that abuses trusted de veloper relationships, cloud control planes, and automation [ 7 ]. These trends expose a fundamental gap: advanced supply- chain attacks unfold acr oss multiple stag es and e xtend beyond sour ce code alone , and the evidence needed to detect and reconstruct them is fragmented acr oss hetero geneous teleme- try sources, much of it generated at runtime . Y et existing research and benchmarks predominantly emphasize static ar- tifact inspection (e.g., malicious package disco very) [ 11 , 16 ] or dynamic analysis in limited sandbox settings [ 41 , 55 ]. As a result, a vailable datasets of fer only partial visibility: static corpora lack ex ecution semantics and post-compromise w ork- flows, while man y dynamic datasets rely on relatively simple triggers and do not model the multi-stage, cross-environment distributed behaviors typical of modern supply-chain intru- sions. Consequently , practitioners and researchers are often forced to study supply-chain compromise either without re- alistic runtime traces or without the multi-source evidence required for chain reconstruction and forensic validation. Key challenge: observability limits in realistic deploy- ments. In real en vironments, defenders seldom have “full fidelity” visibility: telemetry is constrained by access bound- aries (e.g., managed services, proprietary build systems), cost and performance budgets, and operational trade-of fs (sam- pling, retention, and source coverage) [ 9 ]. For advanced supply-chain intrusions, these constraints are not incidental— they directly determine what is detectable. Evidence is fre- quently non-redundant across sources: a chain step visible in process lineage can be absent from system logs; a net- work indicator can be inconclusiv e without service traces; and pipeline-side artifacts can be inaccessible at runtime [ 18 ]. 1 This implies that single-source detection is inherently in- complete for advanced supply-chain attacks: ev en an ideal detector operating on a single stream cannot recover a com- plete compromise chain when required evidence is missing by design [ 45 ]. Our approach. Guided by techniques observed in large- scale malicious open-source softw are (OSS) packages [ 11 ], we select representativ e real-world samples to build a near- production supply-chain attack testbed and a multi-source runtime dataset. W e collect synchronized telemetry across hosts, services, and containers—including process lineage, system/audit logs, and network/service traces, with container- lev el visibility enabled via eBPF-based instrumentation—to support chain-lev el analysis. W e construct chain-lev el ground truth by combining (i) technique-lev el adversary actions directly exported from Mythic 1 C2 tasking logs (A TT&CK-mapped by construction for C2-dri ven steps) with (ii) payload-originated actions ex- tracted via an LLM-assisted pipeline with manual verification. W e then align defender -visible e vents to these actions using coarse rule/keyw ord matching to enable chain reconstruction and detectability analysis (Section 4 ). Finally , we ev aluate the marginal benefit of each telemetry stream under realistic a vailability constraints by comparing detection and chain-reconstruction performance for single- source, two-source, and multi-source fusion. This character- izes observability limits and informs cost-ef fectiv e telemetry choices for runtime SSC defense. This design balances re- alism with experimental control, and isolates observability (telemetry anchors and cross-source joins) as the primary lim- iting factor—rather than confounding results with complex alignment or detector-specific assumptions. What we find: single-sour ce is not enough while mor e is not always better . Our ke y finding is twofold: single-source monitoring is inherently insufficient for chain-complete ev- idence, while ef fectiv e detection depends on which sources are fused rather than simply adding more. Our ev aluation shows that no single source provides chain-complete evi- dence: the best single source reaches only 0.391 weighted tag/step coverage and 0.403 mean chain reconstruction. In contrast, ev en a minimal two-source fusion raises covera ge to 0.636 and reconstruction to 0.639 (approximately 1.6 × gain), with consistent improv ements in chain coverage/recall (0.545). Importantly , the gains are not monotonic with the number of sources: additional streams can introduce noise and redundant e vidence, yielding diminishing returns unless correlation is targeted. Ov erall, multi-source correlation is a prer equisite for r econstructing end-to-end supply-chain 1 An open-source C2 frame work that records operator tasking and implant responses, enabling direct export of e xecuted actions and their A TT&CK technique mappings. compromise chains under realistic observ ability limits , and cost-effective telemetry r equires selecting complementary sources rather than maximizing collection . T o summarize, the contributions include: • End-to-end supply-chain scenarios grounded in real incidents. W e distill exploitation prerequisites and run- time compromise patterns from recent incidents and instantiate them into seven representative end-to-end scenarios spanning multiple ecosystems and platforms. • A near -production testbed and multi-sour ce runtime dataset with chain-level gr ound truth. W e collect syn- chronized host-, process-, network-, and service-lev el telemetry and align it to ground-truth action timelines to enable chain-lev el analysis. • A TT&CK-aligned annotations and exploitation- trend analysis. W e provide A TT&CK-mapped labels and analyses that expose recurring compromise struc- tures, exploitation trends, and cross-scenario indicators. • Chain-level detectability and telemetry trade-off study . W e quantify chain-le vel detectability under telemetry constraints, sho wing that single-source evi- dence is inherently incomplete and that chosen multi- source fusion substantially improv es reconstruction. T o the best of our knowledge, SynthChain is the first pub- lic dataset that jointly captures: (i) end-to-end execution traces grounded in real supply-chain compromise paths; (ii) multi- stage post-compromise activity; and (iii) synchronized multi- source telemetry enabling chain-level reconstruction and e val- uation under realistic observability constraints. 2 Limitations of Existing T elemetry for Ad- vanced Supply-Chain Attack Analysis Advanced SSC compromises are normally mediated by pack- age ecosystems, dependency resolution, and build/release pipelines, leaving the e vidence needed to detect advanced scenarios dispersed across components and telemetry layers. W e identify two key gaps in current detection practice and research: (i) fragmented observ ability of multi-stage beha v- iors across sources, and (ii) a lack of datasets with explicit cross-source alignment to enable chain-lev el analysis. 2.1 Scope and Threat Co verage W e focus on advanced SSC threats that span the package-to- runtime lifecycle, including (i) registry and dependency entry vectors (e.g., typosquatting and dependency confusion), (ii) CI/CD and identity-driv en pipeline compromise, and (iii) post- compromise runtime execution such as obfuscation, ste ganog- raphy , fileless/li ving-off-the-Land (LotL), multi-stage pay- loads, and exfiltration. Appendix D (T able 9 ) lists the full 2 threat types covered in this work. Our scope emphasizes threats whose ex ecution unfolds across multiple components and stages, rather than isolated one-shot tampering. 2.2 Fragmented Observ ability of Advanced Supply-Chain Attacks Advanced SSC attacks increasingly minimize localized arti- facts by distrib uting functionality across stages and contexts: trojanised components may rely on LotL to blend into benign activity [ 5 ], while Lazarus-attrib uted incidents illustrate pay- load fragmentation and encoding across multiple packages to ev ade static detection [ 14 ]. As a result, evidence is scattered across heterogeneous telemetry (e.g., b uild/dependency sig- nals, host process acti vity , and network/service traces) with inconsistent identifiers and loosely synchronized timestamps, making end-to-end reconstruction dependent on explicit cross- source alignment. Y et most SSC studies and benchmarks remain package- centric, classifying individual packages via static features, ML signatures, or sandboxed traces (e.g., DONAPI [ 17 ] and dynamic execution pipelines for npm/PyPI [ 66 ]). Even when incorporating inter-package relations (e.g., transiti ve de- pendency analysis), linkage is typically established at the code/dependency layer rather than through aligned multi- source runtime evidence [ 49 ]; correspondingly , prior surve ys largely or ganize methods around per-instance static/dynamic features [ 65 ]. Overall, SSC defense is thus a chain-le vel problem spanning dependencies, build infrastructure, and dev eloper-centric workflo ws [ 59 ], motiv ating datasets and ev aluations with chain-lev el ground truth and explicit cross- source alignment. 2.3 Synthetic Data Generation and the Lack of Cross-Sour ce Alignment T o support security ev aluation and reproducible experimen- tation, prior work has proposed synthetic or semi-synthetic datasets and testbeds. One line collects per -packa ge behaviors in isolated sandboxes: OpenSSF releases unlabeled execution results with runtime behaviors and static indicators for in- dividual package instances [ 41 ], and QUT -DV25 provides large-scale dynamic traces for PyPI SSC attacks using eBPF- based kernel and user -lev el probes [ 34 ]. While valuable for package-lev el detection, these resources typically treat each package as the unit of analysis and lack explicit cross-source alignment or chain-lev el ground truth across stages. Another line builds simulation-based testbeds, e.g., model- driv en en vironments for infrastructure and attack behav- iors [ 24 ] with improved realism via user-acti vity simula- tion [ 23 ], but the y often focus on limited telemetry (mainly system logs and network traffic) and do not model supply- chain–specific propagation paths. Large-scale semi-synthetic Advanced Persistent Threat(APT) datasets demonstrate multi- stage trace generation [ 4 , 35 ], yet they rely on restricted telemetry , do not explicitly encode cross-layer alignment, and capture general APT beha viors rather than supply-chain e x- ecutions governed by dependency resolution and package- driv en propagation [ 56 ]. Other synthetic corpora emphasize traffic div ersity , attack v ariety , or labeling quality [ 8 , 36 , 48 ], which suits IDS benchmarking but not stealthy SSC chains. Overall, existing data generation efforts emphasize real- ism or scale, but seldom address the cr oss-source alignment needed for chain reconstruction, where semantically related ev ents must be correlated across heterogeneous telemetry . 3 Related W ork T o inform our unique experimental setup and select represen- tati ve scenarios that reflect recent SSC exploitation trends, we compare our testbed against prior datasets in terms of co vered telemetry sources. W e also perform a statistical analysis of a large corpus of malicious open-source packages, using their documented malicious functions and behaviors, to character - ize technique usage trends and guide scenario selection. 3.1 Dataset Comparison T able 1 compares representati ve datasets/testbeds against capabilities required for chain-lev el supply-chain analysis. Multi-Stage/Multi-Source indicate full progressions and het- erogeneous telemetry; Alignment captures explicit cross- source ev ent alignment (or ground truth for chain reconstruc- tion), which is essential for e valuating correlation/provenance reasoning. W e report on A TT&CK mapping, Tracee/eBPF host tracing, and the presence of Normal Behavior . Existing works co ver only subsets: supply-chain datasets focus on package-lev el artifacts without multi-stage traces/alignment, while APT -oriented testbeds provide multi- stage multi-source traces but are not supply-chain specific and typically lack alignment metadata. SynthChain combines supply-chain scenarios with multi-stage, multi-source teleme- try and explicit alignment, plus A TT&CK-grounded TTPs, eBPF tracing, and realistic background activity . 3.2 Statistical Analysis of Malicious Packages T o characterize current exploitation trends in SSC attacks, we analyze the OpenSSF dataset collected through 2025 [ 11 ], which contains 16,272 malicious packages across four major ecosystems (npm, PyPI, RubyGems, Rust) along with meta- data describing their malicious behaviors. For each package, we extract six dimensions capturing both structural and behavioural characteristics from description of indi vidual packages: Ecosystem (platform), Location (where malicious code is embedded), Function (intended malicious 3 T able 1: Capabilities of existing datasets/testbeds vs. requirements for chain-lev el supply-chain attack analysis. W ork SC Multi-Stage Multi-Source Alignment A TT&CK TTPs T racee/eBPF ∗ Normal Behavior QUT -DV25 (2025) [ 34 ] ✓ – ✓ – – ✓ – OpenSSF (2025) [ 41 ] ✓ – ✓ – – – – Zhang et al. (2025) [ 65 ] ✓ – – – – – – Landauer et al. (2023) [ 23 ] – ✓ ✓ – ✓ – ✓ Unrav eled (2023) [ 35 ] – ✓ ✓ – ✓ – ✓ CICIoT2023 (2023) [ 36 ] – – – – – – ✓ OpTC (2021) [ 4 ] – ✓ ✓ – – – ✓ Backstabber (2020) [ 38 ] ✓ – – – – – – SynthChain ✓ ✓ ✓ ✓ ✓ ✓ ✓ SC : supply-chain specific; Alignment : explicit cross-source e vent alignment / chain reconstruction; ∗ eBPF-based host tracing (e.g., T racee). action), Attack T ype (high-level e xploitation pattern), T rig- ger mechanism (conditions activ ating the behaviour), and Evasion method (techniques used to a void detection). Figure 4 in Appendix B summarises the aggregated distributions of these behaviours. The top ro w presents the behavioural semantics— where malicious code is placed/localized, what it does, and the corresponding high- lev el attack types . W e observe a relatively strong concen- tration on install-stage execution, payload installation, and data exfiltration, each appearing in over 9,000 packages. This reflects attackers’ preference for early-stage ex ecution and high-impact actions that require minimal user interaction and can compromise the host immediately upon installation. The bottom row of Figure 4 further characterises the en- abling mechanisms. Most malicious packages are triggered upon installation or download (15,583 cases), confirming installation-time acti vation as the dominant entry point. F or e vasion, lightweight techniques are overwhelmingly pre valent: Base64-based encoding alone appears in over 5,000 packages, far exceeding more sophisticated approaches such as payload splitting or steganography . These findings indicate three consistent regularities: 1. Install-time execution is the primary activation strat- egy for initial f oothold establishment ; 2. Payload delivery and data theft/exfiltration are the central objectives ; 3. Simple b ut effective evasion (especially encod- ing/obfuscation) is fav oured by attack ers . T o avoid anecdotal selection, we prioritize scenarios that match the most frequent behavior cate gories and add a small number of long-tail cases to capture div ersity . These em- pirical regularities directly guide the scenario design in the next section: we choose representative real cases that col- lectiv ely cov er the dominant triggers, ev asion methods, and malicious objecti ves observed in the dataset. T o reflect emerg- ing trends—especially attacks in volving AI/ML components and cloud abuse [ 7 , 47 ]—we additionally design tw o scenar- ios tar geting these vectors. T able 2 summarizes the result- ing coverage, mapping each scenario to the high-frequency behavioral and mechanism cate gories to ensure operational relev ance and representativeness. 4 Methodology This section details how we construct SynthChain. Our frame- work emulates end-to-end, multi-stage compromise pathw ays in controlled en vironments and prioritizes system-level ob- servability over implementation details. W e retain full-stage behaviors up to exfiltration to support early-stage detection, while omitting generic reconnaissance that is not character- istic of typical supply-chain e xploitation. The methodology cov ers system setup, telemetry collection, monitoring configu- ration, benign-behavior emulation, design principles, and the resulting attack scenarios. 4.1 Setting Up Our testbed approximates realistic dev elopment, deployment, and cloud-integrated supply-chain en vironments. It includes W indows and Linux hosts, as well as Docker -based workloads to emulate AI-component integrations common in modern pipelines. T elemetry is collected via two ingestion paths and then processed by a common post-processing pipeline, as demon- strated in Figure 1 . For sources nati vely supported by Azure Log Analytics (e.g., Windo ws ev ents and Syslog), logs are ingested into the workspace and passed through a lightweight transformation layer to normalize schemas and fields. For other sources (e.g., Zeek and Suricata), we directly extract records from hosts and feed them into the same normalization stage. All streams then undergo the same anon ymization and parsing procedures before downstream analysis. The en viron- ment contains attacker -controlled infrastructure, de velopment hosts, office hosts, and public-internet access that supports download and update acti vities. 4.1.1 Collected Data SynthChain integrates telemetry from heterogeneous en viron- ments, including Windo ws hosts, Linux hosts, and Docker- 4 T able 2: Scenario Coverage Matrix Trigger Evasion Functions Case Inst DL Hook CICD Cond Obf Steg Enc FRep MS Inj Fileless Exfil C2 Steal Payload Persist 1.Stegano ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 2.Starter ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 3.Parallel ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 4.NPMEX ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 5.3CX ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 6.CloudEX ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7.LayerInj ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Le gend: CICD=CI/CD pipelines; Cond=conditional trigger; Enc=encoding/str eam cipher; FRep=file replacement; MS=multi-stag e/sequenced execution; Inj=DLL side-loading/pr ocess injection; F ileless=fileless malwar e; Steal=local data theft; P ayload=payload download. Other abbre viations see F igur e 4 in Appendix B . malicious libraries/packages Internet (source) software update network traffic linux dev host Office Zone wins office host wins dev host 1 2 3 mac dev host 4 Linux Agents Wins Agents Sysmon Local Collection Log Analytics Workspace activity log system log process monitor Azure Cloud c2 server (linux) Data Anonymization Usernames Resource Ids Sen. Domains Parsing (diverse parsers) Output Files (Unified Schema) T ag/Label Stages Chain Reconstruction Rule Matching TTPs Alignment Compare/Statistic Analysis Offline Analysis transform Figure 1: Simulation W orkflow and Analysis Pipeline based container workloads. W e collect host-, network-, and system/authentication logs, and optionally enrich them with behavioral tracing for higher-fidelity action reconstruction. Detailed data sources by platform are summarized in Ap- pendix C (T able 7 ). T elemetry variability and trust domains. T elemetry av ail- ability and granularity vary across scenarios due to differ - ences in execution en vironments and attack outcomes (e.g., partial/failed ex ecution, short-lived or fileless acti vity , and container-scoped beha viors). Rather than enforcing artificial completeness, we preserve these natural observability gaps to support analysis of when single-source telemetry fails and how multi-source e vidence mitigates such limitations. When target-side telemetry is sparse, we additionally use attacker -side logs from the orchestration host (e.g., Linux syslog ) to provide auxiliary pro venance conte xt (command issuance and network-attempt timing). W e explicitly maintain the trust-domain boundary between attacker-side and victim- side observations by annotating each record with its collection origin (attacker vs. tar get) to prev ent inadvertent leakage of privile ged signals into detection-only e valuations. MITRE A TT&CK-Aligned Data Annotation T o en- sure consistent and interpretable labeling across heteroge- neous traces, we annotate adv ersarial actions using MITRE A TT&CK. For C2-side operator tasking, Mythic 2 provides built-in A TT&CK technique mapping and exports TTPs di- rectly from recorded tasks and ev ents. For behaviors origi- nating from the payload itself, we use an LLM (GPT -5.1) to propose candidate A TT&CK techniques from payload source code, followed by e vidence-based human validation grounded primarily in traceable source-le vel e vidence. W e further per- 2 https://docs.mythic-c2.net/home 5 form multi-annotator cross-checking and adjudication for am- biguous cases. These annotations form the semantic backbone for scenario construction and subsequent threat analysis. 4.1.2 Normal Behavior Modeling T o elicit realistic runtime signals, we inject benign back- ground activity that commonly co-occurs with early-stage supply-chain compromises (Appendix A , T able 6 ). Unlike prior work using predefined attack scripts or controlled work- loads [ 4 , 23 ], our en vironment embeds routine usage patterns that introduce realistic noise and may obscure stealthy ex- ploitation. Activities include do wnloads and updates, filesys- tem operations, outbound web communication, and interactiv e use (e.g., browsing, of fice work, and service execution). T o increase diversity , hosts are assigned functional pro- files (dev elopment vs. daily use), yielding different proces s mixes, login rates, and communication patterns, which makes rare anomalies harder to isolate and better reflects real de- tection conditions. W e do not model human intent; instead, we randomize and schedule suf ficient benign v ariability for meaningful forensic analysis (Appendix A ). 4.2 Design Principles Our scenarios are designed to balance realism, representative supply-chain threat cov erage, and resistance to tri vial detec- tion. Below , we highlight the key principles that guide the construction of our attack behavior . Randomness Randomness plays a dual role in our scenario design. W e use randomness to avoid overly regular traces. For benign activity , randomized scheduling and mixed ac- tivity types emulate natural operational irregularity and pro- vide realistic background noise. For adversarial beha vior , we randomize triggers, command ordering, timing, and deliv ery paths to pre vent fixed w orkflows that would otherwise yield easy signatures. Coverage of Representative Threat Classes Guided by our statistical analysis (T able 4 ) and recent exploitation trends [ 43 , 60 ], our scenarios span a representativ e spectrum of supply-chain threats (Appendix T able 9 ), covering early-stage vectors (e.g., typosquatting, dependency confusion), late-stage behaviors (e.g., multi-stage payloads, fileless execution, e xfil- tration), ev asion (e.g., obfuscation, steganography), and ab use of cloud/AI components (e.g., cloud-based staging and mali- cious model dependencies). Adversarial Goals W e model APT -like SSC adversaries fo- cused on cov ert information theft and persistence. Scenarios stress operational security: lightweight obfuscation/encoding to frustrate superficial inspection, staged execution with mini- mal observ able footprint, and (when applicable) in-memory ex ecution to reduce disk artifacts and hinder file-centric de- fenses and forensics. Exfiltration is modeled as selective and low-noise to reflect realistic theft-oriented beha vior . Comparable End-to-End Attack Semantics T o support systematic comparison across heterogeneous scenarios, each attack chain ends with an explicit exfiltration phase. If a sam- ple already implements exfiltration, we preserv e it; otherwise, we only add minimal external orchestration to complete miss- ing stages without altering the intended semantics. 4.3 Scenarios Based on our en vironment and telemetry pipeline, we con- struct controlled SSC attack scenarios that capture end-to-end multi-stage behaviors across heterogeneous en vironments, fo- cusing on host-lev el observability . Each scenario includes an exfiltration stage: if a sample lacks a nati ve C2/exfiltration mechanism, we use Mythic only as a contr olled C2 endpoint to driv e remote code execution and a bounded exfiltration step; otherwise, we preserve the sample’ s original C2/exfiltration behavior . Scenario designs (triggers/ev asion, key functions, and tools) are summarized in T able 13 (Appendix F ). All evidence is deri ved solely from system telemetry (e.g., package manager acti vity , process creation, file I/O, and net- work connections). Mythic agents (Apollo 3 and Medusa 4 ) are used only to ex ercise controlled command ex ecution and exfiltration flo ws, covering both script-based e xecution and compiled payload deliv ery . For hea vily packed samples (e.g., 3CX), we avoid unpacking or rewriting and instead rely on runtime observ ables (e.g., network-related system calls and connection attempts) to capture intended communication. 4.3.1 Stegano Follo wing the Checkmarx report (2023) [ 13 ], we simulate a representati ve W indows typosquatting-based open-source supply-chain compromise that primarily leverages steganogra- phy to deliv er a hidden payload. The attack is abstracted into the following observ able execution stages: (1) installation of a malicious package triggering execution of embedded code, (2) retrieval of an external resource containing a concealed payload in an image, (3) in-memory payload e xecution, and (iv) outbound communication attempts for data exfiltration. A detailed attack workflow is pro vided in Figure 2 (option 2). 4.3.2 Starter Also based on the same report [ 13 ], Starter represents a Win- dows typosquatting case that implements an explicit multi- stage chain with an emphasis on persistence, achie ved by 3 https://github .com/MythicAgents/Apollo 4 https://github .com/MythicAgents/Medusa 6 Developer mistakenly downloads typosquat package legitimate package Typosquatt package obfuscated code in setup.py file Downloads image from external source, embedded with malware exfiltrate system data remove all evidence remove all evidence exfiltrate sensitive info persistence on machine possibility 1 possibility 2 create/verify needed directories and files download executable from external source user identification Figure 2: Stegano and Starter Attack Flo w creating or modifying startup entries (e.g., under the Win- dows Start Menu startup folder). W e abstract it into three observable stages: (1) en vironment checks and workspace setup; (2) deployment of covert launchers and persistence; and (3) payload retrie v al and execution. The detailed launcher logic and full workflo w are provided in Figure 2 (option 1). 4.3.3 Parallel Parallel captures Linux-based multi-stage script chains ob- served in npm supply-chain incidents [ 12 , 15 ] (2023). W e model four stages: (1) lifecycle-hook execution at install time; (2) detached ex ecution of secondary scripts; (3) reconnais- sance and local collection; and (4) outbound exfiltration at- tempts. The complete script logic diagram is shown in Ap- pendix F (Figure 6 ). 4.3.4 NPMEX NPMEX represents dependency-chain attacks where multi- ple malicious packages execute sequentially and exchange artifacts [ 14 ] (2023). W e abstract three stages: (1) first de- pendency prepares shared artifacts; (2) second dependency consumes artif acts and progresses e xecution; and (3) dynamic code retriev al followed by payload deployment. The original workflo w is illustrated in Appendix F (Figure 7 ). 4.3.5 3CX Based on public reports of the 3CX incident [ 20 ] (2023), we simulate a simplified chain comprising trojanised installer ex ecution, DLL side-loading, in-memory payload ex ecution, and subsequent C2 attempts. T o preserv e semantic fidelity , we ex ecute collected samples in their original binary form; ex ecution is conducted in a constrained en vironment for safety and observability . The detailed component interactions and workflo w are provided in Appendix F (Figure 8 ). 4.3.6 CloudEX CloudEX models cloud-based supply-chain compromise tar - geting CI/CD pipelines and b uild artifacts, adapted from re- ported cases [ 2 ]. W e abstract four stages: (1) initial access to an exposed service; (2) discovery of residual CI/CD cre- dentials; (3) access to internal artifact repositories; and (4) artifact modification to propagate downstream impact. The complete workflo w is illustrated in Appendix F (Figure 9 ). 4.3.7 LayerInj LayerInj models ML supply-chain attacks where a tampered model artifact embeds a persistent backdoor without explicit malicious code [ 27 , 58 , 62 ]. W e abstract three stages: (1) intro- duction of a tampered model; (2) deployment in a downstream service; and (3) trigger-based activ ation at inference time. The full workflo w is shown in Appendix F (Figure 10 ). 5 Data Sanitization The collected Azure logs and local data, as shown in Figure 1 , contain en vironment-specific identifiers such as hostnames, local user accounts, and Azure resource identifiers. T o protect priv acy while preserving analytical utility , we apply a stable pseudonymization strategy that replaces environment-specific identifiers with consistent tokens across all log sources. Im- portantly , we retain all semantic fields relev ant to security analysis (e.g., ev ent types, ports, protocols, IPs, and temporal ordering). For realism in our simulation, we preserve pub- lic Internet FQDNs and only pseudonymize Azure related domains when they may reveal deplo yment-specific details. This sanitization preserves equality relationships required for correlation and does not materially alter the event statistics used in our analysis. Detailed sanitization rules and examples are provided in Appendix K . 6 Attack Scenario Analysis This section analyzes ho w the defined supply-chain attack scenarios manifest in observ able multi-source telemetry , ab- stracting away attack implementation details and focusing exclusi vely on e xecution traces and extracted indicators. Our analysis is organized around three system-le vel questions: 1. Q1 : How completely can end-to-end attack chains be reconstructed from the av ailable telemetry? 2. Q2 : Where and why does single-source telemetry fail to support reliable reconstruction? 7 3. Q3 : How does multi-source telemetry mitigate these failures across dif ferent attack structures? The answers to these questions form the basis for our cross- scenario insights and deployment implications. 6.1 Analysis Methodology W e treat telemetry as the only evidence source and apply a uniform pipeline that (i) parses and normalizes heterogeneous logs into a schema-tolerant e vent table, (ii) tags e vents with coarse beha vioral steps, and (iii) reconstructs candidate attack chains by correlating evidence across time and entities. (1) Scenario-scoped ingestion and normalization. Parsers con vert raw records into a unified table with a canoni- cal timestamp field ( t s ) and a lightweight text blob (e.g., raw / message / cmdline ) for matching; t s is normalized to a consistent time basis for stable ordering. (2) Coarse step tagging (intermediate evidence). W e assign a small set of interpretable steps ( INSTALL , AUTH , DOWNLOAD , OUTBOUND_CONN , EXFIL ) using a rule-based tag- ger . Rules specify regex patterns, candidate fields, optional structured prefilters ( where_any / where_all ), and applica- ble telemetry sources. The tagger is schema-tolerant via canonical-field aliasing, resolves missing-field/prefilter cases as explicit diagnostics, and breaks ties by priority score while retaining matched candidates for ambiguity analysis. When the ground truth of the scenario is available, we map the A TT&CK techniques of the scenario to an e xpected set of steps and optionally use them to gate the label space dur- ing tagging. A TT&CK is not used for per-e vent technique detection. (3) Event correlation & chain reconstruction. W e pre- compute a temporal ev ent graph ov er tagged ev ents and con- nect e vents that e xhibit plausible continuity (shared entities or consistent network attributes within a bounded temporal window). W e then extract the candidate chains as ’ordered step’ sequences and retain supporting ev ents for analysis. (4) Metrics, ambiguity , and failure characterization. When expected steps are av ailable, we report step- and chain- lev el precision/recall against the expected step set; otherwise we report coverage-oriented observ ability proxies. A continu- ity proxy flags step transitions with excessi ve temporal gaps (default: 10 minutes). W e characterize failures via missing- evidence patterns and, using tagger diagnostics, distinguish true evidence absence from schema/rule mismatches (e.g., missing fields, prefilter drops, or unusable rules). W e quantify ambiguity at two le vels: (i) event-le vel ambiguity as the frac- tion of ev ents that match multiple step tags ( | M ( e ) | > 1 ), and (ii) chain-level ambiguity as competition among candidate T able 3: Statistics of Collected T elemetry (Normalized) Scenario T otal Records Benign Records Attack Records 1.Stegano 8,534 8,530 4 (0.05%) 2.Starter 53,978 48,788 5,190 (9.62%) 3.Parallel 88,674 41,270 47,404 (53.46%) 4.NPMEX 188,270 103,925 84,345 (44.80%) 5.3CX 7,453 6,881 572 (7.67%) 6.CloudEX 9,774 9,636 138 (1.41%) 7.LayerInj 222,046 169,894 52,152 (23.49%) chains, measured by the top-2 score margin (and optionally the entropy ov er top- K candidates). (5) Source-b udgeted runs (Q1–Q3). W e rerun the same pipeline under controlled source budgets: single-sour ce (one stream), combo-sour ce (small fixed pairs), and multi-sour ce (all available). Per scenario and budget category , we report one representativ e run selected by reconstruction quality to enable clean cross-scenario comparisons. Reproducibility . All parsers, rules, and reconstruction param- eters are fixed across scenarios; differences therefore reflect telemetry av ailability and attack structure rather than scenario- specific tuning. 6.2 Scenario Results Overview 6.2.1 Collected telemetry summary T able 3 summarizes the volume of normalized telemetry per scenario. A recor d denotes a single normalized telemetry entry (one atomic observ ation) produced from any e vidence channel. Attack records are defined by e vidence-based asso- ciation (primarily timeline/entity alignment) and should be interpreted as telemetry likely r elated to the attack, rather than an exact one-to-one match with ground-truth attacker actions. Record counts v ary widely across scenarios and should be interpreted primarily as a function of logging granularity and av ailable evidence channels, rather than scenario duration. Accordingly , SC7 (LayerInj) and SC4 (NPMEX) yield the largest datasets (222,046 and 188,270 records), whereas SC1 (Stegano), SC5 (3CX), and SC6 (CloudEX) are much smaller (7,453–9,774 records). Beyond v olume, the composition of telemetry sources also dif fers across scenarios. SC1 dra ws from multiple Azure Mon- itor streams, while SC2 and SC6 rely mainly on syslog plus events . Importantly , events should be interpreted as a com- posite evidence source across scenarios: although e xported as a single dataset, it aggregates multiple underlying channels (e.g., process, network, and security) at collection/export time. W e therefore budget events as multi-source where ver it ap- pears (e.g., SC5, where it is the dominant stream), despite its single-file representation. 8 Finally , the ratio between total records and rule-matched records varies widely . High hit densities (e.g., SC3/SC4/SC7) indicate rich beha vioral traces that can support step tagging and correlation, whereas e xtremely lo w hit density (SC1) sug- gests either (i) true evidence absence for the defined coarse steps, or (ii) schema/field mismatches that prev ent rules from firing, motiv ating the failure analysis in CSA-3. 6.2.2 Cross-Scenario T echnique Pre valence Across scenarios, we observe substantial v ariation in tech- nique breadth (T able 4 ), where | T s | counts distinct techniques observed in scenario s and | T | is the union over all scenarios; f denotes cross-scenario frequency . SC1/SC2/SC6/SC7 cover roughly two-thirds of the global technique pool, whereas SC3–SC5 are much narro wer in total techniques b ut exhibit a markedly higher fraction of scenario-unique techniques. This pattern suggests that broad scenarios share a large com- mon core of beha viors, while narrower scenarios emphasize more specialized, scenario-specific steps. Complementarily , we rank techniques by cross-scenario coverage and find a small set of ubiquitous techniques that appear in almost all scenarios (Appendix G , T able 14 ). Notably , supply-chain–related techniques are consistently present in our scenario set, reflecting that sev eral scenarios are dependenc y- or software-distrib ution–mediated intrusions. The most pre valent techniques largely correspond to capa- bility acquisition/staging and tool transfer , consistent with supply-chain or dependency-mediated intrusion setups where attackers must first obtain and deli ver artifacts before e xecut- ing later-stage actions. 6.2.3 Representati ve baselines and sour ce-budget com- parison. Source-b udget definition. T o contextualize our results against common prior settings and to isolate the benefit of additional telemetry , we group configurations by source b ud- get —the number of distinct evidence sources av ailable to the pipeline. For each scenario, the corresponding full-telemetry source set (“Multi (full telemetry)”) is enumerated in Ap- pendix H (T able 15 ). Importantly , events (our exported azure_events dataset) is a composite stream rather than a single-channel log, so configurations that include events may exceed a 2-source budget despite appearing as one dataset (details in Appendix H ). Representati ve configurations. Single-sour ce (1) base- lines operate on one telemetry stream, reflecting host-only prov enance/audit or single-stream detectors commonly as- sumed in prior work (e.g., audit/provenance) [ 6 , 25 , 63 ]. Combo (2) baselines use e xactly two sources; we instantiate a representati ve host+network setting ( audit+Zeek ) that com- bines host causality with network connectivity signals [ 29 , 30 ]. Multi ( ≥ 3) settings use three or more sources; as a practical example that frequently appears in deployments and prior work, system+ev ents ( syslog+events ) falls into this cate- gory under our composite-stream accounting [ 52 ]. W ithin multi-sour ce ( ≥ 3), we additionally e valuate a full-telemetry setting that uses the maximum telemetry av ailable in each scenario, representing the strongest achiev able configuration of our pipeline. Metrics rationale. W e use three metric families to dis- entangle what is observable from what is correctly r econ- structed . (T ag/Chain) cov erage measures whether the av ail- able telemetry exposes the e xpected coarse steps at all, inde- pendent of attribution quality . Precision/recall then quantify reconstruction correctness when ground truth is av ailable, and reconstructability summarizes end-to-end chain quality . For cross-scenario aggreg ation, we weight coverage/recall by E s to av oid over -emphasizing scenarios with fewer e xpected steps, while reporting unweighted means for precision and reconstructability to reflect per-scenario typical performance. Metric definitions and aggregation. F or each scenario s and configuration (source budget) c , we compute pre-run met- rics from the reconstructed chain. Let E s denote the number of expected coarse steps for scenario s (deri ved from extracted TTPs), and let T s , c and C s , c denote the numbers of tagged and chain-cover ed step types observed under configuration c . W e define the observability as: T agCov ( s , c ) = T s , c E s ∈ [ 0 , 1 ] , ChainCov ( s , c ) = C s , c E s ∈ [ 0 , 1 ] . (1) When ground truth is av ailable, we compute step-lev el and chain-le vel precision/recall for each scenario and con- figuration, denoted as StepR ( s , c ) , ChainR ( s , c ) , StepP ( s , c ) , and ChainP ( s , c ) . W e aggregate weighted metrics (marked “wtd. ”) across scenarios by weighting each scenario by its E s : M ( s , c ) ∈ [ 0 , 1 ] , M wtd c = ∑ s ∈ S c E s · M ( s , c ) ∑ s ∈ S c E s ∈ [ 0 , 1 ] . (2) where M ∈ { T agCov , ChainCov , StepR , ChainR } and S c is the set of scenarios included for configuration c . Metrics labeled “mean” (StepP/ChainP/Reconstructability) are ag- gregated as the unweighted a verage o ver S c : M mean c = 1 | S c | ∑ s ∈ S c M ( s , c ) . (3) W ithin each scenario and source-budget category , we re- port the best run by maximizing StepR ( s , c ) (tie-break by ChainR ( s , c ) , then by ev ent volume), ensuring a fair “best achiev able” comparison under a fixed telemetry budget. Q1–Q3 summary across scenarios. W e quantify how re- construction quality changes with increasing source b udget. 9 T able 4: Scenario-level A TT&CK technique cov erage and uniqueness SC1 SC2 SC3 SC4 SC5 SC6 SC7 T otal observed techniques | T s | 104 103 29 29 35 104 100 Coverage of global pool ( | T | = 161) 64.6% 64.0% 18.0% 18.0% 21.7% 64.6% 62.1% Scenario-unique techniques ( f = 1) 7 6 9 7 13 6 5 Unique share within scenario 6.7% 5.8% 31.0% 24.1% 37.1% 5.8% 5.0% T able 5 reports cross-scenario aggregates by b udget, and Ap- pendix H (T able 15 ) provides per-scenario best-achie v able configurations and missing steps. W e use these summaries to answer Q1–Q3 in terms of observability (T ag/Chain Cov er- age), detection quality (Step/Chain Recall and Precision), and end-to-end chain quality (Reconstructability). Q1: How completely can end-to-end attack chains be re- constructed from the a vailable telemetry? Under full telemetry , our pipeline reaches T agCov wtd = 0 . 481 and StepR wtd = 0 . 481 across all seven scenarios, with mean reconstructability 0 . 488 . This indicates that, on average, multi-source e vidence substantially improv es end-to-end reconstruction relative to single-source settings (best sin- gle: T agCov wtd = 0 . 478 , StepR wtd = 0 . 348 , reconstructability 0 . 361 ). At the scenario le vel (T able 15 in Appendix H ), recon- struction remains bimodal: SC2 and SC4 achie ve StepR = 0 . 75 (missing only OUTBOUND_CONN and EXFIL respecti vely), while SC1/SC5/SC6 remain at StepR = 0 . 25 due to persistent absence of DOWNLOAD / OUTBOUND_CONN / EXFIL evidence. SC7 reaches StepR = 0 . 667 but consistently misses DOWNLOAD , consistent with model-le vel attacks whose retriev al phase is weakly expressed in a vailable host/netw ork schemas. Q2: Where and why does single-source telemetry fail? Single-source telemetry exhibits two systematic failure modes. First, evidence incompleteness: averaged over all single sources, T agCov wtd and StepR wtd both drop to 0.263, with mean precision 0.775. The gap between “avg over all sin- gles” and “best single” reflects that many single-source runs observe few (or none) of the expected steps, yielding unde- fined precision that we conserv ati vely treat as zero. Second, semantic/causal ambiguity: e ven the best-achie vable single- source selection (best single) systematically misses cross- layer phases such as DOWNLOAD and EXFIL (e.g., SC2 and SC4 in the Appendix H ), because these steps require joinable host execution context and network/service evidence that a single stream cannot provide. Q3: How does multi-source mitigate these failures, and what remains unresolv ed? Multi-source mitigates single- source failures primarily by improving completeness: adding complementary anchors increases observability and raises aggregate recall from 0.348 (best single) to 0.481 (full teleme- try), while improving mean reconstructability from 0.361 to 0.488. The gains are dri ven mainly by scenarios where miss- ing phases become observable under additional sources (e.g., SC2 and SC3/SC4 in the Appendix H ). Ho wever , multi-source does not uni versally resolve missing-step g aps: SC1/SC5/SC6 remain bounded by absent DOWNLOAD / OUTBOUND_CONN / EXFIL evidence, indicating that additional sources help only when they expose the missing phase with compatible join keys (process, user , network endpoints) rather than merely adding volume. Finally , two-source pairs can outperform or match multi-source on the subset of scenarios where such pairs exist and are highly informativ e (Combo best: reconstructability 0.639 ov er 3 scenarios in T able 5 ); nevertheless, the primary benefit of multi-source is r obustness across heter ogeneous scenarios , not dominance on ev ery scenario subset. 6.3 Case Studies W e present two contrasting scenarios to illustrate both the strengths and the e vidence-bound limits of our telemetry-to- chain pipeline. SC4 (NPMEX) is a positive e xemplar where complementary host and network telemetry e xposes joinable anchors, enabling near-complete reconstruction of the ex- pected chain (StepR=0.75). In contrast, SC1 (Stegano) is a ne gative ex emplar: e ven under full telemetry ingestion, key phases such as DOWNLOAD , OUTBOUND_CONN , and EXFIL re- main unobservable in rule-matchable fields, bounding recon- struction at StepR=0.25. Full e vidence packages (step anchors, matched rules, join keys, and missing-step diagnostics) are provided in Appendix J . 6.3.1 SC4: NPMEX — Sequential Dependency Chain Attack (Positi ve exemplar: near -complete chain) Results In SC4 (NPMEX), the full-telemetry configuration reconstructs three of the four expected coarse steps, achie ving StepR = 0 . 75 with perfect step precision ( StepP = 1 . 0 ). The observed step set is { INSTALL , DOWNLOAD , OUTBOUND_CONN } and the reconstructed chain is OUTBOUND_CONN → INSTALL → DOWNLOAD . Across the run, the pipeline ingests 188,270 normalized records; the strongest evidence comes from high- volume netw ork telemetry for OUTBOUND_CONN (tens of thou- sands of connection records), complemented by a small num- ber of high-specificity host records for INSTALL (package- manage actions) and DOWNLOAD (explicit retriev al commands). 10 T able 5: Cross-scenario summary by source budget and representati ve combinations. Cov erage/recall are weighted by expected steps (wtd.). Bold indicates the best value and underline indicates the second-best value in each metric column. Category {n}SC T ag Cov . (wtd.) Chain Cov . (wtd.) StepR (wtd.) ChainR (wtd.) StepP (mean) ChainP (mean) Recon. (mean) Single (1): avg over all single sources 6 0.263 0.263 0.263 0.263 0.775 0.775 0.266 Single (1): best single-source 6 0.391 0.391 0.391 0.391 1.000 1.000 0.403 Single (1): audit/provenance [ 25 , 63 ] 3 0.091 0.091 0.091 0.091 0.333 0.333 0.111 Single (1): Zeek [ 10 , 44 ] 3 0.273 0.273 0.273 0.273 1.000 1.000 0.278 Combo (2): avg o ver 2-source pair 5 0.430 0.400 0.430 0.400 1.000 1.000 0.431 Combo (2): best 2-source pair 3 0.636 0.545 0.636 0.545 1.000 1.000 0.639 Combo (2): audit+Zeek [ 29 , 30 ] 3 0.364 0.273 0.364 0.273 1.000 1.000 0.389 Multi ( ≥ 3): syslog+ev ents [ 52 ] 2 0.500 0.500 0.500 0.500 1.000 1.000 0.500 Multi ( ≥ 3): avg full telemetry 7 0.481 0.481 0.481 0.481 1.000 1.000 0.488 Multi ( ≥ 3): best full telemetry 7 0.481 0.481 0.481 0.481 1.000 1.000 0.488 Note : ev ents data is taken as a composite telemetry stream (multiple evidence channels). Analysis SC4 is reconstructable because it provides com- plementary anchors with compatible join ke ys . Network telemetry (e.g., Suricata/Zeek) supplies stable connection- lev el e vidence that grounds OUTBOUND_CONN in time and end- points, while host telemetry (syslog/auth) provides ex ecution- context anchors for INSTALL and explicit fetch behavior for DOWNLOAD . These anchors are temporally consistent and share joinable entities (host identity , process/user context, and/or endpoints), allowing the e vent graph to connect phases into a coherent chain. Why no EXFIL W e do not instantiate a separate EXFIL node because the observed package logic is consistent with a loader– ex ecutor design: it performs token bootstrap and payload retriev al/execution, b ut does not implement an explicit “col- lect → serialize → send” routine in the published artifacts. Consequently , any e xfiltration would be attrib utable only to the downloaded second-stage script, and lacks a distinctiv e host-side marker that w ould support a reliable, joinable EXFIL anchor in our reconstruction. 6.3.2 SC1: Stegano — Steganography Exploitation (Neg- ative exemplar: e vidence-bound ceiling) The attack unfolds across fi ve phases, visualized in Figure 3 with corresponding MITRE A TT&CK technique identifiers at each transition. Results In SC1 (Stegano), reconstruction is bounded de- spite full telemetry ingestion. The full-telemetry run ob- serves only INSTALL from the expected set { INSTALL , DOWNLOAD , OUTBOUND_CONN , EXFIL }, yielding StepR = 0 . 25 (with StepP = 1 . 0 ). Although the run ingests 8,534 normal- ized records, step-tagging fires only sparsely and concentrates on installation-related process acti vity (e.g., the package in- Setup Initial Access & Exec. C2 & Recon. Exfiltration 12:28 Data Collection Start 13:21 setup.py Install 13:22 Payload Extraction 13:28 C2 Established 14:33 Scanner Deployed 15:18 Data Exfiltrated 15:37 Collection End T1195.002 T1027.003 T1059.006 T1105 T1083 T1041 Figure 3: SC1 attack lifecycle. Phases are color -coded: setup (green), initial access and ex ecution (red), C2 establishment (purple), and exfiltration (orange). MITRE A TT&CK technique IDs appear below each phase transition. stallation command), while no rule-matchable evidence is produced for DOWNLOAD , OUTBOUND_CONN , or EXFIL . Analysis SC1 illustrates an e vidence-bound failur e mode : adding more telemetry sources increases volume but does not necessarily increase usable anchors. Here, the ex- pected phases DOWNLOAD / OUTBOUND_CONN / EXFIL are either not present in the exported schemas or do not expose the fields required by our step rules (e.g., process-to-network attribution or explicit transfer indicators). As a result, the event graph lacks the anchors needed to connect installation to subsequent phases, and multi-source correlation cannot compensate for missing or non-joinable evidence. This negativ e case mo- tiv ates our failure taxonomy and deployment implications: multi-source telemetry helps only when it reveals the missing phase with joinable entities (process/user/endpoint), rather than merely adding additional records. 11 6.3.3 Remaining Scenarios (Brief Summaries) SC2: Starter — Persistence via Startup F older SC2 mod- els a lightweight supply-chain payload that establishes persis- tence by placing an autostart artifact in the user Startup folder , enabling ex ecution on subsequent logins. In our telemetry , the most reliable anchors are host-side persistence signals (file/registry updates consistent with startup configuration) and coarse ex ecution evidence around the initial drop. Under full telemetry , reconstruction reaches StepR = 0 . 75 by recov- ering INSTALL , DOWNLOAD , and EXFIL , but still misses a clean OUTBOUND_CONN anchor , illustrating a common “incomplete chain” pattern where persistence is observable while network establishment lacks joinable attribution. Single-source set- tings further underperform because system logs alone cannot consistently connect startup persistence to subsequent net- work beha vior without process-to-connection linkage. SC3: Parallel — Multi-Script Concurrent Execution SC3 benefits from multi-source host+network telemetry , im- proving the best single-source recall (0.25) to 0.50 under full telemetry . Nev ertheless, DOWNLOAD and EXFIL remain missing from the expected coarse-step set, suggesting that concurrent benign-like network activity and multi-process ov erlap reduce the distinctiveness of retrie val and e xfiltration phases. The dominant observable anchors are INSTALL and OUTBOUND_CONN , which can be correlated temporally b ut do not uniquely determine end-to-end intent without stronger content or file-transfer evidence. SC5: 3CX — Multi-Stage Backdoor Deployment SC5 is constrained by evidence av ailability: the dataset is domi- nated by events (treated as a composite stream) and yields StepR = 0 . 25 , observing primarily INSTALL -adjacent acti vity (plus extra AUTH ) while missing DOWNLOAD , OUTBOUND_CONN , and EXFIL . This scenario illustrates that ev en when an ex- ported stream is internally multi-channel, it may still lack the specific join keys (e.g., process-to-network attrib ution) required to reconstruct later-stage communication. SC6: CloudEX — CI/CD Pipeline Compromise SC6 remains dif ficult under av ailable telemetry: even with syslog+events the reconstruction achiev es StepR = 0 . 25 and primarily tags AUTH (expected) plus an extra INSTALL , while missing DOWNLOAD , OUTBOUND_CONN , and EXFIL . This is consistent with a control-plane dominated attack where criti- cal actions occur in cloud identity/API layers that are not fully captured (or not captured in a rule-matchable schema) by the current logging exports, emphasizing the need for IAM/API audit streams [ 50 ] and better cloud-identity step rules. SC7: LayerInj — Neural Network Model Backdoor SC7 demonstrates a precision trade-of f under multi-source correla- tion. The best single-source configuration (Suricata) already achiev es StepR = 0 . 667 with perfect precision ( StepP = 1 . 0 ) by capturing OUTBOUND_CONN and EXFIL . Full telemetry maintains the same recall (0.667) but reduces precision to 0.5 by introducing extra step candidates ( AUTH , INSTALL ) not present in the expected coarse-step set, reflecting the risk of ov er-attrib uting generic system e vents to an attack narrative when the core maliciousness is semantic (model behavior) rather than OS-lev el ex ecution nov elty . 7 Cross-Scenario Analysis T o interpret the aggregate results in T able 5 , we analyze cross- scenario mechanisms that gov ern reconstruction quality and A TT&CK coverage. The goal is to identify which properties transfer across scenarios (stable anchors and join k eys) v ersus which are scenario-structural (missing phases, cloud/control- plane actions), and to distill actionable guidance for telemetry planning. W e present four themes (CSA-1–CSA-4). 7.1 CSA-1: Attack Chain Reconstructability Reconstructability is governed primarily by (i) whether each phase exposes a stable telemetry anchor and (ii) whether anchors share joinable identifiers across sources (host/user/process and network endpoints). When these condi- tions hold, causal chaining is reliable; otherwise the pipeline conservati vely outputs partial chains rather than brittle full narrativ es (Appendix I.1 ). 7.2 CSA-2: TTP Observability and Alignment W e align reconstructed coarse steps to A TT&CK post hoc, because the same attacker action can project differently across telemetry layers. Alignment is therefore evidence- conditioned: missing projections or weak attribution can under-support techniques ev en when the attack occurred. Scenario-lev el A TT&CK breadth and time window further modulate alignment difficulty (Appendix I.2 ). 7.3 CSA-3: Failur e T axonomy & Observ ability Gaps Across scenarios, failures fall into three dominant classes: missing-phase evidence (true observability gaps), non- joinable evidence (attribution breaks), and ambiguity/noise (generic ev ents or concurrency). Multi-source helps mainly when it adds the missing phase or strengthens joins; it can- not recover phases absent from all sources (see T able 17 in Appendix I.3 for details). 12 7.4 CSA-4: Structural Patter ns & Deployment Implications T wo-source host+network b udgets can be strong when phases are joinable, b ut multi-source is most valuable for robust- ness across heterogeneous attack structures (cloud control- plane actions, enterprise software, and model-layer attacks). T elemetry planning should prioritize di verse e vidence types and stable join keys o ver log volume; tar geted additions (e.g., IAM/API audit streams) yield outsized gains for structurally hard cases (Appendix I.4 ). 8 A pplication Scenarios The released dataset and metadata can be used as a benchmark to e valuate detection-related tasks in a reproducible manner . The provided labels and exported data sources support con- sistent comparison across different methods and pipelines. Threat Detection System MITRE techniques and tactics are extracted from the Mythic Platform, together with the cor- responding data sources that can re veal these beha viors. The extracted data can be used to test the capability and perfor- mance of a proposed detection system, like graph prov enance based methods [ 18 , 26 ], in identifying malicious techniques and tactics from multi-source logs. Identify IOCs and IO As By tracking potential IOCs and IO As in the collected data, it is possible to link each action to its observ able traces in the system. This information can be used to profile specific activities and support future detection of recurring SSC exploitation. 9 Discussion Scope and constraints. SynthChain tar gets the exploita- tion stage of SSC attacks on an end-user host, focusing on the observable behaviors after a malicious artifact is intro- duced and e xecuted (e.g., installation, do wnload, execution, and follo w-on actions). As a result, we do not aim to model the full lifecycle of long-running APT campaigns, such as large-scale lateral mo vement or complex pri vilege escalation, which are already well represented in e xisting datasets [ 4 , 35 ]. While our e valuation is scenario-dri ven rather than Internet- scale, it is telemetry-rich: we collect large-volume multi- source traces under controlled source budgets to support sys- tematic cross-layer reconstruction. W e exclude macOS and all BSD-variants: compared to Linux and W indows, its stronger b uilt-in protections restrict audit visibility , limiting telemetry av ailability and comparabil- ity [ 3 ]. W e therefore focus on Linux and Windo ws for richer , more consistent security auditing and multi-source alignment. Rule granularity and matching assumptions. Step tag- ging and part of the reconstruction rely on coarse, portable rule matching (e.g., regular e xpressions and ke yword rules) ov er normalized fields. This choice improv es cross-scenario comparability , but it may under- or ov er-approximate beha v- iors when telemetry schemas differ or when benign software shares similar surface tokens. Steps such as DOWNLOAD and dependency-related actions are particularly dif ficult to disam- biguate from textual fields alone, which can reduce precision. Future work: topology-aware provenance for software and dependencies. This direction aims to achieve higher - fidelity alignment to scenario ground truth (where av ailable), reducing ambiguity beyond coarse token-based matching rather than claiming e xact one-to-one ground-truth matches. T o mov e beyond token-lev el matching, we plan to incor- porate topology-aware evidence extraction [ 19 , 26 ]. Con- cretely , we plan to incorporate (i) process and file prove- nance graphs (parent–child execution and file write–read chains), (ii) package-manager and dependency resolution traces, and (iii) network-to-process attribution using richer host tracing (e.g., eBPF/T racee) to recover higher-fidelity causal links among download, installation, and subsequent ex ecution. Such topology-dri ven correlation would reduce reliance on surface tokens and provide more stable join struc- tures across heterogeneous telemetry schemas, improving both accuracy and rob ustness of chain reconstruction. 10 Conclusion SSC attacks are inherently multi-stage and cross-layer , yet their evidence is fragmented across heterogeneous telemetry streams and often lacks natural join keys, making end-to-end reconstruction challenging under common single-source as- sumptions. T o address this gap, we introduce SynthChain, an SSC-centric dataset and testbed featuring multi-stage sce- narios, multi-source telemetry collection, and explicit cross- source alignment, enabling systematic ev aluation of telemetry- to-chain reconstruction under controlled source budgets. Our scenario analysis yields three cross-scenario insights: (i) reconstruction depends on whether each phase exposes a stable telemetry anchor and whether anchors share identi- fiers across sources; (ii) single-source telemetry fails due to missing phases and semantic/causal ambiguity , often drop- ping or misattributing steps; and (iii) multi-source correlation improv es completeness via complementary anchors and joins, but cannot recover phases that are absent or non-joinable and may add spurious candidates in ambiguous settings (e.g., model-layer attacks). The open-source release of SynthChain can catalyse a step-change in empirical research by mak- ing multi-stage, multi-source reconstruction a shared, repro- ducible benchmark for complex supply-chain attacks. 13 Ethical Considerations SynthChain is a benchmark for e valuating forensic analysis and detection of SSC attacks. The main ethical risk is misuse of realistic attack simulation and payload orchestration. T o mitigate this risk, we publicly release only the artifacts re- quired to e valuate our claims (system telemetry , provenance, and ground-truth annotations), and we do not publicly release ex ecutable payloads or end-to-end attack orchestration. During peer revie w , we provide an anonymized ev aluation artifact to re viewers to support reproducibility . After accep- tance, we will release a sanitized v ersion of the dataset and ex- periment code, with any payloads/malware remo ved. Access to the full attack simulation code (payload implementations and orchestration) is pro vided under controlled access only , to verified researchers for academic or defensi ve purposes, following responsible disclosure practices. Open Science W e will release the following public artifacts to enable e valu- ation and reproduction of all experiments: • SynthChain dataset (sanitized) : system telemetry , prov enance information, and ground-truth annotations for all scenarios. • Experiment code : preprocessing, feature extraction, training/ev aluation scripts, and configuration files. • Baselines and instructions : implementation details and step-by-step commands to reproduce each table/figure. • Documentation : en vironment setup, dependencies, and a reproducibility checklist. All public artifacts will be hosted at: https://anonymous.4open.science/r/SSCMDataset-2E11 . References [1] Unit 42. "shai-hulud" worm compromises npm ecosys- tem in supply chain attack. T echnical report, Palo Alto Networks, September 2025. T echnical alert / threat re- search report. URL: https: //unit42 .paloal t onet works.com/npm- supply- chain- attack/ . [2] Schindel Alon and T amari Shir . Secret-based cloud supply-chain attacks: Case study and lessons for security teams, Dec 2022. URL: https://www.wiz.io/blog/ se cr et - b as ed - clo ud - sup pl y- cha in - att ac ks - c ase- study- and- lessons- for- security- teams . [3] Apriorit. Collecting telemetry data on macos using ap- ple’ s endpoint security , 2025. Describes macOS teleme- try collection mechanisms and their usage challenges. URL: https://www.apriorit.com/dev- blog/coll ectin g- tel emetry- data- on- maco s- usi ng- en dpo int- security . [4] Rody Arantes, Carl W eir , Henry Hannon, and Marisha Kulseng. Operationally transparent cyber (optc), 2021. doi:10.21227/edq8- nk52 . [5] Frederick Barr-Smith, T im Blazytko, Richard Baker , and Ivan Martinovic. Exorcist: Automated Differen- tial Analysis to Detect Compromises in Closed-Source Software Supply Chains. In Pr oceedings of the 2022 A CM W orkshop on Software Supply Chain Offensive Resear ch and Ecosystem Defenses , pages 51–61, Los Angeles CA USA, 2022. A CM. [6] Zijun Cheng, Qiujian Lv , Jinyuan Liang, Y an W ang, De- gang Sun, Thomas Pasquier , and Xueyuan Han. Kairos: Practical intrusion detection and in vestigation using whole-system provenance. 2024 IEEE Symposium on Security and Privacy (SP) , pages 3533–3551, 2023. [7] Cro wdStrike. Cro wdstrike threat hunting report. https: //w ww. cro wds tri ke. com /en - g b/r e sou rce s/r epo rts/threat- hunting- report/ , 2025. Annual threat intelligence and hunting report. [8] Sajjad Dadkhah, Xichen Zhang, Ale xander Gerald W eis- mann, Amir Firouzi, and Ali A. Ghorbani. The largest social media ground-truth dataset for real/fake con- tent: T ruthseeker . IEEE T ransactions on Computa- tional Social Systems , 11(3):3376–3390, 2024. d o i : 10.1109/TCSS.2023.3322303 . [9] Kelle y Dempsey et al. Information security continuous monitoring (iscm) for federal information systems and organizations (sp 800-137). T echnical report, National Institute of Standards and T echnology , 2011. URL: ht tps://csrc.nist.gov/pubs/sp/800/137/final . [10] Ferdi Do ˘ gan, Onur Polat, and Fahri Y ardimci. A new method for detecting beaconing attacks in iot-based scada systems. Int. J. Inf . Secur . , 24(6), No vember 2025. doi:10.1007/s10207- 025- 01161- 6 . [11] Open Source Security F oundation. Openssf malicious packages. h tt p s : // g i t hu b . c o m / o s sf / m a l i c i o u s- packages/ , 2024. [12] Y ehuda Gelb . Python packages lev erage github to deploy fileless malware, Dec 2022. URL: h t t p s : / / m e d i u m . c o m / c h e c k m a r x - s e c u r i t y / p y t h o n - p ac k ag es- l ev er ag e- gi th u b- t o- de pl o y- f il el e s s- m alw are- b6c 281 dea 5 8f#: ~:t e xt= In% 2 0ea rly % 20D ece mber %2C %20 a %20 num ber, cle ver ness %20 of %20their%20deployment%20strategy. 14 [13] Y ehuda Gelb . Attacker hidden in plain sight for nearly six months, targeting python developers, Nov 2023. URL: ht t ps : // m ed i u m .c o m/ c he c k m ar x - se cu r it y/att acker- hidden- in- pla in- si ght- f or- ne arl y- si x- mon ths- ta rgetin g- pyt hon- de veloper s- 3 712f0f107e0 . [14] Y ehuda Gelb . Lazarus group launches first open source supply chain attacks targeting crypto sector , Aug 2023. URL: https://medium.com/checkmarx- security/ lazar us- gro up- l a unches- first- open- source- s uppl y- cha in- a t tacks - tar g eting - cry p to- se cto r- cabc626e404e . [15] Y ehuda Gelb . An ongoing open source attack rev eals roots dating back to 2021, Aug 2023. URL: h t t p s : //m edi um.c om/ che c kma rx- sec urit y/a n- o ngo ing - o pen- sou rce- att a ck- reve als - ro ots - da tin g- b ack- to- 2021- 4a511979fd98 . [16] W enbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Y ong Fang, and Y ang Liu. An empirical study of ma- licious code in pypi ecosystem. In Pr oceedings of the 38th IEEE/ACM International Confer ence on A utomated Softwar e Engineering , ASE ’23, page 166–177. IEEE Press, 2024. doi:10.1109/ASE56229.2023.00135 . [17] Cheng Huang, Nannan W ang, Ziyan W ang, Siqi Sun, Lingzi Li, Junren Chen, Qianchong Zhao, Jiaxuan Han, Zhen Y ang, and Lei Shi. Donapi: malicious npm pack- ages detector using behavior sequence kno wledge map- ping. In Pr oceedings of the 33r d USENIX Confer ence on Security Symposium , SEC ’24, USA, 2024. USENIX Association. [18] Muhammad Adil Inam, Y infang Chen, Akul Goyal, Ja- son Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and W ajih Ul Hassan. Sok: History is a vast early warning system: Auditing the pro venance of sys- tem intrusions. In 2023 IEEE Symposium on Secu- rity and Privacy (SP) , pages 2620–2638, 2023. d o i : 10.1109/SP46215.2023.10179405 . [19] Baoxiang Jiang, T ristan Bilot, Nour El Madhoun, Khal- doun Al Agha, Anis Zouaoui, Shahrear Iqbal, Xueyuan Han, and Thomas Pasquier . Orthrus: achieving high quality of attribution in pro venance-based intrusion de- tection systems. In Pr oceedings of the 34th USENIX Confer ence on Security Symposium , SEC ’25, USA, 2025. USENIX Association. [20] Jeff Johnson, Fred Plan, Adrian Sanchez, Renato Fontana, Jake Nicastro, Dimiter Andonov , Marius Fodor- eanu, and Daniel Scott. 3cx software supply chain com- promise initiated by a prior software supply chain com- promise; suspected north korean actor responsible, Apr 2023. URL: htt ps:// clou d.goo gle. com/ blog/ to pi c s/ th r ea t- in t e l li g e n ce / 3 c x- so ft w a r e- su p ply- chain- compromise/ . [21] Kaspersky . What is typosquatting? – definition and explanation. URL: https://www.kaspersky.com/re so u rc e- c en te r/ d ef in i t i on s/ w ha t- i s- t y po sq u atting . [22] P . Ladisa, H. Plate, M. Martinez, and O. Barais. Sok: T axonomy of attacks on open-source software supply chains. In 2023 IEEE Symposium on Security and Pri- vacy (SP) , pages 1509–1526, Los Alamitos, CA, USA, may 2023. IEEE Computer Society . doi:10.1109/SP 46215.2023.10179304 . [23] Max Landauer , Florian Skopik, Maximilian Frank, W olf- gang Hotwagner , Markus W urzenberger , and Andreas Rauber . Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE T ransactions on Dependable and Secur e Computing , 20(4):3466–3482, July 2023. arXiv:2203.08580 [cs]. doi :10. 110 9/TD SC.2022.3201582 . [24] Max Landauer , Florian Skopik, Markus W urzenberger , W olfgang Hotwagner , and Andreas Rauber . Have it Y our W ay: Generating Customized Log Datasets W ith a Model-Driv en Simulation T estbed. IEEE T ransactions on Reliability , 70(1):402–415, March 2021. doi:10.1 109/TR.2020.3031317 . [25] Shaofei Li, Feng Dong, Xusheng Xiao, Haoyu W ang, Fei Shao, Jiedong Chen, Y ao Guo, Xiangqun Chen, and Ding Li. NODLINK: An Online System for Fine-Grained APT Attack Detection and In vestigation. In Network and Distributed System Security (NDSS) Symposium 2024 . The Internet Society , 2024. d o i : 10.14722/ndss.2024.23204 . [26] T eng Li, Ximeng Liu, W ei Qiao, Xiongjie Zhu, Y ulong Shen, and Jianfeng Ma. T -trace: Constructing the apts prov enance graphs through multiple syslogs correlation. IEEE T ransactions on Dependable and Secur e Comput- ing , 21(3):1179–1195, 2024. doi:10.1109/TDSC.202 3.3273918 . [27] Y uanchun Li, Jiayi Hua, Haoyu W ang, Chunyang Chen, and Y unxin Liu. Deeppayload: Black-box backdoor at- tack on deep learning models through neural payload in- jection. In Pr oceedings of the 43r d International Confer- ence on Software Engineering , ICSE ’21, page 263–274. IEEE Press, 2021. doi:10.1109/ICSE43902.2021.0 0035 . [28] Mario Lins, René Mayrhofer , Michael Roland, Daniel Hofer , and Martin Schwaighofer . On the critical path to implant backdoors and the effecti veness of potential mitigation techniques: Early learnings from xz, 2024. 15 URL: ht t ps : // a rx i v.o rg / ab s /24 04 . 08 9 87 , ar X i v:2404.08987 . [29] Carol Lo, Thu Y ein W in, Zeinab Rezaeifar , Zaheer Khan, and Phil Legg. Lotl-hunter: Detecting multi-stage living-of f-the-land attacks in cyber -physical systems us- ing decision fusion techniques with digital twins. Fu- tur e Generation Computer Systems , page 108382, 2026. doi:10.1016/j.future.2026.108382 . [30] Mingqi Lv , Shanshan Zhang, Haiwen Liu, Tieming Chen, and T iantian Zhu. Apt-mcl: An adaptiv e apt de- tection system based on multi-view collaborati ve pro ve- nance graph learning, 2026. URL: https://arxiv.or g/abs/2601.08328 , arXiv:2601.08328 . [31] Mandiant. Special report: Mandiant m-trends 2025, 2025. URL: htt ps:// serv ices. goog le.c om/fh /f iles/misc/m- trends- 2025- en.pdf . [32] Semilof Margiem and Clark Casey . Steganography , Seq 2023. URL: htt ps:// www. techt arge t.co m/sea rc hse cur ity /de fin iti on/ ste gan ogr aphy #:~ :te xt =St ega nog rap hy% 20i s%2 0th e%2 0te chni que %20 of , for%20hiding%20or%20protecting%20data. [33] T ravis Meade, Zheng Zhao, Shaojie Zhang, Da vid Pan, and Y ier Jin. Revisit sequential logic obfuscation: At- tacks and defenses. In 2017 IEEE International Sympo- sium on Cir cuits and Systems (ISCAS) , pages 1–4, 2017. doi:10.1109/ISCAS.2017.8050606 . [34] Sk T anzir Mehedi, Raja Jurdak, Chadni Islam, and Gowri Sankar Ramachandran. QUT-D V25: A dataset for dynamic analysis of next-gen software supply chain attacks. In The Thirty-ninth Annual Confer ence on Neu- ral Information Pr ocessing Systems Datasets and Bench- marks T rac k , 2025. URL: https://openreview.net /forum?id=GR3P9UXqCE . [35] Sowmya Myneni, Kritshekhar Jha, Abdulhakim Sabur , Garima Agrawal, Y uli Deng, Ankur Chowdhary , and Dijiang Huang. Unraveled — a semi-synthetic dataset for adv anced persistent threats. Computer Networks , 227:109688, 2023. doi:10.1016/j.comnet.2023.10 9688 . [36] Euclides Carlos Pinto Neto, Sajjad Dadkhah, Raphael Ferreira, Alireza Zohourian, Rongxing Lu, and Ali A. Ghorbani. Ciciot2023: A real-time dataset and bench- mark for large-scale attacks in iot en vironment. Sensors , 23(13), 2023. doi:10.3390/s23135941 . [37] Y uqiao Ning, Y anan Zhang, Chao Ma, Zhen Guo, and Longhai Y u. Empirical study of software composition analysis tools for c/c++ binary programs. IEEE Access , 12:50418–50430, 2024. doi:10.1109/ACCE S S.2023. 3341224 . [38] Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier . Backstabber’ s Knife Collection: A Re vie w of Open Source Software Supply Chain Attacks, 2020. ar Xiv:2005.09535 . [39] Marwan Omar . Malwar e Anomaly Detection Using Local Outlier F actor T echnique , pages 37–48. Springer International Publishing, Cham, 2022. doi:10.1007/ 978- 3- 031- 15893- 3_3 . [40] T alha Ongun, Jack W . Stokes, Jonathan Bar Or , K e T ian, Farid T ajaddodianfar , Joshua Neil, Christian Seifert, Alina Oprea, and John C. Platt. Living-off-the-land command detection using acti ve learning. In Pr oceed- ings of the 24th International Symposium on Resear ch in Attacks, Intrusions and Defenses , RAID ’21, page 442–455, New Y ork, NY , USA, 2021. Association for Computing Machinery . doi:10.1145/3471621.3471 858 . [41] OpenSSF. package-analysis: Open source package anal- ysis. ht tp s: // gi th ub .c om /o ss f/ pa ck ag e- an a ly sis , Feb 2024. V ersion rel-36 (tag dated 2024-02-21). Accessed: 2026-01-23. [42] O W ASP T op 10 T eam. OW ASP T op 10:2025, 2025. URL: https://owasp.org/Top10/2025/ . [43] Palo Alto Networks, Unit 42. npm supply chain at- tack, 2025. Cloud Security Blog. Accessed: 2026-02-04. URL: https://www.p aloaltonetworks.co m /blog/ cloud- security/npm- supply- chain- attack/ . [44] Clément Parsse gny , Johan Mazel, Olivier Le villain, and Pierre Chifflier . Striking back at cobalt: Using network traffic metadata to detect cobalt strike masquerading command and control channels. In Mila Dalla Preda, Sebastian Schrittwieser , V incent Naessens, and Bjorn De Sutter , editors, A vailability , Reliability and Security , pages 163–185, Cham, 2025. Springer Nature Switzer - land. [45] Ke xin Pei, Zhongshu Gu, Brendan Saltaformaggio, Shiqing Ma, Fei W ang, Zhiwei Zhang, Luo Si, Xiangyu Zhang, and Dongyan Xu. Hercule: attack story recon- struction via community discovery on correlated log graph. In Pr oceedings of the 32nd Annual Confer ence on Computer Security Applications , A CSAC ’16, page 583–595, New Y ork, NY , USA, 2016. Association for Computing Machinery . doi:10.1145/2991079.2991 122 . [46] Diana-Elena Petrean and Rodica Potolea. Homomorphic encrypted yara rules ev aluation. J. Inf. Secur . Appl. , 82(C), May 2024. doi:10.1016/j.jisa.2024.1037 38 . 16 [47] Rev ersingLabs. Reversinglabs software supply chain security report. h tt p s :/ / w ww . r ev e r s i n gl a b s. c o m / ss c s - r e po r t , 2025. Industry report on software supply chain security . [48] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani. T ow ard generating a new intrusion detection dataset and intrusion traffic characterization. In Interna- tional Confer ence on Information Systems Security and Privacy , 2018. [49] Ridwan Shariffdeen, Behnaz Hassanshahi, Martin Mirche v , Ali El Husseini, and Abhik Roychoudhury . De- tecting Python Malw are in the Software Supply Chain with Program Analysis . In 2025 IEEE/ACM 47th Inter- national Confer ence on Software Engineering: Softwar e Engineering in Practice (ICSE-SEIP) , pages 203–214, Los Alamitos, CA, USA, May 2025. IEEE Computer Society . URL: h t t p s : / /d o i . i e e e c o m p u t e r s o c i et y. o rg /1 0. 11 09 / I C SE - S E IP 66 35 4. 20 2 5. 00 02 4 , doi:10.1109/ICSE- SEIP66354.2025.00024 . [50] Ilia She vrin and Oded Marg alit. Detecting Multi-Step IAM attacks in A WS environments via model checking. In 32nd USENIX Security Symposium (USENIX Secu- rity 23) , pages 6025–6042, Anaheim, CA, August 2023. USENIX Association. [51] Somin Song, Sahil Suneja, Michael V . Le, and Byungchul T ak. On the v alue of sequence-based system call filtering for container security . In 2023 IEEE 16th In- ternational Confer ence on Cloud Computing (CLOUD) , pages 296–307, 2023. doi :10 . 110 9/CL OUD 6004 4.2 023.00043 . [52] Y ubo Song, Kanghui W ang, Xin Sun, Zhongyuan Qin, Hua Dai, W eiwei Chen, Bang Lv , and Jiaqi Chen. A multi-source log semantic analysis-based attack inv esti- gation approach. Computers & Security , 150:104303, 2025. doi:10.1016/j.cose.2024.104303 . [53] Sudhakar and Sushil Kumar . An emerging threat fileless malware: a surv ey and research challenges. Cybersecu- rity , 3, 2020. [54] Huaqi Sun, Hui Shu, Fei Kang, Y untian Zhao, and Y uyao Huang. Malware2att&ck: A sophisticated model for mapping malware to att&ck techniques. Computers & Security , 140:103772, 2024. doi:10.1016/j.cose.2 024.103772 . [55] Zhuoran T an, Christos Anagnostopoulos, and Jeremy Singer . Osptrack: A labeled dataset targeting sim- ulated execution of open-source software. In 2025 IEEE/A CM 22nd International Confer ence on Mining Softwar e Repositories (MSR) , pages 659–663. IEEE, 2025. doi:10.1109/MSR66628.2025.00102 . [56] Zhuoran T an, Shameem Puthiya Parambath, Chris- tos Anagnostopoulos, Jeremy Singer , and Angelos K. Marnerides. Advanced persistent threats based on sup- ply chain vulnerabilities: Challenges, solutions, and future directions. IEEE Internet of Things Journal , 12(6):6371–6395, 2025. d oi : 1 0 . 1 10 9 / JI O T .2 0 25 . 3528744 . [57] Lijin W ang, Jingjing W ang, T ianshuo Cong, Xinlei He, Zhan Qin, and Xinyi Huang. F r om purity to peril: back- dooring mer ged models fr om "harmless" benign compo- nents . USENIX Association, USA, 2025. [58] Eoin W ickens, Kasimir Schulz, and T om Bonner . Shad- owlogic: Backdoors in computational graphs. htt ps: //h idd enl aye r.c om/ inn ova tio n- hub/ sha dow log ic/#Triggers , October 2024. Accessed: 2025-06-26. [59] Laurie W illiams, Giacomo Benedetti, Siv ana Hamer , Ranindya Paramitha, Imranur Rahman, Mahzabin T amanna, Greg T ystahl, Nusrat Zahan, Patrick Morri- son, Y asemin Acar, Michel Cukier , Christian Kästner, Alexandros Kapra velos, Dominik W ermke, and W illiam Enck. Research directions in software supply chain se- curity . ACM T rans. Softw . Eng. Methodol. , 34(5), May 2025. doi:10.1145/3714464 . [60] W iz. Ai supply chain security , 2025. Accessed: 2026- 02-04. URL: htt ps: //w ww. wiz .io/ aca dem y/a i- s ecurity/ai- supply- chain- security . [61] Menghan W u, Y ukai Zhao, Xing Hu, Xian Zhan, Shan- ping Li, and Xin Xia. More than meets the eye: On ev aluating sbom tools in jav a. ACM T rans. Softw . Eng. Methodol. , September 2025. Just Accepted. d o i : 10.1145/3766073 . [62] wunderwuzzi. Machine learning attack series: Back- dooring keras models and how to detect it, 2024. Ac- cessed: 2025-07-28. URL: https://embracethered. com /blo g/p osts /202 4/m achi ne- learn ing- att ac k- series- keras- backdoor- model/ . [63] Fan Y ang, Jiacen Xu, Chunlin Xiong, Zhou Li, and Kehuan Zhang. PR OGRAPHER: An anomaly detec- tion system based on provenance graph embedding. In 32nd USENIX Security Symposium (USENIX Secu- rity 23) , pages 4355–4372, Anaheim, CA, August 2023. USENIX Association. [64] Ugur Y ilmaz and Piers Harding. Securing the software supply chain for containers: practices and challenges in a cloud-nativ e landscape for a global observatory. In Jorge Ibsen and Gianluca Chiozzi, editors, Softwar e and Cyberinfrastructur e for Astr onomy VIII , volume 13101, page 1310114. International Society for Optics and Photonics, SPIE, 2024. doi: 10.111 7/12.3 0205 82 . 17 [65] Junan Zhang, Kaifeng Huang, Y iheng Huang, Bihuan Chen, Ruisi W ang, Chong W ang, and Xin Peng. Killing two birds with one stone: Malicious package detection in npm and pypi using a single model of malicious be- havior sequence. ACM T rans. Softw . Eng. Methodol. , 34(4), April 2025. doi:10.1145/3705304 . [66] Xinyi Zheng, Chen W ei, Shenao W ang, Y anjie Zhao, Peiming Gao, Y uanchao Zhang, Kailong W ang, and Haoyu W ang. T owards rob ust detection of open source software supply chain poisoning attacks in industry en- vironments. In Pr oceedings of the 39th IEEE/ACM International Conference on Automated Software En- gineering , ASE ’24, page 1990–2001, New Y ork, NY , USA, 2024. Association for Computing Machinery . doi:10.1145/3691620.3695262 . [67] Michael Zipperle, Florian Gottwalt, Elizabeth Chang, and Tharam Dillon. Prov enance-based intrusion detec- tion systems: A survey . ACM Comput. Surv . , 55(7), December 2022. doi:10.1145/3539605 . A Benign Activity Simulation T o ev aluate detection and forensic analysis under realis- tic runtime conditions, we introduce a benign background activity simulation framework that generates diverse, non- deterministic system behaviors concurrent with attack exe- cution. The goal of this framew ork is not to precisely emu- late human intent or productivity workflo ws, but to introduce representativ e operational noise commonly observed on real- world de veloper and end-user systems. A.1 Design Objectives The benign activity simulation is designed to satisfy three objectiv es: 1. Behavioral Cov erage : generate system- and network- lev el e vents that ov erlap with those produced by early- stage supply chain attacks (e.g., process creation, file I/O, outbound connections). 2. T emporal V ariability : av oid deterministic execution patterns by randomizing activity timing and selection. 3. Role Diversity : reflect heterogeneity across hosts by assigning different functional profiles. A.2 Activity Categories Each simulated host executes a subset of benign activities drawn from the follo wing categories: • W eb and Network Interaction : outbound web bro ws- ing, search queries, and API-based communications. Algorithm 1 Normal Activity Simulation Require: ActivitySet = {W eb, RemoteAccess, FileOp, Up- date, Download, De v , API, Login} 1: ActiveHours = [09:00, 19:00] 2: N = number of scheduled acti vities 3: f or i = 1 to N do 4: d el ay ← RandomInterval ( min , max ) 5: W A I T ( d el ay ) 6: if CurrentTime ∈ Activ eHours then 7: act ivit y ← RandomChoice ( ActivitySet ) 8: E X E C U T E ( act ivi t y ) 9: end if 10: end f or • Remote Administration : authenticated remote access and command ex ecution. • File Operations : file copying, directory creation, and document handling. • System Maintenance : periodic software updates and package management. • Development W orkflows : source code retriev al, test ex ecution, and application deployment. These acti vities collectiv ely produce realistic background signals, including filesystem modifications, process lifecycles, network flo ws, and authentication traces, which may partially ov erlap with attack-related telemetry . A.3 Scheduling and Execution Logic Benign acti vities are scheduled probabilistically during prede- fined working hours. At each scheduling opportunity , a single activity is randomly selected and ex ecuted, ensuring the be- nign workloads interlea ve with attack beha viours rather than being isolated in separate execution windows. Algorithm 1 summarizes the high-lev el scheduling logic: Each activity inv ocation may trigger multiple low-le vel ev ents (e.g., child processes, network connections, or file writes), thereby generating compound benign traces rather than isolated actions. A.4 Scope and Limitations The benign activity simulation is intentionally lightweight and abstract. W e do not attempt to model fine-grained human intent, productivity c ycles, or organizational policies. Instead, the framework provides sufficient benign v ariability , as sho wn in T able 6 , to challenge detection and forensic analysis while maintaining reproducibility and experimental control. 18 T able 6: Behaviour definition (keyw ords): normal vs. malicious activities instantiated in our scenarios. Behaviour Category Normal (keywords) Malicious (keywords) Process & Script Execution Browser/editor/terminal; tests; benign scripts Install-hook exec; staged scripts; Power- Shell/assembly; interpreter spawning Package / Dependenc y Ops pip / npm install/update; repo clone; dependency fetch T yposquat packages; staged deps; install → ex ec; multi-stage deps External Communications Browsing/search; legit APIs; SSH to known hosts; updates C2 callbacks; module pull; long-liv ed encrypted channels; exfil to attacker Discov ery & Collection Code/file edit; test artifacts; routine file copy (SCP) Host/file enumeration; writable-path search; archiv e ( *.zip ); staging Credentials & Privile ge Normal logins; user sessions; routine secret use Secret access; token abuse; pri vilege escalation; credential dumping attempts Services & Listening Expected dev services (web/DNS/DB/SSHD); routine ports Adversary listeners; anomalous binds; tempo- rary service for C2/exfil Container / CI/CD Build images; run containers; deploy actions; pipeline ops Pipeline credential abuse; artifact tamper; b uild- stage ex ec; CI/CD retriev al B Extended Statistical Analysis of Malicious Packages Figure A 4 sho ws a highly ske wed distribution across se veral dimensions: entrypoint/do wnload and setup-script injection account for roughly 56.4% and 33.6% of all packages, re- spectiv ely ( ≈ 90% combined), mirroring the dominance of install-time behaviors in both function and attack-type labels. T riggering is overwhelmingly installation-centric ( ≈ 95.8% of packages are activ ated upon installation among those with a trigger label). Evasion labels are present in only a subset of packages ( ≈ 39.6%); within this subset, Base64 encoding constitutes ≈ 84.7% of all ev asion instances, indicating that lightweight transformation remains the primary stealth strat- egy , while techniques such as payload splitting or steganogra- phy appear only in the long tail. C Detailed Collected Data Sources T elemetry A vailability and V ariability W e note that the av ailability and granularity of collected telemetry may vary across scenarios and execution en vironments. This variability is not an artifact of incomplete instrumentation, but a con- sequence of differences in attack configurations, execution outcomes, and platform-specific constraints. In particular , failed or partially e xecuted attacks may only manifest as control-plane or network-lev el signals (e.g., au- thentication failures or connection attempts), without gener - ating observable process creation, file system, or privilege- related e vents on the target host. Similarly , fileless execution techniques, short-liv ed processes, or container-scoped attacks may e vade certain host-based sensors while remaining visible through alternativ e telemetry sources. 5 See T able 8 for W indows Custom Channels. This design reflects real-world operational conditions, where defenders rarely observe a complete and uniform set of signals for ev ery adv ersarial action. Rather than enforc- ing artificial completeness, our dataset preserv es these natu- ral observ ability gaps, enabling systematic analysis of when and why single-source telemetry fails, and ho w multi-source telemetry can mitigate such limitations. In scenarios where target-side W indows telemetry is sparse (e.g., due to partial ex ecution, short-liv ed activity , or chan- nel unav ailability), we additionally extract attacker -side logs (e.g., Linux syslog) from the orchestration host. This attacker - side telemetry is used to provide auxiliary conte xt for action prov enance (e.g., command issuance, tooling inv ocation, and network attempt timing), rather than to replace host evidence on the victim. W e explicitly preserv e the trust-domain bound- ary between attacker -side and victim-side observ ations, and annotate each record with its collection origin (attacker vs. tar - get) to support controlled analyses and to prevent inadv ertent leakage of privile ged signals into detection-only e valuations. Host and Container T elemetry W indows systems con- tribute heterogeneous host-level telemetry from multiple in- dependent W indows instrumentation sources, including Sys- mon 6 , W indo ws Security Auditing, WMI Activity logs, and PowerShell operational logs. There sources are generated by distinct subsystems and logging pipelines, providing com- plementary visibility into process e xecution, authentication, management-plane activity , and script-level beha vior . In addi- tion to baseline system and security ev ents, we collect extend extended operational channels (T able 8 ), which provide high- value visibility into adversarial behaviors such as command obfuscation, scheduled-task persistence, lateral movement, and DNS-based command-and-control activity . 6 https://learn.microsoft.com/en-us/sysinternals/downloads/sysmon 19 0 0 . 2 0 . 4 0 . 6 0 . 8 1 E/D Setup V ar Comm Hook Spec 0 . 92 0 . 55 9 · 10 − 2 5 . 94 · 10 − 2 4 . 2 · 10 − 3 4 . 6 · 10 − 3 Count ( × 10 4 ) T rigger Locations 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Install Spy Miner C2 Secrets Persist Cred MS-DL DL 0 . 92 0 . 54 9 . 1 · 10 − 2 5 . 94 · 10 − 2 4 . 2 · 10 − 3 4 . 6 · 10 − 3 4 . 1 · 10 − 3 1 . 4 · 10 − 3 1 · 10 − 4 Count ( × 10 4 ) T rigger Functions 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Exfil Dropper T ypo C2 Cmd Star Social 0 . 92 0 . 54 9 . 69 · 10 − 2 5 . 94 · 10 − 2 4 . 8 · 10 − 3 3 . 1 · 10 − 3 1 . 4 · 10 − 3 Count ( × 10 4 ) Attack T ypes 0 0 . 5 1 1 . 5 Inst Net Call Async Use CondCfg 1 . 56 5 . 94 · 10 − 2 2 . 1 · 10 − 3 1 . 4 · 10 − 3 1 · 10 − 3 1 · 10 − 4 Count ( × 10 4 ) T rigger Mechanisms 0 2 4 6 B64 Obf Split Steg 5 . 46 0 . 9 5 . 6 · 10 − 2 2 . 7 · 10 − 2 Count ( × 10 3 ) Evasion Methods Legend (abbre v . → full name) E/D=Entrypoint/Download; Setup=Setup Script; V ar=V ariables; Comm=Communication Modules; Hook=Hooked Sequential Files; Spec=Special Functions. Install=Install Malware; Spy=Spyware/Info Stealing; Miner=Crypto Miner; C2=Maintain C2; Secrets=Steal Source Code/Secrets; Persist=Persistence/Host Exfil; Cred=Credential Theft; MS-DL=Multi-stage Malw are Download; DL=Do wnload Payload. Exfil=Data Exfiltration/Root Shell; Dropper=Dropper/Malware; T ypo=T yposquatting; Cmd=Malicious Command Execution; Star=Starjacking; Social=Social Engineering/T yposquatting. Inst=Upon Installation; Net=Network Conn. Activity; Call=Function Call; Async=Async Install Ops; Use=Upon Usage; CondCfg=Conditional Config. B64=Encoding:Base64; Obf=Obfuscation; Split=Split Payloads; Steg=Ste ganography . Figure 4: Distributions of malicious semantics and trigger/e vasion mechanisms. Linux hosts pro vide complementary signals through Sys- log, authentication logs (e.g., P AM), and network-layer meta- data recorded by Zeek 7 and Suricata 8 . These network sen- sors capture DNS queries, HTTP transactions, protocol fin- gerprints, and intrusion alerts. Containerised en vironments contribute additional system- lev el metadata, including mount-namespace structures, cgroup hierarchies, Docker socket exposure, and overlay filesystem metadata, offering visibility into ex ecution patterns characteristic of CI/CD and supply-chain attack surfaces. Behavioral T racing and Enrichment T o capture fine- grained behavioral signals beyond baseline system logs, we incorporate multiple system-call–lev el and kernel-le vel trac- ing mechanisms. On Linux, Auditd and Sysmon-for-Linux pro vide detailed records of process creation, file interactions, network connec- tions, and system-call sequences, allowing reconstruction of 7 https://zeek.org/ 8 https://suricata.io/ high-resolution behavioral traces. W e also collected perfor- mance counters, but found them insufficient for reasoning about attack causality . For containerised workloads, we additionally employ T racee 9 , an eBPF-based runtime tracing frame work. T racee captures ephemeral kernel events—including privilege- escalation attempts, anomalous syscalls, and memory-resident ex ecution—without requiring in-container agents. Its low- ov erhead, high-fidelity instrumentation is particularly ef fec- tiv e for surfacing stealthy or short-liv ed behaviors common in supply-chain exploitation. T ogether, these tracing mechanisms enrich the dataset with consistent, fine-grained behavioural signals across hosts and containers. These data are either ingested into Log Analytics W orkspace directly or collected directed from hosts. 9 https://www .aquasec.com/products/tracee/ 20 T able 7: Collected T elemetry by OS and Execution En vironment En vironment Process T elemetry Network T elemetry System/A uth T elemetry Windo ws Sysmon (process, re gistry , file ev ents); PowerShell logs; WMI Activity; T ask Scheduler; Ap- pLocker; Application Experi- ence Sysmon network events; Win- dows Fire wall logs; DNS Client logs W indows Event Logs; RDP Core logs; Security Audit Con- figuration; AMSI 5 Linux Sysmon for Linux (process, file, network e vents); Auditd Zeek telemetry (connections, DNS, HTTP , files); Suricata IDS telemetry (alerts, flows, DNS, HTTP , TLS) Syslog; P AM authentication logs; Audit logs Containers T racee runtime telemetry (syscalls, probes, capability use, process ev ents); Container runtime ev ents T racee network e vents cgroup metadata; mountinfo; docker API e vents T able 8: Extended Windo ws Operational Channels and Associated Threat Indicators Channel Threat Indicators PowerShell / Operational Obfuscation, encoded payloads, script abuse WMI-Activity / Operational Fileless ex ec, lateral movement, WMI persistence Security-Audit-Config-Client Brute force, privilege escalation AppLocker (MSI/Script; EXE/DLL) Block/allow malicious binaries or scripts T askScheduler / Operational Scheduled-task persistence RdpCoreTS / Operational RDP brute force, lateral move- ment Program-T elemetry Unknown or suspicious binaries DNS-Client / Operational Beaconing, C2 domains, DNS exfiltration D Cover ed Supply-Chain Threat T ypes (Entry V ectors, T echniques, and Post-Compr omise Beha viours) E Extended Case Study — SC1 (Stegano) This appendix provides the full telemetry breakdown and supporting evidence for the SC1 exemplar . SC1 (Stegano) was ex ecuted on a single W indows 10 virtual machine to em- ulate a victim de veloper endpoint installing a typosquatted Python package. W e monitored the system for a 189-minute observation window (12:28–15:37 UTC) and collected host and network telemetry using Azure Monitor Agent (AMA). AMA was configured to export six telemetry streams cov- ering process ex ecution, network connections, Windo ws se- curity ev ents, system e vents, bound ports, and performance counters. In total, the collection yields 22,534 records and provides complementary visibility into (i) process-lev el ex- ecution context, (ii) outbound connection beha vior, and (iii) authentication/privile ge-related security e vents (T able 10 ). E.1 Experimental Setup and Data Sources This scenario captures a supply chain attack in which a ty- posquatted Python package deliv ers a steg anographically- concealed command-and-control agent. The malicious pack- age colorsapi-6.6.7 mimics the legitimate colorapi li- brary through single-character insertion. When victims install this package, the setup.py script executes pre-installation hooks that download an image file from a content delivery network. Hidden within this image is ex ecutable Python code, embedded using Least Significant Bit (LSB) steganograph y , which establishes a persistent backdoor on the victim system. E.2 Step-by-step Attack Timeline Phase 1: Initial Access (13:22 UTC). The victim runs pip install colorsapi , which triggers execution of the ty- posquatted package’ s setup.py during installation. Phase 2: Payload Retrieval (13:22 UTC). The in- staller downloads an 8.6 MB PNG from a CDN endpoint ( 146.75.74.132 ) and extracts embedded code via LSB steganography (T1027.003). Phase 3: Code Execution (13:23 UTC). The extracted Python payload is ex ecuted in-process (via exec() ), spawn- ing a Mythic C2 agent (Medusa variant) under the Python interpreter context. Phase 4: C2 Establishment (13:28 UTC). The agent estab- lishes an SSH channel to 172.187.202.111:22 and sustains activity for ∼ 120 minutes, transferring ∼ 265 MB of modular Python tooling (loaded in-memory). Phase 5: Collection and Exfiltration (14:33–15:18 UTC). The attacker runs a reconnaissance script ( SenScanner.py ), packages results into info.zip , and exfiltrates data ov er the existing SSH tunnel, completing at 15:18 UTC. 21 T able 9: Covered Threat T ypes with Descriptions Attack T ype Description T yposquatting An attack tar geting users who mistype URLs, leading them to malicious websites with URLs resembling legitimate ones. It can exploit similar package or repository names in the supply chain [ 21 ]. Dependency Confusion An exploit in software dependency resolution that targets package management systems, exploiting fla ws between public and priv ate repositories [ 22 ]. Cloud-nativ e Attacks Attacks targeting cloud software supply chains, such as inserting malware into code libraries or poisoning containers. It also inv olves identity-based supply chain attacks and compromised CI/CD pipelines [ 64 ]. Steganography A technique to hide malicious payloads in non-secret files like images or documents to evade detection by anti-virus systems [ 32 ]. Data Exfiltration The process of extracting sensiti ve data from compromised systems, often using covert communication channels to ev ade detection. Obfuscation T echniques like encoding, packing, and renaming used to conceal the intention of malicious code and ev ade detection, often applied to malicious code in supply chain attacks [ 33 ]. Multi-Stage Payloads A technique where payloads in different stages help ev ade detection and perform actions such as downloading mal ware or establishing connections to external serv ers. Fileless Malware Malware that operates without traditional files or ex ecutables, using existing system tools to carry out malicious activities, commonly used in LotL attacks [ 40 , 53 ]. T able 10: SC1 data collection statistics across six telemetry sources. Data Source Records Size Process ex ecution (VMProcess) 345 328 KB Network connections (VMConnection) 5,746 3.4 MB Security ev ents (SecurityEvent) 334 440 KB System ev ents (Event) 1,000 2.3 MB Bound ports (VMBoundPort) 1,109 503 KB Performance counters (Perf) 15,000 5.5 MB E.3 Host-lev el Evidence: Process T elemetry Process telemetry re veals a clear separation between attack- related and benign acti vity . Of the 345 recorded processes, 73 (21.2%) are Python interpreter instances, and the command python3 setup.py install appears four times throughout the observation period, each in vocation spa wning additional child processes through Python’ s multiprocessing module. T a- ble 11 categorizes all observ ed processes by their behavioral role, distinguishing malicious execution from le gitimate user activity and system services. T able 11: SC1 process categorization by behavioral role. Category Representativ e Processes Count Malicious ex ecution python, python3 73 User activity chrome, msedge, Explorer 48 System services svchost, HealthService 156 Dev elopment tools git-remote-https 3 Other SearchApp, OneDriv e 65 E.4 Network-le vel Evidence: Infrastructure and T raffic Patterns Network analysis exposes the attack infrastructure. Python processes established 427 connections to six distinct IP ad- dresses, with traf fic patterns that div erge sharply from normal dev elopment workflo ws. T able 12 breaks down these connec- tions by destination. The C2 server at 172.187.202.111 re- ceiv ed 46 SSH connections carrying 277 KB of bidirectional traffic. More notably , the payload server at 20.93.23.234 deliv ered 265.4 MB to the victim, while the image CDN at 146.75.74.132 transferred 8.6 MB containing the ste gano- graphic payload. Figure 5 (a) visualizes this traffic asymmetry on a logarithmic scale, where received bytes e xceed sent bytes by two orders of magnitude for the payload server . T able 12: SC1 Python process network communication by destination. Role Destination IP Port Sent Received C2 Server 172.187.202.111 22 128 KB 143 KB Payload Server 20.93.23.234 80 2.1 MB 265.4 MB Image CDN 146.75.74.132 443 3 KB 8.6 MB Package Index 172.66.0.243 443 22 KB 82 KB Localhost (IPC) 127.0.0.1 varied 26 KB 29 KB E.5 Security Event Context W indo ws Security logs recorded 334 events, dominated by Credential Manager reads (Event ID 5379, 91 occurrences). Successful logon events (ID 4624) and special privilege as- signments (ID 4672) each appeared 69 times, while 10 failed 22 20.93.23.234 (Payload) 146.75.74.132 (Image) 172.187.202.11 1 (C2 Server) 172.66.0.243 127.0.0.1 162.159.140.245 10 −2 10 −1 10 0 10 1 10 2 Traf fic V olume (MB) (a) Python Process Network T raffic by Destination Sent Received 0 10 20 30 Time (5-min intervals) 0 5 10 15 20 Connection Count (b) Python Connection Timeline Payload Server Image CDN C2 Server Other Localhost 0 50 100 150 Connection Count Port 443 (HTTPS) Port 80 (HTTP) Port 22 (SSH/C2) Port 59033 Port 57085 167 141 46 3 3 (c) Destination Port Distribution 0 5 10 15 20 Time (5-min intervals) 0 2 4 6 8 10 12 Traf fic V olume (KB) (d) C2 Channel T raffic Pattern Sent (KB) Received (KB) Figure 5: SC1 network traf fic analysis. (a) T raffic v olume per destination IP on logarithmic scale, showing 265 MB recei ved from the payload server v ersus 2.1 MB sent. (b) Connection timeline in 5-minute intervals, with sustained C2 acti vity (purple) and payload retriev al (red) visible throughout the attack window . (c) Destination port distribution, where SSH (port 22) accounts for 46 connections exclusi vely from Python processes. (d) C2 channel traf fic pattern exhibiting irregular beacon interv als ( µ = 4 . 2 min, σ = 1 . 8 min). logon attempts (ID 4625) were recorded, none of which cor - related temporally with attack activity . E.6 Evasion and Operational Characteristics The attack employs multiple ev asion techniques that com- pound detection dif ficulty . Steganographic delivery ensures the initial payload reaches the victim without triggering content-aware fire walls, as the carrier image passes standard file-type validation. The use of F astly’ s CDN infrastructure provides additional cover; blocking this IP would disrupt access to thousands of legitimate websites. The SSH-based C2 channel exploits the protocol’ s ubiquity in development en vironments, where connections to unf amiliar servers may not raise immediate suspicion. Most critically , the 265 MB in-memory module loading enables fileless operation during the reconnaissance and exfiltration phases, leaving no on-disk artifacts for endpoint detection tools to discov er . T emporal analysis re veals operational security a wareness. The 70-minute payload do wnload (13:33 to 14:43) proceeds at an average rate of 63 KB/s, well below thresholds that might trigger bandwidth-based anomaly detection. Figure 5 (d) shows that C2 beacon interv als follow a jittered pattern ( µ = 4 . 2 min, σ = 1 . 8 min) rather than fix ed periodicity , frustrating detection rules based on regular callback timing. The total attack duration of 116 minutes e xceeds the analysis window of most automated sandbox es, which typically terminate after 10 to 15 minutes of ex ecution. package is downloaded package.json execute crypto and source code data index.js preinstall.js execute Figure 6: Parallel Attack Flo w Finding SC1: Three behavioral anomalies distinguish this attack from legitimate activity . (1) T raffic asymmetry : Python processes receiv ed 274 MB while transmitting only 2.3 MB, yielding a 119:1 download-to-upload ra- tio inconsistent with typical API interactions or package installations. (2) Pr ocess-protocol mismatch : SSH con- nections originated from python3.exe rather than stan- dard SSH clients such as ssh.exe or PuTTY , violating expected process-to-protocol mappings. (3) W orking di- rectory correlation : All 73 Python processes share the installation path C:\xx\xx\colorsapi-6.6.7\ as their working directory , enabling attribution of the entire attack chain through a single forensic piv ot point. F Attack Flow G Most Common A TT&CK T echniques Across Scenarios H Full Scenario Result W e note that the Multi (full telemetry) column enumer- ates the complete telemetry in ventory per scenario. W e treat events as a composite multi-channel source (aggre gated at 23 T able 13: Overview of Attack Scenarios Case Critical Attack Steps T rigger Evasion Functions T ools OS Stegano (1) Install typosquatting package; (2) In- stall image embedded with malware setup.py Obfuscation, steganography , erase trace Exfiltrate sys- tem data (C2) Medusa WIN Starter (1) Install typosquatting package; (2) Create startup folder; (3) Local file re- placement; (4) Install exe scripts setup.py Stream cipher, file replacement, erase trace Exfiltrate sys- tem data (C2) Mimikatz, Power - shell, Apollo WIN Parallel (1) Install malicious package; (2) pack- age.json triggers preinstall.js; (3) Initiate index.js; (4) Compress scanned info Inter-hooked scripts Separate running Exfiltrate sys- tem (HTTP) / sensitiv e info (FTP) — Linux NPMEX (1) Download two NPM dependencies; (2) 1st package fetches updates; (3) 2nd package requests script and downloads payload upon download Run in sequence Exfiltrate sen- sitiv e info — Linux 3CX (1) Install trojanized software; (2) Run downloader; (3) Receiv e C2 servers; (4) Download third stage dataminer; (5) Steal browser info setup.exe (upon installation) Obfuscation, DLL side-loading, pro- cess injection Steal data ICONICSTEALER, D A VESHELL, SIGFLIP , VEILEDSIG- N AL WIN CloudEX (1) Compromise internet-facing server in cloud; (2) Find exposed CI/CD cre- dentials; (3) Exploit credentials to access bucket; (4) Modify artifacts; (5) Multi- stage malware CI/CD pipelines Obfuscation Steal data Medusa, Nmap WIN LayerInj (1) Create malicious call function; (2) Define trigger condition for downloading payload; (3) Configure and run docker instance; (4) W ait for condition trigger conditional trig- ger Logic obfuscation, fileless malware Steal data Docker,Medusa Linux T able 14: T op-10 most common A TT&CK techniques ranked by scenario cov erage. Six techniques appear in all 7 scenarios, two appear in 6 scenarios; the remaining techniques are tied at 5 scenarios, from which we report two representativ e examples. T echnique ID Name (MITRE A TT&CK) #Scenarios T1082 System Information Discovery 7 T1083 File and Directory Discovery 7 T1105 Ingress T ool T ransfer 7 T1195.002 Compromise Software Supply Chain 7 T1588 Obtain Capabilities 7 T1608 Stage Capabilities 7 T1033 System Owner/User Discovery 6 T1070.004 Indicator Removal: File Deletion 6 T1005 Data from Local System 5 T1012 Query Registry 5 Note: In addition to the techniques listed above, 18 techniques are tied with coverage in 5 scenarios (see Supplementary). export) and thus do not count events -based settings as single- source in budget accounting. I Cross-Scenario Analysis Details This appendix expands the mechanism-based cross-scenario analysis summarized in Section 7 . W e provide (i) a recon- structability typing that explains dominant success/failure patterns, (ii) alignment and e vidence-conditioned detectabil- ity conte xt, (iii) a failure taxonomy with diagnostic v alue, and (iv) deplo yment implications deriv ed from the taxonomy . I.1 CSA-1 Details: Mechanism-Based T yping f or Reconstructability Across scenarios, end-to-end reconstruction depends on whether telemetry provides: (i) phase anchors —e vents that unambiguously indicate a coarse step such as INSTALL , DOWNLOAD , OUTBOUND_CONN , or EXFIL ; and (ii) joinable iden- tifiers —stable entities that allow anchors to be chained (host/user/process identifiers, network endpoints, workload IDs, and consistent timestamps). T ype I: Joinable host–network chains (reconstructable). In this type, host acti vity and network activity are both visi- ble and can be link ed. Reconstruction succeeds because the pipeline can (a) anchor installation and code ex ecution on prov enance/audit or process traces, and (b) attribute outbound connections to the same process/host context. The resulting 24 T able 15: Per-scenario best-achiev able reconstruction under each source budget (selected by maximizing StepR; tie-break by ChainR then ev ent volume). Each cell reports the selected source_set, StepR, and missing e xpected steps. Abbre v: E_s=Expected_Steps, I=INST ALL, D=DO WNLO AD, O=OUTBOUND_CONN, E=EXFIL, A=A UTH. Scenario E s Single (best) Combo (best) Multi (full telemetry) SC1 4 azure_process StepR=0.250; miss={D,E,O} – azure_conn+azure_process+ azure_security+azure_events+azure_port StepR=0.250; miss={D,E,O} SC2 4 syslog StepR=0.500; miss={D,O} – azure_events+syslog StepR=0.750; miss={O} SC3 4 suricata StepR=0.250; miss={D,E,I} zeek+syslog StepR=0.500; miss={D,E} auditd+auth+suricata+syslog+zeek StepR=0.500; miss={D,E} SC4 4 syslog StepR=0.500; miss={E,O} zeek+syslog StepR=0.750; miss={E} auditd+auth+suricata+syslog+zeek StepR=0.750; miss={E} SC5 4 – – azure_events StepR=0.250; miss={D,E,O} SC6 4 syslog StepR=0.250; miss={D,E,O} – azure_events+syslog StepR=0.250; miss={D,E,O} SC7 3 suricata StepR=0.667; miss={D} auditd+zeek StepR=0.667; miss={D} auditd+suricata+syslog+zeek+tracee StepR=0.667; miss={D} token fake social accounts invite to collaborate on git repo, contain malicious npm dependencies two malicious NPM packages are executed in sequence package one retrieves a token from remote server package two executes script second payload is downloaded and executed Figure 7: NPMEX Attack Flow narrativ e is robust e ven under conserv ati ve correlation param- eters, since multiple anchors corroborate one another (e.g., a download ev ent follo wed by process execution and a tempo- rally nearby outbound session). T ype II: Evidence-pr esent but weakly joinable (partially reconstructable). Here, phases may be present but linkage is fragile. Common causes include missing process-to-socket attribution (network flows exist but cannot be tied to a pro- cess), non-unique entities (shared hosts/users across multiple parallel tasks), and temporal collision under concurrent be- access download install.exe drop malicious component write dll write shellcode 3CX Execute payload Process Injection C2 Server download run final payload exists Figure 8: 3CX Attack Flow nign activity . The pipeline may observe the correct step set but cannot confidently choose a single causal path, so it fa- vors shorter chains o ver brittle long chains. This behavior is desirable from a threat-hunting perspecti ve: it av oids over - claiming end-to-end completion when the e vidence does not support a unique narrativ e. T ype III: Structural observability gaps (bounded r econ- structability). In this type, at least one expected phase is absent under the a vailable schemas and collection bound- ary . Examples include control-plane actions occurring out- side the host boundary , event exports lacking fields needed to distinguish download vs. generic file activity , or missing attribution keys that prev ent linking host and cloud actions. Multi-source correlation cannot recover phases that are nev er 25 Publicly Exposed CI/CD Web Service Reconn Build Residues credential reuse Internal Artifact Repository T ampered Artifact (contain downloader logic) Downstream Build Conditional Trigger C2 Establishment Figure 9: CloudEX Attack Flow adversary backdoored model Model Hub end-user pull Image Classification class-match trigger C2 Server deploy Figure 10: LayerInj Attack Flow observed; improvements require targeted telemetry that di- rectly exposes the missing phase and provides join keys (e.g., cloud IAM/API audit logs with request IDs, egress proxy logs, or high-fidelity tracing for process/network attrib ution). Implications. This typing clarifies why adding more sources does not automatically yield full chains: evidence diversity only helps if it adds missing anchors or str engthens joins . In practice, the most impactful additions are those that expose currently missing phases (T ype III) or provide stable join keys for phases that are otherwise isolated (T ype II). I.2 CSA-2 Details: Alignment, Observability , and Evidence-Conditioned F easibility W e align reconstructed coarse steps to MITRE A TT&CK techniques post hoc. This section clarifies why alignment quality v aries across scenarios ev en under identical pipeline parameters. Heterogeneous pr ojections of the same action. The same underlying attacker action may manifest as dif ferent observ- ables depending on the source: a retriev al phase can ap- pear as a package-manager transaction, a file creation e vent, or a network flow . Alignment therefore becomes e vidence- conditioned: if the dataset lacks the projection that carries sufficient context (e.g., process attrib ution, URL/domain, or artifact lineage), the corresponding technique may be under - supported despite being ex ecuted. Abstraction gap: step-level evidence vs. technique-level diversity . Coarse steps intentionally compress technique di versity to preserve cross-scenario comparability . This trades granularity for interpretability: the pipeline is optimized for reconstructing chain structure rather than identifying exact technique variants. As a result, alignment is best read as an “e vidence supports this phase” signal, not as a direct technique detector . This is especially relevant for scenarios whose mali- ciousness is semantic rather than system-level (e.g., model- lev el backdoors), where OS telemetry may show standard inference workflo ws while the malicious ef fect appears only in model outputs. Scenario-level breadth and temporal window . Scenario breadth (number of techniques/tactics) and attack windo w du- ration modulate alignment difficulty . Long windows increase background acti vity and temporal collisions; short windows can compress transitions and obscure intermediate anchors. Figure 11 visualizes this diversity and moti vates why a single alignment strategy must remain conserv ati ve. I.2.1 Evidence-Conditioned Detectability Matrix (Con- text, Not Accuracy) T able 16 provides a binary feasibility view: whether the ev- idence a detector f amily requires is typically observ able for each scenario. This matrix is complementary to T able 5 : it is not a measured accurac y claim, but a structured explanation of why certain detector families are ill-suited under e vidence constraints. T wo cross-cutting observations follo w . First, methods that rely on stable indicators (IOC/signatures) are systematically brittle across supply-chain scenarios because artifacts are of- ten novel, polymorphic, or ephemeral, and the dominant signal is behavioral rather than static. Second, methods that require a particular evidence boundary (e.g., host-only provenance or cloud-only integrity checks) fail when the core exploit occurs outside that boundary . Our multi-source chaining approach is feasible across all scenarios because it composes whiche ver evidence is present into a unified narrati ve; ho we ver , feasibil- ity does not imply completeness, and bounded observ ability (CSA-1 T ype III) remains a limiting factor . 26 Figure 11: Scenario-lev el comparison of A TT&CK coverage and time windo w: number of techniques, number of tactics, and attack duration. 0 20 40 60 80 100 Starter Stegano CloudEX LayerInj 3CX NPMEX Parallel #T echniques 0 5 10 Starter Stegano CloudEX LayerInj 3CX NPMEX Parallel #T actics 0 100 200 300 Starter Stegano CloudEX LayerInj 3CX NPMEX Parallel Duration (min) T able 16: Evidence-conditioned detectability matrix (binary feasibility) Scenario (core exploit) IOC / Sig. SCA +SBOM Behavior +A TT&CK 1-class Anom. Call/ Syscall Graphs Single-src Prov . Cross-sr c Corr . Model Integrity Ours (Sources; cor e idea → ) hashes/Y ARA/ domains;match known-bad [ 46 ] repo/build/ SBOM; dep/prov anomalies [ 37 , 61 ] EDR/auditd; stage/seq. rules [ 54 ] features; outlier scoring [ 39 ] eBPF/audit/ traces; sequence graphs [ 51 ] OS prov .; causal chain [ 67 ] host+net; join evidence [ 29 , 30 ] model registry +ev al; attest+ trigger tests [ 57 ] host+net+ proc.+tracee; unified corr . + causal chaining SC1-Stegano (ste ganography) – – ✓ ✓ ✓ ✓ ✓ – ✓ SC2-Starter (autostart) – – ✓ ✓ ✓ ✓ ✓ – ✓ SC3-Parallel (multi-stage) – ✓ ✓ – ✓ ✓ ✓ – ✓ SC4-NPMEX (dependency chain) – ✓ ✓ – ✓ ✓ ✓ – ✓ SC5-3CX (plugin, multi-stage backdoor) – – ✓ – ✓ ✓ ✓ – ✓ SC6-CloudEX (leaked cloud credential) – ✓ ✓ – – – – – ✓ SC7-LayerInj (backdoored model) – – – – – – – ✓ ✓ Per -method proportion ( ✓ / 7 ) 0/7 3/7 6/7 2/7 5/7 5/7 5/7 1/7 7/7 I.3 CSA-3 Details: F ailure T axonomy and Di- agnostic V alue T able 16 already characterizes feasibility under evidence av ail- ability . Here we add only a lightweight diagnostic view , summarized in T able 17 , that is specific to our pipeline out- puts: when reconstruction fails, it typically manifests as (i) missing-phase gaps (expected steps never observed under the collected schemas), (ii) attribution breaks (steps ob- served but not joinable across sources into a unique chain), or (iii) negative/partial chains (attempt signals without downstream execution/connecti vity). W e report these gaps explicitly via missing-step diagnostics (e.g., MISSING_* , NO_STEPS_OBSERVED , PREFILTER_UNUSABLE ) to distinguish evidence absence from schema/rule mismatch and to av oid ov er-claiming end-to-end compromise. T able 17: Minimal diagnostic view of reconstruction failures. Category Symptom T ypical remedy Missing-phase gap MISSING_* persists add phase-specific telemetry Attribution break steps present, chain weak add/join stable identifiers Negati ve/partial chain attempt w/o completion require do wnstream anchors I.4 CSA-4: Structural Patter ns & Deployment Implications The feasibility matrix (T able 16 ) and the above diagnostic patterns suggest that telemetry planning should be dri ven by which failur e mode dominates rather than by collecting more logs indiscriminately . 27 Implication 1: prioritize telemetry that closes missing- phase gaps. When failures are dominated by missing-phase gaps, additional correlation logic cannot help: the missing phase must be made observable. Practically , this means adding phase-specific sources (e.g., egress proxy logs for outbound transfer , artifact re gistry logs for package retriev al, or IAM/API audit for cloud control-plane actions) that e xpose both the phase signal and its identifiers. Implication 2: in vest in attribution to repair join breaks. When steps are present b ut chains remain fragmented, the bot- tleneck is attribution. The most ef fective upgrades are sources or instrumentation that provide stable join keys across lay- ers (process ↔ socket linkage, workload/container identifiers, cloud request IDs). This improv es chain continuity without requiring scenario-specific tuning. Implication 3: tr eat negative/partial chains as first-class outcomes. Attempt signals without downstream anchors should be interpreted as incomplete or failed compromises rather than forced into a full chain. Operationally , requiring do wnstream confirmation (e.g., attributable outbound sessions or file staging) reduces overestimation of attacker progress and aligns reconstruction with incident response needs. Implication 4: two-source baselines can be strong but are scenario-dependent. A small, complementary set (typ- ically host prov enance + network visibility) can be sufficient when it both exposes key phases and provides joinable iden- tifiers. Ho wev er , scenarios whose critical actions lie outside host/network boundaries (e.g., cloud control-plane misuse or semantic/model-layer attacks) require targeted additional sources; multi-source is therefore most valuable for robust- ness across heterogeneous scenario structures. J Evidence packages f or SC1 and SC4 W e provide pipeline-generated e vidence packages for SC4 (success ex emplar) and SC1 (failure exemplar), or ganized by step anchors ( INSTALL , DOWNLOAD , OUTBOUND_CONN , EXFIL ). For each anchor , the package includes (i) timestamped evi- dence excerpts with telemetry source attrib ution, (ii) a step- le vel time window summary ( t min – t max ) and e vidence volume, and (iii) single-source vs. multi-source ablation indicating which anchors are recov erable from each telemetry source in isolation. K Data Sanitization Details The collected logs contain en vironment-specific identifiers that may re veal sensiti ve information about the deployment. These identifiers include hostnames, non-system user ac- counts, cloud resource identifiers, and file paths that embed local usernames as substrings. T o protect pri vac y while pre- serving analytical utility , we apply a stable pseudon ymization strategy that replaces such identifiers with consistent tokens across all log sources and scenarios. W e intentionally preserve public Internet indicators (e.g., non-Azure FQDNs) to main- tain realism in the simulated traffic, while pseudonymizing Azure/Azure-hosted domains that could re veal deployment- specific context. K.1 Threat-Model and Design Goals The sanitization procedure is designed to satisfy the following goals: • Privacy pr otection: Obfuscate v alues that can directly identify infrastructure, users, or internal resources. W ell- known system/built-in accounts and placeholder values (e.g., SYSTEM , NT AUTHORITY\SYSTEM , S-1-... , N/A ) are retained to av oid ov er-sanitization noise. • Consistency across sour ces: The same original iden- tifier is mapped to the same pseudon ym across all log types and across multiple runs. • Preser ve security semantics: Only identifier fields (e.g., usernames, hostnames, Azure resource IDs, and Azure/Azure-hosted domain names) are pseudonymized; all other fields required for analysis remain uncha nged (e.g., event categories, ports, protocols, IP addresses, HTTP methods, DNS query types, and temporal order- ing). Public Internet FQDNs are preserv ed to maintain realism. • Determinism: Sanitization is deterministic under a sta- ble secret salt, enabling reproducible analysis. K.2 Stable Pseudonymization Mechanism W e maintain a secret salt value S (stored locally and ne ver published) and use it to deriv e deterministic tokens. For each sensitiv e field v alue v , we compute a token t = prefix || H ( S || v ) , where H ( · ) is a cryptographic hash function (e.g., SHA-256) and prefix indicates the identifier category (e.g., host_ , user_ , res_ ). In practice, we derive compact tokens (e.g., USER_XYZ ) by hashing S || v (SHA-256), truncating the digest to a numeric identifier , and resolving rare collisions deter- ministically . W e persist a JSON dictionary per category in a shared mappings/ directory to ensure stability across files and notebook ex ecutions. This approach preserves equality relationships (same v alue → same token), enabling correlation across sources, without exposing original identifiers. 28 Scenario Expected anchors Obser ved anchors Missing Step P/R Chain P/R #e vents SC4 INSTALL, DOWNLOAD, OUTBOUND_CONN, EXFIL INSTALL, DOWNLOAD, OUTBOUND_CONN EXFIL 1.00 / 0.75 1.00 / 0.75 188,270 SC1 INSTALL, DOWNLOAD, OUTBOUND_CONN, EXFIL INSTALL DOWNLOAD, OUTBOUND_CONN, EXFIL 1.00 / 0.25 1.00 / 0.25 8,534 T able 18: Reconstruction diagnostics for SC4 and SC1 derived from the e vidence packages. 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment