PrismWF: A Multi-Granularity Patch-Based Transformer for Robust Website Fingerprinting Attack

1 PrismWF: A Multi-Granularity P atch-Based T ransformer for Rob ust W ebsite Fingerprinting Attack Y uhao Pan ID , W enchao Xu ID , Member , IEEE , Fushuo Huo ID , Haozhao W ang ID , Member , IEEE , Xiucheng W ang ID , Graduate Student Member , IEEE, Nan Cheng ID , Senior Member , IEEE Abstract —T or is a low-latency anonymous communication network that protects user privacy by encrypting website trafﬁc. Howev er , recent website ﬁngerprinting (WF) attacks have shown that encrypted trafﬁc can still leak users’ visited websites by exploiting statistical features such as pack et size, direction, and inter -arrival time. Most existing WF attacks formulate the prob- lem as a single-tab classiﬁcation task, which signiﬁcantly limits their effectiveness in realistic browsing scenarios where users access multiple websites concurrently , resulting in mixed trafﬁc traces. T o this end, we propose PrismWF , a multi-granularity patch-based T ransformer f or multi-tab WF attack. Speciﬁcally , we design a rob ust trafﬁc feature repr esentation for raw web trafﬁc traces and extract multi-granularity features using con vo- lutional ker nels with differ ent r eceptive ﬁelds. T o effectively inte- grate inf ormation acr oss temporal scales, the proposed model re- ﬁnes features through three hierarchical interaction mechanisms: inter -granularity detail supplementation from ﬁne to coarse granularities, intra-granularity patch interaction with dedicated router tokens, and router -guided dual-level intra- and cross- granularity fusion. This design aligns with the cognitive logic of global coarse-grained reconnaissance and local ﬁne-grained querying, enabling effective modeling of mixed trafﬁc patterns in WF attack scenarios. Extensive experiments on various datasets and WF defenses demonstrate that our method achie ves state- of-the-art performance compar ed to existing baselines. Index T erms —T or , W ebsite Fingerprint, Multi-tab Attack, Multi-Granularity . I . I N T R O D U C T I O N In recent years, the rapid growth of the Internet has inten- siﬁed concerns ov er user priv acy during web browsing. T o mitigate these risks, anonymous communication systems have been widely deplo yed to conceal users’ identities and bro wsing behaviors. Among them, the T or network is one of the most widely used low-latency anonymity systems, serving millions of users worldwide [1], [2]. T or achieves anon ymity by routing Y uhao Pan and W enchao Xu are with the Division of Integrativ e Systems and Design, Hong Kong Univ ersity of Science and T echnology , Hong Kong, China (e-mail: ypanca@connect.ust.hk, wenchaoxu@ust.hk). W enchao Xu is the corr esponding author . Xiucheng W ang and Nan Cheng are with the State Ke y Labora- tory of ISN and School of T elecommunications Engineering, Xidian Univ ersity , Xi’an 710071, China (e-mail: xcwang 1@stu.xidian.edu.cn, dr .nan.cheng@ieee.org). Fushuo Huo is with the School of Cyber Science and Engineering, Southeast Univ ersity , Nanjing, China (e-mail: fushuohuo@seu.edu.cn). Haozhao W ang is with the School of Computer Science and T echnology , Huazhong Uni versity of Science and T echnology , W uhan, 430068, China (email: hz wang@hust.edu.cn). encrypted trafﬁc through a multi-hop overlay network of relay nodes, such that no single relay can associate a user with their destination. While this design provides strong anonymity guarantees against network-lev el adversaries, it also gives rise to distincti ve traf ﬁc patterns that can be e xploited by website ﬁngerprinting attacks. Although T or provides strong anonymity guarantees at the network layer , the encrypted traf ﬁc between the T or client and the entry node remains observable to a local adversary . By analyzing observable traf ﬁc characteristics, such as packet sizes, packet directions, and inter -packet timing patterns, a local adversary can still infer sensitiv e information about users’ web activities. Early website ﬁngerprinting (WF) attacks primarily relied on expert knowledge to manually design discriminativ e traf ﬁc features, which were then fed into tra- ditional machine learning classiﬁers (e.g., random forests, support vector machines, and k -nearest neighbors) for website identiﬁcation [3]–[5]. With the rapid advancement of deep learning (DL) [6], subsequent studies have proposed end- to-end neural network–based models that automatically learn discriminativ e representations directly from raw traf ﬁc traces, signiﬁcantly improving attack performance [7]–[9]. Moreover , as various mechanisms for defense against WF hav e been proposed, robust attack models (e.g., Deep Fingerprinting [10] and Robust Fingerprinting [11]) hav e been dev eloped to main- tain ef fecti veness under defended trafﬁc conditions. These models are designed to mitigate performance degradation caused by trafﬁc obfuscation and defense strategies, enabling more resilient WF attacks. Despite their effecti veness, existing single-tab WF attack methods still suffer from notable limitations. Most prior approaches are dev eloped under the core assumption that users access only one website at a time, where trafﬁc from different domains is not interlea ved. Howe ver , in realistic browsing scenarios, users frequently visit multiple websites concurrently , resulting in mix ed traf ﬁc traces with interleav ed packets from different websites. As a result, single-tab attack models are inherently limited and often struggle to achieve reliable performance under such traf ﬁc mixing conditions. T o address this g ap, a number of multi-tab WF attack methods hav e been proposed to identify multiple coexisting websites from a single mixed trafﬁc trace. Nev ertheless, existing multi- tab approaches still exhibit critical limitations that hinder their practical deployment and effecti v eness. On the one hand, some methods lack robust traf ﬁc representation design and 2 directly le verage raw traf ﬁc traces as input, which renders them incapable of accurately modeling the complex interleaving patterns of multi-tab trafﬁc [12], [13]. On the other hand, other approaches adopt general DL architectures without adequately considering the mixed trafﬁc traces of multi-tab bro wsing, leading to suboptimal performance. Furthermore, a subset of multi-tab WF attacks requires a priori knowledge of the exact number of concurrent websites present in the trafﬁc trace, which is an unrealistic assumption in practical attack scenarios, undermining the methods’ real-w orld applicability [12], [14]. Motiv ated by these limitations, we propose PrismWF , a multi-granularity patch-based transformer tailored to the mixed characteristics of unidentiﬁed WF traf ﬁc, which ex- plicitly models cross-temporal information interactions for robust multi-tab WF attacks. Concretely , we ﬁrst transform raw website traf ﬁc traces into a more rob ust representation that captures essential traf ﬁc characteristics. Based on this representation, multi-granularity trafﬁc features are extracted via parallel con volutional branches, yielding both coarse- grained global features and ﬁne-grained local features. T o model the intrinsic information ﬂow of website trafﬁc, we design a trafﬁc-a ware interaction mechanism that enables structured information exchange across granularities. Coarse- grained features capture broader contextual patterns of trafﬁc segments, while ﬁne-grained features preserve detailed tem- poral variations, allo wing complementary information to be effecti v ely integrated across scales. In addition, we introduce granularity-speciﬁc routers to aggregate semantic information within each granularity , and concatenate router representations from all granularities for ﬁnal website identiﬁcation. This de- sign substantially mitigates information loss caused by trafﬁc mixing in multi-tab browsing scenarios, leading to improved robustness and accuracy in WF attacks. Extensiv e experiments demonstrate that PrismWF consis- tently achiev es state-of-the-art performance under multi-tab WF settings. Moreover , PrismWF remains robust under vari- ous WF defenses, maintains stable performance as the number of concurrently opened tabs increases, and generalizes well to more realistic mixed-tab scenarios where the number of visited websites is unknown a priori. The main contributions of this paper are summarized as follows: 1) W e propose PrismWF , a robust multi-tab WF attack model explicitly designed to address trafﬁc mixing caused by concurrent multi-tab browsing. 2) W e introduce a nov el transformer-based architecture built upon Multi-Granularity Attention Blocks, which jointly model inter-granularity and intra-granularity traf- ﬁc patterns across ﬁne-to-coarse temporal scales. A dedicated router mechanism is further introduced to aggregate cross-granularity trafﬁc cues, ef fectiv ely miti- gating performance degradation caused by trafﬁc mixing in multi-tab scenarios. 3) W e conduct extensiv e experiments on large-scale public datasets under closed-world, open-world, and mixed-tab settings, and e valuate PrismWF against multiple state- of-the-art WF defense mechanisms. The experimental results show that PrismWF consistently achieves state- of-the-art performance in multi-tab WF attacks and demonstrates strong robustness under realistic deploy- ment conditions. The remainder of the paper is organized as follo ws. W e ﬁrst introduce the related work in Section II and the threat model in Section III. W e present concrete design of PrismWF in Section IV from robust trace representation, website feature extraction, multi-granularity attention block, and website iden- tiﬁcation. Next, we conduct a comprehensiv e ev aluation on the performance of PrismWF in Section V. W e discuss rele vant issues in Section VI and conclude this paper in Section VII. I I . R E L A T E D W O R K A. WF Attac k Single-tab WF attacks. Early WF attacks primarily relied on e xpert prior knowledge to transform raw trafﬁc traces into hand-crafted features, followed by traditional machine learning techniques for website identiﬁcation [4], [15]–[17]. W ang et al. [3] extracted a lar ge set of statistical features and improv ed k -NN classiﬁcation through a weighted distance metric. Panchenko et al. [4] proposed CUMUL, which uses cumulativ e packet size features to represent trafﬁc traces and employs an SVM classiﬁer . T o improve rob ustness against noise and trafﬁc perturbations, Hayes et al. [5] introduced the k -ﬁngerprinting approach, which maps hand-crafted fea- tures into ﬁngerprint representations via random forests and performs website identiﬁcation using k -NN. W ith the strong capability of DL models for end-to-end feature learning, Rimmer et al. [7] ﬁrst introduced the A WF framew ork, enabling WF attacks without manual feature en- gineering. Building upon this direction, Sirinam et al. [10] proposed DF , which adopts a deeper and more e xpressi ve CNN architecture. The DF model achiev es nearly 98% classiﬁcation accuracy in the closed-world setting and maintains around 90% accuracy e ven under the WTF-P AD defense. Bhat et al. [8] presented V ar-CNN, a dual-branch residual conv olutional network. The two branches independently model packet di- rection sequences and temporal trafﬁc v ariations, and their features are fused to improve WF performance, particularly under limited training data. T o further enhance portability and practical applicability , Sirinam et al. [9] explored an n -shot learning frame work based on triplet loss. By performing metric learning in the feature space, this approach pulls trafﬁc traces from the same website closer together while pushing those from dif ferent websites farther apart, enabling more ﬂe xible and transferable WF . Recently , Shen et al. [11] proposed the T raf ﬁc Aggregation Matrix (T AM) traf ﬁc representation, which aggregates uplink and downlink packets within ﬁxed time slots to construct a robust trafﬁc representation, enabling robust WF attacks. Deng et al. [18] proposed an early-stage WF attack based on spatio-temporal distribution analysis. By aligning features extracted from early traf ﬁc and adopting supervised contrastiv e learning, their method enables accurate website identiﬁcation at the early stage of page loading. Multi-tab WF attacks. Most existing WF methods rely on a single-tab assumption, where users are presumed to access only one website at a time. This assumption does not hold in practice, as multi-tab browsing generates mixed trafﬁc 3 W eb s i t e A W eb s i t e C W eb s i t e B I d e al s i t u a t i o n R ea l i s t i c si t u at i o n Ti m e Fig. 1. Illustration of Traf ﬁc Mixing Caused by Concurrent Multi-T ab Browsing. traces that substantially degrade the effecti veness of single-tab models. T o bridge this gap, researchers hav e started exploring multi-tab WF attacks, which aim to identify multiple website labels from a single mixed traf ﬁc trace. In the ﬁeld of multi-tab WF attacks, Guan et al. [12] proposed B APM, one of the ﬁrst deep-learning-based end- to-end multi-tab WF attack models. BAPM performs block- lev el modeling on mix ed trafﬁc traces to learn tab-aware representations and employs a self-attention mechanism to adaptiv ely fuse information across trafﬁc blocks. Jin et al. [13] introduced a T ransformer -based WF attack model inspired by the DETR-style encoder–decoder architecture [19]. Their method adopts the feature extraction module from DF [10] as a backbone to preprocess ra w traf ﬁc traces into feature sequences. A Transformer encoder is then used to capture global trafﬁc context, while tab-aware query vectors in the decoder interact with encoded features to enable multi-tab website identiﬁcation. Deng et al. [20], [21] proposed ARES, which ﬁrst transforms multi-tab website trafﬁc into the MT AF feature representation. Subsequently , multi-head attention is employed to capture global trafﬁc dependencies, together with a T op- K sparse attention mechanism that focuses on trafﬁc patterns most relev ant to tar get labels. More recently , Deng et al. [22] proposed CountMamba, which employs a causal CNN to extract trafﬁc features and lev erages a Mamba state space model for multi-tab website ﬁngerprinting attacks. B. WF Defense T o defend against WF attacks in T or, researchers have proposed to obfuscate traf ﬁc traces by delaying real packets and injecting dummy packets, thereby effecti vely degrading the performance of WF attack models [23], [24]. Existing defenses can be broadly categorized into regularization-based WF defenses and adversarial example(AE)–based WF de- fenses. Regularization-based WF Defenses. Early WF defenses obfuscate traf ﬁc ﬁngerprints by enforcing ﬁxed-time-interv al transmissions with uniform packet sizes. BuFLO [25] adopts a strict constant sending interval, pads all packets to a uniform size, and extends transmissions to a predeﬁned maximum duration—effecti v ely eliminating timing and size-based dis- criminativ e features. T amara w [26] improv es upon BuFLO by supporting asymmetric constant rates for upstream and do wn- stream traf ﬁc, which better matches the inherent asymmetry of web communications and reduces overhead. Despite this optimization, both approaches incur non-trivial bandwidth and latency ov erhead, se verely limiting their practical deployment in real-world scenarios. T o improve deployability , Ju ´ arez et al. [27] propose WTF-P AD, an adaptiv e padding defense that lev erages statistical models of trafﬁc bursts to obfuscate web- site ﬁngerprints without enforcing constant-rate transmission. Gong et al. propose FR ONT , a zero-delay and lightweight WF defense that injects dummy packets during idle periods. The padding interv als are randomly sampled from a Rayleigh distribution to obfuscate timing patterns. AE-based WF Defenses. Recently , AE-based techniques have been integrated into WF defenses, by injecting dummy packets to balance defensiv e efﬁcacy and bandwidth ov erhead without additional latency . Mockingbird [28] employs a targeted strat- egy that iterati vely perturbs traf ﬁc traces to ward a selected target to reduce distinguishability . Sadeghzadeh et al. [29] introduce A W A, constructing perturbations based on pairwise website relationships with two v ariants: NU A W A, which de- riv es sample-speciﬁc perturbations, and UA W A, which learns a univ ersal noise-dri ven perturbation for cross-website reuse. More recently , ALER T [30] utilizes a generator to produce targeted adversarial trafﬁc, not only weakening adversarially trained attack models but also achieving state-of-the-art de- fense performance. Ho we ver , as these methods often suffer from degraded effecti veness in model transfer scenarios, we focus on ev aluating WF attacks under regularization-based defense settings. C. WF T race Repr esentation WF attacks identify target websites solely through traf ﬁc analysis based on packet-le vel metadata, such as packet di- rection and timestamps. Early WF attack methods, including A WF [7] and DF [10], primarily relied on packet direc- tion sequences (i.e., +1 for outgoing packets and − 1 for incoming packets). Subsequent work, such as Tik-T ok [31], incorporated temporal information by jointly modeling packet directions and timestamps, leading to improved classiﬁca- tion performance. Building on this line of research, Shen et al. [11] argued that transforming raw trafﬁc traces into coarse- grained temporal representations can yield more rob ust attack performance, and proposed the T AM representation— which segments trafﬁc into ﬁxed-length time windows to form a 2 × N matrix ( N = ⌈ T /d ⌉ , with T as maximum page loading time and d as windo w size)—for recording incoming and outgoing packet counts. Follo wing this paradigm, a variety of window-based and statistical representations hav e been explored. For example, LASERBEAK [32] employed multi- dimensional statistical feature vectors, WFCA T [33] captured temporal characteristics using logarithmically binned inter- arriv al time (IA T) histograms, and ARES [21] proposed the MT AF representation, which extracts window-le vel features to jointly model cell-le vel and burst-le vel trafﬁc patterns. I I I . T H R E AT M O D E L As illustrated in Fig. 2, users achieve anonymous web browsing via the T or network, where traf ﬁc is routed through 4 E n t r y M i d d l e E xi t W e b s i t e s U s e r T or n e t w or k A d ve r s ar y Fig. 2. Illustration of T or network. a circuit of three randomly selected relay nodes—this multi- hop mechanism prevents destination websites from directly identifying users during access. In the WF threat model, the adversary is assumed to be positioned between the user and the T or entry node. Due to end-to-end encryption, the adversary cannot decrypt packet payloads and can only lev erage trafﬁc metadata (e.g., packet timing, size, and direction) to infer the user’ s visited websites, formulating WF attacks as a trafﬁc classiﬁcation task. Following prior WF studies [9], [10], [13], [22], the adversary is restricted to passi ve observ ation, with no ability to modify , inject, or drop network packets. In a typical WF attack pipeline, the adversary ﬁrst collects encrypted trafﬁc traces by visiting a set of monitored websites via the T or network to construct a labeled training dataset. A WF attack model is then trained to associate encrypted trafﬁc patterns with website identities. During deployment, the adversary passively observes the target user’ s encrypted traf ﬁc and uses the trained model to infer visited websites. This study considers a more realistic and challenging WF attack scenario where users access multiple websites con- currently . Such concurrent access causes traf ﬁc mixing in aggregated traces, signiﬁcantly degrading traditional single-tab WF attack performance. W e further consider a practical mixed- tab setting, where the adversary has no prior knowledge of the number of concurrently accessed websites in collected traces. W e also account for WF defense mechanisms: giv en the high cost of customized defenses, users typically rely on publicly av ailable alternati ves in practice. WF attacks are commonly ev aluated under two standard settings [7], [10], [20], [22], [33]: the closed-world and open- world scenarios. Given the vast number of websites on the Internet, attackers typically monitor only a subset of websites of speciﬁc interest. In the closed-world scenario, the attack task is formulated as classifying trafﬁc traces exclusi vely into the predeﬁned set of monitored websites. In the more realistic open-world scenario, users may also visit a large number of websites outside the monitored set, referred to as unmonitored websites. T o distinguish between monitored and unmonitored trafﬁc, prior work typically aggregates all unmonitored websites into an additional class, constructing a practical open-world ev aluation setting. I V . M E T H O D O L O G Y In this section, we ﬁrst introduce the overall architecture of PrismWF and then detail each component. The pipeline of PrismWF is presented in Algorithm 2. A. Overall Arc hitectur e of PrismWF As illustrated in Fig. 3, PrismWF ﬁrst transforms raw website trafﬁc into a rob ust trace representation by partitioning each trace into ﬁxed-length time slots and extracting an M - dimensional feature matrix that captures both pack et-level statistics and temporal interval characteristics. A multi-branch CNN feature extractor then models trafﬁc patterns at different temporal granularities, where branches with distinct kernel sizes generate patch tokens covering coarse-grained contextual patterns and ﬁne-grained local details. T o reﬁne these multi- granularity representations, we introduce a trafﬁc-a ware Multi- Granularity Attention Block with a dedicated router token for each granularity . Each block incorporates three complementary mechanisms: inter-granularity interaction for information e x- change across temporal scales, intra-granularity interaction for patch-lev el semantic aggregation, and inter-granularity router interaction for fusing semantic summaries across granularities. Finally , router tokens from all granularities are concatenated into a uniﬁed representation and fed into a linear classiﬁer for multi-tab website identiﬁcation. B. Rob ust T r ace Repr esentation Extracting effecti ve traf ﬁc representations from interleav ed website traces is critical for downstream WF attack models, as raw trafﬁc is often contaminated by packets from concur- rent websites and obfuscation mechanisms. In such settings, website-speciﬁc identity features are hard to isolate due to se- vere temporal noise and trafﬁc interleaving. Existing represen- tations based on global traf ﬁc statistics tend to ov erlook ﬁne- grained temporal structures, making them vulnerable to WF defenses. Moreover , representations that rely solely on packet direction sequences fail to capture time-slot-lev el temporal dynamics, thereby limiting their ability to identify the injection timestamps of newly opened websites in multi-tab scenarios. T o address these challenges, we adopt a ﬁx ed-size time-slot strategy that abstracts raw packet e vents into structured local temporal segments. Speciﬁcally , the time axis of each traf ﬁc trace is partitioned into equal-length intervals with duration ∆ t , enabling localized modeling of trafﬁc dynamics. Let a raw trafﬁc trace collected from the T or network be denoted as x = { f 1 , f 2 , . . . , f N } , (1) where each packet event f i = ⟨ d i , t i ⟩ consists of a packet direction d i and its corresponding timestamp t i . Here, d i ∈ { +1 , − 1 } indicates outgoing ( +1 ) or incoming ( − 1 ) packets. Accordingly , the trafﬁc trace can be represented as a matrix x ∈ R 2 × N , where the ﬁrst ro w encodes packet directions and the second ro w records timestamp information. Giv en a predeﬁned maximum page loading time T , the total number of time slots is computed as L =  T ∆ t  . For each time interv al, we extract a six-dimensional feature vector to characterize the trafﬁc pattern within that slot. These features consist of tw o categories: packet-lev el statistics (4 dimensions) and time-interval features (2 dimensions). Specif- ically , the pack et-lev el statistics include (i) the numbers of incoming and outgoing packets, and (ii) the counts of direction 5 O r igina l T r a c e R obust T r a c e R e pr e se nta ti on Multi-Branch CNN Feature Extractor Router Concat Projection I de ntif ic a t i on M ulti - G r a n ul a r it y A tte ntio n B loc k × N R oute r toke n P a tc h toke n I nte r - gr a nu l a r i t y inte r a c ti o n In t ra- g r a n ul a r i t y r oute r - l e ve l inte r a c ti on In t ra- g r a n ul a r i t y p a t c h - l e v e l inte r a c ti on I nte r - gr a nu l a r i t y r oute r inte r a c ti on Fig. 3. Overview of the PrismWF . transitions from outgoing to incoming and from incoming to outgoing packets. The time-interval features capture the temporal gaps between consecutiv e outgoing-to-incoming and incoming-to-outgoing packet transitions within each time slot. By aggreg ating these features across all time intervals, we construct a uniﬁed traf ﬁc feature matrix: M = ϕ ( x ) ∈ R 6 × L , (2) where ϕ ( · ) denotes the trafﬁc feature construction function. In practice, the actual number of intervals is truncated to L if it exceeds the predeﬁned value, or padded with zeros if it is insufﬁcient to reach L . The detailed robust trace feature construction steps are presented in Algorithm 1. C. Multi-Granularity F eature Extraction W ebsite trafﬁc traces are long temporal sequences con- taining patterns at multiple time scales. T o capture traf ﬁc characteristics at dif ferent temporal resolutions, we employ G parallel CNN branches with distinct kernel sizes to e xtract multi-granularity features from the robust trafﬁc representation M . Each branch focuses on a speciﬁc temporal granularity and projects the extracted features into a shared embedding space. For the i -th branch with kernel size k i , a CNN feature extractor is applied to obtain: F i = CNN i ( M ) ∈ R d × N i , (3) where d denotes the uniﬁed embedding dimension and N i represents the number of patch tokens produced at the cor- responding granularity . Each CNN i consists of three stacked con volutional blocks, inspiring by the design of the DF feature extractor . Each Con vBlock includes two 1D con volutional layers with batch normalization and ReLU activ ation, follo wed by a max-pooling layer for temporal downsampling and a dropout layer for regularization. Due to the use of dif ferent kernel sizes and pooling operations, each branch produces a Algorithm 1 Robust T race Feature Construction M = ϕ ( x ) Input: Traf ﬁc trace x ; max loading time T ; time slot size ∆ t Output: Feature matrix M ∈ R 6 × L 1: L ←  T ∆ t  ▷ T otal time slots 2: M ← 0 6 × L 3: f or j = 1 to L do 4: I j ← { i | ( j − 1)∆ t ≤ t i < j ∆ t } ▷ Pack et indices in j -th slot 5: if I j = ∅ then 6: continue 7: end if 8: Extract ordered subsequence { ( t k , d k ) } n k =1 from { ( t i , d i ) } i ∈I j , where n ← |I j | 9: c + ← P n k =1 I ( d k = +1) , c − ← P n k =1 I ( d k = − 1) 10: n + − ← 0 , n − + ← 0 , s + − ← 0 , s − + ← 0 11: for k = 1 to n − 1 do 12: if d k = +1 ∧ d k +1 = − 1 then 13: n + − ← n + − + 1 , s + − ← s + − + ( t k +1 − t k ) 14: else if d k = − 1 ∧ d k +1 = +1 then 15: n − + ← n − + + 1 , s − + ← s − + + ( t k +1 − t k ) 16: end if 17: end f or 18: s + − ← I ( n + − > 0) · s + − n + − 19: s − + ← I ( n − + > 0) · s − + n − + 20: M [: , j ] ← [ c + , c − , n + − , n − + , s + − , s − + ] ⊤ 21: end f or 22: retur n M feature sequence of length N i , corresponding to a speciﬁc temporal granularity . The ﬁnal Con vBlock in each branch outputs d -dimensional features, thus ensuring all branches generate tokens in the pre-deﬁned uniﬁed embedding space. W e then transpose the feature matrix to obtain the patch token sequence for each 6 granularity: u i = F ⊤ i ∈ R N i × d . (4) This design enables seamless interaction among tokens from different granularities in subsequent attention layers, while N i varies across branches due to the different downsampling rates induced by distinct k ernel sizes. D. Multi-Granularity Attention Block After obtaining the multi-granularity representations of website trafﬁc features, we further reﬁne these features us- ing the proposed Multi-Granularity Attention Block, which is designed to enable structured information interaction across temporal scales. The model stacks B identical blocks, each consisting of three attention-based interaction lay- ers that jointly support intra-granularity feature modeling and inter-granularity semantic aggre gation. Speciﬁcally , intra- granularity interactions capture temporal dependencies within each granularity , while inter-granularity interactions fuse com- plementary semantic information across different granularities. Through this hierarchical, attention-driv en reﬁnement process, the model progressi vely enhances the discriminability of traf ﬁc representations, facilitating more accurate WF attack. The detailed design of each interaction layer is described belo w . Router T oken Injection. T o obtain compact and compara- ble representations across dif ferent temporal granularities, we introduce a learnable r outer token for each granularity le vel. The router token serves as a granularity-lev el semantic proxy that summarizes patch-lev el features into a ﬁxed-dimensional representation, which is amenable to downstream projection and classiﬁcation. Speciﬁcally , for the i -th granularity , a router token is deﬁned as v i ∈ R 1 × d . (5) The router token is appended to the corresponding patch token sequence, ˜ u i = [ u i ; v i ] ∈ R ( N i +1) × d , (6) enabling the router token to interact with patch tokens within the same granularity and progressively aggregate granularity- speciﬁc semantic information. As a result, the router tokens yield compact granularity-wise representations that are con- catenated for ﬁnal classiﬁer prediction, while enabling cross- granularity information interaction via a shared projection space. Inter -granularity Interaction. T o enable effecti ve informa- tion exchange across dif ferent temporal granularities, we adopt a coarse-to-ﬁne interaction paradigm where coarse-grained representations query and retriev e complementary information from ﬁner-grained representations. This design is moti v ated by the inherent characteristics of website trafﬁc traces, where trafﬁc mixing frequently arises. Traf ﬁc segments from dif- ferent websites may dominate different portions of a traf- ﬁc sequence or become temporally interleav ed, resulting in mixed trafﬁc patterns. Coarse-grained tokens provide a global and robust contextual perspective, while ﬁne-grained tokens preserve detailed temporal variations. Our inter-granularity interaction mechanism lev erages this duality to guide ﬁne- grained feature retriev al under macroscopic coarse-grained context, thereby mitigating the impact of trafﬁc mixing. During inter-granularity interaction, only patch tok ens participate in cross-granularity information exchange, while router tokens are excluded. Let u ( b − 1) i ∈ R N i × d denote the patch token sequence of the i -th granularity at the ( b − 1) -th Multi-Granularity Attention Block. For two adjacent granularities, we employ a local coarse-to-ﬁne cross-attention mechanism to realize information interaction. Speciﬁcally , each coarse-grained patch token is aligned with a corresponding temporal region in the ﬁne-grained sequence, so that ﬁne-grained features can be selectively queried from the relev ant time span. Assume the coarse-grained sequence contains N c patch tokens and the corresponding ﬁne-grained sequence contains N f patch tokens. For the n -th coarse-grained patch token (with zero- based indexing) acting as the query , we compute the center index of its corresponding region in the ﬁne-grained sequence as c n =  ( n + 0 . 5) N f N c  , (7) where offset +0 . 5 aligns the coarse-grained token with the temporal center (rather than left boundary) of its correspond- ing ﬁne-grained re gion, enabling more accurate coarse-to-ﬁne feature matching. W e then perform multi-head cross-attention (MHCA), which follows the standard attention formulation with queries and key–v alue pairs drawn from dif ferent tok en sequences. Speciﬁcally , coarse-grained patch tokens act as queries, while ﬁne-grained patch tokens serve as ke ys and values within a local temporal window to preserve alignment and suppress irrelev ant interference. Let u f ∈ R N f × d denote the ﬁne- grained patch token sequence. For the n -th coarse-grained patch tok en, we select a local subset of ﬁne-grained patch tokens centered at c n , with the windo w size controlled by the hyper-parameter w , which speciﬁes the maximum number of ﬁne-grained tokens attended by each coarse-grained token: u ( c n ,w ) f =  u f [ k ]   k ∈  max  0 , c n −  w 2  , min  N f − 1 , c n +  w 2  . (8) W ithin this local window , MHCA is applied to update the coarse-grained representations: u ′ c = MHCA  u c , u ( c n ,w ) f , u ( c n ,w ) f  , (9) where u c denotes the coarse-grained patch token sequence. This coarse-to-ﬁne local cross-attention enables coarse-grained representations to selecti vely aggregate informativ e ﬁne- grained temporal details while suppressing interference from unrelated or mixed trafﬁc se gments, which is critical for robust multi-tab WF attacks. Intra-granularity Interaction. Within each temporal gran- ularity , we design an intra-granularity interaction module to model local temporal dependencies among patch tokens and global semantic conte xt summarized by the router token. T o this end, a hybrid attention mechanism with two comple- mentary branches is adopted: patch-le vel local interaction and router-le vel global aggregation. 7 (1) P atc h-level local interaction. Patch tokens exchange information via local multi-head self-attention to capture short- range temporal dependencies within the same granularity . Giv en the patch token sequence u ′ i ∈ R N i × d of the i - th granularity , the locally reﬁned patch representations are computed as u loc i = MHA local ( u ′ i , u ′ i , u ′ i ) . (10) where MHA local ( · ) denotes multi-head self-attention restricted to a local temporal neighborhood to preserve temporal coher- ence while suppressing interference from distant segments. (2) Router-le vel global interaction. In parallel, the router token acts as a granularity-lev el semantic aggregator that summarizes global information within the same granularity . By attending to all locally reﬁned patch tokens, the router token captures holistic trafﬁc semantics that complement patch- lev el representations. The aggregated router representation is obtained via multi-head cross-attention: v glob i = MHCA  v ′ i , u loc i , u loc i  , (11) where v ′ i ∈ R 1 × d denotes the router token of the i -th granularity . T ogether , these two interaction branches enable ef fectiv e intra-granularity feature reﬁnement by integrating ﬁne-grained local temporal structures with global semantic context. Finally , the updated intra-granularity token sequence ˜ u ′′ i = [ u loc i ; v glob i ] serves as the input to subsequent interaction modules. Inter -granularity Router Interaction. T o enable global in- formation exchange across different granularities within each Multi-Granularity Attention Block, we perform global atten- tion among router tokens from all granularities. This design allows each router token to aggre gate complementary global semantic information captured at other temporal scales. Specif- ically , the router token of the i -th granularity is extracted from the corresponding token sequence ˜ u ′′ i , which consists of N i patch tokens followed by one router token: r ′′ i = ˜ u ′′ i [ N i + 1] ∈ R 1 × d , (12) where the ( N i + 1) -th position corresponds to the router token under 1-based indexing. All router tokens are concatenated to form a router -lev el sequence R = [ r ′′ 1 ; r ′′ 2 ; . . . ; r ′′ G ] ∈ R G × d , (13) which is updated via global multi-head self-attention: R ′ = MHA global ( R , R , R ) . (14) The updated router tokens R ′ are then redistributed to their corresponding granularities and written back to the token sequences for the ne xt block: ˜ u ( b ) i = ReplaceRouter( ˜ u ′′ i , R ′ [ i ]) , (15) where ReplaceRouter( · ) replaces the router token in ˜ u ′′ i with R ′ [ i ] while keeping all patch tokens unchanged. Algorithm 2 Pipeline of PrismWF for WF Attack Input: Raw traf ﬁc trace x ; slot size ∆ t ; maximum loading time T ; branch kernels { k i } G i =1 ; number of attention blocks B ; local attention windows ( w intra , w inter ) ; (training only) website label y (single-/multi-tab) Output: Predicted website labels ˆ y 1: 1) Robust T race Representation 2: M ← ϕ ( x ; ∆ t, T ) ▷ M ∈ R M × L : slot-based traf ﬁc features 3: 2) Multi-Granularity Featur e Extraction 4: f or i = 1 to G do 5: U i ← BranchCNN i ( M ; k i ) ▷ patch tokens U i ∈ R N i × d 6: r i ← InitRouter( d ) ▷ router token r i ∈ R 1 × d 7: ˜ U i ← [ U i ; r i ] ▷ append router tok en 8: end f or 9: 3) Stacked Multi-Granularity Attention Blocks 10: f or b = 1 to B do 11: Inter -Granularity Interaction (coarse → ﬁne) 12: { ˜ U i } G i =1 ← InterGran ularityIn teraction( { ˜ U i } G i =1 , w inter ) 13: Intra-Granularity Interaction (patch-local + router -global) 14: { ˜ U i } G i =1 ← IntraGran ularityIn teraction( { ˜ U i } G i =1 , w intra ) 15: Inter -Granularity Router Interaction (router fu- sion) 16: { ˜ U i } G i =1 ← RouterInteract( { ˜ U i } G i =1 ) 17: end f or 18: 4) W ebsite Identiﬁcation 19: z ← ConcatRouters( { ˜ U i } G i =1 ) 20: ˆ y ← f ( z ) ▷ website prediction (single-/multi-tab) 21: if training then 22: L ← L ( ˆ y , y ) 23: update model parameters by minimizing L 24: end if 25: retur n ˆ y E. W ebsite Identiﬁcation After reﬁning trafﬁc representations across multiple tempo- ral granularities via the router mechanism, we perform website identiﬁcation by aggregating global semantic information. Speciﬁcally , the ﬁnal router tokens from all granularities are concatenated to form a uniﬁed global trafﬁc representation z ∈ R Gd , where each router token summarizes granularity- speciﬁc semantics produced by the last Multi-Granularity Attention Block. The global representation z is then fed into the website identiﬁcation classiﬁer f , which applies a linear projection to produce classiﬁcation logits o ov er C website categories. Depending on the task setting, the model supports both single-tab and multi-tab website ﬁngerprinting. F or model optimization, we adopt task-speciﬁc loss functions: L = ( CE( o , y ) , single-tab setting , BCEWithLogits( o , y ) , multi-tab setting , (16) where y is a one-hot label in the single-tab case and y ∈ { 0 , 1 } C is a multi-hot label vector for the multi-tab case. Accordingly , cross-entropy (CE) loss is utilized for single- 8 tab case, while binary cross-entropy (BCE) with logits is employed for the multi-tab. V . E X P E R I M E N T S In this section, we conduct extensiv e experiments on large- scale datasets to e valuate the proposed PrismWF . W e ﬁrst describe the experimental setup in Section V -A, including datasets, ev aluation metrics, baselines, and implementation details. Section V -B e valuates PrismWF under both closed- world and open-world multi-tab WF scenarios. W e further in vestigate the impact of the mixed-tab setting in Section V -C and assess the robustness of PrismWF against representativ e WF defenses in Section V -D. Finally , we conduct ablation studies in Section V -E to validate the effecti veness of each key component. A. Experimental Setup 1) Multi-tab Datasets: W e perform multi-tab WF attacks on the multi-tab datasets [20], [21], which model realistic multi- tab browsing behavior and include sub-datasets with different numbers of concurrently opened tabs (2-tab, 3-tab, 4-tab, and 5-tab). In each sub-dataset, each trafﬁc trace contains mixed trafﬁc generated by the corresponding number of bro wser tabs. ARES pro vides both closed-world and open-world ev aluation scenarios. In the closed-world setting, each sub-dataset con- tains trafﬁc traces from 100 monitored website classes, with ov er 58,000 instances. In the open-world setting, unmonitored trafﬁc is additionally included while the number of monitored classes remains 100, resulting in 64,000 trafﬁc instances per sub-dataset. 2) Multi-tab evaluation metrics: For the multi-tab WF task, we follow prior work [20]–[22] and adopt Precision@K (P@K) and Mean A v erage Precision@K (MAP@K) to e valu- ate classiﬁcation performance [34]. Let y denote the ground- truth multi-tab vector associated with a trafﬁc instance x , where y i = 1 if x contains trafﬁc from the i -th website and y i = 0 otherwise. Let ˆ y denote the predicted conﬁdence scores produced by the model ov er all monitored website labels. Both P@K and MAP@K are computed based on the top- K website labels ranked by prediction conﬁdence. P@K measures the precision among the top- K predicted website labels for each instance, deﬁned as P@K( x ) = 1 K X i ∈ r K ( ˆ y ) y i , (17) where r K ( ˆ y ) denotes the set of website labels with the top- K highest predicted conﬁdence scores. MAP@K e xtends P@K by further e valuating the ranking quality of predicted labels within the top- K results. Speciﬁcally , it quantiﬁes whether true website labels tend to rank higher than non-relev ant ones by av eraging the precision values at different cutof f positions: MAP@K( x ) = 1 K K X i =1 P@i( x ) . (18) The ﬁnal MAP@K score is obtained by averaging over all testing instances. 3) Baselines: In this work, we adopt DF [10], A WF [7], V ar-CNN [8], T ikT ok [31], Holmes [18], RF [11], B APM [12], TMWF [13], ARES [20], [21] and CountMamba [22] as baseline methods for both single-tab and multi-tab WF ev aluation. For traditional single-tab attack methods adapted for deployment in multi-tab scenarios, we follow the experi- mental setup of prior w ork and modify the classiﬁcation head’ s output layer by removing the softmax activ ation and retaining the raw logit outputs [20]–[22]. Subsequently , we employ the binary cross-entropy loss with logits (BCEW ithLogitsLoss) as the loss function to train these models, thereby facilitating their deployment in multi-tab classiﬁcation tasks. T ABLE I P A R A M E T E R S E T T I N G S F O R P R I S M W F . Model Part Details V alue Multi-Branch CNN Embedding Dimension 256 Branch Number 4 Kernel Sizes [15, 11, 7, 5] Con vBlocks per Branch 3 Multi-Granularity Block Block Number 3 Attention Head Number 8 Intra-Granularity W indow 5 Inter-Granularity Windo w 3 FFN Dimension 1024 Classiﬁcation Head Router Fusion Dimension 256 × 4 4) Implementation detail: W e train the PrismWF model in an en vironment with Python 3.10.19 and PyT orch 2.1.2, with the implementation consisting of ov er 2,000 lines of code. All experiments are conducted on NVIDIA A800-SXM4-80GB GPUs. For single-tab WF tasks, each model is trained for 50 epochs, while for multi-tab tasks, all models are trained for 80 epochs. More detailed deployment conﬁgurations of PrismWF are provided in T able I. B. Multi-T ab Attack P erformance In this section, we e valuate the performance of WF attack models in multi-tab scenarios under both closed-world and open-world settings. For baseline methods originally designed for single-tab WF (e.g., A WF , DF , Tik-T ok, V ar-CNN, and RF), we adopt the same multi-tab adaptation strategy described in Section V -A3, follo wing the standard practice of prior multi- tab WF attacks. For representati ve multi-tab WF methods (B APM, TMWF , ARES, and CountMamba), we directly use their publicly released codes for training and testing without modiﬁcations. 1) Closed-world scenario: Experimental results demon- strate that the proposed PrismWF consistently outperforms all baseline methods across both P@k and MAP@k e val- uation metrics, achieving a new state of the art, as shown in T able II. T raditional single-tab attack methods, including A WF , DF , T ik-T ok, V ar-CNN, and RF , exhibit signiﬁcantly inferior performance compared to recent representati ve multi- tab approaches such as TMWF , CountMamba, ARES, and PrismWF . Among these multi-tab baselines, B APM sho ws 9 T ABLE II C O M PAR I S O N W I T H E X I S T I N G M E T H O D S U N D E R M U LTI - TA B B RO W S I N G : C L O S E D - W O R L D A N D O P E N - W O R L D S C E NA R I O S . Scenario # of tabs Metrics A WF DF T ik-T ok V ar-CNN RF B APM TMWF CountMamba ARES PrismWF Closed-world 2-tab P@2 15.66 63.01 70.47 72.94 64.66 57.22 78.24 87.33 87.78 89.46 MAP@2 17.93 72.64 78.87 81.16 73.13 66.38 83.20 91.89 92.00 93.10 3-tab P@3 11.67 45.62 53.51 56.32 47.24 43.09 67.02 81.52 84.56 87.01 MAP@3 13.93 58.57 65.91 69.93 59.44 53.52 73.87 87.76 90.09 91.48 4-tab P@4 11.49 43.15 49.60 40.35 44.25 41.23 65.97 81.26 85.59 88.38 MAP@4 13.64 55.32 61.87 55.62 56.69 51.04 72.52 87.41 90.49 92.34 5-tab P@5 10.84 35.48 41.34 38.75 34.60 34.67 64.00 73.89 83.27 87.54 MAP@5 12.24 46.90 52.94 51.25 44.63 42.88 70.83 81.46 88.38 91.63 Open-world 2-tab P@2 17.59 60.77 69.04 70.46 62.63 55.65 73.98 85.02 86.17 87.83 MAP@2 20.32 70.21 77.23 79.27 71.64 64.71 79.97 90.09 90.91 91.97 3-tab P@3 12.13 45.56 53.35 57.89 47.32 42.07 66.47 81.17 83.69 86.03 MAP@3 14.62 58.43 66.18 71.61 60.41 52.50 73.58 87.86 89.45 91.36 4-tab P@4 11.90 42.19 49.02 40.32 43.25 40.20 67.08 79.98 85.04 88.11 MAP@4 14.35 54.62 61.20 53.41 56.14 50.39 73.54 86.40 90.01 92.08 5-tab P@5 11.96 36.74 42.74 39.39 36.93 35.65 64.21 75.60 84.11 88.52 MAP@5 14.04 48.47 54.99 52.03 47.79 44.38 71.06 83.09 89.11 92.33 relativ ely limited effecti veness compared to the stronger recent methods. Speciﬁcally , under the 2-tab setting, PrismWF achiev es a P@2 score of 89.46%, outperforming BAPM, TMWF , Count- Mamba, and ARES by 32.24%, 11.22%, 2.13%, and 1.68%, respectiv ely . In terms of MAP@2, PrismWF attains 93.10%, exceeding the same baselines by 26.72%, 9.90%, 1.21%, and 1.10%, respectiv ely . Under the more challenging 5-tab setting, the performance of existing multi-tab baselines (including B APM, TMWF , CountMamba, and ARES) degrades substan- tially compared to their 2-tab results. In contrast, PrismWF maintains stable and superior performance, highlighting its robustness to sev ere trafﬁc mixing. This adv antage stems from the carefully designed trafﬁc-a ware attention mechanism, which enables effecti ve information querying from coarse- grained traf ﬁc patches to ﬁne-grained patches across multiple temporal granularities. As a result, PrismWF achie ves a P@5 of 87.54% and a MAP@5 of 91.63%, surpassing B APM by 52.87% and 48.75%, TMWF by 23.54% and 20.80%, CountMamba by 13.65% and 10.17%, and ARES by 4.27% and 3.25%. These results clearly demonstrate that PrismWF is signiﬁcantly more ef fectiv e at identifying mixed trafﬁc traces under multi-tab bro wsing scenarios. 2) Open-world scenario: Consistent with prior work [10], [13], [20], [22], all unmonitored websites are grouped into a single class—introducing an additional label compared to the closed-world setting and substantially increasing classiﬁ- cation difﬁculty , especially under multi-tab trafﬁc mixing. This setting reﬂects a more realistic, challenging threat model, as attackers must handle unknown background trafﬁc. As shown in T able II, PrismWF consistently outperforms all baselines across multi-tab conﬁgurations in the open-world setting. In the 2-tab scenario, it achieves a P@2 of 87.83% and a MAP@2 of 91.97%, demonstrating strong discrimination between monitored and unmonitored traf ﬁc under mild mix- ing. Notably , PrismWF maintains stable performance as the number of concurrent tabs increases: in the challenging 5-tab setting, it attains a P@5 of 88.52% and a MAP@5 of 92.33%, indicating robust performance under heavy trafﬁc interleaving. Its advantages over representative multi-tab baselines become more pronounced with se vere mixing—in the 5-tab open-world scenario, PrismWF outperforms BAPM by 52.87% in P@5 and 47.95% in MAP@5, TMWF by 24.31% in P@5 and 21.27% in MAP@5, CountMamba by 12.92% in P@5 and 9.24% in MAP@5, and ARES by 4.41% in P@5 and 3.22% in MAP@5. This trend suggests that the proposed multi- granularity representation and router-based semantic aggre- gation effecti vely isolate website-speciﬁc ﬁngerprint patterns from unmonitored background traf ﬁc and interleav ed multi- tab ﬂows. Ov erall, these results demonstrate PrismWF’ s strong generalization to open-world settings and superior robustness under the combined challenges of unknown websites and sev ere multi-tab traf ﬁc mixing. C. Multi-T ab Attac k P erformance under V arying Numbers of T abs T o better reﬂect realistic WF attack scenarios, we ev aluate the cross-tab generalization capability of different WF attack models under mixed-tab training and arbitrary-tab testing. In this setting, the number of concurrently opened tabs in captured traf ﬁc is unkno wn a priori ; we refer to it as the mixed-tab setting . Unlike prior ev aluations that train and test models under ﬁxed tab-count settings [22], this setup more closely aligns with practical multi-tab browsing beha viors. Speciﬁcally , we randomly sample 30% of trafﬁc traces from each of the 2-tab, 3-tab, 4-tab, and 5-tab datasets and merge them into a uniﬁed mixed-tab training set. Models trained on this heterogeneous dataset are then ev aluated on test sets with dif ferent ﬁxed numbers of tabs, enabling a comprehensi v e 10 T ABLE III P E R F O R M A N C E U N D E R M I X E D - TA B T R A I N I N G A N D D I FF E R E N T T E S T I N G TA B S E T T I N G S ( % ) . Method 2-tab (T est) 3-tab (T est) 4-tab (T est) 5-tab (T est) P@2 MAP@2 P@3 MAP@3 P@4 MAP@4 P@5 MAP@5 TMWF 60.50 70.09 48.30 61.64 44.99 58.01 38.56 50.93 CountMamba 78.00 86.35 67.54 80.63 63.86 77.91 54.63 69.50 ARES 80.93 88.08 72.67 83.81 71.11 82.69 63.99 77.05 PrismWF 82.01 88.81 75.77 85.90 74.59 85.32 68.12 79.87 assessment of multi-tab WF performance under realistic and varying trafﬁc mixing. As sho wn in T able III, the proposed PrismWF consistently achiev es state-of-the-art performance across all testing tab set- tings. On the 2-tab test set, PrismWF attains a P@2 of 82.01% and a MAP@2 of 88.81%, outperforming CountMamba by 4.01% and 2.46%, and surpassing ARES by 1.08% and 0.73%, respectiv ely . More importantly , PrismWF maintains strong performance advantages under heavier trafﬁc mixing. On the most challenging 5-tab test set, PrismWF achiev es a P@5 of 68.12% and a MAP@5 of 79.87%, e xceeding CountMamba by 13.49% and 10.37%, and outperforming ARES by 4.13% and 2.82%, respectively . W e observe a consistent decline in P@K and MAP@K as the number of concurrent tabs increases, as expected: identifying all target websites grows more difﬁcult with stronger temporal interleaving and feature mixing in trafﬁc traces. Nev ertheless, PrismWF exhibits superior robust- ness and stability compared to existing methods, pro ving that its multi-granularity representation learning and router -based interaction mechanisms ef fectiv ely handle sev ere traf ﬁc mixing under heterogeneous multi-tab conditions. D. Multi-T ab Attack against WF Defenses In this section, we select three representativ e WF defense mechanisms (WTF-P AD [27], Front [35], and RegulaT or [36]) to ev aluate their ef fectiv eness against WF attacks in multi- tab browsing scenarios. T amara w [26] is excluded due to its prohibiti vely high bandwidth ov erhead and latency for practical deployment. WTF-P AD adaptively injects dummy packets into raw trafﬁc traces to perturb original website trafﬁc, disrupting salient statistical features e xploited by ﬁnger- printing attacks. Front targets information leakage in the initial phase of website access and adopts a Rayleigh-distribution- based random dummy packet padding strategy to obfuscate front-end traf ﬁc patterns. RegulaT or combines pack et delay manipulation with dummy packet injection to ef fecti vely blur burst-le vel statistical characteristics of website trafﬁc. W e ev aluate multi-tab WF attacks under these defenses in two representativ e settings: a 2-tab scenario and a more challenging 5-tab scenario. As sho wn in T able IV and V, PrismWF consistently achie ves state-of-the-art performance across all six dataset–defense combinations. In the 2-tab setting, T able IV summarizes the performance of different multi-tab WF attacks under three representative defense mechanisms. Overall, the proposed PrismWF con- sistently achiev es the best performance across all defenses in terms of both P@2 and MAP@2. Under the WTF-P AD defense, PrismWF attains a P@2 of 84.72% and a MAP@2 of 89.70%, outperforming ARES by 2.01% in P@2 and 1.92% in MAP@2, and surpassing CountMamba by 5.64% and 4.34%, respecti vely . Compared with TMWF , PrismWF exhibits substantial improv ements of 24.54% in P@2 and 22.71% in MAP@2, demonstrating superior discriminative capability under moderate packet padding. Among the three defenses, Front is relati vely weaker , under which all multi- tab WF attacks achie ve their highest accuracy . Ne vertheless, PrismWF still delivers the best performance under Front, achieving 86.76% P@2 and 91.22% MAP@2, consistently outperforming all baseline methods. RegulaT or provides the strongest defense among the three by manipulating packet delays and injecting dummy packets at the burst le vel, which signiﬁcantly disrupts coarse-grained traf ﬁc patterns. Despite this challenge, PrismWF remains the most effecti ve method, achieving a P@2 of 72.18% and a MAP@2 of 78.29%. In contrast, CountMamba suf fers a dramatic performance col- lapse under RegulaT or , with both P@2 and MAP@2 reduced to approximately 2.7%, likely due to its reliance on causal CNNs and state-space models that are sensitiv e to burst-le v el perturbations. T ABLE IV C O M PAR I S O N O F W F AT TAC K S O N T H R E E R E P R E S E N TA T I V E D E F E N S E S I N T H E 2 - TAB S E T T I N G . WTF-P AD Front RegulaT or Attack P@2 MAP@2 P@2 MAP@2 P@2 MAP@2 TMWF 60.18 66.99 64.73 72.92 41.49 48.08 CountMamba 79.08 85.36 83.97 89.29 2.69 2.70 ARES 82.71 87.78 85.53 90.39 71.09 77.53 PrismWF 84.72 89.70 86.76 91.22 72.18 78.29 In the 5-tab setting, we observ e that all attack metrics (i.e., P@5 and MAP@5) decrease compared to those in the 2-tab setting. This degradation can be attributed to more intricate trafﬁc mixing and interleaving dynamics, which substantially increase the difﬁculty of multi-tab WF attack. Despite this elev ated comple xity , the proposed PrismWF model maintains relativ ely stable performance across all three representativ e defense mechanisms. More importantly , PrismWF exhibits more pronounced advantages over competing baselines in the 5-tab setting than in the 2-tab setting. As shown in T able V, PrismWF achiev es the highest P@5 and MAP@5 values under all defense scenarios, with P@5 scores of 77.94% (WTF- P AD), 83.92% (Front), and 53.49% (RegulaT or). Compared with the second-best baseline (ARES), PrismWF improves P@5 by 4.05%, 5.92%, and 0.87% under the three WF 11 (a) Maximum Loading Time (b) T ime Slot Interval (c) Number of Blocks Fig. 4. Ablation study of different hyperparameter settings on WF attack performance. defenses, respectively . These results clearly demonstrate the superior rob ustness of PrismWF in complex multi-tab brows- ing en vironments. T ABLE V C O M PAR I S O N O F W F AT TAC K S O N T H R E E R E P R E S E N TA T I V E D E F E N S E S I N T H E 5 - TAB S E T T I N G . WTF-P AD Front RegulaT or Attack P@5 MAP@5 P@5 MAP@5 P@5 MAP@5 TMWF 42.45 49.66 39.50 47.12 19.16 22.01 CountMamba 59.86 69.00 65.48 74.58 5.69 5.77 ARES 73.89 80.69 78.00 84.20 52.62 58.93 PrismWF 77.94 83.99 83.92 88.90 53.49 59.67 E. Ablation Study In this section, we conduct systematic ablation studies on the proposed model to e v aluate the contributions of each core component to the overall WF attack performance, and further in vestigate the sensitivity of key hyperparameters. 1) Design of T race F eatur e Representation: The proposed trace feature representation comprises six channels, con- structed by integrating three types of trafﬁc statistical features. Each trafﬁc statistic is further split into incoming and outgoing packet streams, resulting in a total of 3 × 2 = 6 feature channels. For ablation experiments, we isolate each type of trafﬁc statistic (i.e., its corresponding incoming and outgoing channels) individually , along with the full fused feature set, yielding four distinct experimental conﬁgurations to quantify the contrib ution of each feature component. Fig. 5. Ablation study on different trace feature representations. As shown in Fig. 5, the fused (Whole) feature set achiev es the best attack performance across both 2-tab and 5-tab settings, with MAP@2 reaching 93.10% and 91.63% re- spectiv ely . Among single-feature settings, packet count yields the highest performance, conﬁrming its status as the most discriminativ e feature for WF attacks—a ﬁnding consistent with prior work [10], [11]. Although the time interv al feature exhibits relatively low performance when used in isolation, it provides complementary temporal information that is critical for enhancing the accurac y of the fused feature input. 2) Impact of Maximum Loading T ime: W e analyze the impact of the maximum loading time used in robust trace construction. The maximum loading time determines how much traf ﬁc is retained from each trace and thus directly affects the completeness of temporal information av ailable for WF attack. As shown in Fig. 4a, we ev aluate four different maximum loading times: 80 s, 120 s, 160 s, and 240 s, while keeping all other settings unchanged. The results show that perfor- mance improv es signiﬁcantly when increasing the loading time from 80 s to 160 s, indicating that longer observation windows provide more discriminative traf ﬁc patterns. Specif- ically , PrismWF achie ves a P@2 of 89.46% and a MAP@2 of 93.10% at 160 s. Further increasing the loading time to 240 s brings only marginal gains (P@2 = 89.45%, MAP@2 = 93.21%), suggesting diminishing returns. Therefore, we adopt 160 s as the default maximum loading time, which offers a fa vorable trade-off between attack effecti v eness and practical inference ef ﬁciency . 3) Impact of T ime Slot Interval: W e next explore the effect of time slot interval on trace discretization, a key parameter that controls the temporal resolution of robust trace represen- tations and dictates the input sequence length of the model. As sho wn in Fig. 4b, we v ary the time slot interv al from 10 ms to 50 ms under a ﬁxed maximum loading time of 160 s. A smaller interval results in ﬁner -grained temporal features b ut longer input sequences, whereas a larger interv al leads to coarser representations with reduced sequence length. Experimental results demonstrate a consistent decline in at- tack performance as the interval increases. In particular , the optimal performance is observed at 20 ms, where PrismWF attains a P@2 of 89.46% and a MAP@2 of 93.10%. When the interv al exceeds 30 ms, both P@2 and MAP@2 drop noticeably , suggesting that overly coarse temporal aggregation erodes the model’ s capability to capture ﬁne-grained trafﬁc dynamics. Based on this trade-off analysis, we set the default 12 time slot interv al to 20 ms, which ef fectiv ely balances temporal resolution, attack performance, and computational efﬁcienc y . 4) Impact of Number of Blocks: W e in vestigate the inﬂu- ence of the number of stacked Multi-Granularity Attention Blocks on WF attack performance. Speciﬁcally , we v ary the number of blocks from 1 to 5 while keeping all other settings unchanged. As sho wn in Fig. 4c, performance consistently improv es as more blocks are stacked, indicating that deeper multi-granularity interaction enables more ef fectiv e reﬁnement of traf ﬁc representations. W ith only one block, the model achiev es the lo west performance, with P@2 = 87.51% and MAP@2 = 91.52%. As the number of blocks increases, both metrics steadily improv e, reaching the best performance at 5 blocks (P@2 = 90.03%, MAP@2 = 93.50%). These results show that stacking multiple Multi-Granularity Atten- tion Blocks impro ves the model’ s ability to capture com- plex temporal dependencies and cross-granularity interactions. Howe v er , more blocks also introduce higher computational cost. Therefore, the number of blocks should be chosen by balancing performance gains and computational efﬁcienc y . 5) Sensitivity Analysis of Model Arc hitectur e: The core component of PrismWF is the Multi-Granularity Attention Block , which integrates three complementary interaction mech- anisms for multi-tab WF attacks. T o quantify the contribution of each component, we conduct ablation studies by selecti vely removing individual modules under the most challenging set- ting, i.e., the 5-tab scenario with representati ve WF defenses. Speciﬁcally , we ev aluate three variants: (i) removing router interaction (RI), (ii) removing ﬁne-to-coarse cross-granularity interaction (GI), and (iii) replacing the multi-granularity design with a single-granularity CNN equipped only with intra- granularity attention. These variants are compared against the full PrismWF model to assess the effecti veness of each architectural component. T ABLE VI S E N S I T I V I T Y A N A L Y S I S O F M O D E L A R C H I T E C T U R E . WTF-P AD Fr ont RegulaT or Attack P@5 MAP@5 P@5 MAP@5 P@5 MAP@5 PrismWF (w/o RI + GI) 69.24 79.13 76.49 84.66 43.26 50.10 PrismWF (w/o RI) 75.80 83.17 80.60 87.08 50.89 59.18 PrismWF (Single-G) 73.89 82.02 79.41 86.66 48.89 57.49 PrismWF (Full) 77.94 83.99 83.92 88.90 53.49 59.67 As sho wn in T able VI, the full PrismWF model con- sistently achie ves the best performance across all WF de- fenses. In particular , PrismWF attains P@5/MAP@5 scores of 77.94%/83.99% under WTF-P AD, 83.92%/88.90% under Front, and 53.49%/59.67% under RegulaT or , demonstrating the ef fectiv eness of the proposed architecture under dif ferent WF defenses. Removing router interaction leads to noticeable degradation under all defenses. For example, under RegulaT or , PrismWF (w/o RI) achie ves a P@5 of 50.89%, which is 2.60% lower than the full model, indicating that router -le vel interaction provides consistent performance gains by aggregat- ing global semantic information across temporal scales. The largest drop occurs when both router interaction and cross- granularity interaction are removed. PrismWF (w/o RI + GI) suffers P@5 reductions of 8.70%, 7.43%, and 10.23% under WTF-P AD, Front, and RegulaT or , respectiv ely , suggesting that multi-granularity feature extraction alone is insufﬁcient under se vere trafﬁc mixing and defense-induced obfuscation. The single-granularity variant also exhibits inferior perfor- mance relati ve to the full model. Under RegulaT or , PrismWF (Single-G) achieves a P@5 of 48.89%, which is 4.60% lower than the full model, further conﬁrming the importance of multi-granularity modeling. Overall, these results highlight that router interaction and cross-granularity interaction play complementary and indispensable roles. Router interaction facilitates global semantic fusion across temporal scales, while cross-granularity interaction enables ef fectiv e ﬁne-to-coarse information alignment. T ogether with multi-branch CNNs for extracting temporal features at different resolutions, the pro- posed architecture signiﬁcantly enhances robustness against complex trafﬁc mixing and strong WF defenses. V I . D I S C U S S I O N In this section, we discuss the limitations of the proposed method and outline promising directions for future work. Multi-T ab WF Attacks for Large-Scale Monitoring. In this work, we consider a moderate-scale setting with approximately 100 monitored websites. Scaling multi-tab WF attacks to much larger monitored sets (e.g., hundreds or thousands of websites) remains a challenging problem. As the number of target websites increases, trafﬁc patterns become more di verse and label ambiguity becomes more sev ere, which may degrade classiﬁcation performance. A potential future direction is to incorporate structured relationships among websites (e.g., user browsing preferences or co-visitation patterns), which may further enhance scalability and robustness in large-scale multi- tab scenarios [37], [38]. Designing WF defenses tailored f or multi-tab scenarios. Most existing WF defense mechanisms are designed under the single-tab assumption [27], [35], [36], [39]. Their effecti ve- ness may de grade in multi-tab settings, where traf ﬁc mixing and interlea ving introduce more complex temporal dynamics. Giv en that multi-tab browsing represents a more realistic and challenging threat model, it is important to design WF defenses that explicitly account for concurrent website access. Future work may explore defense mechanisms that le verage inter-tab trafﬁc interactions or dynamically adapt padding and scheduling strategies based on multi-tab trafﬁc characteristics. Evaluation in Real-W orld Defense Deployments. Consistent with prior works [10], [11], [22], [40], this paper e valuates the proposed method using simulated defense mechanisms. Howe v er , simulated en vironments often overlook practical fac- tors such as deployment costs, network latency , and dynamic trafﬁc v ariations in real-w orld scenarios—factors that may lead to discrepancies between simulated and real-world defense performance. In future work, we will deplo y representati ve WF defenses in real-world en vironments and ev aluate both at- tack effecti veness and defense cost under realistic operational settings. V I I . C O N C L U S I O N In this paper , we propose PrismWF , a high-performance multi-tab WF attack method tailored for trafﬁc mixing scenar- 13 ios in realistic multi-tab browsing en vironments. Speciﬁcally , PrismWF ﬁrst constructs a robust traf ﬁc feature representation, then lev erages a multi-branch CNN to extract trafﬁc features at different granularities, and ﬁnally employs a multi-granularity attention block—custom-designed to capture the inherent char - acteristics of real-world multi-tab mixed traf ﬁc—to reﬁne fea- ture representations, thereby enabling ef fecti ve website iden- tiﬁcation. Extensive experiments demonstrate that PrismWF achiev es state-of-the-art performance with remarkable stabil- ity across three challenging scenarios: closed-world settings, open-world settings, and en vironments with representativ e WF defense mechanisms deployed. For future work, we plan to extend the proposed method to support large-scale monitored website sets and to explore dedicated defense mechanisms targeting multi-tab WF attacks. R E F E R E N C E S [1] R. Dingledine, N. Mathewson, and P . Syverson, “T or: The { Second- Generation } onion router, ” in 13th USENIX Security Symposium (USENIX Security 04) , 2004. [2] A. Mani, T . W ilson-Brown, R. Jansen, A. Johnson, and M. Sherr, “Understanding tor usage with priv acy-preserving measurement, ” in Pr oceedings of the Internet Measurement Confer ence 2018 , 2018, pp. 175–187. [3] T . W ang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg, “Effecti ve attacks and provable defenses for website ﬁngerprinting, ” in 23r d USENIX Security Symposium (USENIX Security 14) , 2014, pp. 143– 157. [4] A. Panchenko, F . Lanze, J. Pennekamp, T . Engel, A. Zinnen, M. Henze, and K. W ehrle, “W ebsite ﬁngerprinting at internet scale. ” in NDSS , vol. 1, 2016, p. 23477. [5] J. Hayes and G. Danezis, “k-ﬁngerprinting: A robust scalable web- site ﬁngerprinting technique, ” in 25th USENIX Security Symposium (USENIX Security 16) , 2016, pp. 1187–1203. [6] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2016, pp. 770–778. [7] V . Rimmer, D. Preuveneers, M. Juarez, T . V an Goethem, and W . Joosen, “ Automated website ﬁngerprinting through deep learning, ” arXiv pr eprint arXiv:1708.06376 , 2017. [8] S. Bhat, D. Lu, A. Kwon, and S. Dev adas, “V ar-cnn: A data-efﬁcient website ﬁngerprinting attack based on deep learning, ” arXiv preprint arXiv:1802.10215 , 2018. [9] P . Sirinam, N. Mathews, M. S. Rahman, and M. Wright, “T riplet ﬁngerprinting: More practical and portable website ﬁngerprinting with n-shot learning, ” in Proceedings of the 2019 ACM SIGSAC Confer ence on Computer and Communications Security , 2019, pp. 1131–1148. [10] P . Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep ﬁngerprinting: Undermining website ﬁngerprinting defenses with deep learning, ” in Pr oceedings of the 2018 A CM SIGSAC confer ence on computer and communications security , 2018, pp. 1928–1943. [11] M. Shen, K. Ji, Z. Gao, Q. Li, L. Zhu, and K. Xu, “Subverting website ﬁngerprinting defenses with robust traf ﬁc representation, ” in 32nd USENIX Security Symposium (USENIX Security 23) , 2023, pp. 607–624. [12] Z. Guan, G. Xiong, G. Gou, Z. Li, M. Cui, and C. Liu, “Bapm: block attention proﬁling model for multi-tab website ﬁngerprinting attacks on tor , ” in Pr oceedings of the 37th Annual Computer Security Applications Confer ence , 2021, pp. 248–259. [13] Z. Jin, T . Lu, S. Luo, and J. Shang, “Transformer-based model for multi- tab website ﬁngerprinting attack, ” in Pr oceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , 2023, pp. 1050–1064. [14] Y . Xu, T . W ang, Q. Li, Q. Gong, Y . Chen, and Y . Jiang, “ A multi- tab website ﬁngerprinting attack, ” in Pr oceedings of the 34th Annual Computer Security Applications Conference , 2018, pp. 327–341. [15] A. Panchenko, L. Niessen, A. Zinnen, and T . Engel, “W ebsite ﬁnger- printing in onion routing based anonymization netw orks, ” in Pr oceedings of the 10th annual ACM workshop on Privacy in the electr onic society , 2011, pp. 103–114. [16] X. Cai, X. C. Zhang, B. Joshi, and R. Johnson, “T ouching from a distance: W ebsite ﬁngerprinting attacks and defenses, ” in Proceedings of the 2012 ACM confer ence on Computer and communications security , 2012, pp. 605–616. [17] T . W ang and I. Goldberg, “Improved website ﬁngerprinting on tor, ” in Pr oceedings of the 12th ACM workshop on W orkshop on privacy in the electr onic society , 2013, pp. 201–212. [18] X. Deng, Q. Li, and K. Xu, “Robust and reliable early-stage website ﬁngerprinting attacks via spatial-temporal distribution analysis, ” in Pr o- ceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , 2024, pp. 1997–2011. [19] N. Carion, F . Massa, G. Synnaev e, N. Usunier , A. Kirillov , and S. Zagoruyko, “End-to-end object detection with transformers, ” in Eur opean confer ence on computer vision . Springer , 2020, pp. 213– 229. [20] X. Deng, Q. Y in, Z. Liu, X. Zhao, Q. Li, M. Xu, K. Xu, and J. W u, “Robust multi-tab website ﬁngerprinting attacks in the wild, ” in 2023 IEEE symposium on security and privacy (SP) . IEEE, 2023, pp. 1005– 1022. [21] X. Deng, X. Zhao, Q. Y in, Z. Liu, Q. Li, M. Xu, K. Xu, and J. W u, “T owards robust multi-tab website ﬁngerprinting, ” IEEE T ransactions on Networking , 2026. [22] X. Deng, R. Zhao, Y . W ang, M. Zhan, Z. Xue, and Y . W ang, “Count- mamba: A generalized website ﬁngerprinting attack via coarse-grained representation and ﬁne-grained prediction, ” in 2025 IEEE Symposium on Security and Privacy (SP) . IEEE, 2025, pp. 1419–1437. [23] N. Mathews, J. K. Holland, S. E. Oh, M. S. Rahman, N. Hopper, and M. Wright, “Sok: A critical evaluation of efﬁcient website ﬁngerprinting defenses, ” in 2023 IEEE Symposium on Security and Privacy (SP) . IEEE, 2023, pp. 969–986. [24] Y . Cui, G. W ang, K. V u, K. W ei, K. Shen, Z. Jiang, X. Han, N. W ang, Z. Lu, and Y . Liu, “ A comprehensive surve y of website ﬁngerprinting attacks and defenses in tor: Advances and open challenges, ” arXiv pr eprint arXiv:2510.11804 , 2025. [25] K. P . Dyer , S. E. Coull, T . Ristenpart, and T . Shrimpton, “Peek-a-boo, i still see you: Why efﬁcient traf ﬁc analysis countermeasures f ail, ” in 2012 IEEE symposium on security and privacy . IEEE, 2012, pp. 332–346. [26] X. Cai, R. Nithyanand, T . W ang, R. Johnson, and I. Goldberg, “ A systematic approach to developing and ev aluating website ﬁngerprinting defenses, ” in Pr oceedings of the 2014 ACM SIGSA C confer ence on computer and communications security , 2014, pp. 227–238. [27] M. Juarez, M. Imani, M. Perry , C. Diaz, and M. Wright, “T oward an efﬁcient website ﬁngerprinting defense, ” in European Symposium on Resear ch in Computer Security . Springer , 2016, pp. 27–46. [28] M. S. Rahman, M. Imani, N. Mathews, and M. Wright, “Mockingbird: Defending against deep-learning-based website ﬁngerprinting attacks with adversarial traces, ” IEEE T ransactions on Information F orensics and Security , vol. 16, pp. 1594–1609, 2020. [29] A. M. Sadeghzadeh, B. T ajali, and R. Jalili, “ A wa: Adversarial website adaptation, ” IEEE T ransactions on Information F orensics and Security , vol. 16, pp. 3109–3122, 2021. [30] L. Qiao, B. Wu, H. Li, C. Gao, W . Y uan, and X. Luo, “Trace-agnostic and adversarial training-resilient website ﬁngerprinting defense, ” in IEEE INFOCOM 2024-IEEE Confer ence on Computer Communications . IEEE, 2024, pp. 211–220. [31] M. S. Rahman, P . Sirinam, N. Mathe ws, K. G. Gangadhara, and M. Wright, “T ik-tok: The utility of packet timing in website ﬁnger- printing attacks, ” arXiv preprint , 2019. [32] N. Mathews, J. K. Holland, N. Hopper, and M. Wright, “Laserbeak: Evolving website ﬁngerprinting attacks with attention and multi-channel feature representation, ” IEEE T ransactions on Information F orensics and Security , 2024. [33] J. Gong, W . Cai, S. Liang, Z. Guan, T . W ang, and E.-C. Chang, “Wfcat: Augmenting website ﬁngerprinting with channel-wise attention on timing features, ” IEEE T ransactions on Dependable and Secure Computing , 2025. [34] J. Liu, W .-C. Chang, Y . W u, and Y . Y ang, “Deep learning for extreme multi-label text classiﬁcation, ” in Pr oceedings of the 40th international ACM SIGIR confer ence on r esearc h and development in information r etrieval , 2017, pp. 115–124. [35] J. Gong and T . W ang, “Zero-delay lightweight defenses against website ﬁngerprinting, ” in 29th USENIX Security Symposium (USENIX Security 20) , 2020, pp. 717–734. [36] J. K. Holland and N. Hopper, “Regulator: A straightforward website ﬁngerprinting defense, ” arXiv preprint , 2020. [37] Z.-M. Chen, X.-S. W ei, P . W ang, and Y . Guo, “Multi-label image recognition with graph con volutional networks, ” in Proceedings of the 14 IEEE/CVF confer ence on computer vision and pattern reco gnition , 2019, pp. 5177–5186. [38] T . Ridnik, E. Ben-Baruch, N. Zamir , A. Noy , I. Friedman, M. Protter , and L. Zelnik-Manor, “ Asymmetric loss for multi-label classiﬁcation, ” in Pr oceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 82–91. [39] M. Shen, K. Ji, J. W u, Q. Li, X. Kong, K. Xu, and L. Zhu, “Real- time website ﬁngerprinting defense via traf ﬁc cluster anon ymization, ” in 2024 IEEE Symposium on Security and Privacy (SP) . IEEE, 2024, pp. 3238–3256. [40] M. Shen, J. W u, J. Ai, Q. Li, C. Ren, K. Xu, and L. Zhu, “Swallow: A transfer -robust website ﬁngerprinting attack via consistent feature learning, ” in Pr oceedings of the 2025 A CM SIGSA C Conference on Computer and Communications Security , 2025, pp. 1574–1588.

PrismWF: A Multi-Granularity Patch-Based Transformer for Robust Website Fingerprinting Attack

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment