m^3TrackFormer: Transformer-based mmWave Multi-Target Tracking with Lost Target Re-Acquisition Capability

1 m 3 T rackF ormer: T ransformer -based mmW a v e Multi-T ar get T racking with Lost T ar get Re-Acquisition Capability T ongkai Li, W eifeng Zh u , Shuowen Zhang, Jiannon g Cao , Shuguan g Cui, and Lian g Liu Abstract —This paper considers a millimeter wa ve (mmW ave ) integrated sensing and communication (IS A C) system, where a base station (BS) equi pped with a lar ge n umber of antennas but a small number of radio-frequency (RF) c hain s emits pencil- like narro w beams for persisten t tracking of multip le moving targets. U n der this model, the tracking lost issue arising from the misalignment between the pencil- l ike b eams and the true target positions is inevitable, especially when the trajectories of the targets are com pl ex, and the con ventional Kalman ﬁlter -based scheme does not work well. T o deal with this issue, we propose a T ransformer -based mmW a ve multi-target tracking framewo rk, namely m 3 T rackFormer , with a novel re-acquisition mechanism, such th at ev en if the echo signals from some targets are too weak to extract sensin g information, we are able t o re-acquire their locations quickly with small beam sweeping o verhead. Speciﬁcally , the proposed framew ork can operate in two modes of normal tracking and t arge t re-acquisition d uring the trackin g procedure, depend ing on wh eth er the tracking lost occurs. When all targets are hi t by th e swept beams, the framework works in the Normal T racking Mode (N-Mode) with a T ransformer encoder -based Normal T racking Network (N-Net) to accurately estimate th e positions of these targets and predict the swept beams in the next time block. Whi le the tracking lost happens, the framework will switch to the Re-Acqui sition Mode (R-M ode) with a T ransformer decoder -based Re-Acq u isition N etwork (R- Net) to adjust the beam sweeping strategy for getting back the lost targ ets and maintainin g the tracking of the remaining targets. Thanks to the abil ity of global t rajectory feature extraction, the m 3 T rackFormer can achiev e high beam prediction accuracy and quick ly re-acquire the lost targets, compare d with other tracking methods. Numerical experiments demonstrate th at the m 3 T rackFormer can main t ain high tra ckin g success probability with much longer tracking durations th an the representativ e benchmarks. Index T erms —In tegrated sensing a nd communication (ISA C), multi-target tracking, millimeter wav e, T ransf ormer , t arge t re- acquisition. I . I N T R O D U C T I O N A. Mo tivation ITU-R has recently id entiﬁed integrated sensing and com- munication (ISA C) as a prim ary usage scenario o f the sixth- An earlier version of this paper was presented in part at the 2026 IEEE Internationa l Conference on Acoustics, Speech, and Signal Processing (ICASSP) [1]. T . Li, W . Zhu, S. Zhang, and L. Liu are with the Department of Elec- trical and Electronic Engineering, The Hong Kong Polytechni c Uni versity , Hong Kong SAR, China (e-mails: tongkai.li@conn ect.polyu.hk, { eee-wf.zhu, shuo wen.zhang, liang-eie .liu } @polyu.edu.hk). J. Cao is with the Department of Computing, The Hong Kong Polytechni c Uni versity , Hong Kon g SAR, China (e-mail: jiannong.cao@po lyu.edu.hk). S. Cui is with the School of S cience and E ngineering and the Future Networ k of Intell igence Institute , The Chinese Univ ersity of Hong Kon g, Shenzhen , 518172, China (e-mail: s huguangc ui@cuhk.edu.cn). Moving T rack ing Lost! T arget 1 T arget 1 T arget ݅ T arge t T arget ݅ Moving x z y BS Fig. 1. System model for targe t tracking in the 6G mmW ave ISAC system: tracki ng lost ev ent occurs when the penci l-like mmW a ve beams are not precise ly pointed to the target positions. generation (6G) cellular network [2]. This insp ir es a great amount of effort in in vestigating the in tegration of sens- ing fu nctionality into a commu nication-o riented cellular net- work [3]–[6]. Notably , the millimeter wave (mmW ave) band promised in 6G network provides sufﬁcient bandwid th that is beneﬁcial to most of the sensing ap plications, e.g., hig h- resolution lo calization and imaging. Howe ver, th e story is totally different to track ing, which a ims at precisely localizing moving targets at each time block based on historical signals and is essential fo r many applicatio ns such as beam alignment in mob ile mmW ave comm unication [7] and u nregistered d rone detection in lo w-altitude econ omy [8]. Speciﬁcally , due to the expensiv e r adio freque ncy (RF) c hains at the mmW ave band, an alog beam forming is widely ad opted in mmW a ve systems, which y ields narrow beams. Th e pencil-like n arrow beams make the 6G mmW ave ba se statio n (BS) lack adeq uate historical signals for tracking, because some of its p reviously emitted beam s m ay not h it the movin g targets. This gi ves rise to the tracking lost issue [9], as shown in Fig. 1. There are two importan t action s for tacklin g the track ing lost issue in 6G m m W ave systems: prevention and re-ac quisition. Action 1 - Prevention : in a time blo ck, if the beam s are well aligned with the target, i.e., the target is not lost currently , we should pre c isely predict the location o f this target for b eam alignment in the next tim e block such that the target is not lost with h igh pro bability in the fu ture blo c k . Act ion 2 - Re-acquisition : in a time block, if the beams ar e no t well aligned with the target such that no sen sing info rmation can be extracted fro m th e r e c ei ved signal, we should re-acqu ire the target lo cation in the fu tu re b locks v ia tran smitting beams 2 to various s ites. In the literature, the classic app roaches fo r prevention and re-acqu isition in mmW ave track ing systems are Kalman Filter (KF) and exhaustive beam sweeping, resp e cti vely . The KF lev erag es the state space model and the a vailable me a sure- ments to pre d ict the futu re location of the target [10]. Howev er, its perfor mance is sensitive to th e accuracy of this prior knowl- edge and state m easurements in each time block, wh ich may be se verely degraded with inaccu rate state mod el a nd hea vily corrup ted measurem ents [1 1]. The exhaustive bea m sweeping perfor ms re-acqu isition by scannin g b eams over the whole space to re-identif y the location of the target. Howev er, this approa c h will take so mu ch time in pr actice, and will genera te interferen ce to com munication u sers at all positions. Recently , deep learning (DL)-b ased solution s to mm W ave tr a cking h av e emerged as promising altern ativ es. The DL m e th ods aim to lear n the in h erent dyna mics of the target mobility fr om the pre-collec ted d ata, thereby overcoming th e deﬁciency o f explicit characteriza tion of the com plicated mo bility mod el. Therefo re, th ey are well suitab le to tackle the prevention and re-acqu isition task s in challengin g tracking scena rios inv olv ing highly nonlinear and comp lex target tr a je c tories. B. P rio r W orks Recently , the DL-based track ing problem in m mW ave sys- tems has been investigated in [ 11]–[18], which main ly focu s on the p rev en tion issue. Speciﬁcally , in [11], a hybrid mo del- based and data-dr i ven m e thod is pr oposed to learn th e ﬁlter- ing operation of KF by a recurre n t n e ural network (RNN), thereby enhan cing the tracking perf ormance in the case with model misma tch . Similarly , [12] add resses the codeboo k- based analo g b eam trackin g prob lem b y introd ucing a Hidden Markov Model (HM M)-based ﬁlter realized by a deep neural network (DNN) to lea r n the transition pro b abilities between candidate beam states, wh ich eliminates the ne ed for prior knowledge o f the mob ility model. In [ 1 3], a long sho rt-term memory ( LSTM) ne twork is design ed to directly predict the beamfor ming m atrix fro m the extracted spatial f eatures ou tput by the Conv olution al Neur al Network (CNN) in the vehicular network, thus b ypassing the need for explicit ch annel tracking . For the mu lti-cell system, [14] propo ses an autoen coder-LSTM architecture to jo in tly predict multip le beams across different cells w ith r e d uced comp utational comp lexity . [15] p roposes a LSTM- based pre dictiv e beamf orming algo rithm fo r U A V commun ications, wh e r e rea l-time angle reﬁne m ent is incor p o- rated to mitigate b eam misalignment caused by U A V jitterin g. Despite their effecti veness, RNN- and LSTM-b a sed models exhibit limited capa b ility in captur ing globally tempo ral d e- penden cies du e to their sequ ential pro cessing mechanism, thus suffer from low-precision tracking perf ormance for targets with complicated mob ilities. Beyond the above sup ervised learning meth ods, reinfo rcement learning (RL) app roaches such as multi-armed band it (MAB) and Deep Q-learn ing have also been propo sed to tackle the b eam track in g pro blem by in - teracting with the environment in r eal time to learn an optimal beam pr ediction policy [16]–[ 18]. Com pared with RNN-based schemes, RL-ba sed ap p roaches exhibit strong er gener alization capability in co mplex and dyna m ic environments. Howev er, in mmW av e tracking scenar ios where the number o f cand idate beams is large, th e RL-based approac h es often suffer fr om slow convergence speed and time-consuming tr aining process due to the high dimen sionality o f the action and state space. Despite th e various advances on p revention, the tracking lost remains as a critical problem in the mmW ave system. In the case with tracking lost, con ventional DL-b ased ap p roaches in [11]–[15] fail to work due to th e missing me asurements o n th e tracking targets [19]. I n practice, th e rapid traje c tory variations can easily re su lt in tracking lost. As a solution, th e data imputation techniqu e can be employed with existing track ing schemes to estimate these missing entries and predict the swept beams in the future [20]. Howev er, th is metho d ignores the trajectory variations of th e lost targets fo r beam prediction, resulting in sev ere erro r accumu lation [2 1], [2 2] and qu ite low pro b ability of lost target re-acq uisition. Conseq uently , the frequen t exhaustive beam sweep ing oper a tion is req uired for ﬁnding the lost targets. T o av oid the reliance on exhaustive beam sweeping, this work is m o tiv ated to d esign a robust target track ing scheme enabled by a com plementary target re-acqu isition mechanism with lo w beam sweeping overhead. T o the be st o f our knowledge, th e research on the tracking problem with the consideratio n o f lo st targets r e-acquisition is still missing, and th is work is the ﬁrst attem p t to ad dress this problem . C. Main Co ntributions This p aper aims to design a robust m u lti-target tracking framework in the 6G mmW ave ISA C system. Motiv ated by the powerful cap ability o f the Transformer arch itecture in time-series d ata proce ssing [23], [24], we p ropose a novel T ransfo rmer-based framework to simultaneously tr ack mu lti- ple ta rgets and pr edict the swep t bea ms in the next time block. In par ticu lar , the pr oposed framework can facilitate fast re- acquisition of lost targets by exploiting th e temporal depen- dencies in incomplete trajectory sequ ences and th e negati ve tracking lost events. The main d istinctions and con tributions of this work can be summarized as follows: • This paper propo ses a two-mode Transformer-based framework fo r the mm W ave multi-target tr a c king task. Speciﬁcally , the propo sed tr acking f r amew or k executes its N o rmal T rack ing Mo d e ( N-Mode) to perfor m trackin g on all targets if no target lost events occur . Otherwise, the Re-Ac q uisition Mod e (R-Mode ) is trigger ed to get back the lo st targets as soon as po ssible. In contrast to the conventional track ing method s, the pro posed fram e - work can realize robust target trackin g p erform a nce with low beam sweeping overhead by beneﬁting from the carefully- designed R-M o de. • W e ﬁrst prop ose a n ovel m 2 T rack Form e r in the single- target scenario, which consists of a No rmal Tracking Network ( N-Net) and a Re-Acqu isition N e twork ( R-Ne t). Speciﬁcally , the N-Net is d esigned based o n th e Trans- former enco der arc h itecture and activated in th e N-Mo de to perf orm tra c king an d beam p rediction, which lev erag es the p owerful Self-Atten tio n mechan ism to ca p ture the 3 global motion feature s d uring the tracking pr o cess. Then, the R-Net is designed based on the T ran sformer deco der architecture an d utilized for the R-Mode, which applies the Cro ss-Atten tion mecha nism to in tegrate the n egati ve informa tio n fr om beam misalignment and positi ve infor- mation in histor ical trajectory for target re-acquisition. Compared with traditional tr a c king schemes, the implicit informa tio n in th e target-b eam misalignment is also lever - aged in the track in g procedu re, thereby enhan cing the tracking robustness. • W e then pr opose a scalab le m 3 T rack Form e r in the multi- target s cen ario to sup p ort the tra c king and re-acquisition of multiple targets. Compared with m 2 T rack Form e r, the motion featur es of each target are extracted in parallel and then aggregated to generate a jo int b eam sweeping strategy in m 3 T rack Form e r, which realizes ﬂexible ad - justment of the swep t beams with an arb itrary nu mber o f lost targets. • Numerical results dem o nstrate that the p r oposed fr ame- work consistently achieves h igh tracking success prob- ability with mu ch lo nger tr acking dura tio ns than co n- ventional KF and RNN-based tracking schem es. Speciﬁ- cally , the tracking d uration is improved by 15% in low- mobility scenarios and by more than 130 % in high- mobility scenarios, wh ich validates the effectiv ene ss o f the p roposed re-a c q uisition mechanism in handling rap id trajectory variations a nd re-acqu iring lost targets with- out in troducin g extra beam resou rces. Moreover, the framework m aintains millisecond -lev el infere n ce latency as the numb er of targets increases, demo n strating its scalability a nd feasibility for real-time im plementation in 6G mmW av e ISAC systems. D. Organization The rest of th e paper is organized as fo llows. Sectio n II introdu c es th e system mod e l of th e 6G mmW ave tracking system. Section III formu late s the pr oblem and in troduces the two-mode trackin g f ramework used to solve it. Section IV provides a detailed illustration of the n etwork desig n for the propo sed framework in the single-target scen a rio. The general tracking fram ew ork fo r the multi-ta rget scenario is intr oduced in Section V . Finally , Section VI ev aluates the perfo rmance of the p r oposed tr acking framework, and Sectio n VII con c ludes this work. Notation : Column vectors and m a trices are d enoted by boldfaced lowercase and upp ercase letters, e.g., x an d X . R n × n and C n × n represent the sets of n - dimensional r eal and complex matrices, respecti vely . Th e superscripts ( · ) T and ( · ) H denote the transpose and conjugate tr anspose op erations, respectively . U [ a, b ] deno tes the pr o bability den sity fu nction of the u niform d istribution on th e inter val [ a, b ] . x [ i ] denotes the i -th element of the vector x . X [ i, :] , X [: , j ] , a nd X [ i, j ] d enote the i -th ro w , the j -th column, and th e ( i, j ) -th elemen t of the matrix X , respec tively . I n denotes an n × n identity matrix. 0 n and 1 n denote an all-zero vector an d all-one vector o f length n , respectively . k·k 2 denotes the Euclidean nor m . |·| denotes the magn itude o f a complex scalar . 1-st symbol Communication Phase Tracking Phase ࢗ -th block 2-nd symbol K-th symbol ( ܅ ܙ , ૚ ( ܀۴ ) , ۴ ܙ , ૚ ( ܀۴ ) ) … ܭ OFDM symbols Time … … Communication Phase Tracking Phase ( ࢗ െ ૚ ) -th block Emitted beams ( ܅ ܙ , ૛ ( ܀۴ ) , ۴ ܙ , ૛ ( ܀۴ ) ) ( ܅ ܙ , ۹ ( ܀۴ ) , ۴ ܙ , ۹ ( ܀۴ ) ) Fig. 2. The transmission protocol for the considere d mmW av e ISAC system. I I . S Y S T E M M O D E L W e consider a mm W ave or thogon al fr equency d i v ision multiplexing ( OFDM ) ISAC system in which a BS trans- mits rad io-frequ ency signals fo r trackin g I moving targets, denoted by I = { 1 , . . . , I } , b ased on their echo signals, and commun icating with U users, denoted by U = { 1 , . . . , U } . The BS is eq uipped with a transmit un iform p lanar arra y (UP A) of N T = N x T × N y T antennas and N RF T ≪ N T radio frequen cy (RF) chains, a n d a recei ve UP A of N R = N x R × N y R antennas and N RF R ≪ N R RF c hains. Denote the BS locatio n as p BS = [ x BS , y BS , z BS ] T in a three-d imensional (3D) Cartesian coordin ate system. The targets m ove over a tim e duratio n of T second s (s), which con sists of Q time blocks, each with a d uration of ∆ T = T / Q s. The location of each target is assumed to be ﬁxed with in one b lo ck, but varying at dif fer ent blocks [25], [26]. In the q -th b lo ck, the lo cation of target i ∈ I is denoted as u i,q = [ x i,q , y i,q , z i,q ] T , q = 1 , . . . , Q , while the rang e, azimuth a n gle, and elev ation angle of target i relativ e to the BS are den oted by d i,q = k u i,q − p BS k 2 , θ i,q = a rctan  y i,q − y BS x i,q − x BS  , and φ i,q = arc cos  z i,q − z BS d i,q  , respectively . Let ¯ K d e note the nu mber of OFDM symb ols within a blo ck, and L deno te the num ber of sub- carriers for each OFDM symbol. W e divide each b lock into two phases - trackin g pha se that co nsists of K < ¯ K O FDM sym bols and c o mmunica tio n phase tha t consists of ¯ K − K OFDM symbols, as shown in Fig. 2. In the tracking phase of eac h bloc k, the BS transmits B s = K N RF T > I b eams to track th e I targets over K symbo l duration s. In the subsequent commu n ication phase, th e BS transmits B c = ( ¯ K − K ) N RF T beams to deli ver messages to the U u ser s over ¯ K − K symbol duratio n s. Since mmW ave commun ication has been widely stud ie d , in this paper, we focus on the mmW av e tra c king task. During th e track ing phase, the BS’ s tra nsmit signa l on the l -th sub- carrier o f the k -th OFDM symbol in th e q -th blo c k can be expressed as x q,k ,l = √ P W RF q,k W BB q,k ,l s q,k ,l , ∀ q , k , l , (1) where s q,k ,l ∈ C N RF T × 1 with E [ s q,k ,l s H q,k ,l ] = I N RF T denotes the BS’ s transmit symbol vecto r on the l -th sub-carrier of the k - th OFDM symbo l in the q -th blo c k , P denotes the tran smit power , W BB q,k ,l ∈ C N RF T × N RF T denotes the d igital precod e r o n the l - th sub -carrier o f th e k -th OFDM symbo l in the q -th block, and W RF q,k ∈ C N T × N RF T denotes the freq uency-ﬂat phase shifter-based analog precoder of the k -th symbol in the q -th block. Each e le m ent of W RF q,k satisﬁes th e constan t modu lus constraint, i.e.,    W RF q,k [ n i , n j ]    = 1 √ N T with n i = 1 , . . . , N T 4 and n j = 1 , . . . , N RF T . Here, we consider that W RF q,k is designed by selecting N RF T beams from a codebo o k W T = n w ( m ) T ∈ C N T × 1 , m = 1 , . . . , M T o , (2) where each beam w ( m ) T is a pencil-like narr ow b eam to wards some pre-design ed azimuth angle ¯ θ m and ele vation angle ¯ φ m [27]. Moreover , let a T ( θ, φ ) ∈ C N T × 1 and a R ( θ, φ ) ∈ C N R × 1 denote th e steer ing vectors of th e transmit and receive UP As of th e B S towards th e azimuth ang le θ and elevation angle φ , respectively , with the form a j ( θ, φ ) = a x j ( θ, φ ) ⊗ a y j ( θ, φ ) for j ∈ { T , R } , where ⊗ deno tes th e Kronecker produ ct, and a x j ( θ , φ ) = [1 , e j 2 π d s λ sin φ cos θ , · · · , e j 2 π d s λ ( N x j − 1) sin φ cos θ ] T , a y j ( θ , φ ) = [1 , e j 2 π d s λ sin φ si n θ , · · · , e j 2 π d s λ ( N y j − 1) sin φ sin θ ] T , (3) with d s being the an tenna spacing, and λ being the carrier wa velength. In practical mmW ave systems, the line-of-sight (LoS) channel model is widely adopted [13], [28]. Therefore, the round- trip c h annel matrix b etween the BS and target i on the l -th sub-c a rrier during the k - th sym b ol in the q -th block, denoted by H i,q,k ,l ∈ C N R × N T , can be mod eled as H i,q,k ,l = α i,q e j 2 π ν i,q kT 0 e − j 2 π l ∆ f τ i,q × a R ( θ i,q , φ i,q ) a H T ( θ i,q , φ i,q ) , (4) where α i,q = q λ 2 (4 π ) 3 d 4 i,q ζ i,q is the complex coefﬁcient with ζ i,q denoting the radar cross-section (RCS) of target i in the q -th b lock, τ i,q and ν i,q denote the time delay and Dop p ler frequen cy between the BS and target i , respectively , T 0 denotes the OFDM symbo l duration, and ∆ f denotes the sub- carrier spacing. Then, du ring the k -th symb ol in the q -th block, the echo signal received by the BS on the l -th sub-car rier is given as υ q,k ,l = √ P ( F RF q,k F BB q,k ,l ) H I X i =1 H i,q,k ,l W RF q,k W BB q,k ,l s q,k ,l + ( F RF q,k F BB q,k ,l ) H n q,k ,l , ∀ q , k , l , (5) where n q,k ,l ∼ C N ( 0 , σ 2 I N R ) denotes th e additive white Gaussian noise (A WGN) vector at th e BS with σ 2 being the power , F BB q,k ,l ∈ C N RF R × N RF R denotes the BS rece ive digital combiner on the l -th sub-car rier of the k - th symbol in the q - th blo ck, and F RF q,k ∈ C N R × N RF R denotes the BS receive phase shifter-based analog combin er . Similar to th e an alog pr e coder, F RF q,k is sub ject to    F RF q,k [ n i , n j ]    = 1 √ N R with n i = 1 , . . . , N R and n j = 1 , . . . , N RF R , and is de sig ned by selectin g N RF R beams from the codeb o ok W R = n w ( m ) R ∈ C N R × 1 , m = 1 , . . . , M R o . (6) By co llecting the signals across all L sub-car riers, the received signal at the BS of the k -th sy m bol in the q -th block is denoted a s Υ q,k = [ υ q,k , 1 , . . . , υ q,k ,L ] ∈ C N RF R × L . Furthermo re, by stackin g Υ q,k across K OFDM s ym bols, the overall r eceiv ed signal matrix durin g the track ing ph ase in th e q -th block is given as ˜ Υ q = h Υ T q, 1 , . . . , Υ T q,K i T ∈ C K N RF R × L . A con ventional way to per form tr a c king based on ˜ Υ q is as follows. At the track ing phase of each block q , the BS ﬁrst estimates the range a n d angular inf ormation of each target i based o n ˜ Υ q , denoted b y ˆ z i,q = [ ˆ d i,q , ˆ θ i,q , ˆ φ i,q ] T , where ˆ d i,q , ˆ θ i,q and ˆ φ i,q are th e estimated rang e, azimuth angle and elev ation angle of target i , respec ti vely . The par ameter esti- mation pr oblem can be well solved b y classical me thods such as m aximum likeliho od e stimation ( MLE) [29] or data-driven approa c h es like deep neur al networks [ 30]. Then, Kalman ﬁlter [10] or d ata-driven based app roaches [11] can leverage the estimated d istances and ang les for trac k ing the targets. Howe ver , th e above appro aches a r e un der the assum ption that the echo signals f rom the targets are alw ays stron g enou gh for precise distance and ang le estimation. In our co nsidered mmW av e systems, due to the an alog beamf ormers, th e transmit antenna array can merely emit pencil-like n arrow beams, wh ile the re ceiv e an te n na a r ray can merely rece ive sign als from a na r row direction. Th us, the m a in challeng e of the target tracking problem is ho w to maintain th e narrow analog beams that keep align ed with th e targets, known as b e a m tr a c king [31], to obtain reliable location infor mation. I I I . P R O B L E M D E S C R I P T I O N A N D T R A N S F O R M E R - BA S E D S O L U T I O N A. P r ob lem Description This p aper aims to p redict the analog b eams aligne d with each target in each block to tr ack the lo cations of these targets over time. Speciﬁcally , based on the trajectory of each target, the BS can pred ict { W RF q +1 ,k , F RF q +1 ,k } K k =1 for the n ext b lock q + 1 b y exploiting the historic a l sensing infor mation u p to block q . The an alog b eamform er p redictor, den oted by F ( q ) ( · ) , can be expressed as { W RF q +1 ,k , F RF q +1 ,k } K k =1 = F ( q ) ( H 1: q ) , ∀ q , (7) where H 1: q = {{ ˆ z i,τ } I i =1 , { W RF τ ,k , F RF τ ,k } K k =1 } q τ =1 denotes the historical sensing infor mation that con sists of the measure- ments of each target and d esigned a n alog beamf ormers in each historical block τ ∈ { 1 , . . . , q } . Let B ( tx ) q +1 ,k ⊂ W T and B ( rx ) q +1 ,k ⊂ W R denote the sets of predicted be ams from the co debook W T and W R for d esigning W RF q +1 ,k and F RF q +1 ,k , r espectiv ely , with |B ( tx ) q +1 ,k | = N RF T and |B ( rx ) q +1 ,k | = N RF R . The analog beamformer predictor in (7 ) can be tra nsformed into a predictor for the b eam su b sets, d enoted by ˜ F ( q ) , which can be expressed as {B ( tx ) q +1 ,k , B ( rx ) q +1 ,k } K k =1 = ˜ F ( q ) ( ˜ H 1: q ) , ∀ q , (8) where ˜ H 1: q = {{ ˆ z i,τ } I i =1 , {B ( tx ) τ ,k , B ( rx ) τ ,k } K k =1 } q τ =1 represents the h istorical info r mation o f target measuremen ts and selected beam subsets in each historical block τ ∈ { 1 , . . . , q } . Let ( w ∗ T ,i,q +1 , w ∗ R ,i,q +1 ) den ote the optimal transmit and receive beam pair that is aligned with th e target i in th e q + 1 - th block, which is deﬁned as ( w ∗ T ,i,q +1 , w ∗ R ,i,q +1 ) = arg max w T ∈W T , w R ∈W R    w H R a R ( θ i,q +1 , φ i,q +1 ) a H T ( θ i,q +1 , φ i,q +1 ) w T    2 . (9) In the q + 1 -th block , we conside r that the target i will be tracked if the beamfo rmers of the BS are align e d with the target during on e OFDM symbol in th e track ing p hase. For a 5 giv en symbo l, beam alignme nt is eq uiv alent to the c o ndition that the optimal transmit and receive beam pair fo r the target is con tain ed in th e predicted b eam subsets obtained in (8). Let A i,q +1 ,k denote the bea m alignment e vent for target i during the k -th symbo l. Then, th e tracked event o f the target i in the q + 1 - th block, denoted by E (track) i,q +1 , can be deﬁned as E (track) i,q +1 , K [ k =1 A i,q +1 ,k , (10) where A i,q +1 ,k , n w ∗ T ,i,q +1 ∈ B ( tx ) q +1 ,k o ∩ n w ∗ R ,i,q +1 ∈ B ( rx ) q +1 ,k o . (11) Howe ver , if beam alig n ment for target i fails across all K symbols, we consider th at this ta rget is lost in the q + 1 -th block, becau se the echo reﬂected from the target is too weak to extract a c curate sensing information, i.e., ˆ z i,q +1 is m issing. According ly , th e lost event for target i in the q + 1 -th block, denoted by E (lost) i,q +1 , can be deﬁned as E (lost) i,q +1 , K \ k =1 A c i,q +1 ,k , (12) where A c i,q +1 ,k denotes the com plementary event of A i,q +1 ,k , correspo n ding to beam misalignmen t du ring the k -th symbol. Therefo re, ou r o bjective is to design ˜ F ( q ) ( · ) to max im ize the probab ility that a ll I targets can be successfully trac ked in th e n ext block q + 1 , such that we c a n achieve p e rsistent tracking of a ll the targets. The optimization problem can be formu late d as (P1) : max ˜ F ( q ) ( · ) P I \ i =1 E (track) i,q +1      ˜ H 1: q ! (13a) s.t. (8) , (13b) |B ( tx ) q +1 ,k | = N RF T , |B ( rx ) q +1 ,k | = N RF R , (13c) B ( tx ) q +1 ,k ⊂ W T , B ( rx ) q +1 ,k ⊂ W R . (13d) It is observed that solv ing problem (P1 ) v ia traditional optimization -based tech niques is challeng ing due to the fol- lowing two reasons. First, the mappin g f rom historical sensing informa tio n to fu ture beam subsets in (8) is hig hly non-lin e ar , which limits the effecti veness of trad itio nal KF- based tracking methods. Second, th e tracking lost issue in the m m W ave system implies tha t the historical sensing in f ormation ˜ H 1: q could be incomplete due to missing measurements, which challenges the robustness of the predicto r in ( 8), particular ly in high-m o bility scenarios with complex trajector ies. Ther efore, the design of ˜ F ( q ) ( · ) should have the capability to ma in tain accurate b eam prediction even if the measu rements fr om some targets ar e unav ailable in th e history , while adaptively adjust- ing the beam sweeping strategy in resp onse to the track ing lost ev en ts. B. P r op osed T ransformer-based Solution T o tack le the se challen ges, in this p aper, we propo se to lev erag e the deep learn ing techniqu e to solve the pr oblem (P1) , wher e the powerful Transformer architectu re is adop ted. Speciﬁcally , the pro p osed T ransf o rmer-based tracking meth o d operates in two mod es depend ing on whe ther the trackin g lo st ev ent occu rs, which aims to realize accu rate beam prediction and f ast re-acquisition o f lost targets simultaneou sly . The two modes are described as follows: 1) Normal T r acking Mod e (N-Mode) : In th e q -th block, our sy stem op erates in the N-Mod e if all targets are hit by the swept beams and there is n o tracking lost ev en t. In this mode, the o b jectiv e is to predict th e swept b eam subsets for the q + 1 -th block b ased on historical tr ajectories up to the q -th block, such that all targets can be con tinuously tracked in the q + 1 -th bloc k with a hig h probability . 2) Re-Acqu isition Mode (R-Mode): In the q - th block, the system o perates in the R-Mode if th ere exist targets misaligned with all swept beams, resulting in the tracking lost e vents. The objectiv e of this mode is to adjust future beam sweepin g strategy by jointly exploiting th e positive in formatio n fr om historical trajectories an d the negati ve feedba ck from beam misalignment, such that the lost targets can be re-acquired as soon as po ssible, while the n on-lost targets c a n be con tinuously tracked in the q + 1 -th block with a hig h p r obability . Remark 1: By separating the operation into the N-Mode and R-Mode, the prop o sed f ramework can adap tiv ely learn the beam sweeping strategies f or maintaining nor mal tr a cking and re-acqu ir ing lost targets separately . Compar ed to the single- mode fram ew ork , th is design c an signiﬁcantly r educe the difﬁculty o f m odel training and improve the overall tracking robustness. In the next section, we provide a detailed intr oduction of the proposed two-mode tracking framework. I V . T R A N S F O R M E R - BA S E D T R AC K I N G F R A M E W O R K F O R T H E S I N G L E - T AR G E T C A S E T o provide a co mprehen si ve introductio n on the pro p osed T ransfo rmer-based tracking framework, we start fro m the single-target case in this section, nam ely the design of m 2 T rack Form e r . Th e gener a lization o f the prop osed frame- work to the mu lti- ta rget case, i.e., th e m 3 T rack Form e r, will be discussed in Section V . Moreover , we consider a symm etric conﬁgur ation whe re the num b er of antenna s and RF ch ains for the transmitter and re c ei ver are equal, i.e., N T = N R = N , N RF T = N RF R = N RF . Due to the mo nostatic sen sing, the optimal transmit beam and receive beam f o r the target are equal. Theref ore, we only focus on predictin g th e optimal beam in the cod ebook W with out distingu ishing the transmit and receive sides 1 . In the propo sed m 2 T rack Form e r, we d esign two network s to implement the tasks in the N-M ode and R-Mo de, respe cti vely , as shown in Fig. 3. Speciﬁcally , a Normal Tracking Netw ork (N-Net), d e noted by F ( q ) N ( · ; Θ N ) with train a ble parame ters Θ N , is p roposed fo r th e N- M ode to learn the bea m sweep - ing strategy f o r persistent tr acking, while a Re-Acquisition Network (R-Net), deno ted by F ( q ) R ( · ; Θ R ) with train able pa- rameters Θ R , is pro posed for the R-Mode to learn the b eam sweeping strategy for target re-acq uisition. In the following subsections, we detail the mod u le de sig n of ea ch network, and introdu ce the p olicy for training the networks. A. Desig n of the N-Net For the N-Net, we aim to accurate ly p redict the beam subset for the next block q + 1 and localize the target for the curren t 1 The exte nsion to the asymmetric scenario is straightforw ard by predict ing the optimal beams in the two dif ferent sets W T and W R simultane ously . 6 Emit beams Channel Parameter Estimation Predictor T raje ctory Update ݍ ՚ ݍ + 1 Masked Self-Attention FFN Add & Norm Linear Missing data-Aware Feature Extraction (MAFE) Module Embedding Add & Norm Linear ReLU Linear Softmax Encoding Unit 灤 ܮ ௘ Task Specific Projection (TSP) Module Encoding Unit Embedding Linear Softmax Cross- Attention FFN Add & Norm Add & Norm Decoding Unit 灤 ܮ ௗ Lost Encoding (LE) Module Residual Fusion Feature Fusion (FF) Module ܺ ୯ ܺ ୩ ܺ ୴ ܺ ୯ ܺ ୩ ܺ ୴ Missing Mask Embedding Encoding Unit × ܮ ௘ MAFE Module Top- ܤ ௦ Top- ܤ ௦ Missing Mask Pre- processing Pre- processing Pre- processing (a) Network Design for the N -Net (b) Network Design for the R -Net Mask Matrix Attention Matrix (c) Masked Attention Mechanism Concatenation Attention Output Fig. 3. The proposed Transformer -based tracking framew ork in the single-targ et case, namely m 2 Tra ckFormer . In the predictor ˜ F ( q ) ( · ) , the left upper branch illustra tes the N-Net, where the historical s ensing information is processed sequentially by the MAFE and TSP modules to predict future beam directi ons. T he left lowe r bra nch shows the R-Net, in which the historical sensing information and feedback from the tracking lost e vents are processed via the MAFE and LE modules, respecti vely , follo wed by the feature fusion in the FF module to adjust fu ture beam sweeping strate gy . The right branch shows the multi-head Masked Attenti on Mechanism employe d i n the Self-Atten tion l ayers and Cross-Attentio n laye rs. block q based on the historical sen sing in formation of the target up to block q . The N-Net is c o mprised by two mod ules: the Missing data-A ware Feature Ex traction module (MAFE) and the T ask -Speciﬁc Projection module (TSP). Spec iﬁcally , the MAFE is designed to extract th e target motion feature from the historical measurem ents, and the TSP is utilized to map the extracted h igh-dime n sional featur e to the predictio n of b eam subset an d the estimatio n of ta rget position. By leveraging the Self-Attention mechan ism an d a missing-aware attentio n mask, the N-Net achieves non -r ecursive f eature extraction directly from incomp lete trajector ies, there by elim inating the need fo r data impu tation and mitig ating error pro p agation. These two modu les con tr ibuting to the N-Net ar e detailed introdu c ed as fo llows. 1) MAFE module: In th e N-M o de, the sequence of his- torical mea su rements is directly u sed as the in put to the MAFE mo dule to extract tempo r al fe a tures o f the target trajectory . T o av oid overly lon g sequen ce, a slidin g-window truncation strategy of ﬁxed len gth T p is employed to retain only the measur ements from the recent T p time blocks. Con- sequently , f or bloc k q , the n etwork input is constructed as ˆ Z ( q ) = [ ˆ z q − T p +1 , . . . , ˆ z q ] T ∈ R T p × D in , where D in = 3 is the dimension of mea su rements consisting o f the estimated range, a z im uth and elev ation ang le. If tracking lo st occur s at some h istorical time blocks, zero-paddin g is applied for these positions to e nsure dimension consistency . Th e n, we app end a learnable Pr ed ic tio n T oken at the end of th e tempora l dimen- sion o f ˆ Z ( q ) to en able the predictio n fo r th e next block q + 1 , which extend s the temporal dimen sion to T e = T p + 1 . The extended sequ ence is d e noted as ˜ Z ( q ) = [ ˆ Z ( q ) ; 0 ] ∈ R T e × D in . Subsequen tly , ˜ Z ( q ) is mapped into a un iﬁed featur e space with dimension D via a lea rnable embedd ing layer , which can be formu late d as ˆ Y ( q ) = φ proj ( ˜ Z ( q ) ) + P ( pos ) ∈ R T e × D , wh ere φ proj ( · ) : R T e × D in → R T e × D is a linear pr ojection layer, and P ( pos ) represents the sinusoidal po sitional encoding to preserve temporal order [32]. T o mitigate the impact of missing measuremen ts in featu re extraction, a Missing-A war e Atten tion Ma sk is emp loyed, denoted b y M ( q ) ∈ { η , 0 } T e × T e . Here, η represents a large negativ e scalar ( e.g., η = − 10 9 ) that drives the attention weights of masked positions to zero s after the Softmax o p - eration in the Attention mechanism. Spe ciﬁcally , th e en tries of M ( q ) are deﬁn e d as M ( q ) [: , j ] = η · 1 T e if the j -th row of ˜ Z ( q ) is zero-pa d ded, a nd M ( q ) [: , j ] = 0 oth erwise, ∀ j ∈ { 1 , . . . , T e } . The functio n of the missing mask is to suppress in f ormation pro pagation f rom block s corresp o nding to missing measureme n ts and the appe nded token, thereby ensuring that featu r e extractio n is perfor m ed exclusiv ely on valid measurements and pre venting error propagation . Subsequen tly , the embed ded sequ ence ˆ Y ( q ) and the missing mask M ( q ) served as inp uts to L e stacked En c oding Units fo r feature extraction. In th e l -th E ncoding Unit ( l = 1 , . . . , L e ) , the input ˆ H ( q ) l − 1 (with ˆ H ( q ) 0 = ˆ Y ( q ) ) is ﬁrst pr ocessed by a Self-Attention (SA) layer, which captures globally temp o ral correlation s in the h istor ical trajectory by compu tin g atten- tion scores across mu ltiple a ttention heads. Sp eciﬁcally , the Multi-Head Attention mechanism, den oted by Attn ( · ) , can be expressed as [32] Attn ( X K , X Q , X V , M ) = Concat ( O 1 , . . . , O h ) W o , ( 14) 7 O i = So ftmax ( X Q W Q i )( X K W K i ) T √ D i + M !  X V W V i  , i = 1 , . . . , h, (15) where X Q , X K and X V denote the quer y , key and value matrix, respectively , Concat ( · ) deno tes th e matrix splicing operation along the f eature dime nsion, O i is the ou tput of the Attention o peration for th e i - th Attention head, W o ∈ R D × D , W Q i , W K i , W V i ∈ R D × D i are learn able p rojection matrices, D i = D /h is the dimen sion of i -th Attention head. The ou tput of th e SA layer is processed thr ough a Feed- Forward Network ( FFN). T o facilitate gradien t ﬂow an d sta- bilize training , residual connections and Layer Norm a liza tion (LN) are applied after both the SA lay ers and FFNs. Speciﬁ- cally , the feature extraction in the l -th Encodin g Unit can be expressed as ˆ E ( q ) l = LN  Attn ( ˆ H ( q ) l − 1 , ˆ H ( q ) l − 1 , ˆ H ( q ) l − 1 , M ( q ) ) + ˆ H ( q ) l − 1  , (16) ˆ H ( q ) l = LN ( FFN ( ˆ E ( q ) l ) + ˆ E ( q ) l ) . (17) 2) TSP mo dule: Up on ob taining the ﬁnal latent represen ta- tion ˆ H ( q ) L e ∈ R T e × D from the E ncoding Un its, the TSP modu le is emp loyed to d ecouple the feature pro cessing for th e primar y beam predictio n task a n d an auxiliary localization reﬁne m ent task using the Multi-T ask Learning (MT L) strategy [33]. For the beam p rediction task, th e prediction to ken at tempor a l index T e is p assed throug h a class iﬁer φ cls ( · ) to yield the pre- dicted pro bability vector π ( q ) = [ π (1) q , . . . , π ( M ) q ] T ∈ [0 , 1] M , which can be expre ssed as π ( q ) = φ cls ( ˆ H ( q ) L e [ T e , :]) , wher e π ( m ) q represents the predicted likelihood tha t the m -th beam in the codebo ok will align with the target in the next b lock q + 1 . Finally , a T op - B s selection strategy is employed to construct the sensing b eam sub set {B q +1 ,k } K k =1 by selecting the B s indices with the highest prob a b ilities in π ( q ) . Moreover , the locatio n reﬁnement task is perfo rmed to reﬁne the target’ s estimated locatio n fo r the c urrent block q (index by T p ) v ia a MLP layer , which can be expressed as ˆ u q = φ reg ( ˆ H ( q ) L e [ T p , :]) , where φ reg ( · ) d enotes the reﬁne- ment fu nction. This MTL-b ased auxiliary task o ffers two advantages: ﬁrst, it le verage s historical trajecto ry to ﬁlter measuremen t noise, thereby enhanc in g localization precision; second, it h elps learn a precise target m obility representation, which can also improve th e beam predictio n task. B. Desig n of the R-Net In the trackin g lost scen ario, the beam pred iction pro b lem becomes more difﬁcult because the absence of ra n ge and angu- lar measurements o f the target in block q inevitably incr eases the u ncertainty of the futu re target trajecto r y . Based on emp ir- ical r e su lts, we ﬁnd tha t it is har d to tr ain a well-per f ormed model by on ly exploiting the historical measurem ent seque n ce ˆ Z ( q ) by masking ou t th e missing entries. Motiv ated by th e fact that the misalign ed beam direction s in the lost events also carry imp licit mob ility inform ation of the target about wh ere the target is unlikely to be located, we pr opose to extract the target mobility inform ation in the R-Net from two fea tures: the Motion F eatur e obtained f rom the h istorical measurement sequence, and the Lost F eatur e obtaine d f rom the tracking lost e vents. Speciﬁcally , the R-Net contains three modu les: 1) the MAFE modu le for motion f eature extraction ; 2) the Lost Encodin g (LE) module for tracking lost fea tu re e xtr action; 3 ) the Feature Fusion (FF) mod ule for com b ining the two types of features to adjust the bea m sweep in g p rediction via a r esidu al fusion strategy . In the following, we d e tail the thr ee modu le s contributed to the R-Net. Remark 2: It is worth notin g that adjusting the beam sweep- ing strategy in th e prop osed fram ew ork do es no t gua r antee immediate target r e-acquisition in the next block. As a result, a target may remain to b e lost over multiple consecutive time blocks. Ne vertheless, the proposed sche me is capable of gradua lly adjusting the re-acqu isition strategy by exp lo iting the lost events observed in each lo st block . T o ensure robustness and a void ind eﬁnite re-acquisition attemp ts, a tracking failure will b e d e c lared if the n u mber of consecutive tracking lost ev ents exceeds a predeﬁned maximum threshold T max . 1) MAFE module: The MAFE module p rocesses th e h is- torical measuremen t sequence ˆ Z ( q ) with th e same architecture and par ameters a s in the N-Net, a n d outp uts the late n t mo tio n feature ˆ H ( q ) L e ∈ R T e × D of the target. 2) LE modu le : The objective of the L E mod u le is to explo it the implicit mob ility informatio n f rom the misaligned b eam directions to enable target r e-acquisition . Since the target may be lo st f or multiple consecutive time blocks, let q l denote the ind ex of the last block where the target was successfully tracked. In the LE modu le , we consider a ll the tracking lo st ev ents from tim e blo ck q l + 1 to q to adju st th e b eam sweeping strategy f o r the q + 1 -th block . Spe c iﬁcally , let B τ = ∪ K k =1 B τ ,k denote the set of all the swep t beams over the K symbo ls in the τ -th block , where τ = q l + 1 , . . . , q . Ther e fore, the seque n ce of lost events, wh ich serves as the inpu t to the LE modu le, is denoted as S ( q ) B = {B q l +1 , . . . , B q } . Note tha t the size of S ( q ) B is dynam ic, which is determined by the block du ration of consecutive track ing lost, i.e., T ( q ) lost = q − q l + 1 . Since the raw beam set B τ cannot b e dire c tly pr o cessed by the neu ral network, we encod e each set in to a mu lti-hot lost vector , denoted by c τ ∈ { 0 , 1 } M . The m -th entry of c τ is deﬁned as c τ ,m = 1 if w ( m ) ∈ B τ , and c τ ,m = 0 othe rwise. By stacking these vectors along th e tempo ral dimensio n, we ob tain the feedb ack matrix C ( q ) = [ c q l +1 , . . . , c q ] T ∈ { 0 , 1 } T ( q ) lost × M . Similar to the MAFE mo d ule, a learnab le Prediction T oken is ap pended at the end of C ( q ) , y ield ing ˜ C ( q ) = [ C ( q ) ; 0 ] ∈ { 0 , 1 } T ( q ) l × M with an extended tem - poral dim ension of T ( q ) l = T ( q ) lost + 1 . Subsequen tly , ˜ C ( q ) is projected to th e latent feature space b y an embedd ing layer φ emb ( · ) : { 0 , 1 } T ( q ) l × M → R T ( q ) l × D , followed by the positional encod ing to preserve the tempor al inf o rmation. Th e resulting hig h-dimen sional latent featu re can be expressed as ˆ C ( q ) = φ emb ( ˜ C ( q ) ) + P ( pos ) ∈ R T ( q ) l × D . T o cap ture th e temp oral corr elations within the lost ev ents across tim e b lo cks, ˆ C ( q ) is further processed th rough a n Encodin g Unit f ollowing the op erations in (16)-(17). In this context, th e q uery , key , and v alue matrices are all set to ˆ C ( q ) . Fu r thermor e , th e mask matrix ¯ M ( q ) is an all-zer o matrix because all th e entires are valid. Conseque ntly , the ou tput o f the Encoding Unit, denoted b y ˆ P ( q ) ∈ R T ( q ) l × D , encapsulates 8 the learned lost feature fr om the tracking lost e vents. 3) FF module: Th e outputs from the MAFE and LE modules, i.e., ˆ H ( q ) L e and ˆ P ( q ) , are then fu sed in th e FF module to adjust the beam predictio n result based on the tracking lost events. Motiv ated by the fusion capa b ility of the Cross- Attention (CA) mecha nism, we employ L d Decoding Units, each con sisting of a CA lay er and a FFN with LN and residual connectio n, to extrac t the cro ss-seq u ence infor mation from the two features. F ormally , the process of the l -th Decoding Unit ( l = 1 , . . . , L d ), can be expressed as ˆ S ( q ) l = L N ( Attn ( ˆ D ( q ) l − 1 , ˆ H ( q ) L e , ˆ H ( q ) L e , ˆ M ( q ) ) + ˆ D ( q ) l − 1 ) , (18) ˆ D ( q ) l = L N ( FFN ( ˆ S ( q ) l ) + ˆ S ( q ) l ) , (19) where ˆ D ( q ) 0 = ˆ P ( q ) , ˆ M ( q ) ∈ { η , 0 } T ( q ) l × T e denote the Cross- Attention m ask. Similar to the Missing Mask in the MAFE, the entries of ˆ M ( q ) are deﬁned as ˆ M ( q ) [: , j ] = η · 1 T ( q ) l , if th e j -th row of ˜ Z ( q ) is zero - padded , and ˆ M ( q ) [: , j ] = 0 o therwise, ∀ j ∈ { 1 , . . . , T e } . After L d Decoding Un its, we extract th e latent feature f rom the app ended token in ˆ D ( q ) L d , deno ted as ˆ q ( q ) feed = ˆ D ( q ) L d [ T ( q ) l , :] ∈ R D , as the latent representatio n for th e n ext block q + 1 learned from the tracking lo st e vents. Then , we propose an innovati ve r esidual fusio n strategy that incorpora tes the lost fe a ture as a corr ective reﬁnement rather than a replac ement of the o riginal motion r epresentation fo r beam predictio n. Speciﬁcally , let ˆ q ( q ) mo = ˆ H ( q ) L e [ T e , :] ∈ R D denote the original motion feature for block q + 1 extracted from the MAFE module. W e form u late the reﬁned motion feature ˆ q ( q ) re ∈ R D as ˆ q ( q ) re = ˆ q ( q ) mo + ˆ q ( q ) feed . (20) Note that th is residual fu sion strategy requires the LE and FF modules to lear n only the deviation indu ced by th e lo st ev ents on the mo tion feature. Su b sequently , ˆ q ( q ) re is fed into the classiﬁer φ cls ( · ) to yield the ﬁnal beam prob ability vecto r π ( q ) , fo llowed by the T op- B s selection strategy on π ( q ) to obtain {B q +1 ,k } K k =1 . C. T raining P olicy The train ing policy also has an essential effect on the perfor mance o f the pr o posed m 2 T rack Form e r . However , the key challenge is that the distribution of tracking lost events, which serves as the inpu t fo r the R-Net, is depe n dent on the real-time track in g failures of the N-Mode and cannot b e naturally c a p tured by a static ofﬂine dataset. T o add ress this, we propose a two-p h ase supervised train in g strate gy , inspired by the Dataset Agg regation strategy [34], to actively co llec t the target lost samples for robust re-acquisition tr a ining. Dataset Construction and Pr e-pr ocessing: First, we con struct a raw trajectory dataset D raw = { ˆ Z ( n ) , U ( n ) , m ∗ ( n ) } N d n =1 , containing N d trajectories. Each trajecto ry n consists of Q samples, comprising noisy measurements ˆ Z ( n ) = h ˆ z ( n ) 1 , . . . , ˆ z ( n ) Q i , true target coordin ates U ( n ) = h u ( n ) 1 , . . . , u ( n ) Q i , and optim al beam indices m ∗ ( n ) = h m ∗ ( n ) 1 , . . . , m ∗ ( n ) Q i . This d ataset is th en partitioned into two subsets: a basic subset D base used for pre-train in g the N-Net, an d a held-ou t subset D sim used for generating trackin g lost samples and training the R-Net. T wo-Pha se T raining Strate gy: Now w e introduce the two- phase training strategy to facilitate the trainin g of the N-Net and R-Net. In Phase 1, the MAFE and TSP modules a re pre-train e d on D base to learn the m apping fro m histor ical observations to the futu re optimal beam s and cur rent location. T o enhan ce the mo del’ s ro bustness on capturin g lo n g-term temporal depen d encies, we adopt a multi-to ken pre d iction training stra tegy [35], wh e re the modules are trained to si- multaneou sly pred ict optimal beams for mu ltiple block s in th e future. The total loss is a weighted com bination of the Cross- Entropy (CE) loss and the Mean Squared Er ror ( M SE) loss, which is giv en by L 1 = L CE + λ L MSE , (21) where λ is a weighting coefﬁcient. Th e ind ividual loss co m- ponen ts are deﬁned as L CE = − 1 N s N s X n =1 q + δ q X τ = q +1 M X m =1 y ( m ) n,τ log π ( m ) n,τ , (22) L MSE = 1 N s N s X n =1 k ˆ u n,q − u n,q k 2 2 , (23) where N s denotes the batch size, M is the codeboo k size, δ q denotes the predictio n hor izon, and y ( m ) n,τ is th e binary g round - truth indicator which equals 1 if the m -th bea m is the optimal beam for the n -th sample in block τ , and 0 oth erwise. In Phase 2, we aim to tr ain the LE and FF m odules in the R-Net. First, we gen erate the trackin g lost samples via executing the well-trained M AFE and TSP modu les o n D sim . During this p r ocess, we monito r the tracking status in ea ch time blo ck. If a tracking lost ev en t occurs, we r ecord th e swept beam indices in the lo st e vent as a samp le, denoted by ˜ B lost , togeth er with the correspo nding trajector y inform ation in D sim . T h e co llected data fo rms an augmen ted r e-acquisition dataset ˜ D reacq = {D sim , ˜ B lost } . Subsequently , th e parameter s in the LE and F F modules are trained based o n ˜ D reacq , while the para meters o f the MAFE an d TSP mo dules are fr ozen. The loss function in this phase is the CE loss, given by L 2 = − 1 | ˜ D| | ˜ D| X n =1 q + δ q X τ = q +1 M X m =1 y ( m ) n,τ log π ( m ) n,τ . (24) V . T R A N S F O R M E R - B A S E D T R A C K I N G F R A M E W O R K F O R T H E M U L T I - T A R G E T C A S E In this section , we extend to investigate th e gener a l track ing framework for the m ulti-target trac king scenario, where there are I > 1 m oving targets to b e trac ked over the tracking dura- tion of Q b locks. Comp ared with the single-target scenar io, the beam pred iction task in the mu lti- ta rget case beco mes m ore challengin g d ue to the fo llowing two reasons. First, th e mo del is r equired to p redict beam subsets that can cover the possible directions of all targets, so that the emitted beam s cou ld be aligned with all targets rath er than concen trating o n b eing aligned with o ne target. Secon d, the nu mbers of lo st and u nlost targets vary dy namically over time due to the occurrence of tracking lost a n d re-acqu isition events. Theref ore, the d esigned framework sh ould accomm o date dynam ic variations on the number of targets. 9 Encoding Unit 灤 ܮ ௘ Linear MAFE Module Linear ReLU Linear Softmax TSP Module Encoding Unit LE Module Regression Head Missing Mask Encoding Unit 灤 ܮ ௘ MAFE Module T op- ܤ ௦ ܶ ௘ ܦ ܫ Aggregation Stacking ܦ ܫ ܶ ௘ ܦ ܫ Stacking Lost-Focused Mask Missing Mask Embedding Decoding Unit 灤 ܮ ௗ Appended token Pre- processing Pre- processing Linear Linear ReLU Linear Softmax T op- ܤ ௦ Aggregation Pre- processing Sequence of Emitted beam s ܶ ୫ୟ୶ (a) Network Design o f the N-Net (b) Network Design of the R-Net Embedding T arget 1 (unlost) T arget 2 (lost 2 blocks) T arget 3 (lost 3 blocks) ܶ ୣ lost positions Target 1 (unlost) Target 2 (unlost) Target 3 (unlost) ܶ ௘ ܫ Embedding Stacking Residual Fusion FF Module Fig. 4. The diagram for the proposed m 3 Tra ckformer . T o address the se challen ges, we propo se an m 3 T rack Form e r, which follows the similar idea of the m 2 T rack Form e r . For the ﬁrst ch allenge, we in tr oduce a featu r e ag gregation layer, which com bines individual target fea tu res to f ormulate a joint beam sweep ing strategy under a shared beam budget, which adaptively balan ces the number of swept beams to re-acqu ire lost targets while m a intaining accurate tracking fo r un lost targets. For the second cha llen ge, we p ropose a scalable architecture that lev era ges the parameter-sharing strategy and the inhe r ent parallel processing capability in Transformer to process an ar bitrary number of targets simultan eously witho ut the need for mod el re tr aining, thereby achieving low infer ence latency in real-tim e track ing scenarios. T h e network d esign o f the m 3 T rack Form e r is illustrated in Fig. 4. In th e following, we de tail the key differences between the m 3 T rack Form e r an d m 2 T rack Form e r . A. Desig n of the N-Net The N- Net of m 3 T rack Form e r also applies the MAFE an d TSP modu les for the beam prediction and th e target local- ization tasks. T o accom modate multiple targets, th e historical sequences of each target i ∈ I , den oted by ˆ Z ( q ) i , are stacked along a newly intro duced ta r get d imension to for m a u niﬁed input tensor ˆ Z ( q ) ∈ R I × T e × D in . Moreover, we extend the missing-aware attention mask in m 2 T rack Form e r to a three- dimensiona l mask ten sor , deno ted b y M ( q ) ∈ { η , 0 } I × T e × T e , to m itigate the impact o f missing measurem e nts in f eature extraction for each target. Speciﬁcally , we set M ( q ) [ i, : , j ] = η · 1 T e if the j -th r ow of ˆ Z ( q ) i is zero-p added, and M ( q ) [ i, : , j ] = 0 otherwise, ∀ i ∈ { 1 , . . . , I } , j ∈ { 1 , . . . , T e } . T o add ress the mu lti-target trac king problem in (P1) , th e m 3 T rack Form e r is design e d to satisfy the pro perties of p er- mutation equiv arian ce and in variance. Speciﬁca lly , any per- mutation of the target indices should not affect the overall beam sweeping strategy for the next time block (inv ariance) , while the extracted target-wise mo tion features and localiza- tion results sh o uld be permu ted a c cording ly ( equiv arian c e). T o achieve the perm utation e q uiv ariance, we ado pt the param eter- sharing strategy [36] on the MAFE an d TSP mod ules, where the same network lay ers ar e shar ed acro ss the target dimen - sion. Under this design, the motion fe ature of each target is extracted using the same fu nction weights, which ensures that the extracted featur es tr ansform eq uiv ariantly with respect to permutatio ns of the target indices. As a result, the propo sed architecture is scalable to the numb e r of targets a n d ena b les direct d eployment b y reu sing th e weights of m 2 T rack Form e r trained on the single-target dataset, ther eby a voiding costly model retrain ing whe n the n umber of targets varies. Theref ore, taking ˆ Z ( q ) and M ( q ) as inputs, th e MAFE mo dule yield s the learned m otion feature tenso r, denoted b y ˜ H ( q ) L e ∈ R I × T e × D . According ly , th e MLP layer in the TSP m o dule maps the features to the reﬁned target positions, repre sen ted b y the matrix ˆ U ( q ) ∈ R I × D out , where D out = 3 denotes the outpu t dimension, and th e i -th row represents its reﬁned coo rdinate location for target i , i.e., ˆ u i,q . T o fur ther a c hiev e the pe r mutation in variance proper ty for the b eam p rediction task, a feature agg regation lay e r is in- troduced in the TSP modu le to combine the fe a tures o f all the targets. Speciﬁcally , let ˜ π ( q ) ∈ R M denote a beam score vector that yields the jo in t beam prediction result fo r trac k ing all the targets, which is gi ven by ˜ π ( q ) = φ agg  φ cls  ˜ H ( q ) L e [: , T e , :]  , (25) where φ agg ( · ) denotes the featur e aggregation fu nction imp le- mented via the non -parame tr ic sum pooling function over the target dimen sion. A larger value of ˜ π ( q ) [ m ] indicates that th e m -th beam in the codeb o ok W has a high conﬁdence to b e aligned with one target in th e next time block. Finally , the BS can select th e B s beams with the highest score s in ˜ π ( q ) to fo rm th e cand idate b eam sub set {B q +1 ,k } K k =1 for the next block q + 1 . 10 B. Desig n of the R-Net The R-Net o f m 3 T rack Form e r preserves the MAFE, LE, and FF modu les, while fur ther incorpora tin g the regression head fr o m the TSP modu le to r e ﬁn e the lo calization of unlost targets. Th e key cha llen ge in the R-Net lies in th e time - varying number of lost targets per bloc k and their h eterogen eous duration of tracking lost, wh ich co mplicates th e design of the LE an d FF mod ules for multi-target re-acqu isition. Since the MAFE m odule main tain s the same architectur e and pa r ameters as in the N-Net f or motion featur e extraction of all the targets by (1 6)-(17) regardless of th eir track in g status, in th e following we focus on the specialized design of the LE an d FF mod ules. The L E mo dule employs a dynamic pad ding strategy along both the temporal and target dimen sions to facilitate p arallel processing of a ll the lost targets in a single forward pass. Let I ( q ) unlost and I ( q ) lost denote th e sets of unlost an d lost targets in the q -th blo ck, respectively , and recall that T max is th e threshold of m aximum co nsecutive tracking -lost dur ation. In the R-Net, we construct a un iﬁed trac king-lost ten sor ˜ C ( q ) ∈ R I × ( T max +1) × M as the input to the LE module, in which th e entries correspo nding to the un lost targets and th e non-lost blocks for the lost targets are pad d ed with zeros. Spe c iﬁcally , let c q j denote the mu lti-hot lost vector that encod es the swep t beam indic e s in b lock q j , wher e q j = q − T max + j for j ∈ { 1 , . . . , T max + 1 } . Then, the entr ies of the ˜ C ( q ) are deﬁned as ˜ C ( q ) [ i, j, :] = c q j if target i ∈ I ( q ) lost is lost in the τ j -th block, and ˜ C ( q ) [ i, j, :] = 0 oth erwise. Moreover, we introduce a lost- focused mask , deno ted by ¯ M ( q ) ∈ { η , 0 } I × ( T max +1) × ( T max +1) , which forces the attention mechan ism in the LE modu le to co ncentrate on the lo st events of each lost target wh ile prevents inform ation leakage to the unlost targets an d non - lost blocks. Speciﬁcally , we set ¯ M ( q ) [ i, : , j ] = 0 if the target i ∈ I ( q ) lost is lost in the τ j -th blo ck, and ¯ M ( q ) [ i, : , j ] = η · 1 otherwise, ∀ i ∈ { 1 , . . . , I } , j ∈ { 1 , . . . , T max + 1 } . Based o n the co nstructed tensor inputs ˜ C ( q ) and ¯ M ( q ) , the subseque nt embedd in g layer an d the Enco ding Unit extract the lost featur e ˜ P ( q ) ∈ R I × ( T max +1 ) × D from th e historical track ing lost e vents of each lost target. Subsequen tly , the FF module pro cessed th e motion fea- ture ˜ H ( q ) L e and the lo st feature ˜ P ( q ) of all the targets by (1 8)-(20), yielding the beam score vector ˜ π ( q ) = φ agg  φ cls  ˜ H ( q ) L e [: , T e , :] + ˜ D ( q ) L e [: , T max +1 , :]  . Since both the motion features and lost features p reserve the s ame target or- dering un der permuta tio n, the subsequen t deco ding a nd fusio n operation s remain perm utation equ ivariant. In particular, f or unlost targets, the zero -paddin g strategy in th e lost featur e ensures that their motion featu res remain unaffected in the Decoding Units and residual fusion opera tio n. V I . N U M E R I C A L E X P E R I M E N T S A. E x periment Setup In th is sectio n , we provide nu merical experime nts to verify the effectiveness of th e pro posed tracking framework. The BS is located at p B S = [0 , 0 , 10] T in meters (m) and is equippe d with unifor m transmit and recei ve UP As, each with N T = N R = 3 2 × 32 antenn as and parallel to the ( x, y ) - plane. The d iscrete Fourier transfor m ( D FT ) codeb ook [ 37] with size M = 10 24 is adopted, and each transmit or receiv e beamfor ming vector has to be selected fr o m this codebook. Similar to [38], [39], th e m obility of e a ch target i is mod e le d as u i,q +1 = u i,q + v i,q ∆ T , wh ere u i,q = [ x i,q , y i,q , z i,q ] T and v i,q = [ v x i,q , v y i,q , v z i,q ] T denote the coor dinate loc ation and velocity vector of the target i in the q -th blo ck, r espectiv ely , and ∆ T = 0 . 1 s repre sen ts the tim e du ration o f a block . Here, we assume v x i,q = v i cos( β i,q ) , v y i,q = v i sin( β i,q ) and v z i,q = 0 , where th e target speed v i is a ssum ed to r e m ain constant within a trajec to ry but v aries acr oss different tar gets and range s fro m [10 , 35] m/s. The moving d ir ection in the horizon tal plan e β i,q follows a dynam ic model over time, given by β i,q +1 = β i,q + ∆ β , wher e ∆ β ∼ U ( − 20 ◦ , 20 ◦ ) . The in itial moving directio n is selected as β 1 ∼ U (0 , 2 π ) . W e a ssume that the initial altitude of the targets is ran domly selected between 50 m and 80 m, i.e., z 1 ∼ U (50 , 80 ) m, where the A WGN is introdu ced on the z-coord inate of th e target in each blo ck, i.e., z i,q +1 = z i,q + ∆ z , with ∆ z ∼ N (0 , σ 2 z ) . A dataset o f N d = 20 , 000 trajectories is generated , each with a tr acking duration of T = 50 s, and is split in to training, validation, and ev aluation subsets with pr oportion s of 80%, 10 %, an d 10%, respectively . Mo reover , the network is im p lemented using PyT orch with the AdamW optimizer for training, with a learn ing rate of 3 × 10 − 4 and 50 training epochs. For the h yperpar ameters, we set L e = 3 , L d = 2 , D = 25 6 , h = 8 , and the batch size to 256 2 . B. B a seline Models T o demonstra te th e e ffecti veness of the pr o posed T ransfo rmer-based framework, the following baseline methods are consider ed: • KF-based Scheme : This appro ach employs the Ex te n ded Kalman Filter (EKF) [9] to p redict targets’ po sitions over time a nd select the T o p- B s beams who se steering directions minimize the mean squ ared error (MSE) with respect to the predicted locations. • RNN-based Scheme [ 4 0]: This ap proach utilizes a stan- dard RNN to perform beam tracking . In the e vent of a tracking lost, the RNN adopts a da ta im putation stra tegy by fr eezing the state u pdate and propa g ating the last valid hidden state to predict beams for subsequen t blocks. • T wo-mode LSTM-based Scheme : This appro ach con- structs an ad vanced sequ ence-to-seq uence architecture consisting of two cascade d LSTM su b-networks [41]. The primary LSTM encode s the historical trajectory for beam prediction and localization, while the secon dary LSTM is activ ated in cases of trackin g lost by taking the laten t features from the prim ary LSTM an d the lost events a s input to predict beams for re-acq uisition. • Single-mode T ransformer -based Scheme : T his ap- proach implements a decoder-only Transformer ar c hitec- ture [32]. It is a single m ode fra m ew ork an d utilizes one Transformer n etwork for acc o mplishing bo th normal target tracking and lost tar ge t re-acquisition. 2 Our code can be found in https: //github .com/ltk7 22/ Tra nsformer- based- mm W ave- trackin g. 11 Fig. 5. Tracking Performance of successful probabilit y versus tracking time. Under all schem es, a track ing failure is d eclared if the targets remain lost after T max = 5 consecutive time blocks. C. P erformance Evalua tion in the single-tar get scenario In th is sub section, we ev aluate the tracking performance of the p roposed tr a cking scheme in the single-target scenar io. W e adopt the successful tracking pro bability and the average track- ing dur ation as the p erforma n ce m etrics. Speciﬁcally , the suc- cessful tracking pr obability is d e ﬁned as P S = N tracked / N total , where N total is the total numb e r o f test trajectories an d N tracked is th e numb er o f traje c tories unde r wh ich the target is still tracked after T s. The average trackin g du ration is deﬁn ed as T avg = (1 / N total ) P N total n =1 T n , where T n denotes the track ing time f o r wh ich the n -th target rema in s tracked within the to tal period of T s. 1) P e rfo rmance over T rac king T ime: In Fig. 5, we e valuate the succe ssfu l trackin g p r obability P S of various schem es versus the tracking time. The number of beams fo r tracking in each block is set to B s = 4 . It is o bserved th at o ur propo sed scheme consistently achieves the h ighest successful probab ility as the tracking time incr eases. Spe c iﬁcally , th e KF- based scheme suffers f rom severe perform ance degradation due to its inefﬁciency to mod el highly no n-linear trajectories. Under the learn ing-based schemes, the RNN fails to provide reliable tracking because the frequent track ing lo st e vents leads to error accum u lation and e ventually results in tracking failure. Although the T wo-mod e LST M -based scheme attemp ts to en able re-acqu isition, its improvement is still lim ited due to error accu mulation. In c o ntrast, the p roposed scheme is able to maintain the tracking p rocess in a much larger trac k ing in- terval. Th e p e r forman ce gain is attr ibuted to two main factors. First, the masked attention-emp owered N-Mode ach ieves su- perior capab ility of f eature extraction to av oid p rediction err o r accumulatio n. Sec ond, th e re- acquisition m echanism in the R- Mode effecti vely re-acqu ire th e lost target by exploiting the lost ev en ts, there b y increasing the tracking dura tion comp ared to the Single-mo de T ransformer-based sch e me. 2) P erforman ce over Numbe r of T rac king Beams: In Fig. 6, we ev aluate th e imp a ct of the numb er of track ing beams per block, i.e ., B s , on th e successfu l prob ability and the av erag e tracking du ration, respectively . It is observed th at th e prop osed scheme achieves the hig hest successfu l pro bability and the (a) Per formance of successful probability (b) Performance of avera ge tracking duration Fig. 6. Tra cking performance versus the number of tracking beams. longest track ing duration across all values o f B s . Speciﬁcally , when B s is small, all schem es exhibit limited perfo rmance on maintaining long-term trackin g over 50 s due to in su fﬁcient spatial coverage of the limited tr a c king bea m s. Howe ver, the propo sed scheme demonstrates sup erior rob ustness, achie vin g an average tracking d uration over 15 s, while the benc h mark schemes fail to main tain trackin g for m ore th a n 5 s. As B S increases, the pro p osed scheme exhib its a rapid p erforma n ce improvement. No tably , with as f ew as f our b eams, it achieves a successful probab ility exceed ing 90% and an av erag e trackin g duration of over 40 s. This result d emonstrates the e fﬁciency of the prop osed fr a mew ork on maintainin g robust long-ter m tracking with minimal beam sweeping overhead. 3) P erformance over T ar get S p eed: In Fig. 7, we ev alua te the impact of target speed on the average tracking d uration with B s = 4 and T = 50 s . I t is o b served that the tracking per- forman ce of all schemes d egrades as the target speed increases. This degrada tion is because of the fact that higher mobility introdu c es g reater stochasticity and non-line a rity into the tar- get’ s traje c tory , thereby complica ting the precise p rediction o f beam alignmen t. Howe ver , the pro posed schem e de monstrates robustness to target mob ility . Speciﬁcally , the average trackin g duration decreases by less than 10 s when th e target spee d increases fro m the low-mobility case ( v = 1 0 m/s) to th e high-m o bility case ( v = 35 m/s). In c o ntrast, o ther learnin g- based schemes suffer a more than 50% degradation. Notably , 12 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 50 Fig. 7. A verage tracking duration versus targ et speed. -100 -50 0 50 100 150 200 250 -150 -100 -50 0 50 100 150 60 70 80 -30 -20 -10 Zoom-in Fig. 8. Illustra tion of the real and estimated trajec tories. compare d to the Sing le-mode T ransfor m er-based scheme, the propo sed solutio n achie ves a sub stan tial performan ce gain ranging from 15% ( low m obility) to over 13 0% (h igh mo - bility), which validates the c r itical role of th e pr oposed R- Mode in facilitating effecti ve target re-acquisition under high mobility scenarios. 4) P erformanc e of T rajectory Estimation: I n Fig. 8, w e illustrate th e actual trajectory and the correspo nding estimated locations of the target from the pro posed scheme. It is obser ved that the tracking lost issue occu rs particu larly in regio n s where the target experiences rapid motion an d the o ptimal beam directio n changes abr uptly . Altho ugh the target trajector y is comp lex, th e prop osed scheme is a b le to re-acquir e the target lo cation within a few block s, an d effecti vely tr a ck the trajectory over the entire tracking duration . D. P erforma nce Evaluation in the Multi-T a r get Scenario In this subsection, we ev aluate the trackin g perfor m ance of the pro posed framework in the multi-target scenario . As a benchm ark, we co nsider a fully decoupled strategy that de- composes the multi- ta rget trac k ing p r oblem into I in depend ent single-target sub- ta sk s. Compa r ed to the jo int be a m sweeping strategy in m 3 T rack Form e r, th e benchm ark processes each target sequen tially with an individual sweep ing strategy . The total b eams fo r tr a c king in each time b lock are evenly divided among all targets, with ⌊ B s /I ⌋ beams to track each target. In Fig. 9(a), we ev aluate the p erforma n ce of av erag e tracking du ration versus the nu mber of tra c king beams an d the num b er of targets und er th e propo sed scheme and the benchm a rk. It is observed that the per forman c e of th e prop osed I=4 I=3 I=2 (a) Impa ct of the number of beams Bs=12 Bs=18 (b) Impact of the number of target s Fig. 9. Multi-t arget tracking performance of av erage tracking duration . Fig. 10. A verage inference latenc y v ersus the number of targets. scheme achieves longer track ing du ration than the benchm ark in various number of beams, which d emonstrates that the propo sed joint beam sweeping stra tegy is effecti ve on ada p - ti vely allocating th e limited nu mber of tracking beams to re - acquire the lost targets while maintaining normal track ing for unlost targets. In Fig. 9( b ), we fu rther in vestigate the imp a ct of the numb er of targets. It c an be seen that the perfor mance degrades as the number of targets increases, due to increased competition for limited beam resour ces. Nevertheless, the propo sed scheme exhibits hig her robustness in suc h scenarios, maintaining a substantially longer tr acking du ration than the benchm a rk. In Fig. 10, we ev aluate the average in ference latency per block over the n u mber of targets. The simulations ar e con - ducted o n an NVI DIA GeForce R TX 406 0-Ti GPU. It is observed tha t the inf erence laten cy of m 3 T rack Form e r consis- tently remains below 10 m illiseco n ds (ms) even as the num ber of targets increases. Th is ef ﬁciency com es f rom the in herent parallel pr ocessing capability on the T ransfo rmer backbon e to pro c e ss the fea tu res of all targets simultaneou sly within a sing le fo rward pass, wh ich fully leverages the p arallelism of the GPU. In contra st, th e bench mark scheme exhibits a linear g rowth in laten cy , lead ing to sig n iﬁcant compu tational overhead in dense scenar ios. Th is result d emonstrates tha t the p roposed m 3 T rack Form e r satisﬁes the low-latency r equire- ments of 6G I SA C ap p lications and validates its f e asibility for practical real-time deployment. 13 V I I . C O N C L U S I O N In this paper, we p roposed a r o bust two-mode Transformer- based multi-target track ing f ramework for mmW ave ISA C systems. When all the targets are h it by the swept beams, the fram ework oper ated in the N-Mo de f or realizing target tracking and beam prediction by the N-Ne t, in which the masked self-attention mechanism of T ransformer is employed to extract g lobal motio n features of each target d irectly fro m incomplete histor ical tra je c to ries. When the tracking lost event occurs due to beam misalignmen t, the frame work switched to the R-M ode, in which the R-Ne t fu sed the motion f eatures and negativ e feedback f rom beam misalignmen t to ad just the futur e beam sweeping strategy fo r target re-a c quisition. Nu merical results demon strated that the pr o posed fram ew ork signiﬁ- cantly outper f orms be n chmark schem es in ter ms of successfu l tracking proba b ility and a verage tracking du ration with low inference latency , which demon stra te d its effecti veness and robustness for real-time multi-target tracking in the mmW a ve ISA C system. R E F E R E N C E S [1] T . Li, W . Zhu, S. Zhang, J. Ca o, S. Cui, and L. Liu, “m 2 trackfo rmer: Tra nsformer-based mmwav e tracking with lost target re-acquisitio n capabi lity , ” in Pr oc. IEEE Int. Conf. Acoust., Speech Signal Pro- cess.(ICASSP) , 2026, pp. 1–5. [2] IT U-R, “The ITU-R framew ork for IMT -2030, ” 2023, [Online]. A va ilable: https:/ /www .itu.int/en/ ITU- R/study- groups/rsg5/rwp5d/imt- 2030/Pages/def ault.aspx . [3] L . Zheng, M. Lops, Y . C. E ldar , and X. W ang, “Radar and communi - catio n coexist ence: An ov ervie w: A re vie w of recen t methods, ” IEE E Signal Proc ess. Mag. , vol. 36, no. 5, pp. 85–99, Sep. 2019. [4] F . Liu, C. Masouros, A. P . Petropulu, H. Grifﬁths, and L. Hanzo, “Joint radar and communicati on design: Application s, state -of-the-art, and the road ahead, ” IEEE T rans. Commun. , vol. 68, no. 6, pp. 3834–3862, Jun. 2020. [5] J. A. Zhang et al. , “Enablin g joint communica tion and radar sensing in mobile networks—a survey , ” IEE E Commun. Survey s T uts. , vol. 24, no. 1, pp. 306–345, Nov . 2021. [6] K. V . Mishra, M. B. Shankar , V . Koi vunen, B. Ottersten, and S. A. V orobyov , “T oward millimete r-wa ve joint radar communication s: A signal processing perspecti ve, ” IEEE Signal Pr ocess. Mag. , vol. 36, no. 5, pp. 100–114, Sep. 2019. [7] M. Giordan i, M. Polese, A . Ro y , D. Castor , and M. Z orzi, “ A tut orial on beam management for 3GPP NR at mmW av e frequencies, ” IEEE Commun. Surve ys T uts. , vol. 21, no. 1, pp. 173–196, 1st Quart., 2019. [8] G. Cheng, X. Song, Z. Lyu, and J. Xu, “Network ed ISA C for lo w- altit ude economy: Coordinate d transmit beamforming and U A V traj ec- tory design, ” IEE E Tr ans. Commun. , Feb. 2025, Early Access. [9] V . V a, H. V ikalo, and R. W . Heath, “Beam tracking for mobile millimete r wa ve communication systems, ” in P r oc. GlobalSIP , Dec. 7-9, 2016, pp. 743–747. [10] M. Gruber , “ An ap proach to target trac king, ” MIT Lex ington L incoln Lab, T ech. Rep. AD0654272, Feb . 1967. [11] G. Rev ach, N. Shlezi nger , X. Ni, A. L. Escoriza, R. J. V an Sloun, and Y . C. E ldar , “KalmanNet : Neural netw ork aided kalman ﬁltering for partia lly known dynamic s, ” IEEE T rans. Signal Pr ocess. , vo l. 70, pp. 1532–1547, 2022. [12] F . Pedraza and G. Caire, “Sensing-assisted beam tracking for mm W ave V2I communication s with analog, hybrid, and digital antenna architec- tures, ” IEEE T rans. W ir eless Commun. , vol. 24, pp. 447–461, Jan. 2025. [13] C. Liu et al. , “Learning- based predicti ve beamforming for integrat ed sensing and communication in vehicul ar net works, ” IEEE J . Sel. Areas Commun. , vol. 40, no. 8, pp. 2317–2334, Aug. 2022. [14] S. H. A. Shah and S. Rangan, “Multi -cell multi-be am prediction us- ing a uto-encoder LSTM for mmW ave systems, ” IEEE T rans. W ireless Commun. , vol. 21, no. 12, pp. 10 366–10 380, Dec. 2022. [15] W . Y uan, C. Liu, F . L iu, S. Li, and D. W . K. Ng, “Learning-base d predict iv e bea mforming for UA V communicati ons wit h jittering, ” IEEE W ir eless Commun. Lett. , vol. 9, no. 11, pp. 1970–1974, Nov . 2020. [16] J. Zhang, Y . Huang, Y . Zhou, and X. Y ou, “Beam align ment and tracking for millimeter wa ve communicat ions via bandit learni ng, ” IE E E T rans. Commun. , vol. 68, no. 9, pp. 5519–5533, Sep. 2020. [17] J. Zhang, Y . Huang, J. W ang, X. Y ou, and C. Masouros, “Intell igent intera ctiv e beam training for milli meter wav e communications, ” IEEE T rans. W irel ess Commun. , vol. 20, no. 3, pp. 2034–2048, Mar . 2021. [18] P . Susarla, B. Gouda, Y . Deng, D. Grace, and T . Ratnara jah, “Learning- based beam alignment for uplink mm wa ve UA Vs, ” IEEE T rans. W ire less Commun. , vol. 22, no. 3, pp. 1779–1793, Mar . 2023. [19] F . Giuliari, I. Hasan, M. Crista ni, and F . Galasso, “T ransformer netw orks for traje ctory forecasti ng, ” in Proc . Int. Conf. P attern Recog. (ICPR) , Jan. 10 - 15, 2021, pp. 10 335–10 342. [20] Z . Che et al. , “Recurrent neural net works for multi va riate time series with missing val ues, ” Sci. Rep. , vol. 8, no. 1, p. 6085, 2018. [21] W . Du, D. C ˆ ot ´ e, and Y . Liu, “SAITS: Self-attenti on-based imputation for time series, ” Expert Syst. A ppl. , vol. 219, p. 119619, 2023. [22] A. V enkatrama n, M. Hebert, a nd J. A. Ba gnell, “Impro ving multi -step predict ion of learned time s eries models, ” in Proc. AAAI Conf . Artif. Intell . , v ol. 29, no. 1, 2015. [23] Q. W en et al. , “Tra nsformers in time series: A surve y , ” in Pr oc. Int. J oint Conf . Artif. Intell. (IJCAI) , Aug. 2023, pp. 6778–6786. [24] H. Jiang, M. Cui, D. W . K. Ng, and L. Dai, “ Accurate channel predictio n based on Transformer: Making mobility negli gible, ” IEEE J. Sel. Areas Commun. , vol. 40, no. 9, pp. 2717–2732, Sep. 2022. [25] W . Z hu, J. Gao, S. Z hang, and L. Liu, “Reconﬁgurabl e intell igent surfac e-assisted multiuser tracking and signal detection in ISAC, ” arXiv pre print arXiv:2509.13940 , 2025. [26] P . Ticha vsky , C. H. Muravchi k, and A. Nehora i, “Posteri or Cram ´ er– Rao bounds for discrete-t ime nonlin ear ﬁltering, ” IEEE T rans. Signal Pr ocess. , vol. 46, no. 5, pp. 1386–1396, 1998. [27] H. L. C hiang, K. C. Chen, W . Rave, and T . K. Lo, “Machi ne-learning beam tracking and weight optimizat ion for mm W ave multi-U A V links, ” IEEE T rans. W irel ess Commun. , vol. 20, no. 8, pp. 5481– 5494, Aug. 2021. [28] Y . Niu, Y . Li, D. Jin, L . Su, and A. V . V asilak os, “ A surve y of millimete r wa ve communication s (mmW ave ) for 5G: Opportunitie s and chall enges, ” W ir eless Netw . , vol. 21, no. 8, pp. 2657–2676, Nov . 2015. [29] P . Stoica and K. C. Sharman, “Maximum likel ihood m ethods for Directi on-of-Arri val estimation, ” IEEE T rans. Acoust., Speech, Signal Pr ocess. , vol. 38, no. 7, pp. 1132–1143, Jul. 1990. [30] H. Huang, J. Y ang, H. Huang, Y . Guo, and G. Y ang, “Deep learni ng for super-r esolution channel estimati on and DOA estimation based massive MIMO s ystem, ” IEEE T rans. V eh. T ec hnol. , vol. 67, no. 9, pp. 8549– 8560, Sep. 2018. [31] Y . W ang, Z . W ei, and Z . Feng, “Beam trainin g and trackin g in mmW av e communicat ion: A surv ey , ” China Commun. , v ol. 21, no. 6, pp. 1–22, Jun. 2024. [32] A. V aswani et al. , “ Attent ion is all you need, ” in Proc . Adv . Neural Inf. Pr ocess. Syst. (NeurIPS) , Jun. 2017, pp. 6000–6010. [33] M. Crawsha w , “Multi-task learning with deep neural networks: A surve y , ” arXiv preprin t arXiv:2009.09796 , 2020. [34] S. Ross, G. J. Gordon, and D. Bagnell, “ A reduction of imitation learning and structured prediction to no-re gret onlin e learning , ” in Proc. Int. Conf. Artif. Intell . Statist. (AIST ATS) , 2011, pp. 627–635. [35] F . Gloeckle , B. Y . Idrissi, B. Rozi ` ere, D. Lopez-Paz , and G. Synnae ve, “Bett er & faster l arge l anguage mode ls via multi-tok en predictio n, ” in Pr oc. Int. Conf . Mach. Learn. (ICML) , 2024, pp. 15 706 – 15 734. [36] S. Rav anbakhsh, J. Schneider , and B. P ´ ocz os, “Equi v ariance throug h paramete r-sharing, ” in Pr oc. Int. Conf. Mach. Learn. (ICML) , 2017, pp. 2892–2901. [37] K. Chen, C. Qi, and G. Y . Li, “T wo-step code word design for Millimeter - W ave massi ve MIMO systems with quanti zed phase shifters, ” IEEE T rans. Signal Pr ocess. , vol. 68, pp. 170–180, 2020. [38] Y . Zeng, Q. Wu, and R. Zhang, “ Accessing from the sky: A tutorial on U A V communicati ons for 5G and beyond, ” Proc. IEEE , vol . 107, no. 12, pp. 2327–2375, Dec. 2019. [39] H. Han, T . Jiang, and W . Y u, “ Acti ve sensing for multiuser beam tracking with reconﬁgurab le intel ligent surface, ” IEEE T rans. W ir eless Commun. , vol. 24, no. 1, pp. 540–554, Jan. 2025. [40] D. Burg hal, N. A. Abbasi, and A. F . Molisch, “ A machine learning solution for beam trac king in mmW av e systems, ” in Proc. 53r d Asilomar Conf . Signals, Syst., Comput. , 2019, pp. 173–177. [41] S. Jiang, G. Charan, and A. Alkhateeb, “Lidar aided future beam predict ion in real-world millimet er wav e V2I communica tions, ” IE E E W ir eless Commun. Lett. , vol. 12, no. 2, pp. 212–216, Feb . 2022.

m^3TrackFormer: Transformer-based mmWave Multi-Target Tracking with Lost Target Re-Acquisition Capability

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment