CMANet: Channel-Masked Attention Network for Cooperative Multi-Base-Station 3D Positioning
Achieving ubiquitous high-accuracy localization is crucial for next-generation wireless systems, yet remains challenging in multipath-rich urban environments. By exploiting the fine-grained multipath characteristics embedded in channel state information (CSI), more reliable and precise localization can be achieved. To address this, we present CMANet, a multi-BS cooperative positioning architecture that performs feature-level fusion of raw CSI using the proposed Channel Masked Attention (CMA) mechanism. The CMA encoder injects a physically grounded prior–per-BS channel gain–into the attention weights, thus emphasizing reliable links and suppressing spurious multipath. A lightweight LSTM decoder then treats subcarriers as a sequence to accumulate frequency-domain evidence into a final 3D position estimate. In a typical 5G NR-compliant urban simulation, CMANet achieves less than 0.5m median error and 1.0m 90th-percentile error, outperforming state-of-the-art benchmarks. Ablations verify the necessity of CMA and frequency accumulation. CMANet is edge-deployable and exemplifies an Integrated Sensing and Communication (ISAC)-aligned, cooperative paradigm for multi-BS CSI positioning.
💡 Research Summary
CMANet is a novel deep‑learning framework designed for high‑precision three‑dimensional positioning in next‑generation wireless networks by jointly exploiting raw channel state information (CSI) from multiple base stations (BSs). The authors first identify the limitations of existing CSI‑based localization methods, which either transform CSI into sparse fingerprints (e.g., angle‑delay power matrices) or perform late‑stage fusion of independently estimated positions. Both approaches discard the rich inter‑BS multipath correlations that are inherently present in the raw complex CSI tensor.
To address this gap, CMANet introduces a two‑stage architecture: (1) a Channel‑Masked Attention (CMA) encoder and (2) a Frequency‑Cumulative LSTM decoder. The raw CSI from L BSs, each equipped with M antennas and N OFDM subcarriers, is represented as a complex tensor H∈ℂ^{L×M×N}. After separating real and imaginary parts, the tensor is reshaped to a real‑valued matrix of shape (L, 2M·N). The CMA block computes a per‑BS channel gain vector by taking the Euclidean norm of each BS’s CSI slice. This gain vector is layer‑normalized and broadcasted as a multiplicative mask W∈ℝ^{L×1} that modulates the self‑attention scores computed over the BS dimension. Consequently, BSs with higher received power receive larger attention weights, while noisy or heavily obstructed links are automatically down‑weighted. This physically‑grounded prior injection differentiates CMANet from pure self‑attention models and yields faster convergence and better generalization.
The second stage treats the subcarrier axis as a temporal sequence. The CMA‑processed output retains the shape (L, 2M·N) and is permuted to (N, 2M·L), where each “time step” corresponds to one subcarrier’s spatial feature map. A multi‑layer LSTM processes this sequence, accumulating frequency‑domain evidence across subcarriers. The final hidden state is fed through a multilayer perceptron (MLP) that regresses the user equipment (UE) position (x, y, z). By weighting later subcarriers more heavily in the loss function (weighted MSE), the network is encouraged to refine its estimate as more frequency information becomes available, which explains the observed error reduction with increasing N.
The authors evaluate CMANet in a realistic urban scenario modeled after the Arc de Triomphe area in Paris. Six BSs are placed according to OpenCellID data, each operating at 3.5 GHz with a 20 MHz bandwidth and 288 subcarriers. CSI is generated via ray‑tracing using NVIDIA’s Sionna library, incorporating realistic multipath, blockage, and elevation variations (UE heights 0–30 m). Training data consist of 10 000 random UE locations per epoch, and testing is performed on 1 000 unseen layouts every 20 epochs.
Performance metrics include median error, 90‑percentile error, and mean absolute error. CMANet achieves a median positioning error of 0.48 m and a 90‑percentile error of 0.96 m, outperforming three baselines: (i) a self‑attention model without channel masking (median 0.71 m, 90‑pct 1.34 m), (ii) an ADCPM‑SegNet‑MLP cooperative model (median 0.85 m, 90‑pct 1.58 m), and (iii) MFCNet (median 0.79 m, 90‑pct 1.42 m). Ablation studies confirm that removing the channel‑gain mask degrades performance by ~0.2 m, highlighting the necessity of CMA.
Computationally, CMA adds only O(L·N) operations, and the LSTM scales linearly with the number of subcarriers. The entire network contains roughly 1.2 M parameters, making it suitable for edge deployment on 5G base‑station servers with real‑time inference capability (tens of predictions per second) and modest memory footprints.
In summary, CMANet demonstrates that embedding physically meaningful priors (per‑BS channel gains) into an attention mechanism, combined with frequency‑domain sequential aggregation, can substantially improve multi‑BS CSI‑based positioning. The work paves the way for integrated sensing and communication (ISAC) solutions where accurate, low‑latency localization is required for autonomous driving, smart‑city services, and beyond. Future directions include handling asynchronous CSI, extending to moving UE trajectories, and validating the approach on over‑the‑air measurements.
Comments & Academic Discussion
Loading comments...
Leave a Comment