BlossomRec: Block-level Fused Sparse Attention Mechanism for Sequential Recommendations
Transformer structures have been widely used in sequential recommender systems (SRS). However, as user interaction histories increase, computational time and memory requirements also grow. This is mainly caused by the standard attention mechanism. Although there exist many methods employing efficient attention and SSM-based models, these approaches struggle to effectively model long sequences and may exhibit unstable performance on short sequences. To address these challenges, we design a sparse attention mechanism, BlossomRec, which models both long-term and short-term user interests through attention computation to achieve stable performance across sequences of varying lengths. Specifically, we categorize user interests in recommendation systems into long-term and short-term interests, and compute them using two distinct sparse attention patterns, with the results combined through a learnable gated output. Theoretically, it significantly reduces the number of interactions participating in attention computation. Extensive experiments on four public datasets demonstrate that BlossomRec, when integrated with state-of-the-art Transformer-based models, achieves comparable or even superior performance while significantly reducing memory usage, providing strong evidence of BlossomRec’s efficiency and effectiveness. The code is available at https://github.com/Applied-Machine-Learning-Lab/WWW2026_BlossomRec.
💡 Research Summary
BlossomRec tackles the scalability bottleneck of transformer‑based sequential recommender systems (SRS) by introducing a block‑level fused sparse attention mechanism that simultaneously captures long‑term and short‑term user interests. The authors first observe that user interaction histories can be effectively partitioned into overlapping blocks, an inductive bias that preserves the distribution of long‑term preferences while enabling efficient computation.
The model consists of two parallel sparse attention pathways. Long‑Term Interest Selection (LTIS) splits the key and value matrices into blocks of size l with stride s, compresses each block into a single representative vector using a learnable MLP, and then computes importance scores between the full query matrix and these compressed keys. Only the top‑k blocks (according to the scores) are retained for the actual attention calculation, dramatically reducing the number of token‑to‑token interactions. This selective attention is implemented with Triton‑based native sparse kernels, ensuring high GPU throughput.
Short‑Term Interest Selection (STIS) focuses on the most recent interactions. It employs a power‑law mask that allows each query token to attend to (1) a symmetric local window of win tokens, (2) whole blocks whose block‑index distance from the query block equals an integer power of two, and (3) the final block containing the freshest items. This pattern yields a receptive field comparable to sliding‑window attention but with logarithmic computational complexity O(log L), preserving the influence of the latest behavior without sacrificing sparsity.
Both pathways feed into a learnable gated fusion MLP that assigns adaptive weights to the long‑term and short‑term attention outputs on a per‑head (or per‑group) basis. This gating mechanism enables the model to balance the two interest signals dynamically, resulting in stable performance across sequences of varying lengths.
To further reduce parameter redundancy, BlossomRec adopts Grouped Query Attention (GQA), where multiple query heads share the same key and value projections. This sharing aligns naturally with the block‑wise importance scores computed in LTIS, allowing the scores to be reused across heads within a group.
Theoretical analysis shows that the combined sparsity reduces the classic O(N² d) self‑attention cost to O(k d + log N), where k is the number of selected blocks and d the hidden dimension. Empirical evaluation on four public benchmarks (Amazon Beauty, MovieLens‑1M, Gowalla, Yelp) demonstrates that when BlossomRec is plugged into state‑of‑the‑art transformer‑based recommenders such as SASRec, BERT4Rec, and LightSANs, it achieves 1.2‑2.5 % higher HR@10/NDCG@10 while cutting memory consumption by 30‑45 % and inference latency by 20‑35 % on a standard GPU. Ablation studies confirm that each component—LTIS, STIS, and the gated fusion—contributes uniquely: LTIS excels at modeling global preferences, STIS captures recent trends, and the gate harmonizes them for robust predictions.
Importantly, BlossomRec is designed as a drop‑in replacement for the standard attention layer, requiring no architectural redesign of existing SRS models, and it remains compatible with various transformer variants. By unifying block‑level compression, power‑law masking, and adaptive gating, BlossomRec offers a practical, scalable solution for real‑world recommendation services that must handle long user histories under strict computational constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment