On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sequential recommendation (SR) models predict a user’s next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec’s viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.


💡 Research Summary

The paper introduces Cotten4Rec, an efficient sequential recommendation model that replaces the conventional scaled‑dot‑product attention used in BERT4Rec with a cosine‑similarity‑based attention mechanism and implements the entire computation in a single highly‑optimized CUDA kernel. Traditional transformer‑based SR models suffer from quadratic O(N²) time and memory complexity because they must construct an N × N score matrix (QKᵀ) and apply a Softmax normalization. This becomes prohibitive when user interaction sequences are long and the item vocabulary continually expands, leading to GPU memory overflow and high latency in production systems.

Cotten4Rec addresses these issues in two steps. First, it normalizes the query (Q) and key (K) matrices row‑wise with L₂ norm, computes cosine similarity directly as the dot product of the normalized vectors, and multiplies the result by the value (V) matrix. By pre‑computing the d × d matrix KᵀV and then multiplying it with the normalized Q, the overall computational cost drops to O(N·d), i.e., linear in sequence length. A learnable scaling factor m is introduced to keep the magnitude of the attention scores stable, effectively replacing the Softmax normalization.

Second, the authors fuse row‑wise normalization, the accumulation of KᵀV, and the final Q·(KᵀV) multiplication into one CUDA kernel. The kernel loads small tiles of K and V into shared memory or registers, accumulates the d × d product entirely on‑chip, and finally writes the N × d context matrix back to global memory. This eliminates multiple kernel launches, reduces global memory traffic, and minimizes kernel‑launch overhead.

Compared with LinRec, a recent linear‑attention baseline that uses an ELU+1 feature map to avoid the Softmax, Cotten4Rec does not require additional non‑linear transformations because cosine similarity inherently focuses on direction rather than magnitude. Consequently, Cotten4Rec achieves higher numerical stability and a smaller memory footprint.

Experiments on three public benchmarks (e.g., Amazon Beauty, MovieLens‑1M, Yelp) show that Cotten4Rec reduces memory consumption by roughly 23 % relative to BERT4Rec and LinRec, while accelerating training time by 4 %–8 % over BERT4Rec and about 20 % over LinRec. Recommendation quality measured by HR@10 and NDCG@10 remains virtually unchanged (ΔHR < 0.5 %). The gains are most pronounced for moderate sequence lengths (200–500 items) and vocabularies of size 10⁴–10⁵.

The authors acknowledge limitations: the intermediate d × d matrix KᵀV can become memory‑intensive when the embedding dimension d is very large (e.g., d > 1024), and the current implementation is optimized for a single‑GPU setting, leaving multi‑GPU or distributed training as future work.

In summary, Cotten4Rec demonstrates that cosine‑based attention, when coupled with a unified CUDA kernel, can preserve the expressive power of bidirectional transformer encoders while delivering linear‑time and linear‑memory scaling. This makes it a compelling choice for large‑scale, resource‑constrained sequential recommendation services that require both high throughput and competitive accuracy.


Comments & Academic Discussion

Loading comments...

Leave a Comment