Bending the Scaling Law Curve in Large-Scale Recommendation Systems
Learning from user interaction history through sequential models has become a cornerstone of large-scale recommender systems. Recent advances in large language models have revealed promising scaling laws, sparking a surge of research into long-sequence modeling and deeper architectures for recommendation tasks. However, many recent approaches rely heavily on cross-attention mechanisms to address the quadratic computational bottleneck in sequential modeling, which can limit the representational power gained from self-attention. We present ULTRA-HSTU, a novel sequential recommendation model developed through end-to-end model and system co-design. By innovating in the design of input sequences, sparse attention mechanisms, and model topology, ULTRA-HSTU achieves substantial improvements in both model quality and efficiency. Comprehensive benchmarking demonstrates that ULTRA-HSTU achieves remarkable scaling efficiency gains – over 5x faster training scaling and 21x faster inference scaling compared to conventional models – while delivering superior recommendation quality. Our solution is fully deployed at scale, serving billions of users daily and driving significant 4% to 8% consumption and engagement improvements in real-world production environments.
💡 Research Summary
The paper introduces ULTRA‑HSTU, a next‑generation sequential recommendation model that dramatically improves scaling efficiency for ultra‑long user interaction histories. Traditional transformer‑based recommenders such as HSTU suffer from quadratic O(L²) self‑attention cost, which becomes prohibitive when L reaches tens or hundreds of thousands. Industry workarounds—cross‑attention, truncated histories, or shallow networks—avoid the quadratic term but sacrifice the global context and depth that self‑attention provides.
ULTRA‑HSTU tackles the problem through a three‑pronged model‑system co‑design:
-
Input Sequence Optimization – Item and action embeddings are merged by simple addition, halving the effective sequence length. Heterogeneous action encodings preserve signal richness while reducing token count. A “Load‑Balanced Stochastic Length” sampler equalizes per‑GPU compute load during distributed training, cutting straggler‑induced throughput loss by ~15 %.
-
Semi‑Local Attention (SLA) – SLA combines a local window (size K₁) with a global window (size K₂) to achieve linear complexity O((K₁+K₂)·L). It retains most of the benefits of full self‑attention (global context) while drastically lowering FLOPs. On the systems side, the authors extend FlashAttention V3 with custom CUDA kernels that support SiLU activation, non‑standard masks, and heterogeneous GPU architectures (NVIDIA H100, AMD MI300). A mixed‑precision pipeline keeps most operations in BF16 for stability, accelerates matrix multiplications with FP8, and compresses embedding tables to INT4, reducing memory bandwidth and HBM footprint. These optimizations yield a 70 % training‑throughput gain and a 50 % inference‑throughput gain over the baseline HSTU.
-
Dynamic Topological Designs – Two complementary strategies are introduced. Attention Truncation runs the first N₁ layers on the full sequence, then selects a high‑value sub‑segment for additional N₂ layers, avoiding full‑sequence cost in deeper layers. Mixture of Transducers (MoT) treats different user actions as separate sequences, each processed by its own transformer stack, and fuses the representations later, allowing the model to allocate more capacity to high‑impact signals.
Empirical evaluation uses 18‑layer models processing 16 k‑token histories on hundreds of H100 GPUs. Compared with the original C‑NE HSTU, ULTRA‑HSTU achieves 5.3× higher training scaling efficiency and 21.4× higher inference scaling efficiency (measured as performance per FLOP per item). In a production deployment serving billions of daily users, the model delivers 4 %–8 % lifts in consumption and engagement metrics and a 0.217 % uplift in core business KPIs. Memory footprint shrinks by ~30 %, and GPU utilization rises to >90 % during both training and inference.
The paper situates its contributions among prior works: DIN, SAS‑Recs, and the original HSTU introduced sequential modeling but retained quadratic cost; recent sparse‑attention methods (NSA, STCA) reduce complexity but either add costly pre‑attention projections or forgo full self‑attention, leading to performance regressions. ULTRA‑HSTU’s combination of sequence compression, linear semi‑local attention, mixed‑precision custom kernels, and dynamic depth scaling provides a concrete blueprint for “bending” the scaling law curve in recommendation systems.
In summary, ULTRA‑HSTU demonstrates that by jointly redesigning input representation, attention computation, and model topology—while tightly coupling these changes to hardware‑aware system optimizations—one can achieve LLM‑scale efficiency gains in industrial recommender systems. This work not only sets a new state‑of‑the‑art for large‑scale sequential recommendation but also offers a transferable methodology for any domain that must process ultra‑long sequences under strict latency and cost constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment