LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation
We present LLaTTE (LLM-Style Latent Transformers for Temporal Events), a scalable transformer architecture for production ads recommendation. Through systematic experiments, we demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs. Crucially, we find that semantic features bend the scaling curve: they are a prerequisite for scaling, enabling the model to effectively utilize the capacity of deeper and longer architectures. To realize the benefits of continued scaling under strict latency constraints, we introduce a two-stage architecture that offloads the heavy computation of large, long-context models to an asynchronous upstream user model. We demonstrate that upstream improvements transfer predictably to downstream ranking tasks. Deployed as the largest user model at Meta, this multi-stage framework drives a 4.3% conversion uplift on Facebook Feed and Reels with minimal serving overhead, establishing a practical blueprint for harnessing scaling laws in industrial recommender systems.
💡 Research Summary
The paper introduces LLaTTE (LLM‑Style Latent Transformers for Temporal Events), a transformer‑based architecture designed for large‑scale ad recommendation at Meta. The authors address three core research questions: (1) how to combine sparse ID‑based factorization‑machine (FM) features with dense, sequential user‑behavior signals; (2) whether recommendation systems exhibit predictable scaling laws similar to large language models (LLMs); and (3) how to reap scaling benefits while respecting the millisecond‑level latency constraints of production serving.
Architecture
LLaTTE’s core is a “Target‑Aware Adaptive Transformer” that fuses non‑sequential features (sparse IDs, dense embeddings, float attributes) and candidate‑ad context into extended query tokens. The sequence module uses Multi‑head Latent Attention (MLA) to reduce memory footprint and an adaptive pyramidal output mechanism that progressively trims older tokens, allowing context lengths up to 5,000 events without exceeding GPU memory limits.
Two‑Stage Design
To meet latency requirements, the system is split into an asynchronous upstream model and a lightweight online ranking model. The upstream model processes the full user history with >45× the FLOPs of the online model, generating compressed user embeddings that are cached for later use. The online model consumes only recent short‑horizon events (≈100–200) and accounts for roughly 30 % of the total inference FLOPs. Both stages share the same transformer backbone but operate at vastly different compute budgets.
Scaling Law Experiments
The authors systematically vary three axes—depth (L), width (d), and sequence length (T)—while keeping the non‑sequence backbone fixed. Performance is measured with Normalized Entropy (NE) reduction. Across a wide FLOP range, NE improves in a log‑linear fashion with respect to compute, mirroring the power‑law scaling observed in LLMs. Crucially, the presence of rich semantic features (e.g., content embeddings from a separate text encoder) “bends” the scaling curve: without these features, increasing model size yields diminishing returns, whereas with them the curve steepens dramatically. Width acts as a capacity bottleneck; only when d ≥ 2048 does additional depth translate into measurable NE gains. Sequence length is the most effective lever, but its benefit is amplified only when semantic features are present.
Transfer Between Stages
Improvements in the upstream model transfer predictably to the downstream ranking stage. A 0.2 %p NE gain in the upstream model results in roughly a 0.1 %p gain downstream, corresponding to a transfer ratio of ~50 %. This demonstrates that the cached embeddings retain most of the upstream model’s learned knowledge despite the information bottleneck imposed by the lightweight online model.
Production Deployment
LLaTTE has been deployed as the largest user model at Meta for Facebook Feed and Reels. In live A/B tests it delivers a 4.3 % uplift in conversion rate and a 0.25 %p improvement in NE, while keeping online latency within the 2–3 ms SLA. The asynchronous cache is refreshed on high‑value user events, ensuring that embeddings remain fresh without adding to request‑time latency.
Key Insights
- Recommendation systems obey power‑law scaling similar to LLMs, but only when data richness (semantic embeddings) is sufficient.
- Model width must reach a critical threshold before depth scaling becomes effective, echoing capacity bottlenecks seen in language modeling.
- A two‑stage architecture cleanly separates heavy, offline computation from latency‑critical online inference, enabling the use of very large transformers in production.
- The transfer ratio of ~50 % validates that asynchronous pre‑computed embeddings can convey most of the upstream model’s gains to the real‑time ranking stage.
Limitations and Future Work
The current system relies on periodic asynchronous updates; extremely rapid user behavior changes could outpace cache refreshes. Moreover, while pyramidal trimming reduces compute, it discards older context, which may be valuable for long‑term interest modeling. Future research directions include retrieval‑augmented transformers, more sophisticated token compression, and exploring scaling beyond 5,000‑step histories.
In summary, LLaTTE demonstrates a practical blueprint for harnessing LLM‑style scaling laws in industrial recommender systems by (i) integrating sequential and non‑sequential features via an efficient transformer, (ii) employing a multi‑stage inference pipeline to satisfy latency constraints, (iii) empirically characterizing how depth, width, sequence length, and semantic feature richness interact, and (iv) validating the approach with substantial real‑world performance gains and negligible serving overhead.
Comments & Academic Discussion
Loading comments...
Leave a Comment