GRAB: An LLM-Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm

GRAB: An LLM-Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Traditional Deep Learning Recommendation Models (DLRMs) face increasing bottlenecks in performance and efficiency, often struggling with generalization and long-sequence modeling. Inspired by the scaling success of Large Language Models (LLMs), we propose Generative Ranking for Ads at Baidu (GRAB), an end-to-end generative framework for Click-Through Rate (CTR) prediction. GRAB integrates a novel Causal Action-aware Multi-channel Attention (CamA) mechanism to effectively capture temporal dynamics and specific action signals within user behavior sequences. Full-scale online deployment demonstrates that GRAB significantly outperforms established DLRMs, delivering a 3.05% increase in revenue and a 3.49% rise in CTR. Furthermore, the model demonstrates desirable scaling behavior: its expressive power shows a monotonic and approximately linear improvement as longer interaction sequences are utilized.


💡 Research Summary

The paper introduces GRAB (Generative Ranking for Ads at Baidu), an end‑to‑end generative framework for click‑through‑rate (CTR) prediction that bridges the gap between traditional Deep Learning Recommendation Models (DLRMs) and recent generative recommendation (GR) approaches inspired by large language models (LLMs). Traditional DLRMs excel at handling high‑cardinality sparse features but struggle with long user behavior sequences, often exhibiting “strong memory, weak reasoning” and suffering from diminishing returns as data and model size grow. Conversely, LLM‑style models demonstrate predictable scaling laws, where performance improves monotonically with more parameters, data, and compute. GRAB leverages this insight to create a system that can both memorize massive sparse feature tables and reason over long, heterogeneous user interaction histories.

GRAB’s architecture consists of three stages: (1) a sparse feature layer that converts raw logs into event‑level ID sequences using standard DLRM feature engineering; (2) a dense tokenizer that aggregates per‑event field embeddings and projects them into a fixed‑dimensional token space, preserving temporal order; and (3) an autoregressive‑like sequence modeling layer built on a Transformer backbone. The core novelty lies in the Causal Action‑aware Multi‑channel Attention (CamA) mechanism. CamA splits the attention computation into multiple channels, each dedicated to a specific action type (e.g., exposure, click, conversion). It augments standard multi‑head self‑attention with three relative bias embeddings: position, action, and time. This allows the model to directly encode causal and temporal signals, improving its ability to capture nuanced user intent dynamics.

A second major contribution is the Sequence‑Then‑Sparse (STS) training strategy. When packing variable‑length user sequences into mini‑batches, naïve packing creates high intra‑user correlation, violating the i.i.d. assumption required for stochastic gradient descent and leading to “distribution skew”. STS decouples the optimization of dense Transformer weights from the sparse embedding tables: first the dense parameters are updated on the packed sequences, then a separate step fine‑tunes the sparse embeddings using the same batch. This resolves gradient conflicts, stabilizes convergence, and incurs no extra computational cost.

GRAB also introduces user‑isolated causal masking and heterogeneous visibility masks. Tokens are packed per user to eliminate padding waste, and a block‑diagonal lower‑triangular mask enforces both user isolation and causality (no future leakage). Within each user’s packed stream, two token types are defined: “partial” tokens that contain only time‑varying fields for historical context, and “full” tokens that retain the complete feature set needed to score the current candidate ad. The visibility mask ensures that full tokens attend only to preceding partial tokens and themselves, never to other full tokens, thereby preserving streaming‑style inference semantics.

Empirical evaluation on Baidu’s production advertising platform shows that GRAB outperforms strong industrial DLRM baselines and recent GR models. Offline, it achieves a 0.19 % relative AUC gain over the best baseline. In online A/B tests, GRAB delivers a 3.05 % increase in cost‑per‑mille (CPM) and a 3.49 % lift in CTR. Scaling experiments reveal a near‑linear improvement in AUC as both model capacity and behavior sequence length increase, confirming the model’s ability to benefit from longer user histories without saturation.

The paper acknowledges potential limitations: CamA’s multi‑channel design and relative bias embeddings increase memory and compute requirements, possibly demanding higher‑end hardware; the STS training pipeline adds engineering complexity; and the current evaluation focuses on advertising, leaving generalization to other recommendation domains open. Future work could explore lightweight attention variants, distributed training optimizations, and cross‑domain validation.

Overall, GRAB demonstrates a practical, scalable solution that unifies the memorization power of DLRMs with the reasoning capabilities of LLM‑style generative models, offering a promising direction for next‑generation industrial recommendation systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment