Distribution-Aware End-to-End Embedding for Streaming Numerical Features in Click-Through Rate Prediction
This paper explores effective numerical feature embedding for Click-Through Rate prediction in streaming environments. Conventional static binning methods rely on offline statistics of numerical distributions; however, this inherently two-stage process often triggers semantic drift during bin boundary updates. While neural embedding methods enable end-to-end learning, they often discard explicit distributional information. Integrating such information end-to-end is challenging because streaming features often violate the i.i.d. assumption, precluding unbiased estimation of the population distribution via the expectation of order statistics. Furthermore, the critical context dependency of numerical distributions is often neglected. To this end, we propose DAES, an end-to-end framework designed to tackle numerical feature embedding in streaming training scenarios by integrating distributional information with an adaptive modulation mechanism. Specifically, we introduce an efficient reservoir-sampling-based distribution estimation method and two field-aware distribution modulation strategies to capture streaming distributions and field-dependent semantics. DAES significantly outperforms existing approaches as demonstrated by extensive offline and online experiments and has been fully deployed on a leading short-video platform with hundreds of millions of daily active users.
💡 Research Summary
The paper addresses the problem of embedding numerical features for click‑through‑rate (CTR) prediction in a streaming training environment. Traditional static binning methods rely on offline statistics to define bucket boundaries; when those boundaries are updated, the mapping from raw values to bucket indices changes, causing semantic drift. Neural‑based embedding approaches avoid drift but ignore the underlying distribution of the feature, which can be crucial for stable learning, especially when data are non‑i.i.d. and distributions shift over time.
To overcome these limitations, the authors propose DAES (Distribution‑Aware End‑to‑End Embedding for Streaming), a framework that integrates distribution estimation, quantile‑space encoding, and field‑aware modulation into a single end‑to‑end trainable pipeline. The key components are:
-
Reservoir‑sampling with jump sampling – a memory‑efficient algorithm that maintains a representative sample of the entire stream. When a new record arrives, it replaces an existing sample with a probability that depends on the stream size, ensuring that recent data are reflected while preserving historical diversity. This approach avoids the i.i.d. assumption required by order‑statistics‑based quantile estimators.
-
Quantile‑space encoding – instead of feeding raw numeric values directly into an embedding network, the current cumulative distribution function (estimated from the reservoir) is used to map each value to its quantile (a number in
Comments & Academic Discussion
Loading comments...
Leave a Comment