Design and Evaluation of Whole-Page Experience Optimization for E-commerce Search

Design and Evaluation of Whole-Page Experience Optimization for E-commerce Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

E-commerce Search Results Pages (SRPs) are evolving from linear lists to complex, non-linear layouts, rendering traditional position-biased ranking models insufficient. Moreover, existing optimization frameworks typically maximize short-term signals (e.g., clicks, same-day revenue) because long-term satisfaction metrics (e.g., expected two-week revenue) involve delayed feedback and challenging long-horizon credit attribution. To bridge these gaps, we propose a novel Whole-Page Experience Optimization Framework. Unlike traditional list-wise rankers, our approach explicitly models the interplay between item relevance, 2D positional layout, and visual elements. We use a causal framework to develop metrics for measuring long-term user satisfaction based on quasi-experimental data. We validate our approach through industry-scale A/B testing, where the model demonstrated a 1.86% improvement in brand relevance (our primary customer experience metric) while simultaneously achieving a statistically significant revenue uplift of +0.05%


💡 Research Summary

The paper addresses the growing complexity of e‑commerce Search Results Pages (SRPs), which have evolved from simple ranked lists to rich two‑dimensional layouts that intermix organic results with themed widgets and other visual elements. Traditional ranking approaches that rely on a one‑dimensional position bias are ill‑suited for such non‑linear attention patterns, and most existing optimization frameworks focus on short‑term signals such as clicks or same‑day revenue. The authors identify four critical gaps: (1) the multi‑objective nature of the problem (engagement, revenue, long‑term satisfaction), (2) delayed reward signals, (3) complex satisfaction functions that depend on both content and 2‑D position, and (4) heterogeneous content types (organic results vs. widgets).

To fill these gaps, they formulate SRP optimization as a contextual bandit problem where each action corresponds to selecting a page template (a specific arrangement of widgets and results) and ordering its eligible items. The context consists of “3C” features – Context (query, device, marketplace, etc.), Customer (membership status, past behavior), and Content (relevance, brand alignment, value signals).

The core methodological contribution is the Downstream Value of Whole‑Page Experience (DV‑WPX) causal framework. DV‑WPX treats variations in observable page‑quality metrics as quasi‑experimental shocks, assuming that conditional on historical customer features, these variations are as good as random. A structural equation links long‑term revenue (12‑week post‑search) to short‑term revenue (2‑week), engagement, and a vector of quality metrics Q. By taking the total derivative, the authors decompose the effect of Q into three channels: direct impact on long‑term revenue, indirect impact through short‑term revenue, and indirect impact through short‑term engagement.

Estimation proceeds via Double Machine Learning (DML) to control for confounding. After de‑averaging fixed effects across queries and ZIP codes, the data are split 90/10 for out‑of‑sample validation. Residuals of both target (long‑term revenue) and surrogate (short‑term metrics, quality signals) are obtained using cross‑fitted linear learners. The final stage regresses the residualized target on residualized surrogates using OLS or LASSO, yielding β coefficients that represent the causal impact of each quality metric. The DV‑WPX score for a search event is the weighted sum of β·X across all quality features.

Using this framework, the authors construct a concrete user‑satisfaction metric: Pixel and Region Weighted Whole‑Page Brand Match Rate (PR‑WP‑BMR). The page is divided into three regions—Top (positions 1‑8), Middle (9‑16), Bottom (17+). Within each region, brand‑match rates are weighted by pixel coverage, reflecting visual prominence. Region weights can be derived from short‑term click‑through‑rate (CTR) distributions or from DV‑WPX estimates of downstream value. The DV‑WPX‑based weights (≈0.63 Top, 0.37 Middle, 0 Bottom) emphasize the importance of the upper two regions while effectively ignoring the bottom region.

The PR‑WP‑BMR metric is then incorporated into a production page‑template ranker. Separate Bayesian linear regression models predict revenue, while Bayesian probit models predict binary non‑abandonment. For PR‑WP‑BMR, the same modeling pipeline is used. At inference time, Thompson sampling draws posterior samples for each objective; a weighted linear combination of the sampled predictions produces a single scalar reward. The template with the highest reward is displayed.

Evaluation comprises offline and online experiments. Offline, the addition of content‑aware features improves revenue prediction RMSE by 6 % on both mobile and desktop and slightly raises desktop non‑abandonment AUC by 1 %. Online, three treatments are compared: (1) a control without any satisfaction signal, (2) a treatment using CTR‑based PR‑WP‑BMR, and (3) a treatment using DV‑WPX‑based PR‑WP‑BMR. The DV‑WPX‑based treatment achieves a 1.86 % lift in the primary brand‑relevance metric (the authors’ main customer‑experience KPI) and a statistically significant revenue uplift of +0.05 %.

The paper’s contributions are threefold: (i) a causal inference framework that leverages quasi‑experimental variation to connect page‑quality signals with long‑term customer spend, (ii) a region‑aware, pixel‑weighted satisfaction metric that captures the non‑linear impact of layout, and (iii) a practical multi‑objective template ranker that integrates the metric into a large‑scale production system. Limitations include the linearity assumption in the causal model, potential lack of generality across different widget designs, and the focus on a 12‑week horizon. Future work could explore non‑linear causal learners, longer observation windows, and application to other domains such as video or news feeds.


Comments & Academic Discussion

Loading comments...

Leave a Comment