On the Convergence of Multicalibration Gradient Boosting
Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its convergence properties are not well understood. In this paper, we bridge the gap by providing convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. We show that the magnitude of successive prediction updates decays at $O(1/\sqrt{T})$, which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.
💡 Research Summary
The paper provides the first rigorous convergence analysis of multicalibration gradient boosting (MC‑GB) for regression with squared‑error loss. Multicalibration requires that predictions be calibrated not only overall but across a potentially exponential family of overlapping sub‑populations. Recent practical systems such as MC‑GRAD have shown impressive empirical performance, yet no theory guarantees that the sequence of predictors actually converges to a multicalibrated model, nor how fast this convergence occurs.
The authors formalize MC‑GB as a discrete‑time dynamical system. At round t the algorithm builds a matrix B(f_t) whose columns are the evaluations of all weak learners b_j on the current predictions f_t. Assuming an ideal boosting oracle that solves the least‑squares problem exactly, the update can be written as
f_{t+1}=w_t ( f_t + η A(f_t)(y‑f_t) ),
where A(f)=B(f)B(f)^+ is the orthogonal projector onto the span of the weak‑learner features and w_t is an optional rescaling weight (normally 1). The multicalibration error vector Ē(f) is proportional to the product of the matrix norm ‖B(f)‖ and the prediction gap ‖f_{t+1}‑f_t‖, so proving that the gap shrinks to zero directly yields asymptotic multicalibration.
Main theoretical contributions
- Fundamental sub‑linear convergence – Using a Lyapunov function equal to the squared residual norm ‖y‑f_t‖², the authors show that the residual loss never increases and that the sum of squared gaps over T iterations is bounded by the initial loss. Consequently the minimum gap satisfies
min_{0≤t<T}‖f_{t+1}‑f_t‖ ≤ √(η)‖y‑f_0‖ / √T,
i.e., the gap (and thus the multicalibration error) decays as O(1/√T).
- Linear (geometric) convergence under smoothness – If the projector A(f) is Lipschitz continuous with constant L_A along the trajectory and the product η L_A ‖y‑f_0‖² is small enough, the update map becomes a contraction with factor
κ = 1‑η + η L_A ‖y‑f_0‖² < 1.
In this regime the gap shrinks geometrically, giving linear convergence of both the prediction gap and the multicalibration error. The Lipschitz condition holds when the weak‑learner class is sufficiently smooth (e.g., regression trees with continuous leaf values or differentiable parametric models).
- Rescaling variants – Practical deployments often use a schedule w_t∈(0,1] that approaches 1 to mitigate over‑fitting. The authors prove that any schedule converging to 1 preserves the O(1/√T) rate. Moreover, when w_t is chosen adaptively to minimize the current loss (essentially a line‑search on ‖y‑f_t‖), the algorithm enjoys a “local quadratic” phase: after the gap becomes sufficiently small, the next gap satisfies
‖f_{t+1}‑f_t‖ ≤ C ‖f_t‑f_{t‑1}‖²,
so the loss converges quadratically near the optimum.
Empirical validation – Experiments on large‑scale web‑log data and standard regression benchmarks (UCI Housing, Year Prediction MSD) confirm the theory. With generic decision‑tree weak learners the observed gap follows the O(1/√T) trend. When the weak learners are smooth (gradient‑boosted trees with shallow depth), the linear regime appears early, matching the κ‑condition. Adaptive rescaling yields a fast initial drop followed by a clear quadratic decay, reducing the number of boosting rounds needed for a target multicalibration tolerance.
Significance – By modeling MC‑GB as a dynamical system with evolving features, the paper bridges a gap between empirical practice and classical boosting theory. It shows that despite the moving feature space, the algorithm remains stable, converges to a multicalibrated predictor, and can achieve fast rates under realistic smoothness assumptions. The analysis also provides guidance for practitioners: choosing smooth weak learners and employing adaptive rescaling can dramatically accelerate convergence while preserving calibration guarantees.
Future directions suggested include extending the analysis to infinite or data‑dependent weak‑learner families, online/streaming multicalibration, and other loss functions (e.g., logistic or quantile loss) where multicalibration of higher‑order moments is desired.
Comments & Academic Discussion
Loading comments...
Leave a Comment