Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding and certifying the generalization performance of machine learning algorithms – i.e. obtaining theoretical estimates of the test error from a finite training sample – is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical $0$–$1$ classification. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularization, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.

💡 Research Summary

The paper presents a comprehensive mechanized formalization of generalization error bounds based on Rademacher complexity and Dudley’s entropy integral using the Lean 4 proof assistant and the Mathlib mathematical library. The authors aim to bridge the gap between informal textbook arguments in statistical learning theory and fully verified formal mathematics, thereby providing a reusable foundation for a wide range of learning‑theoretic results.

The development begins by fixing a probability space ((\Omega,\mathcal{F},\mu)), a data domain (X), and an index type (\iota) that parametrizes a family of real‑valued functions (F={f_i:X\to\mathbb{R}}_{i\in\iota}). In Lean, the family is represented as a curried function f : ι → X → ℝ. Samples of size (n) are modeled as functions ω : Fin n → Ω, and the empirical sample is obtained by composition with the data‑generating random variable X : Ω → X.

Two central quantities are defined: empirical Rademacher complexity and its distribution‑dependent counterpart. The former averages over all sign vectors σ : Fin n → {‑1,1} (implemented as Signs n) and takes a supremum over the index set. The latter integrates the empirical quantity with respect to the product measure (\mu^n). Both definitions are written using explicit finite sums and Fintype.card rather than abstract expectations, which aligns with Mathlib’s handling of finite probability spaces.

The first major theorem, expectation_le_rademacher, formalizes the classic symmetrization step. By rewriting the population mean as a coordinatewise mean on the product space, applying a symmetrization identity over two independent samples, and then using the triangle inequality, the authors bound the expected uniform deviation by twice the Rademacher complexity, i.e.
\

Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

💡 Research Summary

Comments & Academic Discussion

Leave a Comment