On the structure of marginals in high dimensions

Let $G, G_1,\dots,G_N$ be independent copies of a standard gaussian random vector in $\mathbb{R}^d$ and denote by $Γ= \sum_{i=1}^N \langle G_i,\cdot\rangle e_i$ the standard gaussian ensemble. We show that, for any set $A\subset S^{d-1}$, with expone…

Authors: Daniel Bartl, Shahar Mendelson

ON THE STR UCTURE OF MAR GINALS IN HIGH DIMENSIONS D ANIEL BAR TL AND SHAHAR MENDELSON Abstract. Let G, G 1 , . . . , G N b e independent copies of a standard gaussian random v ector in R d and denote b y Γ = P N i =1 ⟨ G i , ·⟩ e i the standard gaussian ensemble. W e show that, for any set A ⊂ S d − 1 , with exp onentially high probabilit y , sup x ∈ A 1 N N X i =1   (Γ x ) ♯ i − q i   ≤ c E sup x ∈ A ⟨ G, x ⟩ + log 2 N √ N . Here each q i is the i N +1 –quan tile of the standard normal distribution and (Γ x ) ♯ denotes the monotone increasing rearrangement of the vector Γ x . The estimate is sharp up to a p ossible logarithmic factor and significan tly extends previously known b ounds. More- o ver, we sho w that similar estimates hold in muc h greater generality: after replacing the gaussian quantiles by the appropriate ones, the same phenomenon p ersists for a broad class of random vectors. 1. Introduction The problem w e explore here revolv es around structur e pr eservation : ho w m uch structure is preserv ed b y i.i.d. sampling. More accurately , let F ⊂ L 2 ( µ ) b e a class of mean-zero functions and set σ = ( X 1 , ..., X N ) to be distributed according to µ ⊗ N . Giv en the sample σ , eac h f ∈ F is asso ciated with the random vector P σ f = ( f ( X 1 ) , ..., f ( X N )) ∈ R N , and the question is whether P σ f captures key features of f — uniformly in the class F . T o giv e a flav our of wh y this question is of interest, let A ⊂ R d , identify x ∈ A with the linear functional f x = ⟨· , x ⟩ and set F A = {⟨· , x ⟩ : x ∈ A } . Consider a centred random vector X in R d and let Γ = P N i =1 ⟨ X i , ·⟩ e i b e the random matrix whose ro ws are X 1 , ..., X N — indep endent copies of X . In this case, P σ f x = Γ x , and the question is whether the elements of Γ A = { Γ x : x ∈ A } capture the essence of the p oints in A ; when that happ ens, the random image Γ A inherits m uch of A ’s geometric structure. Let us stress that even in this simplified setup, and in the b est of scenarios — when X is the standard gaussian random v ector in R d —, the in terpla y b etw een Γ A and A is not fully understo o d. Accurate information is known only for notions of structure preserv ation that are ‘coarse’. F or example, the ideal result for what is arguably the most natural notion of structure preserv ation—the L 2 sense—implies that with high probabilit y with respect to Date : March 19, 2026. 1 2 DANIEL BAR TL AND SHAHAR MENDELSON µ ⊗ N , for every f ∈ F , (1.1)      1 N N X i =1 f 2 ( X i ) − E f 2      ≤ c sup f ∈F ∥ f ∥ L 2 · Comp( F ) √ N . Here, c is an absolute constan t and Comp( F ) captures the complexity of the class F in some appropriate sense. T aking in to account the behaviour of the supremum as N → ∞ , one w ould hop e that Comp( F ) is the exp ectation of the suprem um of the canonical gaussian pro cess indexed by F ; but for finite N such a gaussian b ound is often false. The reason that (1.1) is coarse is b ecause it do es not say muc h ab out the lo cation of eac h 1 √ N P σ f —only that for a typical sample it ‘lives’ in an ann ulus of radii ∥ f ∥ L 2 ± sup f ∈F ∥ f ∥ L 2 Comp( F ) √ N , nothing more. Remark 1.1. One should k eep in mind that ‘coarse’ do es not mean ‘easy’ or ‘p oin tless’. Establishing (1.1) (and doing so with the righ t notion of Comp( F )) is a notoriously difficult problem with a wide v ariet y of important applications—see, for example, [14, 18, 20]. In fact, satisfactory v ersions of (1.1) exist either when the class F consists of ligh t-tailed ran- dom v ariables—e.g., exhibiting a subgaussian tail decay , or when F has plent y of symme- tries. T he most natural example of the latter is F S d − 1 = {⟨· , x ⟩ : x ∈ S d − 1 } , and then (1.1) captures the extremal singular v alues of the random matrix 1 √ N Γ = 1 √ N P N i =1 ⟨ X i , ·⟩ e i ; see [2, 3, 12, 21, 26] and the references therein for some relatively recen t results. The notion of ‘structure preserv ation’ we are interested in here is distributional: that the distribution of the co ordinates of each vector P σ f resem bles the true distribution of f , uniformly in F . And since the joint distribution of the sample ( X 1 , . . . , X N ) is in v ariant under co ordinate permutations, it stands to reason that an y meaningful notion of ‘distributional structure preserv ation’ should b e formulated up to p ermutations. Thus, the shap e of the vectors P σ f should b e studied after rearranging their co ordinates, rather than follo wing a sp ecific ordering. With that in mind, for a vector v = ( v i ) N i =1 ∈ R N denote b y v ♯ its monotone nonde- creasing rearrangemen t; in particular, v ♯ 1 = min i v i and v ♯ N = max i v i . With minor abuse of notation, put f ♯ ( X i ) = v ♯ i for v = ( f ( X i )) N i =1 . Setting ( Q f ( u )) u ∈ (0 , 1) to b e the quan tiles of f ( X ), the question is whether eac h reordered vector ( f ♯ ( X i )) N i =1 captures ( Q f ( u )) u ∈ (0 , 1) , uniformly in F . The asymptotic picture for a single random v ariable is w ell understoo d when the distri- bution of f ( X ) is sufficien tly regular—for example, when f ( X ) is gaussian. In suc h cases it is standard to verify (see, e.g. [29]) that almost surely , as N → ∞ , f ♯ ( X ⌊ uN ⌋ ) → Q f ( u ) for ev ery u ∈ (0 , 1) . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 3 Unfortunately , ev en if this behaviour passes to a uniform estimate in F (as is the case under minimal assumptions), this is an asymptotic estimate and pro vides little information on ho w well the finite-dimensional geometry of the v ectors  ( f ♯ ( X i )) N i =1 : f ∈ F  appro xi- mate the profiles of the true quan tiles. As we explain in what follows, existing non-asymptotic results are less than satisfactory . There are kno wn quan titativ e estimates only for highly structured c lasses; while in more general scenarios—ev en when X is a standard gaussian v ector in R d and F A consists of linear functionals indexed by a set A ⊂ S d − 1 —, what is known is far from sharp. Our main result is an optimal non-asymptotic bound that holds under minimal assump- tions on the class and the underlying measure. Rather than in tro ducing the result in full generalit y , it is instructive to see what happ ens in the gaussian case, and b y how muc h the new estimate improv es the curren t state of the art. T o form ulate the gaussian v ersion of our main result, let F g ( t ) = R t −∞ 1 √ 2 π e − s 2 / 2 ds b e the standard gaussian distribution function, put Q g = F − 1 g , and set q ∈ R N b y q i = Q g  i N + 1  , i = 1 , . . . , N . Consider G , the standard gaussian vector in R d , let X 1 , . . . , X N b e independent copies of G , and put Γ to be the random matrix whose rows are X 1 , . . . , X N . Theorem 1.2. Ther e exist absolute c onstants c 1 , c 2 > 0 such that the fol lowing holds. F or every set A ⊂ S d − 1 , if ∆ ≥ c 1 max  ( E sup x ∈ A ⟨ G, x ⟩ ) 2 , log 4 ( N )  N , (1.2) then with pr ob ability at le ast 1 − exp( − c 2 ∆ N ) , sup x ∈ A 1 N N X i =1   (Γ x ) ♯ i − q i   ≤ √ ∆ . (1.3) In other words, with high probabilit y , for every x ∈ A there is a rearrangemen t of the co ordinates of the vector Γ x that ‘liv es’ in a small ℓ 1 ( R N )-ball cen tered in q . As it happ ens, the only (p otential) lo oseness is the log 4 ( N ) factor; the restriction on ∆ and the probabilit y estimate with which (1.3) holds are optimal. Indeed, we show that if N ≥ c , then for every ∆ > 0 and ev ery fixed x ∈ A , P 1 N N X i =1   (Γ x ) ♯ i − q i   ≥ √ ∆ ! ≥ exp( − c ′ ∆ N ) . 4 DANIEL BAR TL AND SHAHAR MENDELSON And also that with probability at least 0 . 99, sup x ∈ A 1 N N X i =1   (Γ x ) ♯ i − q i   ≥ c ′′ E sup x ∈ A ⟨ G, x ⟩ √ N . The pro of of the optimalit y of Theorem 1.2 can b e found in Section 4. In addition to b eing optimal, Theorem 1.2 is a substan tial impro v emen t on existing results: even in the gaussian setting the restrictions on ∆ known previously w ere consider- ably w eak er. F or example (and ignoring logarithmic factors), there are sets A ⊂ S d − 1 for whic h, thanks to Theorem 1.2, ∆ ∼ 1 / N is a ‘legal c hoice’, but previously the best that one could hope for was ∆ ≳ √ d/ N , see [5]. Moreov er, all prior results (whic h are stated in or can be derived from [6, 7, 8, 11, 17, 22]) w ere highly restricted—either requiring strong regularit y assumptions on the set A , or requiring that X is light-tailed and the marginals ha v e b ounded densities. In contrast, the general v ersion of Theorem 1.2—form ulated in the next section—see Theorem 2.6—, is (almost) univ ersal. A t the heart of the pro of of Theorem 1.2 is a uniform concentration estimate on the W asserstein distance b etw een empirical distribution functions and their true coun terparts (see Theorem 2.2), a fact that is of indep endent interest and has sev eral applications (see one example in App endix A). 2. A uniform estima te on the W asserstein dist ance Let (Ω , µ ) b e a probabilit y space, set X to be distributed according to µ , and let X 1 , . . . , X N b e indep endent copies of X . Consider a class of functions F ⊂ L 2 ( µ ), for f ∈ F set F f ( t ) = P ( f ( X ) ≤ t ) to b e its distribution function, and let F N ,f ( t ) = 1 N |{ i ≤ N : f ( X i ) ≤ t }| , t ∈ R b e the empirical distribution function of f ( X ). F or tw o distribution functions F and H , denote their first order W asserstein distance by W 1 ( F , H ) = Z R | F ( t ) − H ( t ) | dt. W e refer to [10, 28] for more information on the W 1 distance and its role in optimal transp ort. Our main result follows from a uniform estimate on the W 1 distance b etw een F N ,f and F f . Set ( G f ) f ∈F to b e the canonical gaussian pro cess asso ciated with F (that is, the pro cess is gaussian with mean zero and co v ariance C ov( G f , G h ) = C o v( f , h )), and let d F = sup f ∈F ∥ f ∥ L 2 and d ∗ ( F ) =  E sup f ∈F G f d F  2 ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 5 b e the radius of F ∪ { 0 } and its critical (or Dv oretzky-Milman) dimension, respectively . F or a function f , set ∥ f ∥ L 2 ( P N ) = 1 N N X i =1 f 2 ( X i ) ! 1 / 2 , and the k ey assumption we require is that ∥ · ∥ L 2 and ∥ · ∥ L 2 ( P N ) are compatible in the follo wing sense: Assumption 2.1. There are θ > E sup f ∈F G f , B ≥ 1 and an even t Ω 0 on which, for every f , h ∈ F ∪ { 0 } , ∥ f − h ∥ L 2 ( P N ) ≤ B ∥ f − h ∥ L 2 + θ √ N . W e sho w in Section 2.1 that Assumption 2.1 is alw a ys satisfied with θ ∼ E sup f ∈F G f when the class F is subgaussian, and in App endix B that it is v alid for other, more heavy- tailed classes as well. The crucial estimate is as follo ws: Theorem 2.2. F or any B ≥ 1 ther e ar e c onstants c 1 and c 2 dep ending only on B for which the fol lowing holds. If Assumption 2.1 is satisfie d and ∆ N ≥ c 1 max  θ 2 d 2 F , log 4 ( N )  , then with pr ob ability at le ast 1 − exp( − c 2 ∆ N ) − P (Ω c 0 ) , sup f ∈F W 1 ( F N ,f , F f ) ≤ d F √ ∆ . (2.1) Remark 2.3. The wa y c 1 and c 2 in Theorem 2.2 dep end on B is of secondary imp ortance. The pro of we presen t implies that one may set c 1 = c ′ 1 B 4 and c 2 = c ′ 2 /B 4 for absolute constan ts c ′ 1 , c ′ 2 , but that is likely to b e sub optimal. Remark 2.4. W e may assume that each f ∈ F has mean zero: the W 1 distance is in v ariant to cen tering. W e also ignore all questions related to measurabilit y that can be resolv ed using standard methods (e.g., assuming that the class can b e well-appro ximated b y a countable subset). As it turns out, the estimate in Theorem 2.2 is sharp—ev en in a gaussian setting and for classes of linear functionals F = F A = {⟨· , x ⟩ : x ∈ A } . Lemma 2.5. Ther e ar e absolute c onstants c 1 and c 2 for which the fol lowing holds. L et X = G b e the standar d gaussian ve ctor in R d , and set A ⊂ S d − 1 . Then for every ∆ ≥ 1 N and x ∈ A , with pr ob ability at le ast exp( − c 1 ∆ N ) , W 1 ( F N ,x , F x ) ≥ c 2 √ ∆ . 6 DANIEL BAR TL AND SHAHAR MENDELSON Mor e over, with pr ob ability at le ast 0 . 99 , sup x ∈ A W 1 ( F N ,x , F x ) ≥ c 2 E sup x ∈ A ⟨ G, x ⟩ √ N . The pro ofs of Theorem 2.2 and Lemma 2.5 are presented in Section 3. Theorem 2.2 leads to our main result (and to its gaussian v ersion in Theorem 1.2) thanks to the standard represen tation of the W 1 -distance using quan tile functions: for an y t w o distribution functions F and H , W 1 ( F , H ) = Z 1 0   F − 1 ( u ) − H − 1 ( u )   du. (2.2) Moreo v er, observ e that for every i = 1 , . . . , N and u ∈ [ i − 1 N , i N ), F − 1 N ,f ( u ) = f ♯ ( X i ). Thus, setting q i ( f ) = F − 1 f ( i N +1 ) for i = 1 , . . . , N , it follo ws from (2.2) that W 1 ( F N ,f , F f ) is almost equal to 1 N P N i =1 | f ♯ ( X i ) − q i ( f ) | up to the discrepancy betw een F − 1 f and its discretization— whic h can b e easily con trolled (see Lemma 4.1). W e thus ha v e the follo wing: Theorem 2.6. F or any B ≥ 1 ther e ar e c onstants c 1 and c 2 dep ending only on B such that the fol lowing holds. If Assumption 2.1 is satisfie d and ∆ N ≥ c 1 max  θ 2 d 2 F , log 4 ( N )  , then with pr ob ability at le ast 1 − exp( − c 2 ∆ N ) − P (Ω c 0 ) , sup f ∈F 1 N N X i =1   f ♯ ( X i ) − q i ( f )   ≤ d F √ ∆ . The pro ofs of Theorem 2.6 and Theorem 1.2 are presented in Section 4. 2.1. On Assumption 2.1. Consider first the case when the class F is L -subgaussian; that is, for every f , h ∈ F ∪ { 0 } , ∥ f − h ∥ L p ≤ L √ p ∥ f − h ∥ L 2 for p ≥ 2 . (2.3) Note that an equiv alen t form ulation is that P  | f ( X ) − h ( X ) | ≥ L ′ λ ∥ f − h ∥ L 2  ≤ 2 exp( − λ 2 ) for λ ≥ 0 , and then L ′ ∼ L ; we refer e .g. to [27] for this and other basic facts ab out subgaussian distributions. W e then ha v e the following result. Lemma 2.7. Ther e ar e absolute c onstants c 1 and c 2 for which the fol lowing holds. If the class F is L -sub gaussian, 0 ∈ F , and θ ≥ c 1 L E sup f ∈F G f , then Assumption 2.1 is satisfie d with that θ , B = c 1 L , and an event Ω 0 of pr ob ability at le ast 1 − exp( − c 2 max { N , θ 2 d 2 F } ) . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 7 The (standard) pro of is giv en in App endix B. Since d ∗ ( F ∪ { 0 } ) ∼ max { 1 , d ∗ ( F ) } , Theorem 2.2 and Lemma 2.7 (applied with θ ∼ d F √ ∆ N ) yield the follo wing. Corollary 2.8. Ther e ar e c onstants c 1 and c 2 dep ending only on L such that the fol lowing holds. If the class F is L -sub gaussian and ∆ N ≥ c 1 max { d ∗ ( F ) , log 4 ( N ) } , then with pr ob ability at le ast 1 − exp( − c 2 ∆ N ) , sup f ∈F W 1 ( F N ,f , F f ) ≤ d F √ ∆ . W e also sho w in App endix B that Assumption 2.1 holds for v arious other (p ossibly hea vy-tailed) classes of functions. 3. Proofs of Theorem 2.2 and Lemma 2.5 The pro of of Theorem 2.2 relies on a (non-standard) v arian t of T alagrand’s generic c haining theory . F or a detailed and illuminating exp osition on generic c haining we refer to T alagrand’s treasured b o ok (see [25]). Definition 3.1. An admissible sequence ( F s ) s ≥ 0 of the class F ⊂ L 2 ( µ ) is a sequence of subsets F s ⊂ F that satisfy |F 0 | = 1 and |F s | ≤ 2 2 s . T alagrand’s γ 2 -functional (with resp ect to the L 2 ( µ ) distance) is γ 2 ( F ) = inf ( F s ) s ≥ 0 sup f ∈F X s ≥ 0 2 s/ 2 ∥ f − π s f ∥ L 2 , where π s f is the nearest elemen t to f in F s with respect to the L 2 ( µ ) distance and the infim um is taken with resp ect to all admissible sequences of F . T alagrand’s ma jorizing measures theorem [24] implies that γ 2 ( F ) — seemingly , a purely metric ob ject—, is actually a probabilistic en tit y: it is equiv alent to the exp ectation of the suprem um of the canonical gaussian process indexed b y F . F ormally , there are absolute constan ts c 1 and c 2 for which , for every F ⊂ L 2 ( µ ), c 1 E sup f ∈F G f ≤ γ 2 ( F ) ≤ c 2 E sup f ∈F G f , where ( G f ) f ∈F is the canonical gaussian pro cess indexed b y the class F . The second ingredient w e require is the following dual representation of the W 1 -distance: If F and H hav e finite first moments, W 1 ( F , H ) = sup  Z R φ dF − Z R φ dH : φ : R → R is 1-Lipsc hitz  . (3.1) In particular, (3.1) immediately yields the follo wing observ ation. 8 DANIEL BAR TL AND SHAHAR MENDELSON Lemma 3.2. F or every two me asur able functions f and h , W 1 ( F N ,f + h , F f + h ) ≤ W 1 ( F N ,f , F f ) + ∥ h ∥ L 1 ( P N ) + ∥ h ∥ L 1 . Pr o of. W e assume without loss of generalit y that f and h hav e finite first moments. Clearly , b y the triangle inequality , W 1 ( F N ,f + h , F f + h ) ≤ W 1 ( F N ,f + h , F N ,f ) + W 1 ( F N ,f , F f ) + W 1 ( F f , F f + h ) . Next, let φ b e an y 1-Lipsc hitz function. Then | E φ ( f ( X )) − E φ ( f ( X ) + h ( X )) | ≤ E [ | h ( X ) | ] and b y (3.1), W 1 ( F f , F f + h ) ≤ ∥ h ∥ L 1 . In a similar fashion, W 1 ( F N ,f + h , F N ,f ) ≤ ∥ h ∥ L 1 ( P N ) , whic h completes the pro of. □ Remark 3.3. W e may assume that ∆ ≤ 1. Indeed, by Lemma 3.2 and Jensen’s inequality , for ev ery f ∈ F , W 1 ( F N ,f , F f ) ≤ ∥ f ∥ L 2 ( P N ) + ∥ f ∥ L 2 , and clearly ∥ f ∥ L 2 ≤ d F . Moreo ver, by Assumption 2.1, for ev ery realization ( X i ) N i =1 ∈ Ω 0 , ∥ f ∥ L 2 ( P N ) ≤ B d F + θ √ N and th us W 1 ( F N ,f , F f ) ≤ c ( B ) d F √ ∆ for ev ery ∆ ≥ 1 satisfying the restriction from Theorem 2.2, that ∆ N ≥ c ′ θ 2 /d 2 F . 3.1. Setting up the c haining argumen t. If Assumption 2.1 is satisfied for F , it is also satisfied for F ∪ { 0 } , hence w e may assume that 0 ∈ F . Observ e that there is a set F θ ⊂ F with cardinalit y at most 2 cN that is θ / √ N -separated (that is, for any distinct f , h ∈ F θ , ∥ f − h ∥ L 2 ≥ θ / √ N ) and cov ers F at scale 2 θ / √ N (that is, for ev ery f ∈ F there is h ∈ F θ satisfying ∥ f − h ∥ L 2 ≤ 2 θ / √ N ). Indeed, by Sudak ov’s inequalit y (see, e.g., [23, Theorem 5.6]), if N ( ε ) denotes the smallest cardinalit y of a set that cov ers F at scale ε > 0, then for ε = θ / √ N , log ( N ( ε )) ≤ c 0  E sup f ∈F G f ε  2 ≤ c 0 N , where the second inequalit y follo ws from the condition on θ in Assumption 2.1. The claim no w follo ws from the standard relation betw een packing num bers and cov ering num bers. Finally , w e ma y assume without loss of generalit y that 0 ∈ F θ . Next, let ( F θ s ) s ≥ 0 b e an almost optimal admissible sequence of F θ and since F θ ⊂ F , w e ha v e that γ 2 ( F θ ) ≤ γ 2 ( F ). Denote by s 1 ≥ 0 the smallest integer that satisfies 2 2 s 1 ≥ |F θ | ; hence 2 s 1 ≤ c ′ N and w e set F θ s = F θ for s ≥ s 1 . F or f ∈ F and s ≥ 0, let π s f b e the closest element in F θ s to f . In particular, π s f = π s 1 f for s ≥ s 1 . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 9 Com bining these observ ations w e ha ve the follo wing: F or ev ery f ∈ F , ∥ f − π s 1 f ∥ L 2 ≤ 2 θ √ N , (3.2) for s ≤ s 1 , either ∥ ∆ s f ∥ L 2 = 0 or ∥ ∆ s f ∥ L 2 ≥ θ √ N , (3.3) and sup f ∈F X s ≥ 0 2 s/ 2 ∥ ∆ s f ∥ L 2 ≤ 4 γ 2 ( F θ ) ≤ 4 γ 2 ( F ) . (3.4) Finally , recall that ∆ ≥ 1 N and set s 0 to b e the first in teger that satisfies 2 s 0 ≥ ∆ N . Remark 3.4. In what follo ws w e assume that s 1 > s 0 . If s 1 ≤ s 0 , the argumen ts presen ted b elo w simplify considerably , as there is no need for c haining; w e therefore omit the standard details. The first step in the proof of Theorem 2.2 is the following reduction. Lemma 3.5. Using the notation of Assumption 2.1, for every ( X i ) N i =1 ∈ Ω 0 and every f ∈ F , W 1 ( F N ,f , F f ) ≤ W 1 ( F N ,π s 1 f , F π s 1 f ) + (2 + B ) θ √ N . Pr o of. By Lemma 3.2, for ev ery f ∈ F W 1 ( F N ,f , F f ) ≤ W 1 ( F N ,π s 1 f , F π s 1 f ) + ∥ f − π s 1 f ∥ L 1 + ∥ f − π s 1 f ∥ L 1 ( P N ) . Recall that ∥ f − π s 1 f ∥ L 1 ≤ ∥ f − π s 1 f ∥ L 2 ≤ θ √ N , and that by Assumption 2.1, ∥ f − π s 1 f ∥ L 2 ( P N ) ≤ B ∥ f − π s 1 f ∥ L 2 + θ √ N , as required. □ The next step in the proof is truncation. F or eac h m > 0 define ϕ ( · ; m ) : R → R , ϕ ( x ; m ) =        x if | x | ≤ m, m if x > m, − m otherwise , 10 DANIEL BAR TL AND SHAHAR MENDELSON and set ψ ( · ; m ) = id − ϕ ( · ; m ). Note that | ψ ( x ; m ) | = max { 0 , | x | − m } . With a minor abuse notation, set for an y function h , ϕ s ( h ) = ϕ  h ; ∥ h ∥ L 2 p N / 2 s  , ψ s ( h ) = ψ  h ; ∥ h ∥ L 2 p N / 2 s  , and for f ∈ F let T ( f ) = s 1 − 1 X s = s 0 ϕ s (∆ s f ) + ϕ s 0 ( π s 0 f ) . The idea behind the in tro duction of T ( f ) is a multi-lev el truncation. Clearly π s 1 f = P s 1 − 1 s = s 0 ∆ s f + π s 0 f ; thus T ( f ) is a truncation of π s 1 f that tak es place for each ‘link’ sepa- rately , according to the level s and to the L 2 -norm of the link. This is essen tial in guaran- teeing that the wan ted degree of concentration is exhibited by all the links simultaneously— and is the reason behind the c hoice of the truncation lev el of an s -link at ∥ ∆ s f ∥ p N / 2 s . Note that for s > s 1 , p N / 2 s < 1, meaning that the truncation w ould b e b ey ond a mean- ingful scale; that is the reason wh y the construction is terminated at the lev el s 1 . Lemma 3.6. Ther e is an absolute c onstant c such that for ( X i ) N i =1 ∈ Ω 0 and every f ∈ F , W 1 ( F N ,π s 1 f , F π s 1 f ) ≤ W 1 ( F N , T ( f ) , F T ( f ) ) + cB 2  d F √ ∆ + γ 2 ( F ) √ N  . Pr o of. By Lemma 3.2, W 1 ( F N ,π s 1 f , F π s 1 f ) ≤ W 1 ( F N , T ( f ) , F T ( f ) ) + ∥ π s 1 f − T ( f ) ∥ L 1 ( P N ) + ∥ π s 1 f − T ( f ) ∥ L 1 . T o estimate ∥ π s 1 f − T ( f ) ∥ L 1 , set b s = ∥ ∆ s f ∥ L 2 p N / 2 s and by the Cauch y-Sc h w arz inequalit y follow ed b y Mark o v’s inequality , ∥ ψ s (∆ s f ) ∥ L 1 ≤ E  | ∆ s f | 1 {| ∆ s f |≥ b s }  ≤ ∥ ∆ s f ∥ L 2 p P ( | ∆ s f | ≥ b s ) ≤ ∥ ∆ s f ∥ L 2 r 2 s N . In a similar manner, using that 2 s 0 ≥ ∆ N and that ∥ π s 0 f ∥ L 2 ≤ d F , ∥ ψ s 0 ( π s 0 f ) ∥ L 1 ≤ ∥ π s 0 f ∥ L 2 q P ( | π s 0 f | ≥ ∥ π s 0 f ∥ L 2 / √ ∆) ≤ d F √ ∆ . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 11 Th us, by (3.4), ∥ π s 1 f − T ( f ) ∥ L 1 ≤ ∥ ψ s 0 ( π s 0 f ) ∥ L 1 + s 1 − 1 X s = s 0 ∥ ψ s (∆ s f ) ∥ L 1 ≤ d F √ ∆ + s 1 − 1 X s = s 0 r 2 s N ∥ ∆ s f ∥ L 2 ≤ d F √ ∆ + 4 γ 2 ( F ) √ N . (3.5) The analysis of the ∥ ψ s (∆ s f ) ∥ L 1 ( P N ) -term is slightly more in v olved. Since F θ s 1 is a θ / √ N -separated set, we either hav e ∆ s f = 0 (in whic h case ∥ ψ s (∆ s f ) ∥ L 1 ( P N ) = 0), or ∥ ∆ s f ∥ L 2 ≥ θ / √ N . In the latter case, and since | ψ s (∆ s f ) | = max { 0 , | ∆ s f | − b s } , tail- in tegration follow ed b y Mark o v’s inequality sho w that ∥ ψ s (∆ s f ) ∥ L 1 ( P N ) = Z ∞ b s P N ( | ∆ s f | > t ) dt ≤ Z ∞ b s ∥ ∆ s f ∥ 2 L 2 ( P N ) t 2 dt. Moreo v er, ∥ ∆ s f ∥ L 2 ≥ θ / √ N and it thus follo ws from Assumption 2.1 that ∥ ∆ s f ∥ 2 L 2 ( P N ) ≤ ( B + 1) 2 ∥ ∆ s f ∥ 2 L 2 . Therefore, b y the choice of b s , ∥ ψ s (∆ s f ) ∥ L 1 ( P N ) ≤ ( B + 1) 2 ∥ ∆ s f ∥ 2 L 2 Z ∞ b s 1 t 2 dt = ( B + 1) 2 ∥ ∆ s f ∥ L 2 r 2 s N . The same arguments can be used to sho w that ∥ ψ s 0 ( π s 0 f ) ∥ L 1 ( P N ) ≤ c 1 B 2 d F √ ∆ , and follo wing the same path as in (3.5), ∥ π s 1 f − T ( f ) ∥ L 1 ( P N ) ≤ c 2 B 2  d F √ ∆ + γ 2 ( F ) √ N  , as required. □ 3.2. The heart of the proof. W e are left to deal with W 1 ( F N , T ( f ) , F T ( f ) ), whic h is where the non-standard chaining argumen t is required. F or f ∈ F set S s 0 − 1 ( f ) = ϕ s 0 ( π s 0 f ) and for r ≥ s 0 let S r ( f ) = r X s = s 0 ϕ s (∆ s f ) + ϕ s 0 ( π s 0 f ) . 12 DANIEL BAR TL AND SHAHAR MENDELSON In particular T ( f ) = S s 1 − 1 ( f ) is a telescopic sum: T ( f ) = s 1 − 1 X r = s 0 ( S r ( f ) − S r − 1 ( f )) + ϕ s 0 ( π s 0 f ) . Moreo v er, P ( T ( f ) ≤ t ) = s 1 − 1 X r = s 0  P ( S r ( f ) ≤ t ) − P ( S r − 1 ( f ) ≤ t )  + P ( ϕ s 0 ( π s 0 f ) ≤ t ) , and the same is true for P N as well. Thus, setting [ P N − P ]( A ) = P N ( A ) − P ( A ), we ha ve that W 1 ( F N , T ( f ) , F T ( f ) ) ≤ s 1 − 1 X r = s 0 Z R    [ P N − P ]( S r ( f ) ≤ t ) − [ P N − P ]( S r − 1 ( f ) ≤ t )    dt + Z R | [ P N − P ]( ϕ s 0 ( π s 0 f ) ≤ t ) | dt = E 1 ( f ) + E 2 ( f ) . (3.6) 3.3. Estimating E 1 ( f ) . Lemma 3.7. Ther e is an absolute c onstant c such that with pr ob ability at le ast 1 − exp( − 4∆ N ) , for every f ∈ F , E 1 ( f ) ≤ c  d F log 2 ( e/ ∆) √ N + γ 2 ( F ) √ N  . The crucial ingredient is a high probabilit y estimate on each incremen t Z R    [ P N − P ]( S r ( f ) ≤ t ) − [ P N − P ]( S r − 1 ( f ) ≤ t )    dt : Lemma 3.8. Ther e is an absolute c onstant c such that the fol lowing holds. F or every f ∈ F and r ∈ { s 0 , . . . , s 1 − 1 } , with pr ob ability at le ast 1 − exp( − 2 r +5 ) , Z R    [ P N − P ]( S r ( f ) ≤ t ) − [ P N − P ]( S r − 1 ( f ) ≤ t )    dt ≤ c d F log( e/ ∆) √ N + ∥ ∆ r f ∥ L 2 r 2 r N ! . (3.7) Let us show that Lemma 3.7 is an immediate outcome of Lemma 3.8. Pr o of of L emma 3.7. Observe that |{ ( S r ( f ) , S r − 1 ( f )) : f ∈ F }| ≤ exp(2 r +4 ) . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 13 Indeed, recall that S r ( f ) = P r s = s 0 ϕ s (∆ s f ) + ϕ s 0 ( π s 0 f ) and thus |{ S r ( f ) : f ∈ F }| ≤ r X s = s 0 |F θ s +1 × F θ s | + |F θ s 0 | ≤ r X s = s 0 2 2 s +1 2 2 s + 2 2 s 0 ≤ exp(2 r +3 ) . No w consider r ∈ { s 0 , . . . , s 1 − 1 } . By Lemma 3.8 and the union b ound ov er the pairs ( S r ( f ) , S r − 1 ( f )), it is evident that with probabilit y at least 1 − exp( − 2 r +4 ) (3.7) is true uniformly for ev ery pair (( S r ( f ) , S r − 1 ( f ))) f ∈F . And, by the union b ound o v er r , with probabilit y at least 1 − X r ≥ s 0 exp( − 2 r +4 ) ≥ 1 − exp( − 2 s 0 +2 ) , for every f ∈ F , E 1 ( f ) ≤ c 1 s 1 − 1 X r = s 0 ∥ ∆ r f ∥ L 2 r 2 r N + d F log( e/ ∆) √ N ! ≤ c 1  4 γ 2 ( F ) + ( s 1 + 1 − s 0 ) d F log( e/ ∆) √ N  . T o complete the pro of, note that s 1 − s 0 ≤ log 2 (4 N ) − log 2 (∆ N ) ≤ c log  e ∆  . □ Next, fix f ∈ F and r ≥ s 0 , and set Z = Z R    [ P N − P ]( S r ( f ) ≤ t ) − [ P N − P ]( S r − 1 ( f ) ≤ t )    dt. W e turn to the ‘main even t’: showing that Z concentrates around its exp ectation, and that the exp ectation is sufficien tly small. 3.3.1. Contr ol ling E Z . Lemma 3.9. L et m > 0 and M ≥ 1 . If h satisfies that ∥ h ∥ L 2 ≤ m and ∥ h ∥ L ∞ ≤ mM , then E W 1 ( F N ,h , F h ) ≤ 2 m log( eM ) √ N . Pr o of. Set w = h/m , thus F N ,h ( t ) = F N ,w ( t/m ) and F h ( t ) = F w ( t/m ) for t ∈ R , and F N ,w ( t ) = F w ( t ) ∈ { 0 , 1 } for | t | > M . By F ubini’s theorem follow ed b y a c hange of v ariables, E W 1 ( F N ,h , F h ) = Z R E | F N ,h ( s ) − F h ( s ) | ds = m Z M − M E | F N ,w ( t ) − F w ( t ) | dt. 14 DANIEL BAR TL AND SHAHAR MENDELSON Clearly , for ev ery t ∈ R ,  E [ | F N ,w ( t ) − F w ( t ) | 2 ]  1 / 2 = ( F w ( t )(1 − F w ( t ))) 1 / 2 √ N . Finally , b y Mark o v’s inequality , F w ( t )(1 − F w ( t )) ≤ ∥ w ∥ 2 L 2 t − 2 ≤ t − 2 , and the claim follows because R M 1 1 t dt = log ( M ). □ Lemma 3.10. Ther e is an absolute c onstant c such that for every f ∈ F and r ∈ { s 0 − 1 , . . . , s 1 − 1 } , E W 1 ( F N ,S r ( f ) , F S r ( f ) ) ≤ cd F log( e/ ∆) √ N . In p articular, E Z ≤ 2 cd F log( e/ ∆) √ N . Pr o of. Observe that ∥ S r ( f ) ∥ L ∞ ≤ c 1 d F / √ ∆. Indeed, ∥ ϕ s 0 ( π s 0 f ) ∥ L ∞ ≤ d F r N 2 s 0 and ∥ ϕ ℓ (∆ ℓ f ) ∥ L ∞ ≤ ∥ ∆ ℓ f ∥ L 2 r N 2 ℓ . Therefore, ∥ S r ( f ) ∥ L ∞ ≤ r X ℓ = s 0 ∥ ϕ ℓ (∆ ℓ f ) ∥ L ∞ + ∥ ϕ s 0 ( π s 0 f ) ∥ L ∞ ≤ r X ℓ = s 0 ∥ ∆ ℓ f ∥ L 2 2 ℓ/ 2 2 s 0 √ N + d F r N 2 s 0 ≤ 4 γ 2 ( F ) √ N ∆ N + d F r 1 ∆ ≤ 5 d F r 1 ∆ , where we used that 2 s 0 ≥ ∆ N and that γ 2 2 ( F ) ≤ d 2 F ∆ N . Next, note that ∥ S r ( f ) ∥ L 2 ≤ c 2 d F . This holds because ∥ ϕ s 0 ( π s 0 f ) ∥ L 2 ≤ ∥ π s 0 f ∥ L 2 and ∥ ϕ ℓ (∆ ℓ f ) ∥ L 2 ≤ ∥ ∆ ℓ f ∥ L 2 ; hence ∥ S r ( f ) ∥ L 2 ≤ r X ℓ = s 0 ∥ ϕ ℓ (∆ ℓ f ) ∥ L 2 + ∥ ϕ s 0 ( π s 0 f ) ∥ L 2 ≤ 4 γ 2 ( F ) 2 s 0 / 2 + d F . Recalling that 2 s 0 ≥ ∆ N ≥ γ 2 2 ( F ) /d 2 F , it is evident that ∥ S r ( f ) ∥ L 2 ≤ 5 d F . The first part of the lemma follo ws from Lemma 3.9 and the second one since E Z ≤ E W 1 ( F N ,S r ( f ) , F S r ( f ) ) + E W 1 ( F N ,S r − 1 ( f ) , F S r − 1 ( f ) ) . □ ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 15 3.3.2. Conc entr ation ar ound E Z . The argument is based on the generalization of the Efron- Stein inequalit y from [9]. Set X ′ 1 , . . . , X ′ N to b e indep endent copies of X that are independent of X 1 , . . . , X N , and let P ( j ) N = 1 N  δ X 1 + · · · δ X j − 1 + δ X ′ j + δ X j +1 + · · · δ X N  . Denote Z ( j ) = Z R    [ P ( j ) N − P ]( S r ( f ) ≤ t ) − [ P ( j ) N − P ]( S r − 1 ( f ) ≤ t )    dt and put V = N X j =1 E   Z − Z ( j )  2 | ( X i ) N i =1  . Here, the conditional exp ectation means that the exp ectation is taken only with resp ect to the X ′ j . Using this notation, a direct consequence of Theorem 2 in [9] is the follo wing: Theorem 3.11 ([9, Theorem 2]) . F or every q ≥ 2 , ∥ Z − E Z ∥ L q ≤ 6 √ q ∥ √ V ∥ L q . The pro of of Lemma 3.8 is based on T heorem 3.11 for q = 2 r +5 : by Marko v’s inequalit y , P  | Z − E Z | ≥ e 6 √ q ∥ √ V ∥ L q  ≤ 1 e q = exp( − 2 r +5 ) . Remark 3.12. In light of the w anted chaining b ound, we need to sho w that ∥ √ V ∥ L q ≤ c ∥ ∆ r f ∥ L 2 / √ N for that c hoice of q . Such an estimate implies that ∥ √ V ∥ L q ∼ ∥ √ V ∥ L 1 , and as the pro of rev eals, this stability is true only up q ∼ 2 r ; for higher moments ∥ √ V ∥ L q gro ws with q . Lemma 3.13. F or every q ≥ 2 , ∥ √ V ∥ L q ≤ 2 ∥ ∆ r f ∥ L 2 √ N + 2 N       v u u t N X j =1 ϕ 2 r (∆ r f ( X j ))       L q . Pr o of. The exp ectations app earing in Z and Z ( j ) are the same, and by the triangle in- equalit y | Z − Z ( j ) | ≤ Z R    [ P N − P ( j ) N ]( S r ( f ) ≤ t ) + [ P N − P ( j ) N ]( S r − 1 ( f ) ≤ t )    dt = 1 N Z R     1 { S r ( f )( X j ) ≤ t } − 1 { S r ( f )( X ′ j ) ≤ t }  −  1 { S r − 1 ( f )( X j ) ≤ t } − 1 { S r − 1 ( f )( X ′ j ) ≤ t }     dt. 16 DANIEL BAR TL AND SHAHAR MENDELSON Next, S r ( f ) = S r − 1 ( f ) + ϕ r (∆ r f ); therefore, { S r ( f ) ≤ t } = { S r − 1 ( f ) + ϕ r (∆ r f ) ≤ t } , and the Leb esgue measure of the set n t ∈ R : 1 { S r ( f )( X j ) ≤ t } − 1 { S r − 1 ( f )( X j ) ≤ t }  = 0 o is | ϕ r (∆ r f ( X j )) | . Using an identical argumen t for X ′ j in place of X j , it is evident that | Z − Z ( j ) | ≤ 1 N  | ϕ r (∆ r f ( X j )) | + | ϕ r (∆ r f ( X ′ j )) |  , and as ( a + b ) 2 ≤ 2( a 2 + b 2 ) for a, b ∈ R , and ∥ ϕ r (∆ r f ) ∥ L 2 ≤ ∥ ∆ r f ∥ L 2 , we hav e E   Z − Z ( j )  2 | ( X i ) N i =1  ≤ 2 N 2  ϕ 2 r (∆ r f ( X j )) + ∥ ∆ r f ∥ 2 L 2  . The claim now follo ws from sub-additivit y of a 7→ √ a . □ With Lemma 3.13 in mind, all that remains is to establish a satisfactory estimate on       v u u t N X j =1 ϕ 2 r (∆ r f ( X j ))       L q : T o that end, w e use the following b ound on moments of sums of indep enden t random v ariables established in [15]: Theorem 3.14 ([15, Corollary 1]) . L et Y b e a nonne gative r andom variable and set Y i , . . . , Y N to b e indep endent c opies of Y . Then, for every q ≥ 1 ,       N X j =1 Y j       L q ∼ sup ( q p  N q  1 /p ∥ Y ∥ L p : max n 1 , q N o ≤ p ≤ q ) . W e also need the standard observ ation that if ∥ Y ∥ L ∞ ≤ M and ∥ Y ∥ L 1 ≤ m , then for ev ery p ≥ 1, ∥ Y ∥ L p ≤ M  m M  1 /p . (3.8) Indeed, | Y | p = | Y | p − 1 | Y | ≤ M p − 1 | Y | and the claim follo ws b y taking exp ectations. Lemma 3.15. Ther e is an absolute c onstant c and for q = 2 r +5 ,       v u u t N X j =1 ϕ 2 r (∆ r f ( X j ))       L q ≤ c ∥ ∆ r f ∥ L 2 √ N . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 17 Pr o of. Thanks to Jensen’s inequalit y , it suffices to show that       N X j =1 ϕ 2 r (∆ r f ( X j ))       L q ≤ c ∥ ∆ r f ∥ 2 L 2 N . Set Y = ϕ 2 r (∆ r f ( X )) and put Y j = ϕ 2 r (∆ r f ( X j )). By the definition of ϕ r , ∥ Y ∥ L 1 ≤ ∥ ∆ r f ∥ 2 L 2 and ∥ Y ∥ L ∞ ≤ ∥ ∆ r f ∥ 2 L 2 N 2 r , and b y (3.8), for an y p ≥ 1, ∥ Y ∥ L p ≤ ∥ ∆ r f ∥ 2 L 2 N 2 r  2 r N  1 /p . In particular, for q = 2 r +5 , we hav e that sup ( q p  N q  1 /p ∥ Y ∥ L p : max n 1 , q N o ≤ p ≤ q ) ≤ 2 5 ∥ ∆ r f ∥ 2 L 2 · sup ( 2 r p  N 2 r  1 /p N 2 r  2 r N  1 /p : 1 ≤ p ≤ 2 r +5 ) = 2 5 ∥ ∆ r f ∥ 2 L 2 · N , and the claim follows from Theorem 3.14. □ With all the ingredients in place, the pro of of Lemma 3.8 is eviden t: Pr o of of L emma 3.8. Set q = 2 r +5 . By Theorem 3.11 and Mark o v’s inequalit y , with prob- abilit y at least 1 − exp( − 2 r +5 ), Z ≤ E Z + c 1 √ 2 r ∥ √ V ∥ L q . Moreo v er, by Lemma 3.13 and Lemma 3.15, ∥ √ V ∥ L q ≤ c 2 ∥ ∆ r f ∥ L 2 √ N , and using Lemma 3.10, E Z ≤ c 3 d F log( e/ ∆) √ N . □ 18 DANIEL BAR TL AND SHAHAR MENDELSON 3.4. Con trolling E 2 ( f ) . T urning to the starting p oint of each chain, we hav e the following: Lemma 3.16. Ther e is an absolute c onstant c such that with pr ob ability at le ast 1 − exp( − 4∆ N ) , for every f ∈ F , E 2 ( f ) ≤ cd F  log(1 / ∆) √ N + √ ∆  . Lemma 3.16 follows from the same argumen ts used in the pro of of Lemma 3.8. In fact, it is slightly simpler, as no c haining is required. W e therefore omit the details. 3.5. Putting it all together: pro of of Theorem 2.2. Let ( X i ) N i =1 ∈ Ω 0 . By Lemma 3.5, for every f ∈ F , W 1 ( F N ,f , F f ) ≤ W 1 ( F N ,π s 1 f , F π s 1 f ) + c 1 B θ √ N . It follo ws from Lemma 3.6 that for ev ery f ∈ F , W 1 ( F N ,π s 1 f , F π s 1 f ) ≤ W 1 ( F N , T ( f ) , F T ( f ) ) + c 2 B 2  d F √ ∆ + γ 2 ( F ) √ N  , and b y (3.6), W 1 ( F N , T ( f ) , F T ( f ) ) ≤ E 1 ( f ) + E 2 ( f ) . Using Lemma 3.7, with µ ⊗ N -probabilit y at least 1 − exp( − 4∆ N ), for ev ery f ∈ F , E 1 ( f ) ≤ c 3  d F log 2 ( e/ ∆) √ N + γ 2 ( F ) √ N  , and in a similar fashion, by Lemma 3.16, with µ ⊗ N -probabilit y at least 1 − exp( − 4∆ N ), for ev ery f ∈ F , E 2 ( f ) ≤ c 4 d F  log( e/ ∆) √ N + √ ∆  . The proof follows by recalling that ∆ N ≥ c 5 ( B ) max { θ 2 /d 2 F , log 4 ( N ) } and that θ ≥ c 6 γ 2 ( F ). □ W e end this section with the pro of of the optimalit y in the gaussian case. 3.6. Pro of of Lemma 2.5. By (3.1) and c ho osing the 1-Lipsc hitz functions φ = ± id, we ha v e that W 1 ( F N ,x , F x ) ≥      1 N N X i =1 ⟨ G i , x ⟩ − E ⟨ G, x ⟩      = G ( x ) . The first c laim in Lemma 2.5 follo ws from a standard gaussian low er tail b ound: setting g to b e the standard gaussian random v ariable, for ev ery x ∈ S d − 1 , G ( x ) has the same distribution as g / √ N . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 19 F or the second claim, note that sup x ∈ A G ( x ) has the same distribution as 1 √ N sup x ∈ A ⟨ G, x ⟩ . By gaussian concen tration (see, e.g., Theorem 4.7 in [23]), for every u ≥ 0, with probability at least 1 − 2 exp( − 1 4 u 2 ), sup x ∈ A ⟨ G, x ⟩ ≥ E sup x ∈ A ⟨ G, x ⟩ − u, whic h completes the pro of. □ 4. Proof of Theorems 1.2 and 2.6 and optimality The proofs of Theorem 1.2 and Theorem 2.6 follo w immediately from Corollary 2.8 and Theorem 2.2, resp ectively , and the following observ ation (applied with p = 2), recalling that q i ( f ) = F − 1 f ( i N +1 ). Lemma 4.1. F or every function f , every p ≥ 1 , and every r e alization of ( X i ) N i =1 ,      W 1 ( F N ,f , F f ) − 1 N N X i =1    f ♯ ( X i ) − q i ( f )         ≤ 10 ∥ f ∥ L p N 1 − 1 /p . Pr o of. Observe that F − 1 N ,f ( u ) = f ♯ ( X i ) for u ∈ [ i − 1 N , i N ) and therefore, by (2.2), W 1 ( F N ,f , F f ) = Z 1 0    F − 1 N ,f ( u ) − F − 1 f ( u )    du = N X i =1 Z i N i − 1 N    f ♯ ( X i ) − F − 1 f ( u )    du. Hence, to complete the pro of it suffices to show that ( ∗ ) = N X i =1 Z i N i − 1 N    q i ( f ) − F − 1 f ( u )    du ≤ 10 ∥ f ∥ L p N 1 − 1 /p . T o that end, fix 2 ≤ i ≤ N − 1. Since i N +1 ∈ [ i − 1 N , i N ), the monotonicity of F − 1 f implies that for every u ∈ [ i − 1 N , i N ),    q i ( f ) − F − 1 f ( u )    ≤ F − 1 f  i N  − F − 1 f  i − 1 N  , and therefore, N − 1 X i =2 Z i N i − 1 N    q i ( f ) − F − 1 f ( u )    du ≤ 1 N  F − 1 f  1 − 1 N  − F − 1 f  1 N  . Set M = ∥ f ∥ L p N 1 /p . By Marko v’s inequalit y P ( | f ( X ) | ≥ 2 M ) < 1 N and hence | F − 1 f ( 1 N ) | ≤ 2 M ; similarly the terms | F − 1 f (1 − 1 N ) | , | q 1 ( f ) | , | q N ( f ) | are all b ounded by 20 DANIEL BAR TL AND SHAHAR MENDELSON 2 M . Moreov er, b y H¨ older’s inequalit y , Z 1 N 0 | F − 1 f ( u ) | du ≤  Z 1 0 1 [0 , 1 N ] ( u ) du  1 − 1 /p ∥ f ∥ L p = M N , and the analogue estimate holds for R 1 1 − 1 N | F − 1 f ( u ) | du . It follows that ( ∗ ) ≤ 10 M N , as claimed. □ Finally , w e present the claims made in the in troduction on the optimality of Theorem 1.2. Here q i = F − 1 g ( i N +1 ) and F g is the standard gaussian distribution function. Lemma 4.2. Ther e ar e absolute c onstants c 1 , c 2 , c 3 for which the fol lowing holds. L et X = G b e the standar d gaussian ve ctor in R d , let A ⊂ S d − 1 b e a symmetric set and let N ≥ c 1 . Then for every ∆ ≥ 1 N and x ∈ A , with pr ob ability at le ast exp( − c 2 ∆ N ) , 1 N N X i =1    ⟨ X i , x ⟩ ♯ − q i    ≥ c 3 √ ∆ . Mor e over, with pr ob ability at le ast 0 . 99 , sup x ∈ A 1 N N X i =1    ⟨ X i , x ⟩ ♯ − q i    ≥ c 3 E sup x ∈ A ⟨ G, x ⟩ √ N . Pr o of. An application of Lemma 4.1 (with p = 4) shows that for ev ery x ∈ A , 1 N N X i =1    ⟨ X i , x ⟩ ♯ − q i    ≥ W 1 ( F N ,x , F x ) − c 1 N 3 / 4 . (4.1) Both claims follo w from Lemma 2.5. Indeed, note that √ ∆ ≥ 1 √ N , and b y symmetry of A we ha v e E sup x ∈ A ⟨ G, x ⟩ ≥ E | ⟨ G, x 0 ⟩ | = p 2 /π for any x 0 ∈ A . Consequently , the error term c 1 N − 3 / 4 in (4.1) can be absorb ed in to √ ∆ and 1 √ N E sup x ∈ A ⟨ G, x ⟩ , resp ectively , pro vided that N ≥ c 2 . □ Appendix A. An applica tion of Theorem 2.2 Theorem 2.2, com bined with the Lipschitz-represen tation of W 1 (see (3.1)) has an im- mediate outcome. T o formulate it, let L denote the set of all 1-Lipschitz functions from R to R . Theorem A.1. In the event wher e (2.1) holds, we have that sup φ ∈L sup f ∈F 1 N N X i =1 φ ( f ( X i )) − E φ ( f ( X )) ! ≤ d F √ ∆ . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 21 Tw o consequences of Theorem A.1 are w orth mentioning. First, assume that ( f ( X )) f ∈F is a gaussian pro cess and that E sup f ∈F G f ≥ log 2 N . Then Theorem A.1 yields the following uniform v ersion of the con traction principle (see, e.g., [16] for the classical result): E sup φ ∈L sup f ∈F 1 N N X i =1 φ ( f ( X i )) − E φ ( f ( X )) ! ∼ E sup f ∈F      1 N N X i =1 f ( X i ) − E f ( X )      . A second application of Theorem A.1 addresses a central c hallenge in empirical pro cess theory: establishing gaussian concentration in non-gaussian settings and under minimal structural assumptions on the indexing class. T o that end, supp ose that F satisfies the one-sided coarse estimate of Assumption 2.1 with θ ∼ E sup f ∈F G f and θ ≥ log 2 ( N ). Applying Theorem A.1 with φ = ± id yields that, with high probabilit y , sup f ∈F      1 N N X i =1 f ( X i ) − E f ( X )      ≤ d F E sup f ∈F G f √ N . Th us, a ‘weak’ one-sided structural condition leads to a gaussian estimate on the empirical pro cess. Appendix B. On Assumption 2.1 B.1. Pro of of Lemma 2.7. Recall that d F = sup f ∈F ∥ f ∥ L 2 and d ∗ ( F ) ∼ γ 2 2 ( F ) /d 2 F , and set H = con v  { f − e f : f , e f ∈ F }  . Clearly F ⊂ H , 0 ∈ H , and d H ≤ 2 d F . Moreov er, a standard consequence of T alagrand’s ma jorizing measures theorem is that γ 2 ( H ) ∼ γ 2 ( F ); thus d ∗ ( H ) ∼ d ∗ ( F ). Let r ≥ γ 2 ( F ) √ N , and put H r = { h ∈ H : ∥ h ∥ L 2 ≤ r } . It suffices to sho w that, with exponentially high probability , sup h ∈H r ∥ h ∥ L 2 ( P N ) ≤ c 1 Lr . If that is true, the lemma follo ws b ecause H is conv ex and con tains 0 and therefore is star-shap ed around 0: for h ∈ H and u ∈ [0 , 1], uh ∈ H . T o that end, an application of Theorem 1.13 in [19] and the preceding discussion therein sho ws that, for every λ ≥ 1, with probability at least 1 − 2 exp( − c 2 λ 2 d ∗ ( H r )), sup h ∈H r      1 N N X i =1 h 2 ( X i ) − ∥ h ∥ 2 L 2      ≤ c 3 L 2  λd H r γ 2 ( H r ) √ N + λ 2 γ 2 2 ( H r ) N  = E λ . 22 DANIEL BAR TL AND SHAHAR MENDELSON Clearly d H r ≤ r and γ 2 ( H r ) ≤ γ 2 ( H ) ∼ γ 2 ( F ). Setting λ = c 4 r √ N γ 2 ( H r ) w e ha v e that λ ≥ 1. Moreo v er, E λ ≤ c 5 L 2 r 2 and using that d H r ∼ min { r, d F } , λ 2 d ∗ ( H r ) = c 2 4 r 2 N d 2 H r ∼ max  N , r 2 N d 2 F  from which the claim follo ws. □ W e pro ceed to sho w that Assumption 2.1 is satisfied for other, hea vy-tailed classes F as w ell, starting with: B.2. Classes of linear functionals indexed by S d − 1 . Let F = F S d − 1 = {⟨ · , x ⟩ : x ∈ S d − 1 } b e the class of linear functionals indexed b y the entire sphere, and let X ∈ R d b e a zero mean random v ector that satisfies L q − L 2 norm-equiv alence with constan t L : for ev ery x ∈ R d , ∥⟨ X , x ⟩∥ L q ≤ L ∥⟨ X , x ⟩∥ L 2 . If X is isotropic and ∥ X ∥ 2 is a well-behav ed random v ariable, under such a norm-equiv alence the random matrix 1 √ N P N i =1 ⟨ X i , ·⟩ e i satisfies the quantitativ e Bai-Yin asymptotics [3]; namely , that ζ d,N = sup x ∈ S d − 1      1 N N X i =1 ⟨ X i , x ⟩ 2 − E ⟨ X , x ⟩ 2      ≲ r d N . The follo wing result is an immediate outcome of [26, Corollary 2] and is the curren t state of the art on such Bai-Yin t yp e estimates. Theorem B.1. F or every L, R ≥ 1 , q > 4 ther e is a c onstant c = c ( q , L ) such that the fol lowing holds. If X is isotr opic and satisfies L q − L 2 norm-e quivalenc e with c onstant L , N ≥ 2 d , and max i ≤ N ∥ X i ∥ 2 ≤ R ( dN ) 1 / 4 with pr ob ability 1 − δ , then, with pr ob ability at le ast 1 − 1 d − δ , ζ d,N ≤ cR 2 r d N . Note that since E ∥ X ∥ 2 2 = d and N ≥ d , the condition that max i ≤ N ∥ X i ∥ 2 ≤ R ( dN ) 1 / 4 is naturally satisfied, with constan ts R and δ dep ending on the tail deca y exhibited by ∥ X ∥ 2 . In fact, there are versions of Theorem B.1 that hold with higher probability , but w e will not pursue that asp ect here. A direct consequence of Theorem B.1 is that: Corollary B.2. L et Ω 0 b e the event in which the assertion of The or em B.1 holds and assume that N ≥ 16 R 4 c 2 d . Then Assumption 2.1 is satisfie d with B = 2 and θ = 0 . A more in teresting question is whether a dimension-free analogue to Corollary B.2 is true for non-isotropic X , with a condition on N in terms of the co v ariance matrix of X ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 23 rather than the assumption that N ≳ d . While this do es not follo w immediately from the dimension-free Bai-Yin theorem recently established in [1] (see also [13]), such an estimate happ ens to be true. T o form ulate the result, set Σ = C o v[ X ], let tr(Σ) b e its trace, denote b y ∥ Σ ∥ op its op erator norm, and put r(Σ) = tr(Σ) ∥ Σ ∥ op , i.e., the effectiv e rank of Σ. It is standard to verify that tr(Σ) ∼ γ 2 2 ( S d − 1 ) and that r(Σ) ∼ d ∗ ( S d − 1 ). Moreov er, in the context of establishing Assumption 2.1, one ma y as- sume without loss of generalit y that Σ is diagonal with decreasing, strictly p ositiv e eigen- v alues λ i = λ i (Σ) for i = 1 , . . . , d . W e denote b y P 1: N : R d → R d the ℓ 2 -pro jection on to span( e 1 , . . . , e min { d,N } ), and since Σ is diagonal, P 1: N is also the L 2 -pro jection. Lemma B.3. F or every L, R ≥ 1 and q > 4 ther e ar e c onstants c 1 , c 2 , c 3 dep ending on q and L such that the fol lowing holds. Assume that X satisfies L q − L 2 norm-e quivalenc e with c onstant L and that with pr ob ability 1 − δ , (i) max i ≤ N ∥ P 1: N Σ − 1 / 2 X i ∥ 2 ≤ R √ N , (ii) max i ≤ N ∥ X i ∥ 2 ≤ R p tr(Σ) . If N ≥ c 1 (r(Σ) + R 4 ) , then with pr ob ability at le ast 1 − c 2 N − 2 δ , for every x ∈ R d satisfying ∥ x ∥ 2 ≤ 1 , ∥ ⟨ X , x ⟩ ∥ L 2 ( P N ) ≤ c 3 R ∥ ⟨ X , x ⟩ ∥ L 2 + r tr(Σ) N ! . (B.1) T o put conditions (i) and (ii) into some persp ectiv e, note that E ∥ X ∥ 2 2 = tr(Σ) and E ∥ P 1: N Σ − 1 / 2 X ∥ 2 2 = N ; hence (i) and (ii) are naturally satisfied with constants δ and R , dep ending on the tail b ehaviour of X and Σ − 1 / 2 X . That dep endence is the reason why w e k ept R explicit in Theorem B.3. Corollary B.4. L et Ω 0 b e the event in which (B.1) holds; then Assumption 2.1 is satisfie d with B = c 3 R and θ = c 3 R p tr(Σ) . In p articular, if ∆ N ≥ c 1 max  r(Σ) , log 4 ( N )  , then with pr ob ability at le ast 1 − c 2 N − 2 δ − exp( − c 3 ∆ N ) , sup x ∈ S d − 1 W 1 ( F N ,x , F x ) ≤ p λ 1 ∆ . Pr o of of L emma B.3. Let β be the absolute constant appearing in Theorem B.5 b elow and set η = 1 2 β ≤ 1. Put K = η N and assume without loss of generality that K is an in teger. Set L = span( e 1 , . . . , e K ) and M = span( e K +1 , . . . , e d ) (if K ≥ d , then M = ∅ ); P L and P M are the ℓ 2 -pro jections on L and M , respectively . 24 DANIEL BAR TL AND SHAHAR MENDELSON The estimates in case x ∈ L and x ∈ M are differen t: we sho w that with probability at least 1 − 1 K − δ , for ev ery x ∈ L , ∥ ⟨ X , x ⟩ ∥ L 2 ( P N ) ≤ c 1 ( q , L ) R ∥ ⟨ X , x ⟩ ∥ L 2 (B.2) and with probability at least 1 − c 2 ( q , L ) 1 N − δ , sup x ∈ M : ∥ x ∥ 2 ≤ 1 N X i =1 ⟨ X i , x ⟩ 2 ≤ c 3 ( q , L ) R 2 tr(Σ) . (B.3) Step 1 (the estimate on the subsp ac e L ): Set Y i = P L Σ − 1 / 2 X i and recall that with proba- bilit y 1 − δ , max i ≤ N ∥ Y i ∥ 2 ≤ max i ≤ N ∥ P 1: N Σ − 1 / 2 X i ∥ 2 ≤ R √ N = Rη − 1 / 4 ( K N ) 1 / 4 . It follo ws from Theorem B.1 that with probabilit y at least 1 − 1 K − δ , sup x ∈ L : ∥ x ∥ 2 =1      1 N N X i =1 ⟨ Y i , x ⟩ 2 − 1      ≤ c 4 ( q , L ) R 2 η − 1 / 2 r K N = c 4 R 2 . On that even t, for ev ery x ∈ L satisfying ∥ x ∥ 2 = 1, ∥ ⟨ Y , x ⟩ ∥ 2 L 2 ( P N ) ≤ c 4 R 2 + 1 = ( c 4 R 2 + 1) ∥ ⟨ Y , x ⟩ ∥ 2 L 2 , and p ositive homogeneity shows that for every x ∈ L , ∥ ⟨ Y , x ⟩ ∥ L 2 ( P N ) ≤ √ c 4 R 2 + 1 ∥ ⟨ Y , x ⟩ ∥ L 2 . Hence (B.2) follows b y transforming bac k to the original coordinate system . Step 2 (the estimate on M ): Set Z = P M X and Z i = P M X i , and put Σ Z = C o v[ Z ]. Observ e that, since ( λ i ) d i =1 is non-increasing, ∥ Σ Z ∥ op = λ K +1 ≤ 1 K K X i =1 λ i ≤ tr(Σ) K . Assume first that λ K +1 ≥ 1 2 tr(Σ) K ; w e explain the minor changes needed in the general case at the end of the proof. By interc hanging t w o suprema, sup x ∈ M : ∥ x ∥ 2 ≤ 1 N X i =1 ⟨ Z i , x ⟩ 2 = sup x ∈ M : ∥ x ∥ 2 ≤ 1 sup a ∈ S N − 1 N X i =1 a i ⟨ Z i , x ⟩ ! 2 = sup a ∈ S N − 1      N X i =1 a i Z i      2 2 = ( ∗ ) . ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 25 F ollowing the notation used in [1] and setting [ N ] = { 1 , . . . , N } , let f ( s, [ N ]) = sup a ∈ S N − 1 : |{ i ≤ N : a i  =0 }|≤ s      N X i =1 a i Z i      2 2 for s ≥ 1. Note that ( ∗ ) = f ( N , [ N ]) and that f ( N , [ N ]) ≤ 1 η 2 f ( K , [ N ]); hence to establish (B.3) it remains to sho w that with high probabilit y , f ( K, [ N ]) ≲ R 2 tr(Σ). T o that end, w e apply [1, Theorem 3]: Theorem B.5 ([1, Theorem 3]) . Ther e is an absolute c onstant β ≥ 1 and c onstants c, c ′ dep ending only on q and L such that the fol lowing holds: F or al l inte gers N and s that satisfy 1 β N ≥ s ≥ r(Σ Z ) , with pr ob ability at le ast 1 − c N , f ( s, [ N ]) ≤ c ′ max i ≤ N ∥ Z i ∥ 2 2 + ∥ Σ Z ∥ op · s  N s  4 / (4+ q ) log 4  N s  ! . W e claim that the condition 1 β N ≥ s ≥ r(Σ Z ) in Theorem B.5 is satisfied for s = 2 K . Indeed, since ∥ Σ Z ∥ op = λ K +1 ≥ tr(Σ) 2 K , we hav e r(Σ Z ) = tr(Σ Z ) ∥ Σ Z ∥ op ≤ tr(Σ) λ K +1 ≤ 2 K , and the claim follows since K = η N and η = 1 2 β . Moreov er, using that ∥ Σ Z ∥ op ≤ tr(Σ) K , that ∥ Z i ∥ 2 ≤ ∥ X i ∥ 2 , and that max i ≤ N ∥ X i ∥ 2 2 ≤ R 2 tr(Σ) with probability 1 − δ b y assumption, it is straightforw ard to v erify that with probability 1 − δ , max i ≤ N ∥ Z i ∥ 2 2 + ∥ Σ Z ∥ op · 2 K  N 2 K  4 / (4+ q ) log 4  N 2 K  ≤ c 5 ( q , L ) R 2 tr(Σ) . Finally , f ( K, [ N ]) ≤ f (2 K , [ N ]), whic h concludes the pro of of (B.3). Step 3 (putting the estimates to gether): On the intersection of the even ts in whic h the assertions of Step 1 and Step 2 hold, set x ∈ R d with ∥ x ∥ 2 ≤ 1. W rite x = y + z ∈ L ⊕ M and note that ∥ ⟨ X , y ⟩ ∥ L 2 ≤ ∥ ⟨ X , x ⟩ ∥ L 2 . Therefore, ∥ ⟨ X , x ⟩ ∥ L 2 ( P N ) ≤ ∥ ⟨ X , y ⟩ ∥ L 2 ( P N ) + ∥ ⟨ X , z ⟩ ∥ L 2 ( P N ) ≤ c 6 ( q , L ) R ∥ ⟨ X , y ⟩ ∥ L 2 + r tr(Σ) N ! . as claimed. Step 4 (on the assumption that λ K +1 ≥ tr(Σ) 2 K ): If the assumption is not satisfied, w e pro ceed as follo ws. Set ξ = tr(Σ) K , let B b e the standard Bernoulli random v ector in R K , indep enden t of X , and put e X = ( X, p ξ B ) ∈ R d + K . 26 DANIEL BAR TL AND SHAHAR MENDELSON Setting e Σ = C o v[ e X ], w e hav e tr( e Σ) = 2tr(Σ) and ∥ e Σ ∥ op ≥ ∥ Σ ∥ op ; thus r( e Σ) ≤ 2r(Σ) and λ K ( e Σ) ≥ ξ = tr( e Σ) 2 K . Moreov er, e X satisfies L q − L 2 norm equiv alence with constant L + c ( q ), and, almost surely , ∥ e X ∥ 2 ≤ ∥ X ∥ 2 + p tr(Σ) and ∥ P 1: N e Σ − 1 / 2 e X ∥ 2 ≤ ∥ P 1: N Σ − 1 / 2 X ∥ 2 + √ K . W e repeat Step 1-3 for e X . □ B.3. Classes of linear functionals indexed by A ⊂ R d . Let F = F A = {⟨ · , x ⟩ : x ∈ A } where A ⊂ R d and X ∈ R d is isotropic. Let c 0 b e a suitable absolute constant, set w to b e a random v ariable with zero mean and unit v ariance that satisfies the following ‘lo cal’ momen t condition—that for some α > 0, ∥ w ∥ L p ≤ Lp 1 /α ∥ w ∥ L 2 for 2 ≤ p ≤ c 0 log( ed ) , (B.4) and put X = ( w 1 , . . . , w d ) where the w i ’s are indep endent copies of w . Remark B.6. Note that (B.4) is indeed m uc h w eak er than (2.3); for example w need not ha v e any momen ts b ey ond c 0 log( ed ). If the random v ector X satisfies the lo cal momen t condition and A is not to o small—in the sense that its critical dimension is at least logarithmic in d , then Assumption 2.1 is satisfied: Lemma B.7. Ther e ar e absolute c onstants c 0 , c 1 , c 2 and, for every L, α > 0 , ther e is a c onstant c 3 = c 3 ( L, α ) such that the fol lowing holds. Supp ose that w satisfies (B.4) with L and c 0 , A ⊂ R d satisfies 0 ∈ A and d ∗ ( A ) ≥ c 1 log( ed ) , and N ≥ log 4 /α +1 ( ed ) . Put ξ = log 2 /α +1  ed d ∗ ( A )  . Then Assumption 2.1 is satisfie d with set Ω 0 of me asur e P (Ω 0 ) ≥ 1 − exp( − c 2 d ∗ ( A )) − d − 10 and c onstants B = c 3 , θ = c 3 ξ γ 2 ( A ) . Corollary B.8. In the setting of L emma B.7 and using its notation: with pr ob ability at le ast 1 − exp( − c 2 d ∗ ( A )) − d − 10 , sup x ∈ A W 1 ( F N ,x , F x ) ≤ c 4 ( L, α ) max { ξ γ 2 ( A ) , log 2 ( N ) } √ N . The pro of of Lemma B.7 resem bles that one of Lemma 2.7 but requires some prepara- tions. Denoting by Γ the matrix that has X i as its rows, the central ingredien t in the pro of of Lemma 2.7 was to establish appropriate estimates on sup x |∥ 1 √ N Γ x ∥ 2 2 − ∥ x ∥ 2 2 | , where the suprem um is taken o v er x in lo calizations of A . T o establish those estimates, w e will rely on the results in [4] on structure preserv ation of random matrices with i.i.d. columns. Note that since eac h X has i.i.d. co ordinates, the columns of Γ are i.i.d., and w e denote them b y Z 1 , . . . , Z d ∈ R N , th us Γ = P d j =1 ⟨· , e j ⟩ Z j . W e start b y proving the (rather standard) result showing that Z inherits the lo cal momen t condition required in [4]. ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 27 Lemma B.9. Ther e is an absolute c onstant c such that the fol lowing holds. L et w b e a r andom variable with zer o me an and unit varianc e and set Z = ( w 1 , . . . , w N ) wher e the w i ’s ar e indep endent c opies of w . Then, for every z ∈ S N − 1 and every p ≥ 2 , ∥ ⟨ Z, z ⟩ ∥ L p ≤ c √ p ∥ w ∥ L p . Mor e over, if Z 1 , . . . , Z d ar e indep endent c opies of Z , then for every p ≥ 2 and for every λ ≥ 1 , with pr ob ability at le ast 1 − dλ − p , max j =1 ,...,d     ∥ Z j ∥ 2 2 N − 1     ≤ cλ √ p ∥ w 2 ∥ L p √ N . Pr o of. Let ( ε i ) N i =1 b e i.i.d. Bernoulli random v ariables that are also indep enden t of Z . By a standard symmetrization argument (see, e.g., [16, Lemma 6.3]), ∥ ⟨ Z, z ⟩ ∥ L p =      N X i =1 w i z i      L p ≤ 2      N X i =1 ε i w i z i      L p . Moreo v er, Bernoulli random v ariables are subgaussian with an absolute constant, in partic- ular ∥ P N i =1 ε i v i ∥ L p ≤ c √ p ∥ v ∥ 2 for every v ∈ R N and p ≥ 2. Applying the latter inequalit y conditionally on Z ,      N X i =1 ε i w i z i      p L p = E Z E ε      N X i =1 ε i w i z i      p ≤ E Z ( c √ p ) p      N X i =1 w 2 i z 2 i      p/ 2 , where here E Z and E ε denote the exp ectation with resp ect to Z and ( ε i ) N i =1 , resp ectiv ely . Let B N 1 b e the ℓ N 1 -unit ball and observe that sup z ∈ B N 2 E Z      N X i =1 w 2 i z 2 i      p/ 2 = sup v ∈ B N 1 E Z      N X i =1 w 2 i v i      p/ 2 = max v ∈ B N 1 ψ ( v ) . As ψ is conv ex, it follows that max v ∈ B N 1 ψ ( v ) is attained at an extreme point of B N 1 , i.e. for v of the form v = ± e i , 1 ≤ i ≤ N ; hence max v ∈ B N 1 ψ ( v ) = max 1 ≤ i ≤ N E Z   w 2 i   p/ 2 = ∥ w ∥ p L p . Com bining all estimates, ∥ ⟨ Z , z ⟩ ∥ L p ≤ 2 c √ p ∥ w ∥ L p . As for the second statement, b y a standard symmetrization argumen t follo w ed by the argumen ts presented in the previous step,     ∥ Z ∥ 2 2 N − 1     L p ≤ 2 1 N      N X i =1 ε i w 2 i      L p ≤ 2 c √ p ∥ w 2 ∥ L p √ N . 28 DANIEL BAR TL AND SHAHAR MENDELSON Hence, by Marko v’s inequalit y , for ev ery λ ≥ 1, P      ∥ Z ∥ 2 2 N − 1     ≥ λ 2 c √ p ∥ w 2 ∥ L p √ N  ≤ λ − p , and the claim follows from the union b ound ov er j = 1 , . . . , d . □ Pr o of of L emma B.7. Set Z = ( w 1 , . . . , w N ). The first part of Lemma B.9 implies that for ev ery z ∈ R N and ev ery 2 ≤ p ≤ c 0 log( d ), ∥ ⟨ Z, z ⟩ ∥ L p ≤ c 1 Lp 1 /α +1 / 2 ∥ z ∥ 2 and the second part of that lemma (applied with λ = e 11 and p = log( d )) sho ws that, with probabilit y at least 1 − d − 10 , max j =1 ,...,d     ∥ Z j ∥ 2 2 N − 1     ≤ c 2 L log 2 /α +1 / 2 ( d ) √ N . Th us Z satisfies the assumption needed in [4] and it follo ws from Theorem 1.5 therein that for ev ery set V ⊂ R d satisfying d ∗ ( V ) ≥ log( d ), setting E ( V ) = log 2 /α +1 / 2 ( d ) √ N d 2 V + log 2 /α +1  ed d ∗ ( V )   d V γ 2 ( V ) √ N + γ 2 2 ( V ) N  , with probabilit y at least 1 − d − 10 − 2 exp( − c 3 d ∗ ( V )), sup x ∈ V          1 √ N Γ x     2 2 − ∥ x ∥ 2 2      ≤ c 4 ( L, α ) E ( V ) . (B.5) F rom this p oint, the pro of of the lemma follows from the same path as the one presen ted for Lemma 2.7 and we only sketc h it: Set ξ = log 2 /α +1 ( ed d ∗ ( A ) ) , and put r = ξ γ 2 ( A ) √ N . Apply (B.5) to A r = { x ∈ con v( { y − z : y , z ∈ A ∪ { 0 }} ) : ∥ x ∥ 2 ≤ r } . The claim follo ws by noting that d ∗ ( A r ) ≥ c 5 d ∗ ( A ) ≥ c 6 log( d ) and that E ( A r ) ≤ c 7 ( L, α ) r 2 , where the latter holds b ecause N ≥ log 4 /α +1 ( d ). □ Ac kno wledgemen t: This research was funded in whole or in part by the Austrian Science F und (FWF) [doi: 10.55776/P34743 and 10.55776/ESP31], the Austrian National Bank [Jubil¨ aumsfond, pro ject 18983], and a Presidential Y oung Professorship gran t [‘Robust statistical learning for complex data’]. ON THE STRUCTURE OF MARGINALS IN HIGH DIMENSIONS 29 References [1] P . Ab dalla and N. Zhivoto vskiy . Co v ariance estimation: Optimal dimension-free guaran tees for ad- v ersarial corruption and heavy tails. Journal of the Eur op e an Mathematic al So ciety , 28(4):1809–1847, 2026. [2] R. Adamczak, A. Litv ak, A. P a jor, and N. T omczak-Jaegermann. Quan titative estimates of the con- v ergence of the empirical cov ariance matrix in log-concav e ensembles. Journal of the Americ an Math- ematic al So ciety , 23(2):535–561, 2010. [3] Z.-D. Bai and Y.-Q. Yin. Limit of the smallest eigen v alue of a large dimensional sample co v ariance matrix. Annals of Pr ob ability , 21(3):1275–1294, 1993. [4] D. Bartl and S. Mendelson. Random em beddings with an almost gaussian distortion. A dvanc es in Mathematics , 400:108261, 2022. [5] D. Bartl and S. Mendelson. Empirical approximation of the gaussian distribution in R d . A dvanc es in Mathematics , 460:110041, 2025. [6] D. Bartl and S. Mendelson. Structure preserv ation via the Wasserstein distance. Journal of F unctional Analysis , 288:110810, 2025. [7] D. Bartl and S. Mendelson. A uniform Dv oretzky–Kiefer–Wolfowitz inequalit y . Pr ob ability The ory and R elate d Fields , pages 1–40, 2025. [8] M. T. Bo edihardjo. Sharp b ounds for max-sliced w asserstein distances. F oundations of Computational Mathematics , pages 1–32, 2025. [9] S. Boucheron, O. Bousquet, G. Lugosi, and P . Massart. Moment inequalities for functions of indep en- den t random v ariables. Annals of Pr ob ability , 33(2):514–560, 2005. [10] A. Figalli and F. Glaudo. A n Invitation to Optimal Tr ansp ort, Wasserstein Distanc es, and Gr adient Flows . EMS T extb o oks in Mathematics, 2021. [11] E. Gin´ e and V. Koltchinskii. Concen tration inequalities and asymptotic results for ratio type empirical pro cesses. The A nnals of Pr ob ability , 34:1143–1216, 2006. [12] O. Gu´ edon, A. E. Litv ak, A. Pa jor, and N. T omczak-Jaegermann. On the interv al of fluctuation of the singular v alues of random matrices. Journal of the Eur op e an Mathematic al So ciety , 19(5), 2017. [13] M. Jirak, S. Minsker, Y. Shen, and M. W ahl. Concen tration and moment inequalities for sums of indep enden t hea vy-tailed random matrices. Pr ob ability The ory and R elate d Fields , pages 1–28, 2025. [14] V. Koltc hinskii and K. Lounici. Concen tration inequalities and moment bounds for sample cov ariance op erators. Bernoul li , pages 110–133, 2017. [15] R. Lata la. Estimation of momen ts of sums of indep enden t real random v ariables. A nnals of Pr ob ability , 25(3):1502–1513, 1997. [16] M. Ledoux and M. T alagrand. Pr obability in Banach Sp ac es: isop erimetry and pr o c esses , v olume 23. Springer Science & Business Media, 1991. [17] G. Lugosi and S. Mendelson. Multiv ariate mean estimation with direction-dep endent accuracy . Journal of the Eur op e an Mathematic al So ciety , 26(6):2211–2247, 2024. [18] S. Mendelson. Empirical processes with a bounded ψ 1 diameter. Ge ometric and F unctional Analysis , 20(4):988–1027, 2010. [19] S. Mendelson. Upp er bounds on pro duct and multiplier empirical processes. Sto chastic Pr o c esses and their Applic ations , 126(12):3652–3680, 2016. [20] S. Mendelson, A. P a jor, and N. T omczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Ge ometric and F unctional A nalysis , 17(4):1248–1282, 2007. [21] S. Mendelson and G. Paouris. On the singular v alues of random matrices. Journal of the Europ e an Mathematic al So ciety , 16(4):823–834, 2014. 30 DANIEL BAR TL AND SHAHAR MENDELSON [22] J. L. M. Olea, C. Rush, A. V elez, and J. Wiesel. On the generalization error of norm p enalty linear regression mo dels. Annals of Statistics, to app e ar , 2025. [23] G. Pisier. The volume of c onvex b o dies and Banach sp ac e ge ometry , volume 94. Cambridge Universit y Press, 1999. [24] M. T alagrand. Regularit y of gaussian pro cesses. A cta Mathematic a , 159:99–149, 1987. [25] M. T alagrand. Upp er and lower b ounds for sto chastic pr o c esses: de c omp osition the or ems , volume 60. Springer Nature, 2022. [26] K. Tikhomirov. Sample co v ariance matrices of heavy-tailed distributions. International Mathematics R ese ar ch Notic es , 2018(20):6254–6289, 2018. [27] R. V ershynin. High-dimensional pr ob ability: An intr o duction with applic ations in data scienc e , vol- ume 47. Cambridge universit y press, 2018. [28] C. Villani. T opics in optimal tr ansp ortation , volume 58. American Mathematical So c., 2021. [29] A. W alk er. A note on the asymptotic distribution of sample quantiles. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 30(3):570–575, 1968. Dep ar tment of Ma thema tics, Dep ar tment of St a tistics and Da t a Science, Na tional Uni- versity of Singapore Email addr ess : bartld@nus.edu.sg Dep ar tment of Ma thema tics, Texas A&M University Email addr ess : shahar.mendelson@gmail.com

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment