Boosting for Vector-Valued Prediction and Conditional Density Estimation

Bo osting for V ector-V alued Prediction and Conditional Densit y Estimation Jian Qian ∗ Sh u Ge † Abstract Despite the widespread use of b oosting in structured prediction, a general theoretical under- standing of aggregation beyond scalar losses remains incomplete. W e study vector-v alued and conditional densit y prediction under general divergences and identify stability conditions under whic h aggregation ampliﬁes weak guarantees into strong ones. W e formalize this stabilit y prop erty as ( α, β ) -b o ostability . W e show that geometric median ag- gregation achiev es ( α, β ) -b oostability for a broad class of divergences, with tradeoﬀs that dep end on the underlying geometry . F or vector-v alued prediction and conditional density estimation, we c haracterize b o ostabilit y under common divergences ( ℓ 1 , ℓ 2 , TV , and H ) with geometric median, rev ealing a sharp distinction betw een dimension-dependent and dimension-free regimes. W e further sho w that while KL divergence is not directly bo ostable via geometric median aggregation, it can b e handled indirectly through b o ostabilit y under Hellinger distance. Building on these structural results, we propose a generic b o osting framew ork GeoMedBoost based on exp onen tial reweigh ting and geometric-median aggregation. Under a weak learner condi- tion and ( α, β ) -b oostability , w e obtain exponential deca y of the empirical div ergence exceedance er- ror. Our framework recov ers classical algorithms such as MedBoost , AdaBoost , and SAMME as special cases, and provides a uniﬁed geometric view of b oosting for structured prediction. 1 In tro duction Bo osting was originally introduced as a metho d for impro ving p erformance in binary 0 – 1 classiﬁcation, most notably through AdaBoost ( F reund and Schapire , 1997 ). It was subsequen tly extended to real-v alued regression ( F riedman , 2001 ; Kégl , 2003 ; Mason et al. , 1999 ; Bühlmann and Y u , 2003 ; Chen , 2016 ) and multi-class classiﬁcation ( Schapire , 1997 ; Schapire and Singer , 1998 ; Allwein et al. , 2000 ; Zhu et al. , 2009 ; Brukhim et al. , 2021a ), among many others. These dev elopments led to a rich b ody of theory connecting weak learnability , loss minimization, and stagewise additive mo deling. While far from exhaustive, this line of w ork has established a mature collection of algorithmic principles and analytical to ols for b o osting in scalar regression and classiﬁcation. Bey ond these settings, a n umber of works ha ve also prop osed b o osting-st yle algorithms for more general prediction problems, including structured outputs ( Ratliﬀ et al. , 2006 ; Shen et al. , 2014 ). A detailed discussion of b oth lines of work is deferred to Section A . ∗ The Universit y of Hong K ong. jianqian@hku.hk . † geshu1026qq@gmail.com . 1 In con trast, a comparable theoretical understanding of b oosting for general prediction problems re- mains largely undeveloped. Man y mo dern learning tasks require predicting structur e d outputs—suc h as vectors or full conditional distributions—rather than scalar lab els, yet no uniﬁed bo osting framew ork with prov able guaran tees exists b ey ond regression and classiﬁcation. This limitation is particularly pronounced for conditional densit y estimation, whic h is cen tral to probabilistic forecast- ing and likelihoo d-based learning, where p erformance is naturally measured by information-theoretic div ergences such as Kullbac k–Leibler divergence ( Gneiting and Raftery , 2007 ; Go o dfello w et al. , 2016 ). A cen tral diﬃcult y in extending b oosting to v ector-v alued prediction and conditional densit y estimation is that its basic comp onen ts lac k direct generalizations. In classical b oosting, analysis relies on weak learner guarantees deﬁned via misclassiﬁcation or excess loss, simple aggregation rules suc h as av eraging or voting, and a notion of strong p erformance that improv es monotonically ov er rounds. F or structured output spaces equipp ed with general divergences, none of these elements carries o ver directly: w eak learner guarantees are no longer naturally expressed in scalar losses, aggregation of structured predictions in to a single output is nontrivial, and its relationship to the ﬁnal prediction error under the target divergence is unclear. These diﬃculties motiv ate the developmen t of aggregation rules and p erformance notions that are compatible with structured prediction spaces. T o address the challenges of b o osting structured predictions, we develop a geometric p erspective on when and ho w weak learners can b e ampliﬁed. W e capture the essen tial stability requirement of aggregation through ( α, β ) -b oostability: if more than an α fraction of weigh ted predictions lie within a small div ergence ball around a target, then the aggregated prediction remains within a β -enlargemen t of that ball. This notion isolates the geometric conditions under which aggregation can pro v ably transform w eak guarantees into strong ones. Building on this p erspective, we study aggregation rules that satisfy ( α, β ) -b oostability and design a generic b oosting framew ork GeoMedBoost for structured outputs. Within this framew ork, w eak learners are required only to satisfy w eak exceedance guarantees under the target div ergence; iterativ e reweigh ting increases concentration across rounds, while geometric aggregation—such as the geometric median under a chosen div ergence—ensures that errors remain controlled. Imp ortan tly , ( α, β ) -b oostability is a prop ert y of aggregation, and yields explicit b ounds that directly relate weak learner p erformance to prediction error. The framework of GeoMedBoost recov ers classical algorithms such as MedBoost , AdaBoost , and SAMME , and provides a uniﬁed geometric view of b oosting for structured prediction. W e no w summarize our main contributions. Con tributions. Our main con tributions are summarized as follows: • Geometric b o ostabilit y framew ork. W e in tro duce ( α, β ) -b oostability , a geometric con- dition c haracterizing when aggregation rules amplify weak guarantees into strong ones for structured prediction under general div ergences. W e prop ose a generic b oosting algorithm, GeoMedBoost , based on exp onential rew eighting and geometric-median aggregation, and establish direct theoretical guaran tees linking weak learner p erformance to ﬁnal prediction error. Classical b o osting algorithms suc h as MedBoost , Ad aBoost , and SAMME are reco vered as sp ecial cases. • Exact b oostability for vector-v alued prediction. F or vector prediction under ℓ 1 and 2 ℓ 2 distances, we give tight characterizations of ( α, β ) -b oostability . Under the ℓ 1 distance, we establish that any concen tration level α > 1 / 2 suﬃces for b oostability , but the enlargement factor β m ust scale linearly with dimension d , and this dep endence is tight. In contrast, for the ℓ 2 distance, we prov e a sharp dimension-indep enden t tradeoﬀ: ac hieving enlargement factor β requires concentration level α to exceed an explicit threshold α 2 ( β ) = β / ( β + p β 2 − 1 ) , and this relationship is tight. Notably , the ℓ 2 c haracterization extends immediately to general Hilb ert spaces, highlighting the fundamen tal role of inner-pro duct geometry in dimension-free b oostability . • Bo ostabilit y and limits for conditional density estimation. F or conditional densit y estimation, viewed as density-v alued prediction on the probability simplex, w e establish b oostability results under natural information-theoretic divergences. Under total v ariation distance, w e sho w that an y concen tration level α > 1 / 2 yields b o ostabilit y with an enlargemen t factor of 2( d − 1) , mirroring the dimension-dep enden t b eha vior of ℓ 1 distance. F or Hellinger distance, a similar dimension-indep endent tradeoﬀ b et w een α H ( β , d ) and β analogous to the ℓ 2 case is established. W e also prov e that square-ro ot KL divergence is not directly b o ostable via geometric median aggregation, even under standard density-ratio assumptions. This imp ossibilit y result demonstrates fundamental limitations of geometric aggregation for certain div ergences and motiv ates our indirect approac h for KL divergence. • Indirect b o osting of KL div ergence. Despite the imp ossibility of direct b o ostabilit y for KL div ergence, we show that KL can b e b o osted indirectly by exploiting its relationship with Hellinger distance. Sp eciﬁcally , we pro ve that applying geometric median aggregation under the Hellinger distance yields v alid b o ostabilit y guarantees for KL divergence, at the cost of a logarithmic density-ratio factor in the enlargement factor. This result enables b oosting for conditional densit y estimation under KL risk, which is central to probabilistic forecasting and lik eliho o d-based learning. Organization. The remainder of the paper is organized as follows. Section 2 introduces the prediction tasks, divergence-based exceedance ob jective, and the w eighted geometric median used for aggregation. Section 3 formalizes ( α, β ) -b oostability by geometric median and dev elops its basic geometric prop erties. Bo ostabilit y chara cterizations for vector-v alued prediction under ℓ 1 and ℓ 2 are giv en in Section 3.1 , while results for conditional density estimation under TV , H , and KL app ear in Section 3.2 . In Section 4 , we present the b o osting algorithm GeoMedBoost and prov e exp onen tial deca y guarantees for the exceedance error. W e present a short conlusion in Section 5 . All pro ofs are deferred to the App endix. 2 Preliminaries 2.1 Prediction T asks V ector-v alued prediction. W e consider sup ervised learning with co v ariates x ∈ X and vector- v alued resp onses y ∈ Y ⊆ R d . A predictor is a measurable map f : X → Y . W e will measure accuracy using a div ergence div : Y × Y → R + , t ypically induced by a norm (e.g. ℓ 1 or ℓ 2 ). 3 Conditional density estimation. W e are particularly interested in conditional density estima- tion with density-v alued resp onses: Y = ∆ d : = { p ∈ R d + : P d j =1 p j = 1 } . Here y ( x ) represents a conditional distribution ov er a ﬁnite outcome space [ d ] : = { 1 , ..., d } . W e will consider the typical information-theoretic div ergences (e.g. TV , H , or KL ). 2.2 Empirical Ob jectives and Div ergences Throughout this pap er, we do not study generalization or p opulation risk. Our ob jectiv e is purely empiric al : given a ﬁxed training sample, we aim to construct a predictor with small training error measured under a prescrib ed divergence. T raining data. W e are giv en a ﬁnite sample { ( x i , y i ) } n i =1 , x i ∈ X , y i ∈ Y , where Y ⊆ R d for v ector-v alued prediction and Y = ∆ d for conditional densit y estimation. Div ergence-based training error. Let div : Y × Y → R + b e a div ergence. F or a predictor f : X → Y , w e measure training p erformance through the empirical exceedance error at scale ε > 0 , L div ,ε ( f ) : = 1 n n X i =1 I  div  y i , f ( x i )  > ε  , (1) where for an ev ent E , I { E } denotes its indicator function. Div ergences. W e will consider several divergences dep ending on the prediction task: • V ector-v alued prediction: the ℓ p distance of ∥ y − y ′ ∥ p for p ∈ { 1 , 2 } . • Conditional densit y estimation: the total v ariation distance TV ( y , y ′ ) = 1 2 ∥ y − y ′ ∥ 1 , the Hellinger distance H( y , y ′ ) = √ 2 2 ∥ √ y − √ y ′ ∥ 2 , and the KL-divergence KL ( y , y ′ ) = P d j =1 y j log y j y ′ j , where √ · is applied elemen t-wise. 2.3 Bo osting Approach Our ob jective is to driv e the empirical exceedance error ( 1 ) to zero on a ﬁxed training sample. Rather than minimizing the div ergence directly , we pursue this goal through a b o osting-st yle reduction. A t a high level, the approach combines tw o ingredients: 1. Exp onential r eweighting , which pro duces a sequence of predictors whose predictions satisfy a w eak exceedance guarantee at scale ε under changing data weigh ts; and 2. A ggr e gation , whic h combines these predictors into a single estimator whose accuracy is measured under the same div ergence. The role of aggregation is to conv ert wea k guarantees to a ﬁnal strong guaran tee. This aggregation step is div ergence-dep enden t and is formalized through the notion of a w eighted geometric median. 4 2.4 W eighted Geometric Median F or any collection of p oin ts Y n = { y 1 , . . . , y n } ⊆ R d with corresp onding p ositiv e w eights w = ( w 1 , . . . , w n ) ∈ ∆ n , where w e also write w ( y i ) = w i . Deﬁnition 1 (W eigh ted geometric median) . The weigh ted geometric median of ( Y n , w ) under a diver genc e div is med( Y n , w ; div) : = arg min g n X i =1 w i div( y i , g ) . When div ( y , z ) = ∥ y − z ∥ p , we write med p ( Y n , w ) . F or div ∈ { TV , H , √ KL } we write med TV ( Y n , w ) , med H ( Y n , w ) , and med √ KL ( Y n , w ) , r esp e ctively. F or any set B ⊂ R d , write w ( B ) : = P i : y i ∈ B w i . F or any divergence div and ε > 0 , deﬁne the ball B ε ( z ; div ) : = { y : div( z , y ) ⩽ ε } . 2.5 F rom Bo ostabilit y to Bo osting Guarantees The eﬀectiveness of the ab o ve b oosting strategy hinges on a geometric prop ert y of the div ergence div and the associated geometric median. W e formalize this requirement through the notion of ( α, β ) -b o ostability by ge ometric me dian (Section 3 ). A t a high lev el, ( α, β ) -b oostability certiﬁes that geometric-median aggregation is stable under w eighted concentration of predictions, allowing rep eated weak accuracy guarantees to b e conv erted in to a ﬁnal guarantee at a controlled enlargement of the target scale. The formal deﬁnition and its geometric implications are giv en in Section 3 . Under ( α, β ) -b oostability and a w eak learner condition, we obtain an exp onen tial deca y b ound on the empirical β ε -exceedance error of the b o osted predictor (Section 4 ). 3 Bo ostabilit y b y Geometric Median W e now formalize the geometric stabilit y prop ert y underlying the aggregation step describ ed in Section 2 . Then show the p ositiv e or negativ e b o ostabilit y results for v ector-v alued prediction under ℓ 1 , ℓ 2 distance (Section 3.1 ), and for conditional density estimation under total v ariation distance, Hellinger distance, and KL-div ergence (Section 3.2 ). Deﬁnition 2 ( ( α, β ) -b oostability by geometric median) . L et α ∈ ( 1 2 , 1) and β ⩾ 1 . W e say that a diver genc e div is ( α, β ) -b oostable b y geometric median if for every weighte d set ( Y n , w ) , every z ∈ R d , and every ε ⩾ 0 , w ( B ε ( z ; div )) ⩾ α = ⇒ med( Y n , w ; div) ⊆ B β ε ( z ; div ) . W e r efer to α as the c onc entr ation level and β as the enlar gement factor. 5 In tuitively , ( α, β ) -b oostability ensures that geometric-median aggregation preserves concen tration: if a strict α -fraction of the weigh ted mass lies near a target p oin t z , then the aggregate prediction cannot drift far from z . The enlargement factor β quan tiﬁes the w orst-case enlargement incurred b y the geometric-median aggregation under the divergence div . An immediate consequence of the ab o ve deﬁnition is that if a single p oint carries weigh t exceeding α , then by taking ε = 0 , the geometric median must coincide with that p oin t. 3.1 Bo ostabilit y for V ector-V alued Prediction W e b egin by c haracterizing b oostability prop erties for v ector-v alued prediction under the ℓ 1 and ℓ 2 distances. F or the ℓ 1 distance, any concentration level α > 1 / 2 suﬃces (Prop osition 3 ): whenever more than half of the total weigh t lies within an ℓ 1 ball of radius ε cen tered at a target p oint z , ev ery ℓ 1 geometric median m ust lie in an ℓ 1 ball cen tered at z whose radius is enlarged b y a factor of d . Moreo ver, this dep endence on the dimension is una voidable—the enlargement factor d is tight for all α > 1 / 2 . In contrast, for the ℓ 2 distance, b o ostabilit y exhibits a sharp tradeoﬀ b etw een the concen tration level α and the enlargement factor β . A notable feature of this tradeoﬀ is that it is indep enden t of the ambien t dimension d . Sp eciﬁcally , achieving a given enlargemen t factor β requires α to exceed an explicit threshold α 2 ( β ) , as characterized in Prop osition 4 , and this dep endence is tigh t. Prop osition 3. L et ℓ 1 ( y , z ) = ∥ y − z ∥ 1 on R d . Then: (i) F or any α > 1 / 2 , ℓ 1 is ( α, d ) -b o ostable: if w ( B ε ( z ; ℓ 1 )) > 1 / 2 , then med 1 ( Y n , w ) ⊆ B dε ( z ; ℓ 1 ) . (ii) F or any α > 1 / 2 and any β < d , ℓ 1 is not ( α , β ) -b o ostable. Pr o of sketch. Separabilit y of ∥ · ∥ 1 implies that ℓ 1 geometric medians are obtained co ordinate-wise as (one-dimensional) w eighted medians. (i) If w ( B ε ( z ; ℓ 1 )) > 1 / 2 , then for each co ordinate j , more than half the w eight satisﬁes | y ij − z j | ⩽ ε . An y weigh ted median in one dimension lies in the minimal in terv al carrying w eight > 1 / 2 , hence | g j − z j | ⩽ ε for all j and ∥ g − z ∥ 1 ⩽ P j | g j − z j | ⩽ dε . (ii) A tight example assigns slightly more than half the total w eight to p oin ts { z + εe j } j ∈ [ d ] (ev enly spread across these p oin ts) and the remaining weigh t to a far outlier z + L 1 , where L > ε , e j for the j -th unit vector in R d and 1 ∈ R d denotes the all-ones vector. The unique co ordinate-wise median b ecomes z + ε 1 , giving ∥ g ⋆ − z ∥ 1 = dε . The pro of in fact establishes a stronger co ordinate-wise statement: the ℓ 1 geometric median lies in the ℓ ∞ ball of radius ε around z , which immediately implies the claimed ℓ 1 b ound by summation. Prop osition 4. L et d ⩾ 2 and ℓ 2 ( y , z ) = ∥ y − z ∥ 2 on R d . F or any β > 1 , deﬁne α 2 ( β ) : = β β + p β 2 − 1 . (i) If α > α 2 ( β ) , then ℓ 2 is ( α, β ) -b o ostable (e.g. (3 / 5 , 2) ). (ii) If α < α 2 ( β ) , then ℓ 2 is not ( α , β ) -b o ostable. 6 Pr o of sketch. Let F ( g ) = P i w i ∥ g − y i ∥ 2 and write W = w ( B ε ( z ; ℓ 2 )) . F or an y g / ∈ B β ε ( z ) , a standard directional-alignmen t lemma (Lemma 15 ) lo wer b ounds the cosine b et ween each inlier direction ( g − y i ) / ∥ g − y i ∥ and the radial direction ( g − z ) / ∥ g − z ∥ b y p β 2 − 1 /β , while outliers con tribute at worst − 1 . Th us D ∇ F ( g ) , g − z ∥ g − z ∥ E ⩾ W √ β 2 − 1 β − (1 − W ) , whic h is p ositiv e exactly when W > α 2 ( β ) , excluding stationarity outside B β ε ( z ) . Tigh tness follows from a planar construction (Lemma 16 ) achieving the threshold. A notable feature of the ℓ 2 result is that the tight relationship b et w een the concentration level α and the enlargement factor β is indep endent of the am bient dimension d . Indeed, the pro of relies only on the inner-pro duct geometry of the space and do es not exploit any ﬁnite-dimensional structure, yielding an immediate extension to general Hilb ert spaces. A dimension-free stabilit y prop ert y for the geometric median in Hilb ert spaces w as previously established by Minsk er ( 2015 , Lemma 2.1), alb eit with a non-tight tradeoﬀ b et ween α and β . More generally , Minsker ( 2015 ) derives robustness guar- an tees for geometric medians in arbitrary Banach spaces under w eaker geometric assumptions. Our result sharp ens the Hilb ert-space case by identifying the exact threshold gov erning ( α, β ) -b oostability . 3.2 Bo ostabilit y for Conditional Densit y Estimation In this section, we establish p ositiv e b o ostabilit y results for conditional density estimation under the total v ariation distance (Prop osition 5 ) and the Hellinger distance (Prop osition 6 ). W e further show that the square-ro ot KL divergence, while not directly b oostable via the geometric median under √ KL (Prop osition 7 ), can nevertheless b e b o osted indirectly by applying the geometric median under the Hellinger distance (Prop osition 8 ). Sp eciﬁcally , mirroring the b ehavior of the ℓ 1 distance in vector-v alued prediction, the total v ariation distance is b o ostable whenever the concen tration lev el satisﬁes α > 1 / 2 : if more than half of the total weigh t lies in a TV ball of radius ε around a target densit y z , then every TV geometric median lies in an enlarged ball with radius at most 2( d − 1) ε . F or the Hellinger distance, we establish a tradeoﬀ b et w een the concen tration level α and the enlarge- men t factor β , analogous to the ℓ 2 case. This tradeoﬀ is characterized explicitly in Prop osition 6 . Finally , for the square-ro ot KL divergence, w e sho w that ev en under a standard densit y-ratio assumption, direct b o ostabilit y via the geometric median under √ KL is not p ossible. Ho wev er, by exploiting the relationship b et ween KL div ergence and the Hellinger distance, we sho w that KL div ergence can b e b oosted indirectly via the Hellinger geometric median, at the cost of a logarithmic densit y-ratio factor in the enlargement factor. Prop osition 5 (T otal v ariation) . L et d ⩾ 2 . F or any α > 1 / 2 , the total variation distanc e TV on ∆ d is ( α, 2( d − 1)) -b o ostable. Pr o of sketch. Let g b e a weigh ted TV –geometric median of { y i } in ∆ d , and assume that inliers in B ε ( z ; TV ) carry weigh t > 1 2 . By separabilit y of TV ( · , g ) and the simplex constraint, ﬁrst–order optimalit y along feasible exc hange directions implies a balancing condition: there exists τ suc h 7 that each co ordinate g j is a τ –quan tile. W.l.o.g., assume τ ⩾ 1 / 2 . Hence, at least half the weigh t satisﬁes y ij ⩽ g j for all j . Fix j . If g j < z j − 2 ε , then every x ∈ B ε ( z ; TV ) must satisfy y ij > g j , con tradicting the balancing prop ert y . Thus z j − g j ⩽ 2 ε whenev er g j < z j . Since g , z ∈ ∆ d , at most d − 1 co ordinates contribute to TV( g , z ) = P j : g j 1 deﬁne α H ( β , d ) : =    2 3 , d = 2 , 2 β 2 3 β 2 − 1 , d ⩾ 3 . If α > α H ( β , d ) , then the Hel linger distanc e H on ∆ d is ( α, β ) -b o ostable (in p articular, (8 / 11 , 2) - b o ostable). Pr o of sketch. Use the square-ro ot embedding p 7→ √ p , under which H( p, q ) = √ 2 2 ∥ √ p − √ q ∥ 2 and the w eighted Hellinger ob jective b ecomes a Euclidean geometric-median ob jectiv e on the unit sphere: F ( √ y ) = P i w i ∥ √ y − √ y i ∥ 2 . An y constrained minimizer √ g m ust ha ve zero pro jected (tangent) gradien t. Pro jecting ∇ F ( √ g ) onto the tangen t space yields a weigh ted sum of tangen t directions to ward each √ y i . It suﬃces to ﬁnd a tangent direction with p ositiv e inner pro duct against this sum. T ake the tangent direction p z at √ g to ward √ z . If g / ∈ B β ε ( z ; H) , then for any inlier y i ∈ B ε ( z ; H) , Lemma 19 implies uniform p ositiv e alignment with p z , while each outlier contributes at w orst − 1 . Th us, the pro jected gradient in direction p z is p ositiv e whenever the inlier weigh t exceeds α H ( β , d ) . Hence, the tangen t gradient cannot v anish outside B β ε ( z ; H) , and no minimizer lies there. A standard subgradien t argument cov ers nondiﬀerentiable p oints, completing the pro of. The ab ov e pro of heavily relies on the spherical geometry . It also do es not dep end on the ambien t dimension d , th us generalizable to general Hilb ert spaces. An in teresting immediate question is whether square-ro ot of KL div ergence is b oostable through geometric median under a standard densit y-ratio assumption. Unfortunately , it is not, as shown in the following Prop osition 7 . How ev er, w e can b o ost the KL divergence through the Hellinger distance, as shown in Prop osition 8 . Prop osition 7 (KL imp ossibility) . F or any δ ∈ (0 , 1 / 25) , letting y 1 = ( δ, 1 − δ ) , y 2 = (1 − δ, δ ) , and w ( y 2 ) = 3 p δ log(1 /δ ) , w ( y 1 ) = 1 − w ( y 2 ) , we have med √ KL ( { y 1 , y 2 } , w )  = { y 1 } . The pro of pro ceeds by a careful calculation, whic h we defer to Section B.2 . The ab o v e result sho ws that ev en under a standard density-ratio b ound, achieving b o ostabilit y for √ KL via the geometric median requires an excessively large concentration level α . In particular, it rules out the ideal scenario in which b o ostabilit y could b e obtained b y choosing a suﬃciently large enlargemen t factor while allo wing the concentration level to b e as low as 1 − Ω( p oly ( log (1 /δ ))) . Nevertheless, an imp ortant observ ation is that by exploiting the relationship b et w een KL div ergence and the Hellinger distance, one can still b oost KL divergence indirectly: applying the geometric median under the Hellinger distance yields a v alid b oostability guarantee for KL divergence, as formalized in Prop osition 8 . 8 Prop osition 8 (KL through Hellinger) . F or z ∈ ∆ d and ε > 0 , deﬁne the KL b al l B ε ( z ; √ KL ) : = { y ∈ ∆ d : KL ( z , y ) ⩽ ε 2 } . Supp ose for every i we have the r atio b ound z ⩽ V y i c o or dinate-wise for some V ⩾ 1 . L et c V = √ 2 + log V and ﬁx β > c V / √ 3 . If w ( B ε ( z ; √ KL)) > α KL ( β , d ) , wher e α KL ( β , d ) : =    2 3 , d = 2 , 2 β 2 3 β 2 − c 2 V , d ⩾ 3 , then every Hel linger ge ometric me dian of ( Y n , w ) lies in the enlar ge d KL b al l: med H ( Y n , w ) ⊆ B β ε ( z ; √ KL) . Pr o of of Pr op osition 8 . By Lemma 20 , H 2 ⩽ KL , hence w ( B ε ( z ; H)) ⩾ w ( B ε ( z ; √ KL )) > α KL ( β , d ) . With β ′ = β /c V , the threshold matches the Hellinger b oostability threshold presen ted in Prop osition 6 , implying med H ( Y n , w ) ⊆ B β ′ ε ( z ; H) . The ratio-cov er assumption and Lemma 21 give z ⩽ V g for an y g ∈ med H ( Y n , w ) , so applying Lemma 20 again yields KL( z , g ) ⩽ c 2 V H 2 ( z , g ) ⩽ ( β ε ) 2 . 4 Bo osting for V ector-V alued Prediction and Conditional Densit y Estimation In this section, we show how ( α, β ) -b oostability can b e leveraged to construct a b o osting algorithm, GeoMedBoost (Algorithm 1 ), that yields pro v able b ounds on the exceedance error for b oth v ector-v alued prediction and conditional density estimation (Theorem 12 ). The prop osed algorithm generalizes MedBoost ( Kégl , 2003 ), whic h was originally developed for real-v alued regression. A t a high level, MedBoost replaces the usual av eraging or v oting step in b oosting by a weighte d me dian aggregation of the weak learners’ predictions. Because the prediction is one-dimensional and the loss is the absolute loss, the median coincides with the minimizer of a weigh ted ℓ 1 ob jective. This aggregation rule decouples accuracy from outliers: as long as a strict ma jorit y of the aggregate weigh t concen trates on h yp otheses that are close to the target, the median prediction remains close as well. Com bined with an exp onen tial reweigh ting scheme that progressiv ely do wnw eigh ts p oorly p erforming examples, this yields exp onen tial decay of the exceedance error while maintaining robustness to corrupted lab els or heavy-tailed noise. The algorithm GeoMedBoost extends this principle to substan tially richer prediction spaces, including v ector-v alued predictions in R d and conditional density estimation on the simplex. In these settings, the scalar median is no longer adequate, and aggregation is carried out via a (weigh ted) geometric median under the chosen div ergence. The required stability of this aggregation is exactly the ( α, β ) -b oostability prop ert y established in Section 3 . Another deﬁning feature of MedBoost is robustness, whic h is also inherited—and made explicit—by GeoMedBoost . In particular, supp ose the weak learner achiev es exceedance error at most 1 − α at lev el ε under a divergence that is ( α − γ , β ) -b oostable. The slack parameter γ pro vides a robustness margin: even if a γ -fraction of the aggregate weigh t is adversarially p erturb ed, b o ostabilit y still ensures con traction. Equiv alen tly , one ma y shift up to a γ p ortion of the weigh ts without destroying the exp onen tial deca y of the exceedance error. 9 In what follows, we instantiate this reduction: w e deﬁne the γ -robust geometric median, state the w eak learner condition, and prov e an exp onential deca y b ound on the exceedance error at scale β ε of GeoMedBoost (Theorem 12 ). Deﬁnition 9 ( γ -neigh b orho od) . L et ( Y , w ) b e a weighte d set. Deﬁne the distanc e b etwe en weighte d sets by D  ( Y , w ) , ( Y ′ , w ′ )  : = X y ∈Y ∪Y ′ | w ( y ) − w ′ ( y ) | . F or γ ∈ [0 , 1) , deﬁne the γ -neighb orho o d B γ ( Y , w ) : = { ( Y ′ , w ′ ) : D (( Y , w ) , ( Y ′ , w ′ )) ⩽ γ } . Deﬁnition 10 ( γ -robust geometric median) . L et 0 ⩽ γ < 1 . The γ -robust (weigh ted) geometric median of ( Y , w ) under diver genc e div is med( Y , w ; div , γ ) : = [ ( Y ′ ,w ′ ) ∈B γ ( Y ,w ) med( Y ′ , w ′ ; div ) . 4.1 W eak Learner Condition and GeoMedBoost Fix a sample { ( x i , y i ) } n i =1 and a divergence div , a margin parameter α > 0 , and a loss C ε satisfying C ε ( y , y ′ ) ⩾ I { div ( y , y ′ ) > ε } ∀ y , y ′ . The loss C ε serv es as a surrogate for the ε -exceedance indicator induced by the divergence div . Assumption 11 (W eak learner) . F or any distribution w over { x 1 , . . . , x n } , the le arner outputs a hyp othesis h such that n X i =1 w ( x i ) C ε  y i , h ( x i )  ⩽ 1 − α. This condition requires a uniform adv antage α o ver the trivial predictor under arbitrary reweigh tings of the data. The algorithm GeoMedBoost (Algorithm 1 ) follows the standard b o osting template. It main tains w eights w t o ver the samples, rep eatedly in vok es the weak learner to obtain h t satisfying Assumption 11 , and up dates the weigh ts by exp onen tial rew eighting. The step size η t is chosen by minimizing a one-dimensional conv ex ob jectiv e, ensuring a multiplicativ e decrease of the global exp onen tial p oten tial. (Under Assumption 11 , η t > 0 for all t since the con vex ob jectiv e minimized in η t has nonp ositiv e deriv ativ e at η t = 0 and ac hiev es its minim um at a p ositiv e step.) After T rounds, predictions are aggregated via a γ -robust geometric median under div with w eights prop ortional to η t . No w we are ready to present our guarantee for GeoMedBoost . Theorem 12. If div is ( α − γ , β ) -b o ostable and f T is the output of GeoMedBoost , then L div ,β ε ( f T ) = 1 n n X i =1 I { div( y i , f T ( x i )) > β ε } ⩽ T Y t =1 n X i =1 w t ( x i ) exp  − η t  1 − α − C ε ( y i , h t ( x i ))   . 10 Algorithm 1 GeoMedBoost Require: Data { ( x i , y i ) } n i =1 , w eak learner, loss C ε . 1: w 1 ( x i ) = 1 /n for all i . 2: for t = 1 , . . . , T do 3: T rain h t under w t with P i w t ( x i ) C ε ( y i , h t ( x i )) ⩽ 1 − α . 4: η t = arg min η X i w t ( x i ) exp  − η (1 − α − C ε ( y i , h t ( x i )))  . 5: w t +1 ( x i ) ∝ w t ( x i ) exp  η t C ε ( y i , h t ( x i ))  . 6: end for 7: η ′ t ← η t / P T s =1 η s for t ∈ [ T ] . 8: Output f T ( x ) ∈ med( { h t ( x ) } T t =1 , { η ′ t } T t =1 ; div , γ ) . ▷ div = H for KL guarantees; see Remark 13 . Pr o of sketch. If div ( y i , f T ( x i )) > β ε , then by ( α − γ , β ) -b oostability and the deﬁnition of the γ -robust median, at most an ( α − γ ) fraction of aggregate weigh t can lie on h yp otheses within ε of y i , i.e. at least a (1 − α ) fraction lies outside the ε -ball. This implies P t η t (1 − α − C ε ( y i , h t ( x i ))) ⩽ 0 . Applying I { u ⩾ 0 } ⩽ e u con verts the indicator in to an exp onen tial p oten tial, and the exp onen tial rew eighting up date makes the p otential telescop e m ultiplicatively across rounds, yielding the pro duct b ound. Remark 13 (KL exceedance via Hellinger aggregation) . Pr op osition 8 shows that KL exc e e danc e c ontr ol c an b e obtaine d by aggr e gating with the Hel linger ge ometric me dian. Concr etely, if the go al is to b ound L √ KL ,ε ( · ) under the r atio c ondition z ⩽ V y i , then in GeoMedBoost we take the aggr e gation rule to b e f T ( x ) ∈ med  { h t ( x ) } T t =1 , η ; H , γ  , and evaluate the ﬁnal pr e dictor using √ KL in the exc e e danc e loss. In other wor ds, the diver genc e use d for aggr e gation (Hel linger) ne e d not c oincide with the diver genc e use d for evaluation (KL), as long as the c orr esp onding b o ostability implic ation holds. Connection to Classical Bo osting Metho ds The results of this section place classical b oosting algorithms within a uniﬁed geometric-aggregation framework based on ( α, β ) -b oostability . View ed through this lens, GeoMedBoost serves as a template: c ho osing the prediction space and divergence ( Y , div ) sp eciﬁes the aggregation rule, recov ering familiar algorithms: • In one-dimensional regression with absolute loss, the geometric median coincides with the w eighted median, yielding MedBoost ( Kégl , 2003 ). • In binary classiﬁcation with 0 – 1 divergence, the geometric median reduces to w eighted ma jority v ote, recov ering AdaBoost ( F reund and Sc hapire , 1997 ). • In multi-class classiﬁcation with 0 – 1 div ergence, the resulting up date and aggregation coincide with SAMME ( Zh u et al. , 2009 ) and AdaBoost.MH ( Schapire and Singer , 1998 ). In all cases, the standard w eak learning conditions and exp onen tial reweigh ting rules arise as direct instan tiations of Assumption 11 and Algorithm 1 . 11 5 Conclusion W e introduced ( α, β ) -b oostability as a geometric principle that c haracterizes when aggregation can amplify weak guarantees into strong ones for vector-v alued prediction and conditional density estimation under general divergences. Our results show that geometric-median aggregation provides a uniﬁed and theoretically grounded replacement for av eraging or voting in structured spaces, yielding sharp tradeoﬀs for ℓ 1 , ℓ 2 , TV , and H , while also revealing intrinsic limitations (e.g., for √ KL ). T ogether, these ﬁndings clarify the geometric mec hanisms underlying b o osting b ey ond scalar losses and lay the foundation for extending the framewo rk to broader divergence classes and alternativ e aggregation op erators. References Erin L. Allwein, Rob ert E. Schapire, and Y oram Singer. Reducing m ulticlass to binary: A unifying approac h for margin classiﬁers. Journal of Machine L e arning R ese ar ch , 1:113–141, 2000. URL https://www.jmlr.org/papers/v1/allwein00a.html . Blair Bilo deau, Dylan J. F oster, and Daniel M. Ro y . Minimax rates for conditional densit y estimation via empirical en tropy . The Annals of Statistics , 51(2):762–790, 2023. doi: 10.1214/23- A OS2270. Marco Bressan, Nataly Brukhim, Nicolò Cesa-Bianc hi, Emmanuel Esp osito, Yisha y Mansour, Shay Moran, and Maximilian Thiessen. Of dice and games: A theory of generalized b o osting. In Pr o c e e dings of Thirty Eighth Confer enc e on L e arning The ory (COL T) , 2025. Nataly Brukhim, Elad Hazan, Shay Moran, Indraneel Mukherjee, and Rob ert E Schapire. Multiclass b oosting and the cost of w eak learning. A dvanc es in Neur al Information Pr o c essing Systems , 34: 3057–3067, 2021a. Nataly Brukhim, Elad Hazan, Shay Moran, Indraneel Mukherjee, and Rob ert E. Schapire. Multiclass b oosting and the cost of weak learning. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/ 17f5e6db87929fb55cebeb7fd58c1d41- Abstract.html . Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Y eh uday oﬀ. A characterization of multi class learnabilit y . In Pr o c e e dings of the IEEE 63r d Annual Symp osium on F oundations of Computer Scienc e (FOCS) , 2022. Nataly Brukhim, Amit Daniely , Yishay Mansour, and Shay Moran. Multiclass b o osting: Simple and intuitiv e weak learning criteria. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2023a. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 050f8591be3874b52fdac4e1060eeb29- Abstract- Conference.html . Nataly Brukhim, Stev e Hanneke, and Shay Moran. Improp er multiclass b o osting. In The Thirty Sixth A nnual Confer enc e on L e arning The ory , pages 5433–5452. PMLR, 2023b. P eter Bühlmann and Bin Y u. Boosting with the l 2 loss: regression and classiﬁcation. Journal of the A meric an Statistic al Asso ciation , 98(462):324–339, 2003. 12 Moses Charik ar and Chirag P abbara ju. A characterization of list learnability , 2022. Tianqi Chen. Xgb o ost: A scalable tree b o osting system. Cornel l University , 2016. Thomas G. Dietterich, Andrea Ashenfelter, and Y aroslav Bulato v. T raining conditional random ﬁelds via gradien t tree b o osting. In Pr o c e e dings of the International Confer enc e on Machine L e arning (ICML) , 2004. Y oav F reund and Rob ert E Schapire. A decision-theoretic generalization of on-line learning and an application to b oosting. Journal of c omputer and system scienc es , 55(1):119–139, 1997. Jerome H F riedman. Greedy function appro ximation: a gradien t bo osting machine. Annals of statistics , pages 1189–1232, 2001. Tilmann Gneiting and Adrian E Raftery . Strictly prop er scoring rules, prediction, and estimation. Journal of the Americ an statistic al Asso ciation , 102(477):359–378, 2007. Ian Go odfellow, Y oshua Bengio, Aaron Courville, and Y oshua Bengio. De ep le arning , volume 1. MIT press Cam bridge, 2016. Balázs Kégl. Robust regression b y b o osting the median. In L e arning The ory and Kernel Machines: 16th Annual Confer enc e on L e arning The ory and 7th Kernel W orkshop, COL T/Kernel 2003, W ashington, DC, USA, August 24-27, 2003. Pr o c e e dings , pages 258–272. Springer, 2003. Balázs Kégl. Generalization error and algorithmic con vergence of median b o osting. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2004. URL https://papers.nips.cc/paper_ files/paper/2004/hash/a431d70133ef6cf688bc4f6093922b48- Abstract.html . Gáb or Lugosi and Shahar Mendelson. Sub-gaussian estimators of the mean of a random vector. The A nnals of Statistics , 47(2):783–794, 2019. doi: 10.1214/17- A OS1639. Llew Mason, Jonathan Baxter, Peter L Bartlett, Marcus F rean, et al. F unctional gradient techniques for combining hypotheses. A dvanc es in Neur al Information Pr o c essing Systems , pages 221–246, 1999. Charles A. Micchelli and Massimiliano Pon til. On learning vector-v alued functions. Neur al Compu- tation , 17(1):177–204, 2005. Stanisla v Minsker. Geometric median and robust estimation in banach spaces. Bernoul li , 21(4):2308– 2335, 2015. doi: 10.3150/14- BEJ645. URL https://projecteuclid.org/journals/bernoulli/ volume- 21/issue- 4/Geometric- median- and- robust- estimation- in- Banach- spaces/10. 3150/14- BEJ645.full . Indraneel Mukherjee and Rob ert E. Schapire. A theory of multiclass b o osting. Journal of Machine L e arning R ese ar ch , 14:437–497, 2013. URL https://jmlr.org/papers/v14/mukherjee13a.html . Jian Qian, Alexander Rakhlin, and Nikita Zhivoto vskiy . Reﬁned risk b ounds for unbounded losses via transductiv e priors, 2024. 13 Nathan Ratliﬀ, Da vid Bradley , J. Andrew Bagnell, and Jo el Chestn utt. Bo osting struc- tured prediction for imitation learning. In A dvanc es in Neur al Information Pr o c essing Systems 19 (NeurIPS 2006) , 2006. URL https://papers.neurips.cc/paper/2006/hash/ fdbd31f2027f20378b1a80125fc862db- Abstract.html . Rob ert E Sc hapire. Using output co des to b o ost m ulticlass learning problems. In ICML , volume 97, pages 313–321, 1997. Rob ert E Schapire and Y oram Singer. Improv ed b o osting algorithms using conﬁdence-rated predic- tions. In Pr o c e e dings of the eleventh annual c onfer enc e on Computational le arning the ory , pages 80–91, 1998. Rob ert E. Sc hapire and Y oram Singer. Impro ved b o osting algorithms using conﬁdence-rated predictions. Machine L e arning , 37(3):297–336, 1999. doi: 10.1023/A:1007614523901. URL https://doi.org/10.1023/A:1007614523901 . Ch unhua Shen, Guosheng Lin, and An ton v an den Hengel. Structb o ost: Bo osting metho ds for predicting structured output v ariables. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 36(10):2089–2103, 2014. doi: 10.1109/TP AMI.2014.2315792. URL https://doi.org/ 10.1109/TPAMI.2014.2315792 . Y uhong Y ang and Andrew R Barron. An asymptotic prop ert y of mo del selection criteria. IEEE T r ansactions on Information The ory , 44(1):95–116, 1998. Y uhong Y ang and Andrew R. Barron. Information-theoretic determination of minimax rates of con vergence. The Annals of Statistics , 27(5):1564–1599, 1999. doi: 10.1214/aos/1017939142. Ji Zh u, Hui Zou, Saharon Rosset, T revor Hastie, et al. Multi-class adab oost. Statistics and its Interfac e , 2(3):349–360, 2009. Xin Zou, Zhengyu Zhou, Jingyuan Xu, and W eiwei Liu. A b o osting-t yp e conv ergence result for adab oost.mh with factorized multi-class classiﬁers. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 2024. 14 Con ten ts of App endix A Related W ork 15 B Pro ofs from Section 3 16 B.1 Pro ofs from Section 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Pro ofs from Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C Pro ofs from Section 4 28 A Related W ork Regression b y median b oosting. Median-type aggregation is a standard to ol in robust esti- mation: aggregating multiple noisy candidates via a median can strengthen w eak concen tration prop erties into high-probability guarantees ( Lugosi and Mendelson , 2019 ). More generally , Minsker ( 2015 ) study geometric median aggregation in normed spaces and pro vide geometric arguments establishing robustness guarantees for vector estimation. In contrast to this robust-estimation setting, our use of median aggregation arises within a b o osting pro cedure and is limited to the ﬁnal aggregation step. Speciﬁcally , the median is used to combine a collection of weak predictors pro duced during the b o osting pro cess into a single output predictor, rather than serving as a standalone estimator. F or regression, Kégl ( 2003 , 2004 ) analyze bo osting algorithms in which av eraging is replaced by a median-based aggregation rule. In this pap er, w e use a closely related geometric argumen t to Minsker ( 2015 ) in the ℓ 2 setting to analyze b oosting for vector-v alued prediction. W e further establish analogous geometric prop erties for ℓ 1 , TV , H , and KL div ergences, whic h c haracterize when median-type aggregation yields a v alid ﬁnal-step aggregation rule in b o osting. Multiclass b oosting and weak-learning criteria. Multiclass classiﬁcation can b e viewed as a ﬁnite-lab el instance of structured output prediction. Bo osting for m ulticlass classiﬁcation has b een studied with multiple, non-equiv alent weak-learning form ulations ( Mukherjee and Schapire , 2013 ; Brukhim et al. , 2021b ; Bressan et al. , 2025 ). Represen tative algorithms include A daBo ost.MH/MR ( Sc hapire and Singer , 1999 ) and direct multiclass v arian ts such as SAMME ( Zh u et al. , 2009 ). A substan tial b o dy of theoretical w ork analyzes the weak-learning conditions and p erformance guaran- tees of multiclass b o osting algorithms. In particular, Mukherjee and Sc hapire ( 2013 ) characterizes necessary-and-suﬃcien t weak-learning conditions for a tailored loss. More recent work examines the resource/class dep endence of multiclass b o osting and how to improv e it via alternativ e weak-learning in terfaces ( Brukhim et al. , 2021b , 2022 ; Charik ar and Pabbara ju , 2022 ; Brukhim et al. , 2023a , b ). Relatedly , Zou et al. ( 2024 ) gives a conv ergence guarantee for a factorized AdaBoost.MH v arian t. Within our framew ork, multiclass classiﬁcation corresp onds to the sp ecial case of conditional density estimation ov er a ﬁnite lab el set. Our results provide a geometric p ersp ectiv e on aggregation that applies to this setting, and recov ers certain existing guaran tees as sp ecial cases, and extends to more general structured output spaces. 15 Structured prediction by b oosting. Structured prediction has b een studied in settings suc h as v ector-v alued regression in RKHSs ( Micc helli and Pon til , 2005 ). Conditional density estimation also has a mature decision-theoretic theory ( Y ang and Barron , 1999 ; Bilo deau et al. , 2023 ; Qian et al. , 2024 ). In contrast, b o osting for structured outputs is largely application-driven—e.g., column- generation/functional-gradien t metho ds ( Ratliﬀ et al. , 2006 ; Shen et al. , 2014 ) and b oosting p oten tials in structured mo dels ( Dietterich et al. , 2004 )—with guaran tees that are typically loss- and mo del- sp eciﬁc. This leav es a gap in general guaran tees for learning structured real-v alued conditional densities under distributional constrain ts (nonnegativity , normalization, and further structure). B Pro ofs from Section 3 B.1 Pro ofs from Section 3.1 Pr o of of Pr op osition 3 . W e ﬁrst sho w the b oostability in the 1-dimensional case, then extend the result to the higher-dimensional cases. Lemma 14 (1-dimensional weigh ted median) . F or any interval I ⊂ R , if w ( I ) > 1 / 2 , then we have med 1 ( Y n , w ) ⊂ I . Pr o of of L emma 14 . Without loss of generality , assume that y 1 ⩽ y 2 ⩽ ... ⩽ y n , then med 1 ( Y n , w ) = [ m 1 , m 2 ] , where m 1 = arg max { y i +1 : P i l =1 w l < 1 2 } and m 2 = arg min { y i : P i l =1 w l > 1 2 } . Since w (( −∞ , m 1 )) < 1 / 2 , w (( m 1 , ∞ )) < 1 / 2 , and w ( I ) > 1 / 2 , we hav e m 1 ∈ I . Similarly for m 2 ∈ I . Th us we can conclude [ m 1 , m 2 ] ⊆ I . Let y i = ( y i 1 , ..., y id ) for all i ∈ [ n ] . Since the ob jective separates co ordinate-wise: n X i =1 w i ∥ g − y i ∥ 1 = n X i =1 d X j =1 w i | g j − y ij | = d X j =1 n X i =1 w i | g j − y ij | . W e hav e that arg min g n X i =1 w i ∥ g − y i ∥ 1 = d Y j =1 arg min g j n X i =1 w i | g j − y ij | . If y i ∈ B ε ( z ; ℓ 1 ) , then | y ij − z j | ⩽ ε for all j . Therefore, for each co ordinate j , X i : y ij ∈ [ z j − ε,z j + ε ] w i ⩾ w ( B ε ( z ; ℓ 1 )) > 1 2 . Th us by Lemma 14 , we hav e arg min g j n X i =1 w i | g j − y ij | ⊂ [ z j − ε, z j + ε ] . 16 This implies that for ev ery g ⋆ ∈ med 1 ( Y n , w ) , ∥ g ⋆ − z ∥ 1 = d X j =1 | g ⋆ j − z j | ⩽ dε, so g ⋆ ∈ B dε ( z ; ℓ 1 ) . This concludes the pro of of claim ( i ) . Fix z ∈ R d and ε > 0 . W e will set n = d + 1 p oin ts. Let η ∈ (0 , 1 2 ) . F or i = 1 , . . . , d , place a p oin t y i : = z + εe i with w eight w i = 1+2 η 2 d , and for L > ε , we place one p oint y d +1 : = z + L 1 with weigh t w d +1 = 1 − 2 η 2 . Then w ( B ε ( z ; ℓ 1 )) = d X j =1 w j = 1 2 + η > 1 2 . F or each co ordinate j , the one-dimensional mass to the left of z j + ε is ( d − 1) 1+2 η 2 d < 1 2 , adding the mass at z j + ε yields 1 2 + η ⩾ 1 2 , and the mass to the right is 1 2 − η ⩽ 1 2 . Therefore arg min g j n X i =1 w i | g j − y ij | = { z j + ε } , j = 1 , . . . , d. Hence med 1 ( Y n , w ) = { z + ε 1 } . Th us the geometric median g ⋆ satisﬁes ∥ g ⋆ − z ∥ 1 = dε . In particular, no b ound of the form ∥ g ⋆ − z ∥ 1 ⩽ β ε holds for any β < d . This concludes the pro of of claim ( ii ) Lemma 15. F or any z , x, g ∈ R d , ε > 0 , and β > 1 , if x ∈ B ε ( z ; ℓ 2 ) and g / ∈ B β ε ( z ; ℓ 2 ) , then ⟨ g − x, g − z ⟩ ⩾ p β 2 − 1 β ∥ g − x ∥ 2 ∥ g − z ∥ 2 . Pr o of of L emma 15 . Let a = g − z , b = x − z . and r = ∥ b ∥ 2 ∥ a ∥ 2 ⩽ ε β ε = 1 β < 1 . Also let t = ⟨ a, b ⟩ ∥ a ∥ 2 ∥ b ∥ 2 ∈ [ − 1 , 1] . Then w e hav e ⟨ a, b ⟩ = ∥ a ∥ 2 2 r t and ∥ b ∥ 2 2 = ∥ a ∥ 2 2 r 2 17 This implies that ⟨ g − x, g − z ⟩ = ⟨ g − z , g − z ⟩ + ⟨ z − x, g − z ⟩ = ∥ a ∥ 2 2 − ⟨ a, b ⟩ = ∥ a ∥ 2 2 (1 − r t ) . and ∥ g − x ∥ 2 2 = ∥ g − z ∥ 2 2 + ∥ z − x ∥ 2 2 + 2 ⟨ g − z , z − x ⟩ = ∥ a ∥ 2 2 + ∥ b ∥ 2 2 − 2 ⟨ a, b ⟩ = ∥ a ∥ 2 2 (1 + r 2 − 2 rt ) . Th us ⟨ g − x, g − z ⟩ ∥ g − x ∥ 2 ∥ g − z ∥ 2 = 1 − r t √ 1 + r 2 − 2 rt . F or ﬁxed r ∈ (0 , 1) , the function f r ( t ) = 1 − r t √ 1 + r 2 − 2 rt , t ∈ [ − 1 , 1] , ac hieves its minimum at t = r with f r ( r ) = √ 1 − r 2 . Therefore, ⟨ g − x, g − z ⟩ ∥ g − x ∥ 2 ∥ g − z ∥ 2 ⩾ p 1 − r 2 ⩾ r 1 − 1 β 2 = p β 2 − 1 β . Lemma 16. F or any ε > 0 , α ∈ (1 / 2 , 1) , let θ satisfy α cos θ = 1 − α . L et p ± = ( ε sin θ , ± ε cos θ ) , q =  2 ε sin θ , 0  , with weights w ( p + ) = w ( p − ) = α/ 2 and w ( q ) = 1 − α . Then the weighte d ge ometric me dian is g ⋆ =  ε sin θ , 0  . Pr o of of L emma 16 . Let F ( g ) = α 2 ∥ g − p + ∥ + α 2 ∥ g − p − ∥ + (1 − α ) ∥ g − q ∥ . Note that p + and p − are reﬂections of eac h other across the x -axis, while q lies on the x -axis. F or any p oin t g = ( x, y ) , let ˜ g = ( x, − y ) and ¯ g = 1 2 ( g + ˜ g ) = ( x, 0) . Then, since F is symmetric with resp ect to the x-axis and by the con vexit y of the norm and of F , F (¯ g ) ⩽ 1 2 F ( g ) + 1 2 F (˜ g ) = F ( g ) . Th us, for every g , there exists a p oin t ¯ g on the x -axis with F ( ¯ g ) ⩽ F ( g ) . F urthermore, if F ( ¯ g ) = F ( g ) for some y  = 0 , then there on this line, 0 is alwa ys a subgradient, which contradicts the structure of the norm function. Hence, every minimizer of F m ust lie on the x -axis. W rite such a minimizer as g = ( x, 0) . F or g  = p ± , q , the ob jectiv e is diﬀerentiable and the ﬁrst-order condition is 0 = α 2 g − p + ∥ g − p + ∥ + α 2 g − p − ∥ g − p − ∥ + (1 − α ) g − q ∥ g − q ∥ . The y -comp onen ts cancel, yielding α u x = 1 − α, 18 where u x is the x -comp onen t of ( g − p ± ) / ∥ g − p ± ∥ . Ev aluating at g 0 = ( ε/ sin θ , 0) gives u x = cos θ , so the ab o ve condition holds exactly when α cos θ = 1 − α , which is our assumption. Th us g 0 is a stationary p oin t. Since F is strictly conv ex along the x -axis a wa y from p ± , q , g 0 is the unique minimizer. Pr o of of Pr op osition 4 . W rite F ( g ) = P n i =1 w i ∥ g − y i ∥ 2 . W e are going to sho w that if w ( B ε ( z ; ℓ 2 )) > α , then for any g / ∈ ( B β ε ( z ; ℓ 2 ) S Y n ) , we hav e ∇ F ( g )  = 0 and for g ∈ Y n \ B β ε ( z ; ℓ 2 ) , we hav e 0 / ∈ ∂ F ( g ) . F or any g / ∈ Y n , F is diﬀerentiable at g with ∇ F ( g ) = n X i =1 w i g − y i ∥ g − y i ∥ 2 . Sp eciﬁcally , we show that for any g / ∈ ( B β ε ( z ; ℓ 2 ) S Y n ) , the inner pro duct ⟨∇ F ( g ) , g − z ∥ g − z ∥ 2 ⟩  = 0 , whic h implies ∇ F ( g )  = 0 and g / ∈ med 2 ( Y n , w ) . F or any y i , w e hav e b y Lemma 15 ,  g − y i ∥ g − y i ∥ 2 , g − z ∥ g − z ∥ 2  ⩾ p β 2 − 1 β . (2) Moreo ver, for any y i / ∈ B ε ( z ; ℓ 2 ) , w e hav e  g − y i ∥ g − y i ∥ 2 , g − z ∥ g − z ∥ 2  ⩾ − 1 . (3) Ov erall, combining ( 2 ) and ( 3 ), for any g / ∈ ( B β ε ( z ; ℓ 2 ) S Y n ) w e hav e  ∇ F ( g ) , g − z ∥ g − z ∥ 2  = n X i =1 w i  g − y i ∥ g − y i ∥ 2 , g − z ∥ g − z ∥ 2  = X i : y i ∈ B ε ( z ; ℓ 2 ) w i  g − y i ∥ g − y i ∥ 2 , g − z ∥ g − z ∥ 2  + X i : y i / ∈ B ε ( z ; ℓ 2 ) w i  g − y i ∥ g − y i ∥ 2 , g − z ∥ g − z ∥ 2  ⩾ w ( B ε ( z ; ℓ 2 )) · p β 2 − 1 β − (1 − w ( B ε ( z ; ℓ 2 ))) = w ( B ε ( z ; ℓ 2 )) · β + p β 2 − 1 β − 1 > 0 , where the last inequality is by the fact that w ( B ε ( z ; ℓ 2 )) = α > β / ( β + p β 2 − 1 ) . Similar arguments can b e made to show that for g ∈ Y n \ B β ε ( z ; ℓ 2 ) , w e hav e 0 / ∈ ∂ F ( g ) . These conclude the pro of of ( i ) b y showing that med 2 ( Y n , w ) ⊆ { g : ∇ F ( g ) = 0 or 0 ∈ ∂ F ( g ) } ⊆ B β ε ( z ; ℓ 2 ) . F or the pro of of ( ii ) , we consider the example in Lemma 16 , which shows that the guarantee in Prop osition 4 is sharp. Indeed, observ e that ∥ p ± ∥ 2 = ε , so p ± ∈ B ε (0) . Moreov er, w ( B ε (0)) = w ( p + ) + w ( p − ) = α. Ho wev er, since sin θ < 1 , we hav e ∥ g ⋆ ∥ 2 = ε/ sin θ > ε . More generally , for an y β > 1 , the condition ε sin θ > β ε ⇐ ⇒ sin θ < 1 β 19 ensures that g ⋆ / ∈ B β ε (0) . Using α cos θ = 1 − α and sin 2 θ + cos 2 θ = 1 , one c hecks that sin θ < 1 β ⇐ ⇒ α < β β + p β 2 − 1 . This concludes the pro of of claim ( ii ) . B.2 Pro ofs from Section 3.2 Equal prop ortion across co ordinates. The follo wing lemma formalizes the observ ation that, at an y w eighted TV (or ℓ 1 ) geometric median under the simplex constraint, all active co ordinates share a common im balance prop ortion. Lemma 17 (Equal prop ortion of mass across co ordinates) . L et F ( g ) = 1 2 P n i =1 w i ∥ y i − g ∥ 1 with P i w i = 1 , and supp ose g ∈ ∆ d minimizes F over the simplex ∆ d = { g ∈ R d + : P j g j = 1 } . F or e ach c o or dinate j , deﬁne the one–side d weights W + j ( t ) : = X i : y ij ⩽ t w i , W − j ( t ) : = X i : y ij 1 2 and in vok e Lemma 17 to obtain τ ∈ d \ j =1 [ W − j ( g j ) , W + j ( g j ) ] . Without loss of generality (by symmetry of the “left/righ t” roles), assume τ ⩾ 1 2 . Then for ev ery co ordinate j we hav e W + j ( g j ) ⩾ τ ⩾ 1 2 . (1) Lemma 18. If g j < z j , then | g j − z j | ⩽ 2 ε . Pr o of of L emma 18 . Supp ose, for contradiction, that for some co ordinate j w e hav e g j < z j and z j − g j > 2 ε . Consider the TV –ball B ε ( z ; TV ) = { x : 1 2 ∥ x − z ∥ 1 ⩽ ε } . F or any y i ∈ B ε ( z ; TV ) , w e m ust ha ve | y ij − z j | ⩽ 2 ε , hence y ij ⩾ z j − 2 ε > g j . That is, ev ery suc h y i satisﬁes y ij > g j , so all p oin ts within B ε ( z ; TV ) (whose total weigh t exceeds 1 2 ) pro ject into the interv al ( g j , 1] along co ordinate j . Therefore, W + j ( g j ) = X i : y ij ⩽ g j w i < 1 − w ( B ε ( z ; TV )) < 1 2 , con tradicting ( 1 ). Hence, z j − g j < 2 ε whenever g j < z j . Because g , z ∈ ∆ d , w e hav e P d j =1 ( g j − z j ) = 0 , so TV( g , z ) = 1 2 d X j =1 | g j − z j | = X j : g j 1 2 . 21 Lemma 19. L et g , y , z ∈ ∆ d b e pr ob ability distributions and let √ · denote element-wise squar e r o ot. F or any β > 1 , supp ose H( z , g ) > β H( z , y ) . L et p y = √ g − √ y − ⟨ √ g − √ y , √ g ⟩ √ g and p z = √ g − √ z − ⟨ √ g − √ z , √ g ⟩ √ g b e the tangent pr oje ction of √ g − √ y and √ g − √ z at p oint √ g . Then we have ⟨ p y , p z ⟩ ⩾        H( y , g ) H( z , g ) , d = 2 , β 2 − 1 β 2 H( y , g ) H( z , g ) , d ⩾ 3 . Pr o of of L emma 19 . let ⟨ √ g , √ y ⟩ = cos a, ⟨ √ g , √ z ⟩ = cos b, and ⟨ √ y , √ z ⟩ = cos c, where 0 ⩽ a, b, c ⩽ π 2 . Then w e hav e H( y , g ) = √ 1 − cos a, H( z , g ) = √ 1 − cos b and H( z , y ) = √ 1 − cos c. Th us, we hav e ⟨ p y , p z ⟩ = ⟨ √ g − √ y , √ g − √ z ⟩ − ⟨ √ g − √ y , √ g ⟩⟨ √ g − √ z , √ g ⟩ = 1 − cos a − cos b + cos c − (1 − cos a )(1 − cos b ) = cos c − cos a cos b. Th us, our ratio of interest is ⟨ p y , p z ⟩ H( y , g )H( z , g ) = cos c − cos a cos b p (1 − cos a )(1 − cos b ) When d = 2 , ⟨ √ y , √ z ⟩ = cos( a − b ) or cos( a + b ) . But since w e hav e H( z , g ) H( z , y ) = s 1 − cos b 1 − ⟨ √ y , √ z ⟩ > β > 1 , w e further hav e ⟨ √ y , √ z ⟩ > cos b . Th us ⟨ √ y , √ z ⟩ = cos( a − b ) . Then when d = 2 , we hav e ⟨ p y , p z ⟩ H( y , g )H( z , g ) = cos( a − b ) − cos a cos b p (1 − cos a )(1 − cos b ) = sin a · sin b √ 1 − cos a √ 1 − cos b . 22 Since sin x √ 1 − cos x = √ 2 cos( x 2 ) ⩾ √ 2 cos( π 4 ) = 1 for x ∈ [0 , π 2 ] . Thus for d = 2 , we hav e ⟨ p y , p z ⟩ ⩾ H( y , g )H( z , g ) . When d ⩾ 3 , let u = H( z , g ) H( z , y ) = r 1 − cos b 1 − cos c > β > 1 . This implies cos c > cos b and cos b < 1 . Let h ( x ) = x 2 − 1 x 2 . Since h ( x ) is monotonically increasing, w e hav e h ( u ) > h ( β ) . W e are going to show our target inequality as the ﬁrst inequality b elo w that ⟨ p y , p z ⟩ H( y , g )H( z , g ) = cos c − cos a cos b p (1 − cos a )(1 − cos b ) ⩾ h ( u ) = cos c − cos b (1 − cos b ) > β 2 − 1 β 2 . (5) This will imply our desired result for d ⩾ 3 . Let f ( x ) = cos c − x cos b √ 1 − x . The deriv ative is f ′ ( x ) = − cos b √ 1 − x + cos c − x cos b 2(1 − x ) 3 / 2 = x cos b + cos c − 2 cos b 2(1 − x ) 3 / 2 . Case I : If cos c ⩾ 2 cos b , then f ( x ) is monotonically increasing on x ⩾ 0 . This implies cos c − cos a cos b √ 1 − cos a = f (cos a ) ⩾ f (0) = cos c. In this case, our target ( 5 ) b ecomes the last inequality b elo w ⟨ p y , p z ⟩ H( y , g )H( z , g ) = cos c − cos a cos b p (1 − cos a )(1 − cos b ) = f (cos a ) √ 1 − cos b ⩾ f (0) √ 1 − cos b = cos c √ 1 − cos b ⩾ cos c − cos b 1 − cos b . The last inequalit y is equiv alen t to cos 2 c (1 − cos b ) ⩾ (cos c − cos b ) 2 . 23 This is again equiv alent to 2 cos b cos c ⩾ cos 2 c cos b + cos 2 b. This inequalit y is true b ecause 1 ⩾ cos c > cos b . This concludes the pro of in Case I. Case I I : If cos c < 2 cos b , then f ( x ) obtains its minimum at the p oin t 2 cos b − cos c cos b . Th us our target ( 5 ) b ecomes the last inequality b elo w ⟨ p y , p z ⟩ H( y , g )H( z , g ) = cos c − cos a cos b p (1 − cos a )(1 − cos b ) = f (cos a ) √ 1 − cos b ⩾ f  2 cos b − cos c cos b  · 1 √ 1 − cos b = 2 p (cos c − cos b ) cos b ⩾ cos c − cos b (1 − cos b ) . The last inequalit y is equiv alen t to 5 cos b − 4 cos 2 b ⩾ cos c. This inequalit y holds b ecause 5 cos b − 4 cos 2 b ⩾ 2 cos b · I (cos b ⩽ 3 / 4) + I (cos b > 3 / 4) ⩾ cos c. This concludes the pro of in Case I I. Th us concludes our pro of. Pr o of of Pr op osition 6 . W rite F ( √ g ) = P n i =1 w i H( y i , g ) = √ 2 2 P n i =1 w i ∥ √ g − √ y i ∥ 2 . W e are going to sho w that if w ( B ε ( z ; H)) > α , then for any g / ∈ ( B β ε ( z ; H) S Y n ) , w e hav e ∇ F ( √ g ) ∥ √ g and for g ∈ S \ B β ε ( z ; H) , we hav e t √ g / ∈ ∂ F ( √ g ) for any t ∈ R . F or any √ g / ∈ S , F is diﬀeren tiable at √ g with ∇ F ( √ g ) = √ 2 2 n X i =1 w i √ g − √ y i ∥ √ g − √ y i ∥ 2 . Then w e can consider the pro jection on to the tangent space at √ g , which is ( ∇ F ( √ g )) ∥ = ∇ F ( √ g ) − ⟨∇ F ( √ g ) , √ g ⟩ √ g = n X i =1 √ 2 w i 2 ∥ √ g − √ y i ∥ 2 ( √ g − √ y i − ⟨ √ g − √ y i , √ g ⟩ √ g ) = n X i =1 √ 2 w i 2 ∥ √ g − √ y i ∥ 2 p y i , 24 where p y i = √ g − √ y i − ⟨ √ g − √ y i , √ g ⟩ √ g . Let p z = √ g − √ z − ⟨ √ g − √ z , √ g ⟩ √ g . Speciﬁcally , w e show that for any g / ∈ ( B β ε ( z ; H) S Y n ) , the inner pro duct D ( ∇ F ( √ g )) ∥ , p z ∥ √ g − √ z ∥ 2 E  = 0 , which implies √ g ∥ ∇ F ( √ g ) and g / ∈ med 2 ( Y n , w ) . F or any y i , w e hav e b y Lemma 19 , ⟨ p y i , p z ⟩ ⩾          1 2 ∥ √ g − √ y ∥ 2 ∥ √ g − √ z ∥ 2 , d = 2 , β 2 − 1 2 β 2 ∥ √ g − √ y ∥ 2 ∥ √ g − √ z ∥ 2 , d ⩾ 3 . (6) Moreo ver, for any y i / ∈ B ε ( z ; H) , we hav e ⟨ p y i , p z ⟩ ⩾ − 1 . (7) Ov erall, combining ( 6 ) and ( 7 ), for any g / ∈ ( B β ε ( z ; H) S Y n ) w e hav e √ 2  ( ∇ F ( √ g )) ∥ , p z ∥ √ g − √ z ∥ 2  = n X i =1 w i ∥ √ g − √ y i ∥ 2  p y i , p z ∥ √ g − √ z ∥ 2  = X i : y i ∈ B ε ( z ;H) w i  p y i ∥ √ g − √ y i ∥ 2 , p z ∥ √ g − √ z ∥ 2  + X i : y i / ∈ B ε ( z ;H) w i  p y i ∥ √ g − √ y i ∥ 2 , p z ∥ √ g − √ z ∥ 2  ⩾          w ( B ε ( z ; H)) · 1 2 − (1 − w ( B ε ( z ; H))) , d = 2 , w ( B ε ( z ; H)) · β 2 − 1 2 β 2 − (1 − w ( B ε ( z ; H))) , d = 3 , > 0 . where the last inequality is b y the assumption that w ( B ε ( z ; H)) = α > α H ( β , d ) . Similar argumen ts can b e made to show that for g ∈ Y n \ B β ε ( z ; H) , we hav e t √ g / ∈ ∂ F ( √ g ) for all t ∈ R . These conclude the pro of by showing that med H ( Y n , w ) ⊆ { g : ∇ F ( √ g ) ∥ √ g or t √ g ∈ ∂ F ( √ g ) for some t ∈ R } ⊆ B β ε ( z ; H) . Pr o of of Pr op osition 7 . W e ﬁrst v erify that the weigh t assignment is feasible, e.g., w ( y 2 ) ∈ (0 , 1) . F or δ ∈ (0 , 1 / 25) , the function δ → 3 p δ log(1 /δ ) is monotonically increasing and we hav e 3 p 0 + log(1 / 0 + ) = 0 + and 3 p log 25 / 25 < 1 . This prov es w ( y 2 ) ∈ (0 , 1) . Deﬁne F ( g ) = w 1 p KL( y 1 , g ) + w 2 p KL( y 2 , g ) , g ∈ ∆ 2 . Since KL ( y 1 , y 1 ) = 0 , we hav e F ( y 1 ) = w 2 p KL( y 2 , y 1 ) . Let g ′ := (2 δ, 1 − 2 δ ) ∈ ∆ 2 . W e will show F ( g ′ ) < F ( y 1 ) , whic h prov es our claim. Step 1: b ound p KL( y 1 , g ′ ) . A direct calculation yields KL( y 1 , g ′ ) = δ log δ 2 δ + (1 − δ ) log 1 − δ 1 − 2 δ = − δ log 2 + (1 − δ ) log  1 + δ 1 − 2 δ  . 25 Using log(1 + x ) ⩽ x and δ ⩽ 1 / 25 , we get KL( y 1 , g ′ ) ⩽ − δ log 2 + (1 − δ ) δ 1 − 2 δ ⩽ δ  24 23 − log 2  ⩽ 2 5 δ, hence p KL( y 1 , g ′ ) ⩽ q 2 5 δ . (8) Step 2: b ound p KL( y 2 , g ′ ) . F or any a > b > 0 , conca vity implies √ b ⩽ √ a − a − b 2 √ a . Apply this with a = KL( y 2 , y 1 ) and b = KL( y 2 , g ′ ) to obtain p KL( y 2 , g ′ ) ⩽ p KL( y 2 , y 1 ) − KL( y 2 , y 1 ) − KL( y 2 , g ′ ) 2 p KL( y 2 , y 1 ) . (9) Step 3: low er b ound on KL( y 2 , y 1 ) − KL( y 2 , g ′ ) . Compute KL( y 2 , y 1 ) − KL( y 2 , g ′ ) = (1 − δ ) log 2 + δ log 1 − 2 δ 1 − δ . Using log(1 − x ) ⩾ − x − x 2 for x ∈ [0 , 1 / 2] , we obtain δ log 1 − 2 δ 1 − δ ⩾ − δ 2 1 − δ − δ 3 (1 − δ ) 2 ⩾ − 2 δ 2 , where w e used δ ⩽ 1 / 25 . Therefore, for all δ ∈ (0 , 1 / 25) , KL( y 2 , y 1 ) − KL( y 2 , g ′ ) ⩾ (1 − δ ) log 2 − 2 δ 2 ⩾  24 25 − 2 25 2 log 2  log 2 ⩾ 0 . 95 log 2 . (10) Step 4: upp er b ound KL( y 2 , y 1 ) . Since the density ratio is upp er b ounded by 1 /δ , KL( y 2 , y 1 ) ⩽ log(1 /δ ) (11) Step 5: combine all the b ounds. Using w 1 ⩽ 1 , combining ( 8 ) – ( 11 ) gives F ( g ′ ) = w 2 p KL( y 2 , g ′ ) + w 1 p KL( y 1 , g ′ ) ⩽ w 2 p KL( y 2 , y 1 ) − w 2 · 0 . 95 log 2 2 p KL( y 2 , y 1 ) + q 2 5 δ ⩽ F ( y 1 ) − w 2 · 0 . 95 log 2 2 p log(1 /δ ) + q 2 5 δ . Substituting w 2 = 3 p δ log(1 /δ ) yields F ( g ′ ) ⩽ F ( y 1 ) −  3 · 0 . 95 log 2 2 − q 2 5  √ δ ⩽ F ( y 1 ) − 0 . 35 √ δ < F ( y 1 ) . Therefore y 1 is not a minimizer of F , and consequently med √ KL ( S, w )  = { y 1 } . 26 Lemma 20 (Lemma 4 of ( Y ang and Barron , 1998 )) . F or any p, q ∈ R d > 0 , we have H 2 ( p, q ) ⩽ KL( p, q ) ⩽ (2 + log V ) · H 2 ( p, q ) , wher e V > 0 is such that p ⩽ V q . Lemma 21. L et z , y 1 , ...y n ∈ ∆ d . Supp ose for every i , z ⩽ V y i c o or dinate-wise for some V ⩾ 1 . Then for any g ∈ med H ( Y n , w ) , we have z ⩽ V g . Pr o of of L emma 21 . W e ﬁrst deﬁne the pro jected spherical conv ex h ull of the set Y n = { y 1 , ..., y n } b y co p ( Y n ) : = ( y =  P i a i √ y i  2 ∥ P i a i √ y i ∥ 2 2 | a ∈ ∆ n ) , where √ · and ( · ) 2 are b oth element-wise. Then for any y = ( P i a i √ y i ) 2 ∥ P i a i √ y i ∥ 2 2 ∈ co p ( Y n ) , w e hav e √ z ⩽ X i a i p V y i . Since w e know ∥ P i a i √ y i ∥ 2 ⩽ P i a i ∥ √ y i ∥ 2 ⩽ 1 b y the conv exity of norm. This implies that for all y ∈ co p ( Y n ) , z ⩽ X i a i p V y i ! 2 ⩽ V  P i a i √ y i  2 ∥ P i a i √ y i ∥ 2 2 = V y . (12) No w w e sho w that g ∈ co p ( Y n ) , which will imply the desired result. W e pro ve b y con tradiction. Assume g / ∈ co p ( Y n ) . Let y g = arg min y ∈ co p ( Y n ) ∥ y − g ∥ 2 . Since co p ( Y n ) is spherically conv ex, we know that for any i ∈ [ n ] and λ ∈ [0 , 1] , (1 − λ ) √ y g + λ √ y i ) ∥ (1 − λ ) √ y g + λ √ y i ) ∥ 2 ∈ co p ( Y n ) . Let f i ( λ ) =      √ g − (1 − λ ) √ y g + λ √ y i )   (1 − λ ) √ y g + λ √ y i )   2      2 2 . The gradien t of is f ′ i ( λ ) = − 2 ⟨ √ g , √ y i − √ y g ⟩ ∥ (1 − λ ) √ y g + λ √ y i ) ∥ 2 + 2 ⟨ √ g , (1 − λ ) √ y g + λ √ y i ) ⟩ (2 λ − 1 + (1 − 2 λ ) ⟨ √ y g , √ y i ⟩ ) ∥ (1 − λ ) √ y g + λ √ y i ) ∥ 3 2 . Since y g ∈ arg min x ∈ co p ( Y n ) ∥ √ g − √ x ∥ 2 2 , w e hav e f ′ i (0) ⩾ 0 . Thus, we hav e for an y i , ⟨ √ y i , √ g ⟩ ⩽ ⟨ √ y i , √ y g ⟩⟨ √ g , √ y g ⟩ ⩽ ⟨ √ y i , √ y g ⟩ . (13) This implies ∥ √ g − √ y i ∥ 2 2 = ∥ √ y g − √ y i ∥ 2 2 + 2 ⟨ √ y i , √ y g − √ g ⟩ ⩾ ∥ √ y g − √ y i ∥ 2 2 . 27 Then b y deﬁnition of g ∈ med H ( Y n , w ) , we hav e X i w i ∥ √ g − √ y i ∥ 2 ⩽ X i w i ∥ √ y g − √ y i ∥ 2 ⩽ X i w i ∥ √ g − √ y i ∥ 2 . Th us equality holds for ( 13 ) for all i ∈ [ n ] . Then w e divide into the following tw o cases: Case I : If ⟨ √ y i , √ y g ⟩ > 0 , for some i ∈ [ n ] . Then ⟨ √ g , √ y g ⟩ = 1 by the second inequality of ( 13 ) . This means g = y g ∈ co p ( Y n ) , whic h contradicts our assumption that g / ∈ co p ( Y n ) . Case I I : If ⟨ √ y i , √ y g ⟩ = 0 , for all i ∈ [ n ] . Then ⟨ √ y i , √ g ⟩ = 0 for all i ∈ [ n ] b y the ﬁrst inequality of ( 13 ) . Then H( y i , g ) = 1 ⩾ H( y i , x ) for all i ∈ [ n ] and x ∈ ∆ d . This also leads to a contradiction since w e can show that y 1 has a b etter weigh ted sum of norms, i.e., X i w i ∥ √ y 1 − √ y i ∥ 2 = n X i =2 w i ∥ √ y 1 − √ y i ∥ 2 ⩽ n X i =2 w i ∥ √ g − √ y i ∥ 2 < n X i =1 w i ∥ √ g − √ y i ∥ 2 . This con tradicts the deﬁnition of g ∈ med H ( Y n , w ) . Ov erall, by the con tradiction shown for the ab o v e tw o cases, we hav e shown that g ∈ co p ( Y n ) , whic h together with ( 12 ), concludes our pro of. C Pro ofs from Section 4 Lemma 22. Supp ose the diver genc e div is ( α − γ , β ) -b o ostable. If div ( y i , f T ( x i )) > β ε , then we have P T t =1 η t (1 − α − C ε ( y i , h t ( x i ))) ⩽ 0 . Pr o of of L emma 22 . Since f T ( x i ) ∈ med ( { h t ( x ) } T t =1 , { η ′ t } T t =1 ; div , γ ) , there exists ( { u s } S s =1 , { ρ s } S s =1 ) ∈ B γ ( { h t ( x ) } T t =1 , { η ′ t } T t =1 ) suc h that f T ( x i ) ∈ med( { u s } S s =1 , { ρ s } S s =1 ; div ) . Since the div ergence div is ( α − γ , β ) -bo ostable, div ( y i , f T ( x i )) > β ε , we know that S X s =1 ρ s I { div( y i , u s ) ⩽ ε } ⩽ ( α − γ ) . Since ( { u s } S s =1 , { ρ s } S s =1 ) ∈ B γ ( { h t ( x ) } T t =1 , { η ′ t } T t =1 ) , w e know T X t =1 η ′ t I { div( y i , f t ( x i )) ⩽ ε } ⩽ α By deﬁnition of η ′ t in Algorithm 1 , this is equiv alen t to T X t =1 η t I { div( y i , h t ( x i )) ⩽ ε } ⩽ α T X t =1 η t . 28 This further implies T X t =1 η t I { div( y i , h t ( x i )) > ε } > (1 − α ) T X t =1 η t . Com bining with the assumption on the loss function C ε , w e hav e T X t =1 η t C ε ( y i , h t ( x i )) ⩾ T X t =1 η t I { div( y i , h t ( x i )) > ε } > (1 − α ) T X t =1 η t . Th us, we hav e the desired inequalit y of T X t =1 η t (1 − α − C ε ( y i , h t ( x i ))) ⩽ 0 . This concludes our pro of. Pr o of of The or em 12 . With Lemma 22 , we are able to derive the desired result L ( f T ) = 1 n n X i =1 I { div( y i , f T ( x i )) > β ε } ⩽ 1 n n X i =1 I ( − T X t =1 η t (1 − α − C ε ( y i , h t ( x i ))) ⩾ 0 ) (Lemma 22 ) ⩽ 1 n n X i =1 exp − T X t =1 η t (1 − α − C ε ( y i , h t ( x i ))) ! ( exp( x ) ⩾ I { x ⩾ 0 } ) = T Y t =1 n X i =1 w t ( x i ) exp( − η t (1 − α − C ε ( y i , h t ( x i )))) ! n X i =1 w t +1 ( x i ) = T Y t =1 n X i =1 w t ( x i ) exp( − η t (1 − α − C ε ( y i , h t ( x i )))) ! . 29

Boosting for Vector-Valued Prediction and Conditional Density Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment