Calibeating Made Simple

Calibeating Made Simple Y urong Chen 1 , Zhiyi Huang 2 , Michael I. Jordan 1 , 3 , Haipeng Luo 4 1 Inria, ´ Ecole Normale Sup ´ erieure, PSL Resear ch University 2 e University of Hong K ong 3 University of California, Berkeley 4 University of Southern California yurong.chen@inria.fr , zhiyi@cs.hku.hk , jordan@cs.berkeley .edu , haipengl@usc.e du Abstract W e study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based b enchmark. Unlike prior work, which analyzed calibeating for specic losses with spe cic arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely , we rst show that calib eating is minimax-e quivalent to regr et minimization. is recovers the O (log T ) calibeating rate of Foster and Hart [ FH23 ] for the Brier and log losses and its optimality , and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prov e that multi-calibeating is minimax-equivalent to the combination of calib eating and the classical expert problem. is yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally , we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the rst calibrated algorithm that at the same time also achieves the optimal O (log T ) calibeating rate. 1 Introduction Calibration has aracted gr owing aention in r ecent years as a desideratum for probabilistic prediction, motivated by the ne ed to produce reliable probabilities for downstream decision-making [ Guo+17 ]. Despite its appeal as a benchmark for reliability , however , calibration is not necessarily a meaningful test of forecasting expertise. For example, online calibration can be achieved by randomized strate- gies without any knowledge of the data-generating process [ F V98 ]. Hence, calibration alone cannot distinguish true expertise from uninformative pr o cedures. T o quantify and preserve forecasting expertise, Foster and Hart [ FH23 ] introduced calib eating in a post-processing seing. In this seing, an external for e caster (e .g., a machine learning model) outputs a probabilistic forecast at each round, and then the learner produces its own forecast based on it. It is known that for proper losses such as the Brier and log losses, the cumulative score can be decompose d into a calibration term, which measures the reliability , and a renement term, which measures the informativeness and skill. is motivates the question of whether one can improve reliability without sacricing skill. Calibeating formalizes this goal by requiring the learner’s loss to be as small as the external forecaster’s renement scor e (in other words, to “beat” the forecaster by its calibration error). Existing work [ FH23 ; Lee+22 ] establishes online calibeating guarantees for the Brier and log losses, and studies extensions such as b eating multiple forecasters (multi-calib eating) and imposing simultaneous calibration constraints. ese results rely on loss-spe cialized analyses. More broadly , the fundamental statistical diculty of calibeating and its relationship to standar d online-learning problems 1 have remained unclear , leaving open whether kno wn bounds are optimal or ho w the y generalize beyond the Brier and log losses. 1.1 Our Results W e study calibeating fr om an online-learning perspective. Rather than analyzing dier ent losses on a case-by-case basis, we identify simple reductions from (multi-)calibeating to standard online- learning primitiv es. is yields a “plug-and-play” analysis: by instantiating the reductions with classical online-learning algorithms, we obtain general upper and lower bounds in a modular way . Calibeating = no-regret learning ( Se ction 3 ). W e prov e that calibeating is minimax-equivalent to regret minimization. eorem 3.1 gives a reduction that turns any no-regret learner with regret bound α ( T ) into a calib eating algorithm with a corresponding b ound that scales with | Q | , the number of distinct external forecast values over T rounds. e reduction exploits the fact that the renement benchmark decomposes across distinct forecast values, allowing one to treat each corresponding subsequence independently . Instantiating this reduction recovers the O ( | Q | log T ) guarantees for the Brier and log losses from Foster and Hart [ FH23 ] and extends them to general mixable losses ( Corollar y 3.4 ). W e also obtain an O ( p | Q | T ) bound for general bounded proper losses ( Corollary 3.2 ). Conversely , eorem 3.5 provides a matching low er bound that completes the minimax-e quivalence. Multi-calibeating = calibeating + expert problem ( Section 4 ). Next, we present (in eorem 4.1 ) a simple decomposition of multi-calibeating into calibeating and the expert pr oblem: run a separate calibeating subroutine for each forecaster to produce candidate predictions, then aggregate them with an expert algorithm. e resulting multi-calibeating guarantee is the sum of the calib eating bound and the expert regret bound. For mixable losses, we obtain a logarithmic bound of O (log N + | Q ( n ) | log T ) ( Corollary 4.3 ), where N is the number of forecasters and Q ( n ) is the set of distinct forecasts produced by forecaster n . is improves exponentially over the polynomial dependence on N in Foster and Hart [ FH23 ] and the polynomial dependence on T in Lee et al. [ Lee+22 ]. W e complement this with a lower-bound reduction ( eorem 4.4 ), showing that multi-calibeating inherits hardness from both the expert problem and the calibeating problem. is yields matching lower bounds for Brier and log losses and shows the tightness of our results ( Cor ollar y 4.6 ). Simultaneous (multi-)calibeating and calibration for Brier loss ( Section 5 ). Finally , we provide new bounds for achieving calib eating and calibration simultaneously . W e propose a meta-algorithm that tracks an arbitrar y reference algorithm while ensuring calibration ( eorem 5.1 ). e construction employs two existing online learning primitives: the reduction by [ BM07 ] to enforce calibration via the calibration-swap-regret conne ction, and a two-expert algorithm by Sani, Neu, and Lazaric [ SNL14 ] to aggregate the predictions from the Blum–Mansour ( BM ) reduction and the reference algorithm. Instantiating the reference with the (multi-)calibeating algorithms from the previous sections, for the Brier loss, we obtain for the binar y case the optimal logarithmic (multi-)calibeating rate of O (log N + | Q ( n ) | log T ) while ensuring a sublinear ℓ 2 -calibration err or of order ˜ O ( √ T ) ( Corollary 5.2 ). is improves the p olynomial T -dependence on the calibeating side in Foster and Hart [ FH23 ] and improv es both sides compared to Le e et al. [ Le e+22 ]. For multi-class outcomes, we derive explicit tradeos between (multi-)calibeating and calibration ( Corollar y 5.3 ). In particular , at one extreme, we recover the known dependence on T for calibration [ FH23 ; Fis+25 ] while dropping the | Q ( n ) | dependence in Foster and Hart [ FH23 ]. For a summary of our results and comparisons with prior work, see T able 1 . 2 T able 1: Comparison of prior and our guarantees in N , T , K (the numb er of outcomes), and | Q | (we assume | Q ( n ) | = | Q | for simplicity). For simultane ous calibeating and calibration, the rst rate is for calib eating and the second for calibration. W e omit polynomial dep endence on K for presentation clarity , and ˜ O omits logarithmic dependence on T . e simultaneous results of Foster and Hart [ FH23 ] are only for calibeating (but not multi- calibeating), so we only show r esults for calibeating for comparison. e results of Lee et al. [ Lee+22 ] are only for binary outcomes. Seing Loss class Prior work is pap er Calibeating Mixable – Θ( | Q | log T ) (Cor . 3.4 , 3.7 ) - Brier Θ( | Q | log T ) 1 - Log O ( | Q | log T ) 1 Bounded – Θ( p | Q | K T ) (Cor . 3.2 , 3.6 ) Multi- Calibeating Mixable – Θ(log N + | Q | log T ) (Cor . 4.3 , 4.6 ) - Brier O (( N + | Q | ) log T ) 1 O ( √ N T + | Q | log T ) 1 ˜ O ( p | Q | (log N ) 1 4 T 3 4 ) 2 Bounded – Θ( √ T log N + p | Q | K T ) (Cor . 4.2 , 4.5 ) Calibeating & Calibration Brier - binary ˜ O ( | Q | 2 3 T 1 3 ) , ˜ O ( | Q | 2 3 T 1 3 ) 1 O ( | Q | log T ) , ˜ O ( √ T ) (Cor . 5.2 ) - K -class ˜ O ( | Q | 2 K +1 T K − 1 K +1 ) , ˜ O ( | Q | 2 K +1 T K − 1 K +1 ) 1 ˜ O ( | Q | + T K − 1 K +1 ) , ˜ O ( T K − 1 K +1 ) (Cor . 5.3 ) 1 [ FH23 ], 2 [ Lee+22 ] 1.2 Related W ork Calibeating. In the seminal work proposing calib eating, Foster and Hart [ FH23 ] give online guar- antees for the Brier and log losses via a bin-wise estimation viewpoint. ey also study extensions to multiple forecasters and to simultaneous calibeating and calibration. Lee et al. [ Lee+22 ] formulate simultaneous multi-calibration and multi-calib eating as an online multi-objective optimization pr oblem, achieving favorable dependence on the numb er of external forecasters but sub optimal dependence on the time horizon. In comparison, our results are obtained via r e ductions that connect calibeating to standard online-learning problems. Finally , as an application, Gupta and Ramdas [ GR23 ] apply calibeating as a robustness layer on top of online Pla scaling to guarantee adversarial calibration in binary classication while preserving predictive performance. Online recalibration. Online recalibration is studie d in the same post-processing seing as cal- ibeating, but it b enchmarks performance by proper-loss regret rather than renement. e goal is to achieve small regret relative to the external forecaster while simultaneously ensuring calibrate d predictions [ MKE25 ; DMK24 ]. For binar y classication, Kulesho v and Ermon [ KE17 ] pr ovide adversarial online guarantees, and Okoroafor , Kleinberg, and Sun [ OKS24 ] obtain impro ved b ounds and explicit regret versus ℓ 1 -calibration trade os via Blackwell approachability for strictly proper losses. ese tradeos yield sublinear but typically polynomial-in- T rates. While we focus on the Brier loss in our simultaneous guarantee, we target the stronger r enement benchmark and achieve logarithmic-in- T rates while still ensuring sublinear ℓ 2 -calibration. Calibration and proper scoring loss. Proper scoring losses admit classical decompositions into a reliability (calibration) term and an informativeness (renement) term [ Daw06 ; San63 ; Br ¨ o09 ]. In this spirit, ℓ 2 -calibration [ F V98 ] and KL-calibration [ LSS25 ] can b e viewed as online analogues of the calibration term for the Brier loss and the log loss, respectively; more generally , this motivates dening online calibration measur es compatible with arbitrary pr op er scoring losses. Several calibration notions, including ℓ 2 -calibration [ Fis+25 ] and KL-calibration [ LSS25 ], have be en shown to be e quivalent 3 to swap-regret objectives. W e exploit this connection in our simultane ous guarantees by enforcing (pseudo-)swap regr et via the BM reduction [ BM07 ]. 2 Model W e consider an online pr ediction problem over a nite outcome space with N external forecasts. Let K ≥ 2 be the number of p ossible outcomes, and ∆ K : = { p ∈ R K ≥ 0 : P K k =1 p k = 1 } be the probability simplex. W e let [ n ] denote the set { 1 , . . . , n } for any positive integer n . e outcome space is denoted by E : = { e i : i ∈ [ K ] } ⊆ ∆ K , where e i is the i -th standard basis vector . e interaction procee ds for T rounds. At each round t ∈ [ T ] , the learner rst obser ves N external forecasts, q ( n ) t ∈ ∆ K , n ∈ [ N ] , and makes its own prediction p t ∈ ∆ K . e outcome y t ∈ E is then revealed, and the learner incurs loss ℓ ( p t , y t ) . For simplicity , w e assume that q 1: T : = ( q t ) T t =1 and y 1: T : = ( y t ) T t =1 are generated by an oblivious adv ersar y , i.e., they are decided at time t = 0 with complete knowledge of the learner’s algorithm (but not its random bits). roughout, we consider a proper scoring loss ℓ : ∆ K × E → R , i.e., losses such that for any q ∈ ∆ K , q ∈ arg min p ∈ ∆ K E y ∼ q [ ℓ ( p, y )] . W e write ℓ ( p, q ) : = E y ∼ q [ ℓ ( p, y )] . Let 1 {·} denote the indicator function, which equals one if the condition holds and zero other wise. Given a prediction sequence p 1: T and outcome sequence y 1: T , for any p ∈ ∆ K , denote the numb er of times the learner predicts p as n T ( p ) : = P T t =1 1 { p t = p } , and the empirical outcome distribution conditioned on prediction p as ρ p T ( y ) : = 1 n T ( p ) P T t =1 1 { p t = p, y t = y } for y ∈ E , whenever n T ( p ) > 0 . With these denitions, the cumulative loss, renement scor e, and calibration error are dened as follows. Denition 2.1. e cumulative loss of predictions p 1: T under outcomes y 1: T is L T ( p 1: T , y 1: T ) : = T X t =1 ℓ ( p t , y t ) . e renement score is R T ( p 1: T , y 1: T ) : = X p n T ( p ) ℓ ( ρ p T , ρ p T ) = X p min q ∈ ∆ K X t : p t = p ℓ ( q , y t ) . Finally , the calibration error is K T ( p 1: T , y 1: T ) : = L T ( p 1: T , y 1: T ) − R T ( p 1: T , y 1: T ) . By construction, L T = R T + K T and K T ≥ 0 . Moreover , K T coincides with the full-swap- regret notion of Fishelson et al. [ Fis+25 ], while R T corresponds to the best-in-hindsight swap-regr et benchmark. Inde ed, for each prediction p , the renement term equals the loss of the best constant predictor over rounds with p t = p . us, the renement score measures the informativeness of the forecasts: sequences that induce ner bins with lo wer within-bin variability achiev e smaller renement. In contrast, the calibration error measures within-bin reliability , i.e., ho w close the issue d prediction p is to the empirical conditional distribution ρ p T on the corresponding subsequence. Proper scoring losses admit a classic decomposition into terms measuring the informativeness (or renement) of forecasts and their reliability (or calibration) in a pr obabilistic seing; see, e.g., Br ¨ ocker [ Br ¨ o09 ] and Dawid [ Daw06 ]. Denition 2.1 can b e seen as the empirical counterparts of these quantities. Example 2.2. For Brier loss ℓ ( p, y ) = ∥ p − y ∥ 2 2 , the renement score equals the weighted sum of within-bin variances, and the calibration error becomes the ℓ 2 -calibration [ F V98 ], R T ( p 1: T , y 1: T ) = X p n T ( p ) X t : p t = p 1 n T ( p ) ∥ ρ p T − y t ∥ 2 2 , K T ( p 1: T , y 1: T ) = X p n T ( p ) ∥ p − ρ p T ∥ 2 2 . 4 Example 2.3. Denote the Shannon entropy under distribution p to be H ( p ) = − P k p k log p k . For log loss ℓ ( p, y ) = − P K k =1 y k log p k , the renement scor e equals the weighted sum of the Shannon entropy within each bin, and the calibration error becomes the KL–calibration [ LSS25 ], R T ( p 1: T , y 1: T ) = X p n T ( p ) H ( ρ p T ) , K T ( p 1: T , y 1: T ) = X p n T ( p ) KL( ρ p T ∥ p ) . Motivated by this de composition, Foster and Hart [ FH23 ] compare the learner to the external forecaster’s renement score and dene the notions of calibeating and multi-calib eating . Denition 2.4 (Calibeating and Multi-Calib eating) . A learner is α ( T ) - multi-calibeating w .r .t. loss ℓ if for any external forecasts { q ( n ) 1: T } N n =1 and outcomes y 1: T , the learner’s predictions p 1: T satisfy L T ( p 1: T , y 1: T ) ≤ R T ( q ( n ) 1: T , y 1: T ) + α ( T ) , ∀ n ∈ [ N ] . (1) W e call α ( T ) the multi-calib eating rate. W e say the learner is multi-calib eating if α ( T ) = o ( T ) . When ( 1 ) holds in expectation ov er the learner’s randomness, we call α ( T ) the expected multi-calibeating rate. When there is only N = 1 external forecast, we simply say calibeating. W e also introduce another performance measure called calibration. Denition 2.5 (Calibration) . A learner is β ( T ) -calibrated w .r .t. loss ℓ if for any outcome sequences y 1: T (and any external for ecasts), the learner’s predictions p 1: T satisfy K T ( p 1: T , y 1: T ) ≤ β ( T ) . (2) W e call β ( T ) the calibration rate and say the algorithm is calibrated if β ( T ) = o ( T ) . When ( 2 ) holds in expectation over the algorithm’s randomness, we call β ( T ) the expected calibration rate. Note that calibeating and calibration are incomparable in general. Calib eating only guarante es that the learner’s loss is no larger than the forecaster’s loss minus the forecaster’s calibration error , i.e., it competes with the external forecaster ’s renement score. e learner’s calibration error might not vanish if it itself aains a low renement term. 3 Calibeating = No-Regr et Learning is section considers the calibeating problem, i.e., when there is only one external forecast every round. Let Q : = { q t : t ∈ [ T ] } denote the set of distinct e xternal forecast values that appear over the horizon. 1 Foster and Hart [ FH23 ] study the Brier and log losses and give algorithms with calibeating rate of O ( | Q | log T ) . W e recover and extend their r esults by reductions to no-regret learning. First, dene the regret of pr e dictions p 1: T under outcomes y 1: T to be Reg T ( p 1: T , y 1: T ) : = T X t =1 ℓ ( p t , y t ) − min p ∈ ∆ K T X t =1 ℓ ( p, y t ) . W e say an algorithm has ( expected-)regret of α ( T ) if Reg T ( p 1: T , y 1: T ) ≤ α ( T ) always holds (in expectation). e following theorem shows that calibeating reduces to no-regret learning. eorem 3.1. For any proper loss ℓ and any online algorithm A with regret α ( T ) , where α is a concave function, Algorithm 1 is | Q | α ( T / | Q | ) -calibeating. 1 W e also use Q to denote the set of possible external forecast values for lower bound results. 5 Proof. e reduction partitions the rounds t ∈ [ T ] by the external forecast value q t , and runs an independent copy of the no-regret learner A for each forecast value. Formally , for any external forecast q that appeared at least once, run a separate copy of A , denoted as A q , on the subset of rounds I q : = { t : q t = q } . For each subsequence, we have X t : q t = q ℓ ( p t , y t ) − min p ∈ ∆ K X t : q t = q ℓ ( p, y t ) ≤ α ( n T ( q )) . Summing up over all the subsequences and by Denition 2.1 , we hav e L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T ) ≤ X q α ( n T ( q )) ≤ | Q | α  P q n T ( q ) | Q |  (Jensen’s inequality) = | Q | α  T | Q |  , which nishes the proof. Algorithm 1: Calib eating by Bin- Wise No-Regret Input: Online learner A . 1 for t = 1 to T do // prediction 2 Observe external forecasts q t ∈ ∆ K . 3 if A q t is uninitialized then 4 Initialize a fresh copy A q t ← A . 5 ery A q t and obtain prediction p t ∈ ∆ K . // update 6 Observe outcome y t and incur loss ℓ ( p t , y t ) . 7 Update A q t with y t . W e note that common regret bounds obtained for standard online algorithms are all concave in T , e.g., they ar e oen of the form O ((log T ) α T β ) for some α > 0 and β ∈ [0 , 1) . e algorithm of Foster and Hart [ FH23 ] can b e recovered as a special case of eorem 3.1 , with the online algorithm being follow–the–leader ( FTL ) (with a standard interior restriction for the log loss to avoid the unbounded boundar y). Moreover , eorem 3.1 readily obtains O ( p | Q | K T ) for b ounded proper losses [ LSS24 ] and O ( | Q | log T ) for mixable losses [ HAK07 ], which encompasses Brier and log losses as sp ecial cases. Corollary 3.2. For bounded proper loss ℓ , instantiating the online learner in e orem 3.1 with the follow- the-perturbe d-leader algorithm of Luo, Senapati, and Sharan [ LSS24 ] yields an algorithm with expected calibeating rate of O ( p | Q | K T ) . W e note that, inherited from the no-regret guarantees in Luo, Senapati, and Sharan [ LSS24 ], the algorithm in Corollary 3.2 can actually achieve O ( p | Q | K T ) simultaneously for all b ounded proper losses. Denition 3.3. A convex function ℓ ( · ) is η -mixable if for any probability distribution π ∈ ∆(∆ K ) , there exists a pr e diction p π ∈ ∆ K such that e − η ℓ ( p π ,y ) ≥ R e − η ℓ ( p,y ) π (d p ) holds for all y ∈ E . 6 Corollary 3.4. For an η -mixable loss ℓ (e .g., Brier and log losses), instantiating the online learner in eorem 3.1 with exponentially weighted online optimization ( EW OO ) [ HAK07 ] yields an algorithm with expected calibeating rate of O ( | Q | log T ) . Besides the upper bound, we also prove that any lo wer b ound for no-regret learning with a proper loss implies a lower bound for calibeating. Combining with eorem 3.1 , our results show that calibeating is minimax-equivalent to regret minimization. W e defer the proof to Section A.1 . eorem 3.5. For any proper loss ℓ , denote the optimal regret bound as β ( T ) : = inf A sup y 1: T ∈E T E p 1: T ∼ A " T X t =1 ℓ ( p t , y t ) − min p ∈ ∆ K T X t =1 ℓ ( p, y t ) # , (3) where A ranges over (possibly randomized) online algorithms. en, every algorithm is at best | Q | β ( ⌊ T / | Q |⌋ ) - calibeating. Combining eorem 3.5 with known regret lower bounds for bounded proper losses [ LSS24 ] and Brier and log losses [ CL06 ] yields the following. Corollary 3.6. ere exist b ounded proper losses with calibeating rate at least Ω( p | Q | K T ) . Corollary 3.7. For the Brier and log losses, the calibeating rate is at least Ω( | Q | log ( T / | Q | )) . 4 Multi-Calibeating = Calibeating + Exp ert Problem Next, we consider the multi-calib eating problem. Foster and Hart [ FH23 ] obtain multi-calibeating rates of O (( N + | Q | ) log T ) and O ( √ N T + | Q | log T ) , via Blackw ell approachability and online linear regression. Lee et al. [ Lee+22 ] achieve a rate logarithmic in N , but polynomial in T (more precisely , ˜ O ( p | Q | (log N ) 1 4 T 3 4 ) for the optimal choice of parameters). is section presents a simple reduction from multi-calibeating to the expert problem. Via that reduction, we achiev e the optimal multi-calib eating rates. Expert Problem. e interaction protocol in this problem is the same as in multi-calib eating: at each round t , the learner observes N expert predictions { p ( n ) t } n ⊆ ∆ K , and makes its own prediction p t ∈ ∆ K . An experts algorithm E achieves regret γ ( T ) if for every sequence { ( p (1: N ) t , y t ) } T t =1 , E " T X t =1 ℓ ( p t , y t ) # ≤ min n ∈ [ N ] T X t =1 ℓ ( p ( n ) t , y t ) + γ ( T ) , (4) where the expectation is over the randomness of E . Comparing this denition of regret with the denition of multi-calib eating rate, the only dierence is that the laer remaps the experts/forecasters’ predictions optimally , while the former does not. e remapping of each individual forecaster is precisely the problem of calibeating. Hence, we run a separate calibeating algorithm for each forecaster , and use an experts algorithm to aggregate their decision. Se e Algorithm 2 for a formal description. eorem 4.1. For any loss function ℓ , any calibeating algorithm with rate α ( T ) , and any experts algorithm with regret γ ( T ) , Algorithm 2 is ( α ( T ) + γ ( T )) -calib eating. 7 Algorithm 2: Multicalib eating by Expert Aggregation 1 Sub-routines: • For each forecaster n ∈ [ N ] , a separate calib eating algorithm A ( n ) ( Algorithm 1 ). • Experts algorithm E [e.g., Hedge, FS97 ]. for t = 1 to T do // prediction 2 Observe external forecasts q (1) t , . . . , q ( N ) t ∈ ∆ K . 3 For each n ∈ [ N ] , query A ( n ) with forecast q ( n ) t to get its prediction p ( n ) t ∈ ∆ K . 4 ery E with { p ( n ) t } N n =1 as the experts’ forecasts, and follow its prediction p t . // update 5 Observe outcome y t and update A ( n ) for each n ∈ [ N ] with this outcome. 6 Update E with ℓ ( p ( n ) t , y t ) as the loss of expert n ∈ [ N ] . Proof. By the regret bound of algorithm E , for any n ∈ [ N ] , we have E [ L T ( p 1: T , y 1: T )] ≤ L T ( p ( n ) 1: T , y 1: T ) + γ ( T ) . By the calibeating rate of algorithm A , we have L T ( p ( n ) 1: T , y 1: T ) ≤ R T ( q ( n ) 1: T , y 1: T ) + α ( T ) . Combining these inequalities yields a multi-calibeating rate of α ( T ) + γ ( T ) . Let Q ( n ) : = { q ( n ) t : t ∈ [ T ] } denote the set of distinct external forecasts made by forecaster n ∈ [ N ] . W e assume | Q ( n ) | = | Q | for all n for simplicity . 2 By the regret b ounds of Hedge (e.g., [ Bub11 , eorems 2.1 and 2.2]), and the calibeating rates in Corollaries 3.2 and 3.4 , we get the following corollaries. In contrast to the loss-oblivious property of Corollary 3.2 , the algorithm in Corollary 4.2 requires a xed ℓ , as the experts algorithm relies on the loss values to update. Corollary 4.2. For any b ounded proper loss ℓ , there exists an algorithm with an expected multi-calibeating rate of O ( √ T log N + p | Q | K T ) . Corollary 4.3. For any η -mixable loss ℓ (e .g., Brier and Log losses), there exists an algorithm with expecte d multi-calibeating rate of O (log N + | Q | log T ) . W e show a minimax-e quivalence of the two problems. Since now external forecasts are inv olved, we consider calibeating rates and expert regrets as functions of b oth the time round and the number of possible distinct external for e cast values. Similarly to Q ( n ) , given an instance of the e xp ert problem, let P ( n ) : = { p ( n ) t : t ∈ [ T ] } denote the set of possible expert predictions made by expert forecaster n ∈ [ N ] . W e have the following theorem. eorem 4.4. For any proper loss ℓ , suppose there exist functions ϕ, λ : Z 2 → R such that for any T and m , inf A sup ( q 1: T ,y 1: T ): ∀ n, | Q ( n ) |≤ m E p 1: T ∼ A [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] ≥ ϕ ( T , m ) , 2 Our results also hold when | Q ( n ) | s dier across forecasters, and the resulting bounds adapt to specic forecasters. 8 where A ranges over all randomized calib eating algorithms, and, inf E sup ( p (1: N ) 1: T ,y 1: T ): ∀ n, | P ( n ) |≤ m E p 1: T ∼ E " T X t =1 ℓ ( p t , y t ) − min n ∈ [ N ] T X t =1 ℓ ( p ( n ) t , y t ) # ≥ λ ( T , m ) , where E ranges over all randomized expert algorithms. en, inf M sup ( q (1: N ) 1: T ,y 1: T ): ∀ n, | Q ( n ) |≤ m E p 1: T ∼ M  L T ( p 1: T , y 1: T ) − min n ∈ [ N ] R T ( q ( n ) 1: T , y 1: T )  ≥ max { ϕ ( T , m ) , λ ( T , m ) } , where M ranges over all multi-calibeating algorithms. By known lower bounds for expert problems when the expert predictions can be arbitrar y val- ues [ CL06 ], and that the lower-bound examples can be obtained when the size of distinct expert prediction values is constant, we show the following lower bounds for multi-calibeating, matching the upper bounds. Corollary 4.5. ere exist bounded proper losses under which the multi-calib eating rates are at least Ω( √ T log N + p | Q | K T ) . Corollary 4.6. For Brier and log losses, the multi-calib eating rate of any algorithm is at least Ω(log N + | Q | log( T / | Q | )) . 5 Calibeating and Calibration at the Same Time In this section, we consider the problem of achieving simultaneous calib eating and calibration. Ex- isting approaches focus on the Brier loss. Foster and Hart [ FH23 ] obtain simultaneous rates of ˜ O ( | Q | 2 K +1 T K − 1 K +1 )) via bin renement and stochastic xed-point methods, while Le e et al. [ Lee+22 ] obtain ˜ O ( p | Q | (log N ) 1 4 T 3 4 ) in the binar y case aer parameter tuning. W e focus on the Brier loss and provide new and improved b ounds for simultane ous calib eating and calibration. Specically , we pro vide a meta-algorithm (Algorithm 3 ) which, for any given external reference algorithm A ∗ , keeps careful track of the losses of A ∗ while ensuring calibration. eorem 5.1. For Brier loss, any ε ∈ (0 , 1) , and any reference algorithm A ∗ , Algorithm 3 simultaneously guarantees an expected regret of at most O ( ε 2 T ) compared to A ∗ , and a calibration rate of at most O K, log T ( √ T + 1 ε K − 1 log 1 ε + ε 2 T ) with high probability . Here, the O K, log T notation hides a factor polynomial in K and log T for readability . W e hide the log T factors because for the calibration error , the dominant dependence in T is polynomial, and we hide the K factors because the bounds degenerate to the trivial O ( T ) when K gets larger and larger . e bounds from previous works ar e also polynomial in K . Let A ∗ be a multi-calibeating algorithm from the pre vious sections. With ε = q log T T , for K = 2 , we obtain the optimal calibeating rate, improving the polynomial-in- T dependence in Foster and Hart [ FH23 ]. Corollary 5.2. For Brier loss with binary outcomes, there is an algorithm with an expecte d multi-calibeating rate of at most O (log N + | Q | log T ) , and a calibration rate of at most O K, log T ( √ T ) with high probability . With ε = ( log T T ) 1 K +1 for K ≥ 3 , we achieve the same calibration rate as in Fishelson et al. [ Fis+25 ] and Foster and Hart [ FH23 ], and drop the | Q |− dependence in Foster and Hart [ FH23 ]. 9 Corollary 5.3. For Brier loss and K ≥ 3 outcomes, there is an algorithm with an expecte d multi-calibeating rate of at most O K, log T (log N + | Q | + T K − 1 K +1 ) , and a calibration rate of at most O K, log T ( T K − 1 K +1 ) with high probability . For K ≥ 3 outcomes, we can also low er the multi-calibeating rate at the cost of raising the calibration rate, by choosing a dierent ε . Corollary 5.4. For Brier loss and K ≥ 3 outcomes, for any x ∈ ( K − 3 K − 1 , K − 1 K +1 ] , there is an algorithm with expected multi-calibeating rate of at most O K, log T (log N + | Q | + T x ) , and a calibration rate of at most O K, log T ( T ( K − 1)(1 − x ) 2 ) with high probability . 5.1 Algorithm Discretization and rounding. T o achieve calibration, it is necessary to focus on a nite set of predictions via discretization. For that, we consider a triangulation of ∆ K and randomly round each prediction to a vertex of the triangulation (r e call that ℓ is xed to the Brier loss in this se ction). Lemma 5.5 ([ Fis+25 ]) . For any ε ∈ (0 , 1) , there is a subset of predictions K ε ⊂ ∆ K of size M = |K ε | = O ( √ K ε − K +1 ) , and a rounding scheme H : ∆ K → ∆ ( K ε ) that maps an arbitrary prediction q ∈ ∆ K to a distribution ov er those in K ε , such that for any outcome y ∈ E , we have E s ∼ H ( q ) [ ℓ ( s, y )] ≤ ℓ ( q , y ) + O ( ε 2 ) . Blum-Mansour reduction. For the connection between calibration and no-swap-regret learning, we employ the well-known reduction by Blum and Mansour [ BM07 ] with the O (log T ) regret online learning algorithm for Brier loss, e.g., FTL. W e present the algorithm and its proof in Section C.2 . Lemma 5.6. ere is an online algorithm A BM that, in each step t ∈ [ T ] , rst predicts an M × M column-stochastic matrix A t , and then observes outcome y t and a distribution π t ∈ ∆( K ε ) , such that for any transformation σ : ∆ K → ∆ K , X t ∈ [ T ] E p t ∼ A t π t ℓ ( p t , y t ) ≤ X t ∈ [ T ] E p ′ t ∼ π t ℓ ( σ ( p ′ t ) , y t ) + O  M log T + ε 2 T  . Intuitively , we may interpret the column-stochastic matrix A t from algorithm A BM as a suggested remapping from any prediction, so that for any sequence of outcomes y t and distributions of predictions π t , the remapped/calibrate d predictions are competitive against the best remapping σ in hindsight. e standard approach is then to sample a randomized prediction from the stationary distribution of A t (but we will do this step later aer mixing A t with another remapping matrix). Interpolating between calibration and multi-calibeating. Besides achieving small swap re- gret and calibration rate, we also want to follow the reference prediction b t from algorithm A ∗ to be competitive against this reference algorithm. Obser ve that following the reference pr e diction corresponds to remapping e ver y prediction to b t , which can be captured by a remapping matrix B t = ( b t , b t , . . . , b t ) ∈ R M × M . T o hedge between these two factors, we resort to a lopsided two-expert algorithm A lopsided to obtain a weight w t ∈ [0 , 1] , and take a linear combination C t = w t A t + (1 − w t ) B t as the aggregated remapping. Lemma 5.7 ([ SNL14 ]) . ere is an algorithm A lopsided for the expert problem with two experts, such that the expe cte d regret w .r .t. expert 1 is at most O ( √ T log T ) , and the expe cted regret w .r .t. expert 2 is at most O (1) . Finally , we sample a prediction from the stationary distribution of C t , as shown in Algorithm 3 . 10 Algorithm 3: Multi-Calib eating + Calibration 1 Sub-routines: • Discretization and rounding algorithm H [see Fis+25 ] • Reference algorithm A ∗ ( Algorithm 1 for calibeating, Algorithm 2 for multi-calibeating). • BM reduction A BM ( Algorithm 3.1 ). • Lopsided two-expert algorithm A lopsided ( Algorithm 3.2 ). for t = 1 to T do // prediction 2 Algorithm A BM predicts A t . 3 Round algorithm A ∗ ’s prediction with H to get b t ∈ ∆( K ε ) , and let B t = ( b t , . . . , b t ) . 4 Algorithm A lopsided predicts w t ∈ [0 , 1] . 5 Let C t = w t A t + (1 − w t ) B t , and π t ∈ ∆ M be its stationar y distribution, i.e., π t = C t π t . 6 Predict p t ∼ π t . // update 7 Observe outcome y t . 8 Update A BM and A ∗ based on y t and π t (applicable to the former ). 9 Update A lopsided with E z ∼ A t π t ℓ ( z , y t ) and E z ∼ b t ℓ ( z , y t ) as the losses of experts 1 and 2. 5.2 Analysis: Proof of a weaker version of e orem 5.1 W e will prov e a weaker guarantee of pseudo-calibration due to space constraints and defer the rest of the proof to Section C.4 . By denition, the expected cumulative loss of Algorithm 3 is E L T ( p 1: T , y 1: T ) = E π 1: T  X t ∈ [ T ] E z t ∼ π t ℓ ( z t , y t )  . Since π t is the stationary distribution of C t = w t A t + (1 − w t ) B t , the above further equals X t ∈ [ T ] E z t ∼ C t π t ℓ ( z t , y t ) = X t ∈ [ T ]  w t E z t ∼ A t π t ℓ ( z t , y t ) + (1 − w t ) E z t ∼ B t π t ℓ ( z t , y t )  = X t ∈ [ T ]  w t E z t ∼ A t π t ℓ ( z t , y t ) + (1 − w t ) E z t ∼ b t ℓ ( z t , y t )  . By construction, E z t ∼ A t π t ℓ ( z t , y t ) and E z t ∼ b t ℓ ( z t , y t ) are the losses of the two-expert pr oblem in r ound t ∈ [ T ] . e lopside d regret bounds of A lopsided ( Lemma 5.7 ) give E L T ( p 1: T , y 1: T ) ≤ E b 1: T  X t ∈ [ T ] E z t ∼ A t π t ℓ ( z t , y t )  + O ( p T log T ) , (5) E L T ( p 1: T , y 1: T ) ≤ E b 1: T  X t ∈ [ T ] E z t ∼ b t ℓ ( z t , y t )  + O (1) . (6) Regret w .r .t. A ∗ . Recall that b t is obtained by rounding the prediction from A ∗ with rounding algorithm H . By the O ( ε 2 ) rounding error b ound of H ( Lemma 5.5 ) and Eq. (6) , the regret w .r .t. the reference algorithm A ∗ is at most O ( ε 2 T ) . 11 Pseudo-calibration. W e will prov e a weaker guarantee that E L T ( p 1: T , y 1: T ) ≤ min σ :∆ K → ∆ K E T X t =1 ℓ ( σ ( p t ) , y t ) + O  p T log T + √ K ε K − 1 log T + ε 2 T  . is follows from Eq. (5) , the guarantee of BM reduction ( Lemma 5.6 ), and that M = O ( √ K ε − K +1 ) ( Lemma 5.5 ). It is w eaker than the original statement as the choice of σ does not depend on the realization of randomness of the algorithm. By contrast, the original statement allows choosing σ based on the realization of randomness. W e defer this concentration argument to Section C.4 . 6 Conclusion W e have revisited calibeating through the lens of online learning and developed a reduction-based framework that connects calibeating and its extensions to standar d online-learning notions. is viewpoint enables us to recov er and sharpen existing results, extend them to general proper scoring losses, and deliver new matching low er b ounds, in a unied and modular way . A natural direction for future work is to push this approach further—both to identify additional achievable guarantees and to improv e online forecasting more broadly . Another open question is whether one can simultaneously achieve an O ( | Q | log T ) calibeating rate and ˜ O ( T 1 3 ) calibration for the binary case, matching the best known bounds for each objective, and whether analogous simultaneous guarantees hold b eyond the Brier loss. In fact, the meta-algorithm developed in Se ction 5 suggests a general recipe that may apply more broadly . Whenev er one can identify an appropriate discretization of the prediction space, a compatible rounding scheme with controlled loss guarantees, and a mechanism for upgrading pseudo- swap regret to true swap r egret, the same framework should yield analogous extensions to other loss functions. A Omitted Proofs in Se ction 3 A.1 Proof of eorem 3.5 eorem 3.5. For any proper loss ℓ , denote the optimal regret bound as β ( T ) : = inf A sup y 1: T ∈E T E p 1: T ∼ A " T X t =1 ℓ ( p t , y t ) − min p ∈ ∆ K T X t =1 ℓ ( p, y t ) # , (3) where A ranges over (possibly randomized) online algorithms. en, every algorithm is at best | Q | β ( ⌊ T / | Q |⌋ ) - calibeating. Proof. W e prov e a general result that, for any integers T q ≥ T 0 with P q ∈ Q T q = T , we have, inf A sup ( q 1: T ,y 1: T ) E p 1: T ∼ A [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] ≥ X q ∈ Q β ( T q ) . By Y ao ’s minimax principle, ( 3 ) implies that, for any T , there exists a distribution S T ∈ ∆( E T ) such that min D E y 1: T ∼ S T " T X t =1 ℓ ( p t , y t ) − min p ∈ ∆ K T X t =1 ℓ ( p, y t ) # ≥ β ( T ) , (7) where the minimum is ov er deterministic online algorithms D . For any q and the corresponding T q , let S q = S T q be the distribution guaranteed by ( 7 ) at horizon T q . Dene the following distribution S over pairs ( q 1: T , y 1: T ) ∈ ( Q × E ) T as follows: 12 1. Choose disjoint index sets {I q } q ∈ Q with I q ⊆ [ T ] , |I q | = T q and ∪ q ∈ Q I q = [ T ] . Set q t = q for all t ∈ I q . 2. For each q ∈ Q , denote the subse quence of outcomes y t s that t ∈ I q to be y I q = ( y t ) t ∈I q . Independently sample y I q according to S q . en by denition and the additivity of sums of expectations, min D E ( q 1: T ,y 1: T ) ∼ S [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] = min D E ( q 1: T ,y 1: T ) ∼ S   X q ∈ Q X t : q t = q ℓ ( p t , y t ) − min p ∈ ∆ K X t : q t = q ℓ ( p, y t ) !   = min D X q ∈ Q E y I q ∼ S q " X t : q t = q ℓ ( p t , y t ) − min p ∈ ∆ K X t : q t = q ℓ ( p, y t ) # ≥ X q ∈ Q min D E y I q ∼ S q " X t : q t = q ℓ ( p t , y t ) − min p ∈ ∆ K X t : q t = q ℓ ( p, y t ) # ≥ X q ∈ Q β ( T q ) . erefore , since a randomized algorithm A is a probability distribution over deterministic algorithms, it holds that min A sup ( q 1: T ,y 1: T ) E p 1: T ∼ A [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] ≥ min A E ( q 1: T ,y 1: T ) ∼ S [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] ≥ X q ∈ Q β ( T q ) . Choosing balanced T q ∈ {⌊ T / | Q |⌋ , ⌈ T / | Q |⌉} completes the proof. B Omitted Proofs in Se ction 4 B.1 Proof of eorem 4.4 eorem 4.4. For any proper loss ℓ , suppose there exist functions ϕ, λ : Z 2 → R such that for any T and m , inf A sup ( q 1: T ,y 1: T ): ∀ n, | Q ( n ) |≤ m E p 1: T ∼ A [ L T ( p 1: T , y 1: T ) − R T ( q 1: T , y 1: T )] ≥ ϕ ( T , m ) , where A ranges over all randomized calib eating algorithms, and, inf E sup ( p (1: N ) 1: T ,y 1: T ): ∀ n, | P ( n ) |≤ m E p 1: T ∼ E " T X t =1 ℓ ( p t , y t ) − min n ∈ [ N ] T X t =1 ℓ ( p ( n ) t , y t ) # ≥ λ ( T , m ) , where E ranges over all randomized expert algorithms. en, inf M sup ( q (1: N ) 1: T ,y 1: T ): ∀ n, | Q ( n ) |≤ m E p 1: T ∼ M  L T ( p 1: T , y 1: T ) − min n ∈ [ N ] R T ( q ( n ) 1: T , y 1: T )  ≥ max { ϕ ( T , m ) , λ ( T , m ) } , where M ranges over all multi-calibeating algorithms. 13 Proof. First, for any realization of expert predictions { p ( n ) t } t ∈ [ T ] ,n ∈ [ N ] and outcomes y 1: T , consider the multi-calibeaing problem with the same outcome sequence and the external forecasts q ( n ) t = p ( n ) t for any t ∈ [ T ] and n ∈ [ N ] (Recall that the experts problem and multi-calibeating only dier in the benchmarks). By Denition 2.1 , R T ( q ( n ) 1: T , y 1: T ) = X p min u ∈ ∆ K X t ≤ T : q ( n ) t = p ℓ ( u, y t ) ≤ X p X t ≤ T : q ( n ) t = p ℓ ( p, y t ) = T X t =1 ℓ ( q ( n ) t , y t ) , where the inequality chooses u = p in each bin. T aking min n ∈ [ N ] on both sides yields min n ∈ [ N ] R T ( q ( n ) 1: T , y 1: T ) ≤ min n ∈ [ N ] T X t =1 ℓ ( q ( n ) t , y t ) . erefore , for any predictions p 1: T , L T ( p 1: T , y 1: T ) − min n ∈ [ N ] R T ( q ( n ) 1: T , y 1: T ) ≥ T X t =1 ℓ ( p t , y t ) − min n ∈ [ N ] T X t =1 ℓ ( q ( n ) t , y t ) = T X t =1 ℓ ( p t , y t ) − min n ∈ [ N ] T X t =1 ℓ ( p ( n ) t , y t ) , and multi-calibeating inherits the lower bound of the expert problem. Second, consider the instances where q ( n ) t = q t for all t ∈ [ T ] , which is equivalent to beating only one external for ecaster . erefore, multi-calib eating inherits the lower bound of the calib eating pr oblem. is completes the proof. C Omitted Proofs in Se ction 5 C.1 Proof of Corollary 5.4 Corollary 5.4. For Brier loss and K ≥ 3 outcomes, for any x ∈ ( K − 3 K − 1 , K − 1 K +1 ] , there is an algorithm with expected multi-calibeating rate of at most O K, log T (log N + | Q | + T x ) , and a calibration rate of at most O K, log T ( T ( K − 1)(1 − x ) 2 ) with high probability . Proof. Let ε 2 T = T x , then the calibeating rate is O ( | Q ( n ) | log T + T x ) and ε = T x − 1 2 . e corresponding calibration err or is of the order O ( √ T log T + T ( K − 1)(1 − x ) 2 log T + T x ) , which always has greater or der than the regret. As the order of calibration error is optimized when x = ( K − 1)(1 − x ) 2 = K − 1 K +1 , we consider smaller x with higher T ( K − 1)(1 − x ) 2 term. e tradeo follows by solving T ( K − 1)(1 − x ) 2 < T . C.2 Algorithm A BM and the proof of Lemma 5.6 Lemma 5.6. ere is an online algorithm A BM that, in each step t ∈ [ T ] , rst predicts an M × M column-stochastic matrix A t , and then observes outcome y t and a distribution π t ∈ ∆( K ε ) , such that for any transformation σ : ∆ K → ∆ K , X t ∈ [ T ] E p t ∼ A t π t ℓ ( p t , y t ) ≤ X t ∈ [ T ] E p ′ t ∼ π t ℓ ( σ ( p ′ t ) , y t ) + O  M log T + ε 2 T  . 14 Subroutine 3.1: A BM 1 Sub-routines: • For each grid action z i , a separate online learner A ( i ) (e .g., FTL) • Discretization and rounding algorithm H [see Fis+25 ] for t = 1 to T do // prediction 2 for i = 1 to M do 3 Observe strategy q ( i ) t ∈ ∆ K from Algorithm A ( i ) . 4 Round q ( i ) t to H ( q ( i ) t ) ∈ ∆( K ε ) . 5 Output A t = ( H ( q (1) t ) , . . . , H ( q ( M ) t )) ∈ R M × M . // update 6 Receive feedback tuple ( y t , π t ) . 7 Update Algorithm A ( i ) with loss function π t ( i ) ℓ ( · , y t ) . Proof. Denote K ε : = { z 1 , . . . , z M } . For any j ∈ [ M ] and σ ( z j ) ∈ ∆ K , we have X t π t ( j ) X i A t ( i, j ) ℓ ( z i , y t ) − X t π t ( j ) ℓ ( σ ( z j ) , y t ) = X t π t ( j ) E i ∼ H ( q ( j ) t ) ℓ ( z i , y t ) − X t π t ( j ) ℓ ( σ ( z j ) , y t ) ( a ) ≤ X t π t ( j )  ℓ ( q ( j ) t , y t ) + C 2 ε 2  − X t π t ( j ) ℓ ( σ ( z j ) , y t ) = X t π t ( j ) ℓ ( q ( j ) t , y t ) − X t π t ( j ) ℓ ( σ ( z j ) , y t ) + C 2 ε 2 X t π t ( j ) ( b ) ≤ C 1 log T + C 2 ε 2 X t π t ( j ) , for some positive constants C 1 , C 2 > 0 , where ( a ) follows from the denition of the rounding scheme and the upper b ound of the rounding error ( Lemma 5.5 ); ( b ) holds by that FTL has a regret of O (log T ) under the Brier loss. Summing this inequality over all j ∈ [ M ] , we obtain T X t =1 E i ∼ A t π t ℓ ( z i , y t ) − T X t =1 E j ∼ π t ℓ ( σ ( z j )) = T X t =1 M X j =1 π t ( j ) M X i =1 A t ( i, j ) ℓ ( z i , y t ) − M X j =1 π t ( j ) ℓ ( σ ( z j ) , y t ) = M C 1 log T + C 2 ε 2 T X t =1 M X j =1 π t ( j ) ≤ max { C 1 , C 2 } √ K log T ε K − 1 + ε 2 T ! , where in the last inequality , we use M = O ( √ K ε K − 1 ) in Lemma 5.5 . 15 C.3 Algorithm A lopsided Subroutine 3.2: A lopsided [ SNL14 ] Input: learning rate η ∈ (0 , 1 2 ] , initial weights { s 1 , 1 − s 1 } 1 for t = 1 to T do // prediction 2 Output weight w t = s t s t +1 − s 1 . // update 3 Receive loss E z ∼ A t π t ℓ ( z , y t ) and E z ∼ b t ℓ ( z , y t ) as the losses of experts 1 and 2, respectively . 4 Compute δ t = g (2) t − g (1) t and set s t +1 = s t · (1 + η δ t ) . C.4 Concentration arguments to nish the proof of e orem 5.1 W e nish the proof of eorem 5.1 by upper-bounding the true calibration error using bounds of the pseudo-calibration error . For convenience, denote the pseudo-calibration error under an algorithm A to b e e K T : = min σ :∆ K → ∆ K E p t ∼ A h P T t =1 ℓ ( p t , y t ) − P T t =1 ℓ ( σ ( p t ) , y t ) i . e following lemma is a multiclass extension of eorem 3 in Luo, Senapati, and Sharan [ LSS25 ], relating e K T to the calibration error . Lemma C.1. For the Brier loss, for discretization with size M and an algorithm A that always predicts the discretization grid points, with probability at least 1 − δ over the randomness in A ’s predictions p 1 , . . . , p T , we have K T ( p 1: T , y 1: T ) ≤ 6 e K T + 96 K M log 4 K M δ . erefore , together with the weaker version proved in Section 5.2 and M = O ( 1 ε K − 1 ) , we have K T ( p 1: T , y 1: T ) ≤ O  p T log T + 1 ε K − 1 K 1 / 2 log T + K 3 / 2 log 4 K 3 / 2 ε K − 1 ! + ε 2 T − log δ  = O K, log T  √ T + 1 ε K − 1 log 1 ε + ε 2 T  , with probability at least 1 − δ . C.5 Proof of Lemma C.1 W e rst note the following fact on the closed-form of K T and ˜ K T . Fact C.2. For the Brier losses, the calibration error under prediction sequence p 1: T and outcome sequence y 1: T is K T = X p ∈ ∆ K T X t =1 1 { p t = p }∥ p − ρ p T ∥ 2 2 = X p ∈ ∆ K T X t =1 1 { p t = p } K X k =1  p ( k ) − ρ p T ( k )  2 , where ρ p T = P t : p t = p y t n T ( p ) . 16 e pseudo-calibration error under algorithm A and outcome sequence y 1: T is e K T = T X t =1 E p ∼ P t h ∥ p − ˜ ρ T ( p ) ∥ 2 i = T X t =1 X p P t ( p ) K X k =1  p ( k ) − ˜ ρ p T ( k )  2 , where P t ( p ) is the randomized prediction under algorithm A and ˜ ρ p T : = P T t =1 y t P t ( p ) P T t =1 P t ( p ) . Denote the i th grid p oint by z i in the discretization, and ρ p T ( k ) as ρ i ( k ) for simplicity . For any k ∈ [ K ] , let K T ( k ) : = X i ∈ [ M ] T X t =1 1 { p t = z i }  z i ( k ) − ρ i ( k )  2 , and e K T ( k ) : = X i ∈ [ M ] T X t =1 P t ( z i )( z i ( k ) − ˜ ρ i ( k )) 2 . W e have the follo wing lemma. Lemma C.3. With probability at least 1 − δ , K T ( k ) ≤ 6 e K T ( k ) + 96 M log 4 M δ . en, with a union bound across k ∈ [ K ] , with probability at least 1 − δ , w e have K T = K X k =1 K T ( k ) ≤ K X k =1  6 e K T ( k ) + 96 M log 4 M K δ  = 6 e K T + 96 M K log 4 M K δ . C.6 Proof of Lemma C.3 Lemma C.3. With probability at least 1 − δ , K T ( k ) ≤ 6 e K T ( k ) + 96 M log 4 M δ . While the proof of Lemma C.3 follo ws almost exactly as the proof of eorem 3 in Luo, Senapati, and Sharan [ LSS25 ], we include it here for completeness. Proof of Lemma C.3 . e proof relies on the following version of Freedman’s inequality . Lemma C.4 ([ Bey+11 ]) . Let { X i } n i =1 be a martingale dierence se quence adapte d to the ltration F 1 ⊆ · · · ⊆ F n , where | X i | ≤ B for all i ∈ [ n ] , and B is a xed constant. Dene V := P n i =1 E  X 2 i | F i − 1  . en, for any xed µ ∈  0 , 1 B  , δ ∈ [0 , 1] , with probability at least 1 − δ , we have      n X i =1 X i      ≤ µ V + log 2 δ µ . Fix i ∈ [ M ] and dene the martingale dierence sequence X t : = y t ( k )( P t ( i ) − 1 { p t = z i } ) and Y t : = P t ( i ) − 1 { p t = z i } . Obser ve that | X t | ≤ 1 , | Y t | ≤ 1 for all t . Fix µ i ∈ [0 , 1] . Applying Lemma C.4 to the sequence X : = Y : = X 1: T , Y 1: T and taking a union b ound over them, we obtain that with probability at least 1 − δ ,      T X t =1 y t ( k ) ( P t ( i ) − 1 { p t = z i } )      ≤ µ i V X + log 4 δ µ i ,      T X t =1 P t ( i ) − 1 { p t = z i }      ≤ µ i V Y + log 4 δ µ i , (8) 17 where V X , V Y are given by V X = T X t =1 E  X 2 t | F t − 1  = T X t =1 y t ( k ) · P t ( i ) (1 − P t ( i )) ≤ T X t =1 P t ( i ) , and V Y = T X t =1 E  Y 2 t | F t − 1  = T X t =1 P t ( i ) (1 − P t ( i )) ≤ T X t =1 P t ( i ) . e upper tail ρ i ( k ) − ˜ ρ i ( k ) can then be bounded in the following manner: ρ i ( k ) − ˜ ρ i ( k ) = P T t =1 y t ( k ) 1 { p t = z i } P T t =1 1 { p t = z i } − P T t =1 y t ( k ) P t ( i ) P T t =1 P t ( i ) ( a ) ≤ P T t =1 y t ( k ) 1 { p t = z i } P T t =1 1 { p t = z i } + µ i P T t =1 P t ( i ) + log 4 δ µ i − P T t =1 y t ( k ) 1 { p t = z i } P T t =1 P t ( i ) = P T t =1 y t ( k ) 1 { p t = z i }  P T t =1 1 { p t = z i }   P T t =1 P t ( i )  · T X t =1 P t ( i ) − T X t =1 1 { p t = z i } ! + µ i P T t =1 P t ( i ) + log 4 δ µ i P T t =1 P t ( i ) ( b ) ≤ P T t =1 y t ( k ) 1 { p t = z i }  P T t =1 1 { p t = z i }   P T t =1 P t ( i )  · µ i T X t =1 P t ( i ) + log 4 δ µ i ! + µ i P T t =1 P t ( i ) + log 4 δ µ i P T t =1 P t ( i ) ( c ) ≤ 2 µ i + 2 log 4 δ µ i P T t =1 P t ( i ) , where (a) and (b) follow from ( 8 ) , and (c) follows by y t ( k ) 1 { p t = z i } ≤ 1 { p t = z i } . e lower tail can 18 be bounde d in an exact manner as ˜ ρ i ( k ) − ρ i ( k ) = P T t =1 y t ( k ) P t ( i ) P T t =1 P t ( i ) − P T t =1 y t ( k ) 1 { p t = z i } P T t =1 1 { p t = z i } ≤ P T t =1 y t ( k ) 1 { p t = z i } + µ i P T t =1 P t ( i ) + log 4 δ µ i P T t =1 P t ( i ) − P T t =1 y t ( k ) 1 { p t = z i } P T t =1 1 { p t = z i } = P T t =1 y t ( k ) 1 { p t = z i }  P T t =1 P t ( i )   P T t =1 1 { p t = z i }  · T X t =1 1 { p t = z i } − T X t =1 P t ( i ) ! + µ i P T t =1 P t ( i ) + log 4 δ µ i P T t =1 P t ( i ) ≤ P T t =1 y t ( k ) 1 { p t = z i }  P T t =1 1 { p t = z i }   P T t =1 P t ( i )  · µ i T X t =1 P t ( i ) + log 4 δ µ i ! + µ i P T t =1 P t ( i ) + log 4 δ µ i P T t =1 P t ( i ) ≤ 2 µ i + 2 log 4 δ µ i P T t =1 P t ( i ) . Combining both bounds, we have shown that for a xe d µ i ∈ [0 , 1] , | ρ i ( k ) − ˜ ρ i ( k ) | ≤ 2 µ i + log 4 δ µ i P T t =1 P t ( i ) holds with probability at least 1 − δ . T aking a union bound over all i , with probability 1 − δ , the following holds simultaneously for all i ,      T X t =1 y t ( k ) ( P t ( i ) − 1 { p t = z i } )      ≤ µ i T X t =1 P t ( i ) + log 4 M δ µ i , (9)      T X t =1 P t ( i ) − 1 { p t = z i }      ≤ µ i T X t =1 P t ( i ) + log 4 M δ µ i , (10)   ρ i ( k ) − ˜ ρ i ( k )   ≤ 2 µ i + 2 log 4 M δ µ i P T t =1 P t ( i ) . (11) Consider the function g ( µ ) := µ + a µ , where a ≥ 0 is a xed constant. Clearly , min µ ∈ [0 , 1] g ( µ ) = 2 √ a when a ≤ 1 , and 1 + a otherwise. Minimizing the bound in ( 11 ) with respe ct to µ i , we obtain   ρ i ( k ) − ˜ ρ i ( k )   ≤      4 r log 4 M δ P T t =1 P t ( i ) , when log 4 M δ ≤ P T t =1 P t ( i ) , 2 + 2 log 4 M δ P T t =1 P t ( i ) , when log 4 M δ > P T t =1 P t ( i ) . erefore , when P T t =1 P t ( i ) is tiny , which is possible if algorithm A does not allo cate enough probability mass to the index i , the bound obtained is large making it much w orse than the trivial bound   ρ i ( k ) − ˜ ρ i ( k )   ≤ 1 which follows since ρ i ( k ) , ˜ ρ i ( k ) ∈ [0 , 1] by denition. Base d on this r easoning, we dene the set I := ( i ∈ [ M ] , s.t. log 4 M δ ≤ T X t =1 P t ( i ) ) , (12) 19 and let ¯ I : = [ M ] \ I . W e bound ( ρ i ( k ) − ˜ ρ i ( k )) 2 as  ρ i ( k ) − ˜ ρ i ( k )  2 ≤    16 log 4 M δ P T t =1 P t ( i ) if i ∈ I , 1 otherwise. (13) Similarly ,    P T t =1 P t ( i ) − 1 { p t = z i }    can b e b ounded by substituting the optimal µ i obtained abov e in ( 10 ); we obtain      T X t =1 P t ( i ) − 1 { p t = z i }      ≤ ( 2 q log 4 M δ P T t =1 P t ( i ) if i ∈ I P T t =1 P t ( i ) + log 4 M δ otherwise. (14) Equipped with ( 13 ) and ( 14 ), we proceed to b ound K T ( k ) in the follo wing manner: K T ( k ) = X i ∈ [ M ] T X t =1 1 { p t = z i }  z i ( k ) − ρ i ( k )  2 ≤ 2 X i ∈ [ M ] T X t =1 1 { p t = z i }   z i ( k ) − ˜ ρ i ( k )  2 +  ρ i ( k ) − ˜ ρ i ( k )  2  , where the inequality holds because ( a + b ) 2 ≤ 2 a 2 + 2 b 2 for all a, b ∈ R . T o further bound the term above, w e split the summation into two terms T 1 , T 2 dened as T 1 := X i ∈I T X t =1 1 { p t = z i }   z i ( k ) − ˜ ρ i ( k )  2 +  ρ i ( k ) − ˜ ρ i ( k )  2  , T 2 = X i ∈ ¯ I T X t =1 1 { p t = z i }   z i ( k ) − ˜ ρ i ( k )  2 +  ρ i ( k ) − ˜ ρ i ( k )  2  , and bound T 1 and T 2 individually . W e bound T 1 as T 1 ( a ) ≤ X i ∈I   T X t =1 P t ( i ) + 2 v u u t log 4 M δ T X τ =1 P τ ( i )    z i ( k ) − ˜ ρ i ( k )  2 + 16 log 4 M δ P T τ =1 P τ ( i ) ! = X i ∈I T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 16 log 4 M δ |I | + 2 X i ∈I v u u t log 4 M δ T X τ =1 P τ ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 16 log 4 M δ P T τ =1 P τ ( i ) ! ( b ) ≤ X i ∈I T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 16 log 4 M δ |I | + 2 X i ∈I T X τ =1 P τ ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 16 log 4 M δ P T τ =1 P τ ( i ) ! = 3 X i ∈I T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 48 log 4 M δ |I | 20 where ( a) follows by substituting the bounds from ( 13 ) and ( 14 ) ; while (b) follo ws since by the denition of I in ( 12 ), we have q log 4 M δ P T τ =1 P τ ( i ) ≤ P T τ =1 P τ ( i ) . Next, we bound T 2 as T 2 ( a ) ≤ X i ∈I 2 T X t =1 P t ( i ) + log 4 M δ !   z i ( k ) − ˜ ρ i ( k )  2 + 1  ( b ) ≤ 2 X i ∈I T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 2 X i ∈I T X t =1 P t ( i ) + 2 log 4 M δ |I | ( c ) ≤ 2 X i ∈I T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 4 log 4 M δ |I | where (a) follows by substituting the b ounds from ( 13 ) and ( 14 ) ; (b) follows by b ounding ( z i ( k ) − ˜ ρ i ( k )) 2 ≤ 1 ; and (c) follows fr om the denition of I in ( 12 ) . Collecting the b ounds on T 1 and T 2 , we obtain T 1 + T 2 ≤ 3 X i ∈ [ M ] T X t =1 P t ( i )  z i ( k ) − ˜ ρ i ( k )  2 + 48 log 4 M δ |I | + 4 log 4 M δ |I | ≤ 3 e K T ( k ) + 48 M log 4 M δ , where the last inequality follows from the denition of e K T ( k ) and since |I | + |I | = M . Since K T ( k ) ≤ 2 ( T 1 + T 2 ) , we have sho wn that K T ( k ) ≤ 6 e K T ( k ) + 96 M log 4 M δ , with probability at least 1 − δ . is completes the proof. References [Bey+11] Alina Beygelzimer, John Langford, Lihong Li, Le v Reyzin, and Robert Schapir e. “Contextual Bandit Algorithms with Super vised Learning Guarantees”. In: Proceedings of the Fourteenth International Conference on A rticial Intelligence and Statistics . PMLR, 2011, pp. 19–26. url : hps://proceedings.mlr .press/v15/beygelzimer11a.html . [BM07] A vrim Blum and Yishay Mansour. “From External to Internal Regret”. In: Journal of Machine Learning Research (2007), pp. 1307–1324. url : hp://jmlr .org/papers/v8/blum07a.html . [Br ¨ o09] Jochen Br ¨ ocker. “Reliability , suciency , and the de composition of proper scores”. In: ar- terly Journal of the Royal Meteorological Society (2009), pp. 1512–1519. url : hps://rmets. onlinelibrary .wiley .com/doi/abs/10.1002/qj.456 . [Bub11] S ´ ebastien Bubeck. Introduction to Online Optimization . 2011. url : hps://www .microso. com/en- us/r esearch/publication/introduction- online- optimization/ . [CL06] Nicolo Cesa-Bianchi and Gab or Lugosi. Prediction, Learning, and Games . Cambridge Univ er- sity Press, 2006. [Daw06] Philip Dawid. “Probability Forecasting”. In: Encyclopedia of Statistical Sciences . Wiley and Sons, 2006. [DMK24] Shachi Deshpande, Charles Marx, and V olodymyr Kuleshov. Calibrated Regression Against A n Adversary Without Regret . 2024. url : hps://arxiv .org/abs/2302.12196 . 21 [FH23] Dean Foster and Sergiu Hart. “”Calib eating”: b eating forecasters at their own game”. In: eoretical Economics (2023), pp. 1441–1474. url : hps://econtheor y .org/ojs/index.php/te/ article/view/20231441/0 . [Fis+25] Maxwell Fishelson, Robert Kleinberg, Princewill Okoroafor , Renato Paes Leme, Jon Schnei- der , and Yifeng T eng. “Full Swap Regret and Discretized Calibration”. In: Proce edings of e 36th International Conference on Algorithmic Learning e ory . PMLR, 2025, pp. 444–480. url : hps://proceedings.mlr .press/v272/shelson25a.html . [FS97] Y oav Freund and Rob ert E Schapire. “A decision-theoretic generalization of on-line learning and an application to b oosting”. In: Journal of Computer and System Sciences (1997), pp . 119– 139. [F V98] Dean P . Foster and Rakesh V . V ohra. “Asymptotic calibration”. In: Biometrika (1998), pp. 379– 390. url : hps://doi.org/10.1093/biomet/85.2.379 . [GR23] Chirag Gupta and Aaditya Ramdas. “Online P la Scaling with Calib eating”. In: Procee dings of the 40th International Conference on Machine Learning . PMLR, 2023, pp. 12182–12204. url : hps://proceedings.mlr .press/v202/gupta23c.html . [Guo+17] Chuan Guo, Geo Pleiss, Y u Sun, and Kilian Q. W einberger. “On Calibration of Modern Neural Networks”. In: Proceedings of the 34th International Conference on Machine Learning . PMLR, 2017, pp. 1321–1330. url : hps://proceedings.mlr .press/v70/guo17a.html . [HAK07] Elad Hazan, Amit Agarwal, and Satyen Kale. “Logarithmic regret algorithms for online convex optimization”. In: Machine Learning (2007), pp. 169–192. [KE17] V olodymyr Kuleshov and Stefano Ermon. “Estimating uncertainty online against an adver- sary”. In: Procee dings of the AAAI Conference on A rticial Intelligence . 2017. [Lee+22] Daniel Lee, Georgy Noarov, Mallesh Pai, and Aaron Roth. “Online Minimax Multiobjective Optimization: Multicalibeating and Other Applications”. In: A dvances in Neural Information Processing Systems . Curran Associates, Inc., 2022, pp. 29051–29063. url : hps://pr o ceedings. neurips . cc / pap er les / pap er / 2022 / le / ba942323c447c9bbb9d4b638eadefab9 - Paper - Conference.pdf . [LSS24] Haipeng Luo, Spandan Senapati, and V atsal Sharan. “Optimal Multiclass U-Calibration Error and Beyond”. In: Advances in Neural Information Processing Systems . Curran Associates, Inc., 2024, pp. 7521–7551. url : hps://proce edings.neurips.cc/paper les/paper/2024/le/ 0e4d695de4c606494ba9b0f3dac3b57a- Paper- Conference.pdf . [LSS25] Haipeng Luo, Spandan Senapati, and V atsal Sharan. Simultaneous Swap Regret Minimization via KL-Calibration . 2025. url : hps://arxiv .org/abs/2502.16387 . [MKE25] Charles Marx, V olodymyr Kuleshov, and Stefano Ermon. Calibrated Probabilistic Forecasts for A rbitrary Sequences . 2025. url : hps://arxiv .org/abs/2409.19157 . [OKS24] Princewill Okoroafor, Bobby Kleinberg, and W en Sun. “Faster Re calibration of an Online Predictor via Appr oachability”. In: Proceedings of e 27th International Conference on A rticial Intelligence and Statistics . PMLR, 2024, pp. 4690–4698. url : hps://procee dings. mlr .press/v238/okoroafor24a.html . [San63] Frederick Sanders. “On Subjective Probability Forecasting”. In: Journal of A pplied Meteorol- ogy and Climatology (1963), pp. 191–201. url : hps://journals.ametsoc.org/view/journals/ apme/2/2/1520- 0450 1963 002 0191 ospf 2 0 co 2.xml . 22 [SNL14] Amir Sani, Gergely Neu, and Alessandro Lazaric. “Exploiting easy data in online optimiza- tion”. In: Advances in Neural Information Processing Systems . Curran Associates, Inc., 2014. url : hps://proceedings.neurips.cc/pap er les/paper/2014/le/8c39d9174128beb141866808b d154e- Paper .p df . 23

Calibeating Made Simple

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment