Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives
Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algor…
Authors: S Akash, Pratik Gajane, Jawar Singh
Best-of-Both-Worlds Multi-Dueling Bandits 1 Best-of-Both-W orlds Multi-Dueling Bandits: Unified Algorithms for Sto c hastic and Adv ersarial Preferences under Condorcet and Borda Ob jectiv es S Ak ash 1 , Pratik Ga jane 2 , and Ja war Singh 1 1 Indian Institute of T ec hnology Patna 2 Univ ersity of Orl ´ eans, F rance Abstract Multi-dueling bandits, where a learner selects m ≥ 2 arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, y et a fundamental question has remained open: can a single algorithm p erform optimally in b oth sto c hastic and adv ersarial en vironments, without knowing whic h regime it faces? W e answer this affirmatively , providing the first b est-of-both-worlds algorithms for multi-dueling bandits under b oth Condorcet and Borda ob jectives. F or the Condorcet setting, we prop ose MetaDueling , a black-box reduction that conv erts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming m ulti-wa y winner feedback in to an un biased pairwise signal. Instan tiating our reduction with Versatile-DB yields the first b est-of- b oth-w orlds algorithm for multi-dueling bandits: it ac hieves O ( √ K T ) pseudo-regret against adversarial preferences and the instance-optimal O P i = a ⋆ log T ∆ i pseudo-regret under sto c hastic preferences, b oth sim ultaneously and without prior kno wledge of the regime. F or the Borda setting, we prop ose SA-MiDEX , a sto c hastic-and-adv ersarial algorithm that achiev es O K 2 log K T + K log 2 T + P i :∆ B i > 0 K log K T (∆ B i ) 2 regret in stochastic environmen ts and O K √ T log K T + K 1 / 3 T 2 / 3 (log K ) 1 / 3 regret against adversaries, again without prior kno wledge of the regime. W e complement our upper b ounds with matching low er b ounds for the Condorcet setting. F or the Borda setting, our upper bounds are near-optimal with resp ect to the lo wer b ounds (within a factor of K ) and match the b est-kno wn results in the literature. 1 In tro duction Multi-armed bandits with preference feedback hav e emerged as a p o werful framework for applications where absolute numerical rewards are difficult to obtain, but relativ e comparisons are natural, such as information retriev al ( Y ue and Joachims , 2009 ) and recommendation systems. In the classical dueling b andits setting, a learner selects tw o arms p er round and observes which is preferred. Recen tly , multi-dueling b andits ( Saha and Gopalan , 2018 ; Brost et al. , 2016 ) ha v e extended this framew ork to settings where m ≥ 2 arms are compared simultaneously , with only the winner revealed (e.g., ranker ev aluation based on user clic ks). A fundamen tal question in online learning is whether algorithms can achiev e optimal p erformance in b oth sto c hastic and adv ersarial en vironments without prior knowledge of the regime. Suc h b est-of-b oth- worlds guarantees are established for multi-armed ( Zimmert and Seldin , 2021 ) and dueling bandits ( Saha and Gaillard , 2022 ), but remain unexplored for m ulti-dueling bandits. Existing w ork addresses these regimes separately: Saha and Gopalan ( 2018 ) achiev e O ( K log T / ∆) regret in sto c hastic multi-dueling bandits under the Condor c et assumption, while Ga jane ( 2024 ) achiev e O (( K log K ) 1 / 3 T 2 / 3 ) regret in adversarial settings using Bor da scores. Best-of-Both-Worlds Multi-Dueling Bandits 2 1.1 Con tributions W e pro vide the first best-of-b oth-w orlds algorithms for m ulti-dueling bandits, addressing both Condorcet and Borda ob jectives: 1. Condorcet Setting: The MetaDueling Reduction. W e propose MetaDueling (Algorithm 1 ), a black- b o x reduction from multi-dueling to dueling bandits. By constructing multisets containing only tw o distinct arms, w e extract unbiased pairwise comparison information from multi-w ay winner feedback, as the bias from duplicate arms corresp onds to an affine rescaling of the preference matrix that preserves the Condorcet winner and gap ordering (Lemma C.1 ). Instan tiating this with Versatile-DB ( Saha and Gaillard , 2022 ) yields the first b est-of-both-worlds multi-dueling guarantee (Theorem 4.1 ): O ( √ K T ) ad- v ersarial pseudo-regret and instance-optimal O ( P i = a ⋆ log T / ∆ i ) stochastic pseudo-regret, indep enden tly of the subset size m . 2. Borda Setting: The SA-MiDEX Algorithm. W e prop ose SA-MiDEX (Algorithm 2 ) for Borda win- ners. Since the Borda ob jective requires estimating p erformance against all opp onen ts, SA-MiDEX initially assumes sto c hastic preferences using successive elimination strategy to obtain unbiased estimates. It monitors for adversarial deviations via martingale concentration; up on detection, it irreversibly switches to adversarial mo de. This yields O K 2 log K T + K log 2 T + P i :∆ B i > 0 K log K T (∆ B i ) 2 sto c hastic regret and O K √ T log K T + K 1 / 3 T 2 / 3 (log K ) 1 / 3 adv ersarial regret (Theorems 5.3 and 5.4 ). 3. Lo w er Bounds. W e prov e matc hing low er bounds of Ω( √ K T ) for adv ersarial Condorcet regret and Ω( P i = a ⋆ log T / ∆ i ) for sto chastic Condorcet regret (Theorems 4.4 and 4.5 ), establishing the optimality of our Condorcet guarantees. W e also establish low er b ounds of Ω( K 1 / 3 T 2 / 3 ) for adversarial regret and Ω P i :∆ B i > 0 log T (∆ B i ) 2 for sto c hastic regret under Borda ob jectiv es. 1.2 Related work Multi-dueling bandits ( Brost et al. , 2016 ) generalize the standard dueling framew ork ( Y ue and Joachims , 2009 ) to simultaneous m -wa y comparisons. Under the challenging winner-only feedbac k mo del, Saha and Gopalan ( 2018 ) established gap-dep enden t O ( K log T / ∆) regret b ounds for purely stochastic environmen ts with Condorcet winners. More recently , Ga jane ( 2024 ) studied the adversarial multi-dueling regime, achiev- ing O (( K log K ) 1 / 3 T 2 / 3 ) regret under the Borda ob jective. How ev er, neither of these approaches can auto- matically adapt when the nature of the environmen t (sto c hastic versus adv ersarial) is unknown. The quest for algorithms that adapt seamlessly to either regime—termed b est-of-b oth-worlds —has seen significan t breakthroughs via Tsallis en tropy regularization ( Zimmert and Seldin , 2019 ; 2021 ). Saha and Gaillard ( 2022 ) successfully extended this mac hinery to standard dueling bandits ( m = 2), achieving optimal rates in b oth regimes sim ultaneously . Our MetaDueling reduction directly lev erages their Versatile-DB algorithm to elev ate these dual guarantees to the multi-dueling setting. F or the Borda setting, w e instead dra w upon the stochastic-and-adv ersarial paradigm in tro duced by Auer and Chiang ( 2016 ). Rather than relying on regularization, this approach actively monitors empirical obser- v ations for adversarial deviations, p ermanen tly switching from a sto chastic successiv e elimination strategy to a robust adversarial one up on detection. A comprehensive discussion of the broader literature, including alternativ e choice models and preference structures, is deferred to App endix A . 2 Preliminaries W e consider online learning ov er a finite set of K arms [ K ] := { 1 , . . . , K } for T rounds. At each round t ∈ [ T ], the environmen t obliviously selects a preference matrix P t ∈ [0 , 1] K × K satisfying recipro cit y ( P t ( i, j ) + P t ( j, i ) = 1) and self-comparison ( P t ( i, i ) = 1 / 2) for all i, j ∈ [ K ]. The learner selects a multiset A t ⊆ [ K ] of size |A t | = m (2 ≤ m ≤ K ) and observes a winning arm. In the sto chastic setting , preferences are stationary ( P t = P for all t ); in the adversarial setting , the sequence { P t } T t =1 is arbitrary . Best-of-Both-Worlds Multi-Dueling Bandits 3 Choice Mo del. F ollowing Saha and Gopalan ( 2018 ), we assume a pairwise-subset choice model. Giv en a m ultiset A and preference matrix P , the probability that the arm at index i ∈ [ m ] wins is determined by a veraging pairwise preferences: W ( i | A , P ) := P j = i 2 P ( A ( i ) , A ( j )) m ( m − 1) . Condorcet Setting. The Condorcet framework assumes a globally dominant arm. Assumption 2.1 (Condorcet Winner and Gap) . We assume ther e exists a Condor c et winner a ⋆ ∈ [ K ] such that P t ( a ⋆ , i ) ≥ 1 / 2 for al l i ∈ [ K ] and t ∈ [ T ] . We define the sub-optimality gap for arm i at r ound t as ∆ t ( i ) := P t ( a ⋆ , i ) − 1 / 2 . In the sto chastic setting, we write ∆( i ) := ∆ t ( i ) and let ∆ min := min i = a ⋆ ∆( i ) > 0 . The cumulativ e Condorcet pseudo-regret measures the av erage gap of selected arms: R C T := P T t =1 1 m P m j =1 ∆ t ( A t ( j )). Note that standard dueling bandit regret is simply the special case where m = 2. Borda Setting. Alternatively , optimality can b e measured b y a verage pairwise p erformance. The Borda score of arm i at round t is the probability it beats an opp onen t selected uniformly at random: b t ( i ) := 1 K − 1 P j = i P t ( i, j ). Unlik e a Condorcet winner, which is not guaran teed to exist (for instance , no Condorcet winner exists in the MSLR-WEB10k dataset Qin et al. ( 2010 )), a Borda winner alwa ys exists. Moreov er, when Condorcet and Borda winners differ, the Borda winner can b etter reflect the underlying preferences and is more robust to estimation errors ( Jamieson et al. , 2015b ). F or the relationship b et w een Condorcet setting and Borda setting, see App endix B . 3 The MetaDueling Reduction The transition from pairwise to m ulti-wa y preferences introduces a significan t structural c hallenge: most m ulti-dueling algorithms are designed for sp ecific, narrow environmen ts. This lac k of mo dularity often prev ents the direct application of w ell-optimized dueling bandit algorithms to m ulti-pla yer settings, requiring new, complex strategies for every different scenarios. W e presen t MetaDueling , a blac k-b o x reduction that decouples the m ulti-wa y feedback mechanism from the core learning strategy . Our approac h constructs the m ultiset A t b y p opulating its m av ailable slots with only tw o distinct arms, i and j . By isolating these tw o arms, w e collapse the multi-pla y er interaction in to a controlled pairwise signal, allowing multi-w ay winner probabilities to b e mapp ed directly bac k to pairwise preferences without needing to mo del the in ternal dynamics of the selection pro cess. The primary adv an tage of MetaDueling is its inherent mo dularit y . By treating the m ulti-dueling envi- ronmen t as a blac k-b o x interface, our framework can b e seamlessly in tegrated with any standard dueling bandit learner—including stochastic, adversarial, or con textual v arian ts. This plug-and-pla y architecture ensures that MetaDueling inherits the regret guarantees of the base learner while remaining extensible to an y preference learning setting where m ulti-wa y feedback is av ailable. 3.1 Algorithm Description The reduction op erates as follo ws: at eac h round, the base dueling bandit algorithm prop oses t wo arms ( x t , y t ). W e construct a m ultiset containing only these t wo arms, with counts ( n x , n y ) summing to m . When m is even, w e set n x = n y = m/ 2. When m is o dd, we randomize: with probability 1 / 2, w e set ( n x , n y ) = ( ⌈ m/ 2 ⌉ , ⌊ m/ 2 ⌋ ), and with probabilit y 1 / 2, w e swap. This symmetrization ensures unbiased feedbac k in exp ectation. 3.2 Rescaled Preferences The m ultiset construction in tro duces bias from ties b et ween identical arms: when multiple copies of x t comp ete, some probability mass go es to x t -vs- x t comparisons (which x t wins with probability 1 / 2) rather than x t -vs- y t comparisons. W e sho w that this bias corresp onds exactly to an affine rescaling of the preference matrix. Best-of-Both-Worlds Multi-Dueling Bandits 4 Algorithm 1 MetaDueling : Reduction from Multi-Dueling to Dueling Bandits Require: Base dueling bandit algorithm D , horizon T , arms K , subset size m 1: Initialize base algorithm D (e.g., Versatile-DB ) 2: for t = 1 , 2 , . . . , T do 3: // Query base algorithm for arm pair 4: Let ( q (1) t , q (2) t ) b e the distributions main tained by D 5: Sample x t ∼ q (1) t and y t ∼ q (2) t indep enden tly 6: // Construct multiset with symmetrization 7: Sample B t ∼ Bernoulli(1 / 2) 8: if m is even then 9: ( n x , n y ) ← ( m/ 2 , m/ 2) 10: else if B t = 1 then 11: ( n x , n y ) ← ( ⌈ m/ 2 ⌉ , ⌊ m/ 2 ⌋ ) 12: else 13: ( n x , n y ) ← ( ⌊ m/ 2 ⌋ , ⌈ m/ 2 ⌉ ) 14: end if 15: Construct A t = { x t , . . . , x t | {z } n x , y t , . . . , y t | {z } n y } 16: // Play and observ e feedbac k 17: Pla y A t and observ e winner index I t ∈ [ m ] 18: Set o t ← 1 [ A t ( I t ) = x t ] 19: // F eed outcome to base algorithm 20: Up date D with outcome o t for pair ( x t , y t ) 21: end for Definition 3.1 (Rescaling Constants) . Define the r esc aling c onstants: β m := m 2( m − 1) if m is even , m + 1 2 m if m is o dd , α m := 1 − β m 2 . (1) Definition 3.2 (Rescaled Preference Matrix) . The r esc ale d pr efer enc e matrix is: b P t ( i, j ) := α m + β m P t ( i, j ) . (2) The rescaled preference matrix b P t is a v alid preference matrix and preserves the Condorcet winner (see App endix C for the proof ). The follo wing lemma sho ws that the observed outcome o t is an unbiased estimate of the rescaled prefer- ence, not the original preference. Lemma 3.3 (Unbiased F eedback) . Under Algorithm 1 , the observe d outc ome satisfies: E [ o t | x t , y t ] = b P t ( x t , y t ) = α m + β m P t ( x t , y t ) . Pr o of. See App endix D.1 . Remark 3.4 (Implicit Rescaling) . The r aw outc ome o t ∈ { 0 , 1 } is not an unbiase d estimate of P t ( x t , y t ) due to the tie bias. However, o t is an unbiase d estimate of b P t ( x t , y t ) ∈ [0 , 1] . Sinc e the r esc ale d pr efer enc es pr eserve al l structur al pr op erties ne e de d by the b ase algorithm (L emma C.1 ), we c an fe e d o t dir e ctly to D without explicit tr ansformation. 3.3 Gap Rescaling The sub-optimality gaps are also rescaled by the factor β m . Best-of-Both-Worlds Multi-Dueling Bandits 5 Lemma 3.5 (Gap Rescaling) . F or any arm i ∈ [ K ] , the r esc ale d gap satisfies: b ∆ t ( i ) := b P t ( a ⋆ , i ) − 1 2 = β m ∆ t ( i ) . Lemma 3.6 (Bound on Rescaling F actor) . F or al l m ≥ 2 , the r esc aling factor satisfies β m ≥ 1 2 , with e quality as m → ∞ . F or pro ofs of Lemma 3.5 and Lemma 3.6 , see App endix D.2 . 3.4 Regret Equiv alence The key result of this section establishes that the multi-dueling regret equals the dueling regret in rescaled preferences, up to the factor 1 /β m ≤ 2. Lemma 3.7 (Regret Equiv alence) . F or any se quenc e of pr efer enc e matric es P 1 , . . . , P T satisfying Assump- tion 2.1 : E [ R C T ] = 1 β m E [ b R DB T ] , (3) wher e b R DB T := P T t =1 1 2 ( b ∆ t ( x t ) + b ∆ t ( y t )) is the pseudo-r e gr et of the b ase algorithm under b P t . F or the pro of of Lemma 3.7 , see App endix D.3 . Remark 3.8 (Black-Bo x Nature of the Reduction) . We only r e quir e that the b ase algorithm D ac c epts binary outc omes o t ∈ { 0 , 1 } and achieves b ounde d pseudo-r e gr et on valid pr efer enc e matric es. No mo dific ation to D ’s internals is ne e de d. This mo dularity me ans that futur e impr ovements to dueling b andit algorithms automatic al ly tr ansfer to multi-dueling b andits via our r e duction. Remark 3.9 (Sp ecial Case: m = 2) . F or m = 2 , we have α 2 = 0 and β 2 = 1 , so b P t = P t . The multiset A t = { x t , y t } is a standar d p air, and Algorithm 1 r e duc es exactly to the b ase algorithm D . 4 Main Results: Condorcet Setting While previous literature has addressed sto c hastic and adv ersarial m ulti-dueling en vironments in isolation, w e demonstrate that feeding our rescaled pairwise signals into a best-of-b oth-worlds base algorithm seamlessly bridges this divide. This approac h eliminates the need to man ually adapt to environmen tal shifts, as Alg. 1 inherits the robust prop erties of the underlying learner. By instantiating MetaDueling with Versatile-DB ( Saha and Gaillard , 2022 ), w e obtain the first unified guaran tees for multi-dueling bandits: optimal adv ersarial pseudo-regret and instance-optimal sto c hastic pseudo-regret, ac hieved simultaneously and without prior kno wledge of the regime. This reduction provides a rigorous theoretical foundation for multi-w ay feedback that remains v alid across settings. The strength of Theorem 4.1 lies in the rigorous mapping b et ween multi-w ay winner probabilities and the rescaled pairwise preference matrix b P t . By pro ving that the symmetrized m ultiset construction in MetaDueling yields an unbiased pairwise signal, w e ensure that the base learner D op erates on a v alid dueling bandit instance. The core of the proof establishes a strict linear regret equiv alence 1 /β m , whic h allo ws the O ( √ K T ) adversarial and O ( P log T / ∆ i ) sto c hastic guarantees of Versatile-DB to b e transferred exactly to the m ulti-dueling setting. This transformation demonstrates that the complexit y of the m -w a y interaction is captured b y a universal constant factor (1 /β m ≤ 2), preserving the optimality of the underlying dueling bandit algorithm regardless of the num b er of arms compared sim ultaneously . 4.1 Best-of-Both-W orlds Guaran tee Theorem 4.1 (Main Result: Best-of-Both-W orlds for Condorcet Multi-Dueling) . Under Assumption 2.1 , MetaDueling (A lgorithm 1 ) instantiate d with Versatile-DB satisfies the fol lowing simultane ously for any T ≥ 1 : Best-of-Both-Worlds Multi-Dueling Bandits 6 (i) A dversarial pseudo-r e gr et b ound. F or any oblivious se quenc e of pr efer enc e matric es P 1 , . . . , P T : E [ R C T ] ≤ 1 β m 4 √ K T + 1 ≤ 8 √ K T + 2 . (4) (ii) Bound for self-b ounding pr efer enc e matric es. If the pr efer enc e matric es satisfy the self- b ounding c ondition E [ b R DB T ] ≥ 1 2 E T X t =1 X i = a ⋆ p +1 ,t ( i ) + p − 1 ,t ( i ) b ∆( i ) , (5) wher e p +1 ,t and p − 1 ,t ar e the arm distributions of the two players in Versatile-DB , then: E [ R C T ] ≤ 1 β 2 m X i = a ⋆ 16 log T + 48 ∆ i + 4 log T + 12 β m + 8 √ K β m . (6) Both b ounds hold simultane ously without prior know le dge of the r e gime. The proof of Theorem 4.1 lev erages the insigh t that the algorithm’s observed outcomes can b e rescaled to form v alid dueling bandit feedback, allowing the application of existing guaran tees for the Versatile-DB algorithm. In the adv ersarial case, this directly yields sublinear pseudo-regret bounds, while in the self- b ounding setting, the rescaling of gaps enables the translation of known gap-dep endent guarantees to the Condorcet setting. This approach simultaneously harnesses prior w ork on dueling bandits and our nov el rescaling tec hnique, pro ducing the first best-of-b oth-w orlds b ounds for m ulti-dueling bandits under Con- dorcet ob jectives. F or the complete pro of of Theorem 4.1 , see App endix D.4 . 4.2 Sto c hastic Condorcet Setting In the sto chastic setting with a fixed Condorcet winner, the self-b ounding condition ( 5 ) is automatically satisfied, yielding instance-optimal regret. Corollary 4.2 (Instance-Optimal Sto c hastic Regret) . In the sto chastic setting with a fixe d pr efer enc e ma- trix P t = P for al l t ∈ [ T ] and Condor c et winner a ⋆ , the self-b ounding c ondition ( 5 ) holds with e quality. Conse quently: E [ R C T ] ≤ X i = a ⋆ 64 log T + 192 ∆ i + 8 log T + 24 + 16 √ K = O X i = a ⋆ log T ∆ i . (7) Pr o of. In the sto c hastic setting, ∆ t ( i ) = ∆( i ) for all t ∈ [ T ], and the Condorcet winner a ⋆ is fixed. The self-b ounding condition holds b ecause pulling sub optimal arms directly contributes to regret prop ortional to their gaps. The b ound follows from Theorem 4.1 (ii) with 1 /β m ≤ 2 and 1 /β 2 m ≤ 4. Remark 4.3 (Comparison with Prior W ork) . Saha and Gop alan ( 2018 ) achieve d O ( K log T / ∆ min ) r e gr et for sto chastic multi-dueling b andits with Condor c et winners. Our b ound O ( P i = a ⋆ log T / ∆ i ) is instanc e-optimal: it dep ends on individual gaps r ather than just the minimum gap, pr oviding tighter guar ante es when gaps ar e heter o gene ous. Mor e over, our algorithm simultane ously achieves O ( √ K T ) adversarial r e gr et, which Saha and Gop alan ( 2018 ) do not addr ess. 4.3 Lo w er Bounds W e establish matching lo wer b ounds that demonstrate the optimality of our guaran tees. Theorem 4.4 (Adversarial Low er Bound) . F or any multi-dueling b andit algorithm and any T ≥ K ≥ 4 , ther e exists an adversarial instanc e satisfying Assumption 2.1 such that: E [ R C T ] ≥ Ω( √ K T ) . (8) Best-of-Both-Worlds Multi-Dueling Bandits 7 Pr o of Sketch. The pro of of the adv ersarial low er b ound proceeds via a reduction from dueling b andits. First, an y m ulti-dueling bandit algorithm is conv erted in to a standard dueling bandit algorithm by randomly selecting a pair of arms from the chosen subset and using the observed duel outcome to simulate m ulti- dueling feedback. This ensures that the exp ected regret of the constructed dueling algorithm matc hes that of the original m ulti-dueling algorithm. Next, the standard minimax lo wer b ound for adversarial m ulti- armed bandits is applied, and via kno wn reductions, this yields a hard adv ersarial instance with a b est arm in hindsight. Finally , the reduction shows that if a m ulti-dueling algorithm could ac hieve low er regret, it w ould violate the dueling bandit low er bound, establishing the desired Ω( √ K T ) bound for multi-dueling setting. Theorem 4.5 (Sto c hastic Low er Bound) . F or any multi-dueling b andit algorithm, ther e exists a sto chastic instanc e with Condor c et winner a ⋆ such that: E [ R C T ] ≥ Ω X i = a ⋆ log T ∆ i . (9) Pr o of Sketch. The pro of follows the same reduction structure as Theorem 4.4 , but applies the stochastic dueling bandit low er b ound of Komiyama et al. ( 2015 ): lim inf T →∞ E [ R DB T ] log T ≥ X i = a ⋆ 1 2 KL( P ( a ⋆ , i ) ∥ 1 2 ) . F or gaps ∆ i = P ( a ⋆ , i ) − 1 2 , a T aylor expansion gives KL( P ( a ⋆ , i ) ∥ 1 2 ) = Θ(∆ 2 i ), yielding the Ω( P i = a ⋆ log T / ∆ i ) lo wer bound. See App endix D.5 for the complete pro of. Remark 4.6 (Optimality) . The matching upp er and lower b ounds establish that our algorithm achieves optimal pseudo-r e gr et in b oth adversarial and sto chastic settings. The adversarial b ound Θ( √ K T ) matches the minimax r ate for multi-arme d b andits, and the sto chastic b ound Θ( P i = a ⋆ log T / ∆ i ) matches the instanc e- optimal r ate for dueling b andits. Remark 4.7 (Pseudo-Regret vs. High-Probabilit y Bounds) . Our b ounds ar e in exp e ctation (pseudo-r e gr et). As shown in Auer and Chiang ( 2016 ), no algorithm c an simultane ously achieve optimal high-pr ob ability b ounds in b oth sto chastic and adversarial r e gimes. Our pseudo-r e gr et formulation is ther efor e the natur al tar get for b est-of-b oth-worlds guar ante es. 5 Borda Setting: The SA-MiDEX Algorithm W e now turn to multi-dueling bandits with Borda ob jectives. Unlike the Condorcet setting, the Borda ob jective requires estimating each arm’s av erage p erformance against al l opp onen ts, not just the sp ecific arm it was compared against. W e prop ose SA-MiDEX (sto c hastic-and-Adversarial Multi-Dueling with EXploration), an algorithm based on the sto chastic-and-adv ersarial paradigm of Auer and Chiang ( 2016 ). The algorithm initially assumes a sto c hastic environmen t, monitors for adv ersarial deviations, and switches to a robust adv ersarial strategy up on detection. The Borda setting introduces unique challenges: estimating each arm’s Borda score requires comparisons against all other arms, no Condorcet winner is guaran teed (so preferences may b e cyclic or c hange o ver time), and m ulti-dueling feedbac k provides limited information for imp ortance-weigh ted estimation. T o address these, the approac h combines decaying uniform opp onent sampling with probability p t = min(1 , K log t t ) to main tain un biased Borda estimates, a tw o-phase exploration-exploitation scheme, martingale-based deviation detection to identify adversarial shifts, and an EXP3 fallbac k to ensure robust adv ersarial regret. 5.1 Algorithm Description The SA-MiDEX algorithm operates in tw o mo des with an irreversible mo de switch, lev eraging the MetaDueling reduction (Algorithm 1 ) as a black b ox to conv ert multi-dueling feedback in to pairwise comparisons. Best-of-Both-Worlds Multi-Dueling Bandits 8 F eedbac k T ransformation. The MetaDueling reduction returns a raw outcome o t ∈ { 0 , 1 } indicating whether arm x t w on. T o obtain an un biased estimate of the true preference probabilit y P ( x t , y t ), we apply the affine transformation: g t = o t − α m β m , (10) where α m and β m are the bias and scaling constan ts from Lemma 3.3 that depend on the subset size m . This yields E [ g t | x t , y t ] = P ( x t , y t ), enabling standard bandit estimation techniques despite the non-standard m ulti-dueling feedback model. Sto c hastic Mo de. The algorithm b egins in sto c hastic mode, whic h op erates via three mechanisms: Stage 0: Baseline Explor ation ( t ≤ T 0 ). F or the first T 0 = 4 K ( K − 1) ⌈ log (2 K 2 T 2 /δ ) ⌉ rounds, the algorithm explores all ordered pairs via a round-robin schedule. This pure exploration phase is necessary to establish a high-confidence empirical baseline ˆ P T 0 ( i, j ) for the deviation detection mec hanism. Stage 1: Suc c essive Elimination ( t > T 0 ). The algorithm maintains an active set of candidate arms C , initialized to [ K ]. In eac h round t , to guaran tee un biased Borda estimates, the algorithm selects x t ∈ C (e.g., via round-robin o ver C ) and samples the opponent y t ∼ Uniform([ K ] \ { x t } ). Suboptimal arms are p ermanen tly eliminated from C once sufficien t evidence accumulates. Specifically , arm j ∈ C is eliminated if there exists another arm i ∈ C suc h that ˆ b t ( i ) − conf t ( i ) > ˆ b t ( j ) + conf t ( j ), where conf t ( i ) = q 20 K log(2 K 2 T /δ ) N t ( i ) . Once |C | = 1, letting C = { i ⋆ } , the algorithm exploits by playing x t = y t = i ⋆ . Stage 2: Backgr ound Monitoring ( t > T 0 ). T o ensure the martingale deviation detection functions correctly in case of adversarial shifts, the algorithm forces random exploration with probability p t = min(1 , K log t t ) by sampling x t , y t ∼ Uniform([ K ]). These uniform samples feed directly in to the deviation detection statistics. Deviation Detection. Throughout exploitation, the algorithm monitors eac h pair ( i, j ) for deviation from the learned mo del b y tracking: D t ( i, j ) := S obs t ( i, j ) − S obs T 0 ( i, j ) − ( N t ( i, j ) − N T 0 ( i, j )) · ˆ P T 0 ( i, j ) . (11) A deviation triggers when D t ( i, j ) > τ ( p N t ( i, j ) − N T 0 ( i, j ) + 1) for threshold τ = p 8 log ( K 2 T 2 /δ ). Adv ersarial Mo de. Up on detecting deviation, the algorithm irreversibly switches to EXP3 with explicit uniform exploration following Ga jane ( 2024 ). The sampling distribution mixes a Gibbs distribution o v er cum ulative importance-weigh ted scores with uniform exploration: q t ( i ) = (1 − γ ) exp( η ˜ S t ( i )) P K j =1 exp( η ˜ S t ( j )) + γ K , (12) ensuring q t ( i ) ≥ γ /K for b ounded in v erse probability w eigh ts. The imp ortance-w eighted Borda score up date uses the transformed feedback g t and accoun ts for the sampling probabilities of b oth arms in the pair. 5.2 Imp ortance-W eighted Borda Estimation In adv ersarial mo de, we use imp ortance-weigh ted estimators for the shifted Borda score. Definition 5.1 (Imp ortance-W eighted Borda Estimator) . Given arms x t , y t ∼ q t and tr ansforme d fe e db ack g t , the imp ortanc e-weighte d shifte d Bor da sc or e estimate is: ˆ s t ( i ) := 1 ( i = x t ) K q t ( i ) X j ∈ [ K ] 1 ( j = y t ) g ( m, o t , x t ) q t ( j ) (13) Lemma 5.2 (Unbiased Estimation) . The estimator ˆ s t ( i ) is unbiase d: E [ ˆ s t ( i ) | q t ] = s t ( i ) = 1 K X j ∈ [ K ] P t ( i, j ) . (14) Best-of-Both-Worlds Multi-Dueling Bandits 9 Algorithm 2 SA-MiDEX : Stochastic-and-Adv ersarial Multi-Dueling Bandits Require: Arms K , horizon T , subset size m , confidence δ = 1 /T 2 1: Initialize: mo de ← Stoch , C ← [ K ], T 0 ← 4 K ( K − 1) ⌈ log(2 K 2 T 2 /δ ) ⌉ 2: Init Stats: N , S obs , ˜ S ← 0, ˆ P ← 1 / 2 3: for t = 1 , . . . , T do 4: if mo de = Stoch and t ≤ T 0 then ▷ Stage 0: Baseline Exploration 5: ( x t , y t ) ← R oundRobin ( t ) 6: else if mo de = Stoch then ▷ Stages 1 & 2: Elimination & Monitoring 7: Sample u t ∼ Uniform(0 , 1) 8: if u t < min(1 , K log t t ) then 9: ( x t , y t ) ∼ Uniform([ K ] × [ K ]) 10: else if |C | > 1 then 11: Pic k x t ∈ C (round-robin), y t ∼ Uniform([ K ] \ { x t } ) 12: else 13: ( x t , y t ) ← ( i ⋆ , i ⋆ ) where C = { i ⋆ } 14: end if 15: else 16: ( x t , y t ) ← EXP3Select ▷ Sample from q in Eq. ( 12 ) 17: end if 18: o t ← MetaDueling ( x t , y t , m ); g t ← ( o t − α m ) /β m ; Up date N , S obs , ˆ b 19: if t = T 0 then ˆ P T 0 ( i, j ) ← S obs ( i, j ) / N ( i, j ) for all ( i, j ) ▷ Lo c k baseline 20: if mo de = Stoch and t > T 0 then 21: Eliminate j from C if ∃ i ∈ C s.t. ˆ b t ( i ) − conf t ( i ) > ˆ b t ( j ) + conf t ( j ) 22: if ∃ ( i, j ) : D t ( i, j ) > τ ( p N t ( i, j ) − N T 0 ( i, j ) + 1) then mo de ← Adv ; ˜ S ← 0 23: else if mo de = Adv then 24: ˜ S ( x t ) ← 1 ( i = x t ) K q t ( i ) P j ∈ [ K ] 1 ( j = y t ) g t q t ( j ) 25: end if 26: end for Pr o of. T aking exp ectations o ver x t , y t i.i.d. ∼ q t : E [ ˆ s t ( i ) | q t ] = X x,y q t ( x ) q t ( y ) · 1 ( i = x ) · E [ g t | x, y ] K · q t ( i ) · q t ( y ) = X y q t ( y ) · P t ( i, y ) K · q t ( y ) = 1 K X y P t ( i, y ) = s t ( i ) , where we used E [ g t | x t , y t ] = P t ( x t , y t ) from the feedback transformation. 5.3 Main Results: Borda Setting W e no w state our main results for the Borda setting. The pro ofs rely on concentration argumen ts for the sto c hastic phase and EXP3 analysis for the adversarial phase. Theorem 5.3 (Sto c hastic Regret Bound) . In the sto chastic setting ( P t = P for al l t ), Algorithm 2 achieves: E [ R B T ] ≤ O K 2 log K T + K log 2 T + X i :∆ B i > 0 K log K T (∆ B i ) 2 (15) Pr o of Sketch. Conditioned on the high-probabilit y even t E where confidence b ounds hold and no false ad- v ersarial switch triggers, we decompose the regret into four phases. The baseline exploration of T 0 rounds con tributes O ( K 2 log K T ). During successive elimination, selecting a sub optimal arm i against a uni- formly sampled opponent incurs Θ(1) instan taneous regret; by Lemma 5.10 , arm i is eliminated after at Best-of-Both-Worlds Multi-Dueling Bandits 10 most O ( K log K T / (∆ B i ) 2 ) pulls, con tributing O P i = i ⋆ K log K T (∆ B i ) 2 . Once only the optimal arm remains, ex- ploitation yields exactly 0 regret. Finally , bac kground monitoring forces uniform sampling with probabilit y p t = min(1 , K log t t ), adding at most P T t =1 p t ≤ O ( K log 2 T ). Adding these comp onents yields the b ound. See App endix E.6 for full details. Theorem 5.4 (Adversarial Regret Bound) . Under any oblivious adversarial se quenc e P 1 , . . . , P T , Algo- rithm 2 achieves: E [ R B T ] ≤ O K p T log K T + K 1 / 3 T 2 / 3 (log K ) 1 / 3 (16) Pr o of Sketch. W e analyze t wo cases based on whether a mo de switch o ccurs. If a switch o ccurs at rou nd t sw ≤ T , the regret comprises the O ( K 2 log K T ) exploration ov erhead, the undetected pre-switc h manipulation strictly b ounded by the martingale detection threshold R pre ≤ O ( K √ T log K T ), and the post-switch EXP3 exploration which contributes E [ R post ] ≤ O ( K 1 / 3 T 2 / 3 (log K ) 1 / 3 ) (Lemma 5.12 ). Alternatively , if no switch o ccurs, the adv ersary’s cum ulative deviation is contin uously bounded b y the threshold, limiting regret to O ( K √ T log K T ). F ull details are in App endix E.6 . Corollary 5.5 (Best-of-Both-W orlds for Borda Setting) . Without prior know le dge of the envir onment typ e, A lgorithm 2 simultane ously achieves: (i) Sto chastic: E [ R B T ] ≤ O K 2 log K T + K log 2 T + P i :∆ B i > 0 K log K T (∆ B i ) 2 (ii) A dversarial: E [ R B T ] ≤ O K √ T log K T + K 1 / 3 T 2 / 3 (log K ) 1 / 3 5.4 Lo w er Bounds Remark 5.6 (Optimality in the Borda Setting) . 1. Sto chastic Setting: Instanc e-dep endent lower b ounds imply an Ω P i :∆ B i > 0 log T (∆ B i ) 2 limit. Our SA-MiDEX algorithm achieves this gap-dep endent r ate asymptot- ic al ly, though it incurs an initial O ( K 2 log K T ) explor ation overhe ad ne c essary to unbiase d ly estimate glob al Bor da sc or es. 2. A dversarial Setting: As establishe d by Gajane ( 2024 ), the exp e cte d r e gr et for any multi-dueling b andit algorithm me asur e d against an adversarial Bor da winner is lower b ounde d by Ω( K 1 / 3 T 2 / 3 ) . Our adversarial guar ante e of O ( K 1 / 3 T 2 / 3 (log K ) 1 / 3 ) ther efor e matches the the or etic al limit of this setting up to lo garithmic factors. The pro ofs for these fundamental limits follow a similar structure to our Condorcet low er b ounds (The- orems 4.4 and 4.5 ), utilizing a reduction argument that simulates multi-dueling feedback to inv oke dueling bandit lo wer bounds. 5.5 Key T ec hnical Lemmas W e state the key lemmas underlying our analysis. F ull pro ofs are deferred to App endix E . Lemma 5.7 (Exploration Guarantees) . With T 0 = 4 K ( K − 1) ⌈ log(2 K 2 T 2 /δ ) ⌉ , after explor ation, with pr ob ability at le ast 1 − δ / 2 : (i) Each p air ( i, j ) is observe d at le ast n 0 = Θ(log T ) times. (ii) | ˆ P T 0 ( i, j ) − P ( i, j ) | ≤ O ( p log T /n 0 ) for al l p airs. (iii) | ˆ b T 0 ( i ) − b ( i ) | ≤ O ( p log T /n 0 ) for al l arms. Lemma 5.8 (No F alse Alarm in Sto c hastic Setting) . In the sto chastic setting, with pr ob ability at le ast 1 − δ / 2 , the deviation dete ction never triggers thr oughout T r ounds. Lemma 5.9 (Adversarial Detection Guarantee) . If the adversary’s cumulative pr efer enc e shift on p air ( i, j ) exc e e ds 3 γ ( √ n + 1) wher e n is the p ost-explor ation c omp arison c ount, then deviation is dete cte d with pr ob a- bility at le ast 1 − δ / ( K 2 T 2 ) . Best-of-Both-Worlds Multi-Dueling Bandits 11 Lemma 5.10 (Confidence V alidity) . With pr ob ability at le ast 1 − δ / 4 , for al l arms i and r ounds t : | ˆ b t ( i ) − b ( i ) | ≤ s 2 K log(2 K 2 T /δ ) N t ( i ) . (17) Lemma 5.11 (Elimination Sample Complexity) . Conditione d on c onfidenc e validity, e ach sub optimal arm j is eliminate d fr om C after at most: N T ( j ) ≤ 32 K log(2 K 2 T /δ ) (∆ B j ) 2 + 1 pul ls. (18) Lemma 5.12 (EXP3 Regret in Adversarial Mo de) . Starting fr om any r ound t 0 with fr esh initialization, the EXP3 algorithm with explicit explor ation p ar ameters η = Θ( T − 2 / 3 ) and γ = Θ( T − 1 / 3 ) achieves: E " T X t = t 0 s t ( i ⋆ ) − s t ( x t ) + s t ( y t ) 2 # ≤ O K 1 / 3 ( T − t 0 + 1) 2 / 3 (log K ) 1 / 3 . (19) 6 Extensions Our algorithmic framework flexibly accommo dates several practical extensions. Time-V arying Subset Sizes: Theorems 4.1 , 5.3 , and 5.4 extend immediately to the setting where the subset size m t ∈ { 2 , . . . , K } v aries arbitrarily across rounds. Because the rescaling factor β m t dep ends only on the curren t m t and the feedbac k transformations remain unbiased point wise, the regret bounds carry through with identical constants. F ormal statements and pro ofs are deferred to App endix F . Con textual, Batched, and High-Probabilit y Settings: The blac k-b o x nature of the MetaDueling reduction allo ws it to effortlessly inherit capabilities for con texts, batched feedback, and high-probabilit y b ounds simply b y sw apping the base algorithm (e.g., Versatile-DB ) for a v arian t designed for those settings. F or the Borda setting, extending SA-MiDEX to contextual or high-probabilit y domains requires mo difying the global Borda score estimators and concen tration thresholds, whic h presen ts an interesting direction for future w ork. 7 Discussion and Conclusion W e hav e presented the first b est-of-both-worlds algorithms for multi-dueling bandits, addressing b oth Con- dorcet and Borda ob jectives. The O ( K 2 log K T ) exploration phase in SA-MiDEX is required to learn all pairwise preferences. This o verhead dominates for small T or large K . Structured preference mo dels (e.g., lo w-rank, parametric) could p oten tially reduce this ov erhead. Both algorithms construct multisets containing only t wo distinct arms, p o- ten tially w asting the ability to compare m > 2 arms sim ultaneously . A more sophisticated reduction might extract more information from multi-w ay comparisons, though this would lik ely sacrifice the black-box prop- ert y . Our b ounds are in exp ectation (pseudo-regret). As discussed in Remark 4.7 , sim ultaneously optimal high-probabilit y bounds in both regimes is impossible, but regime-sp ecific high-probability guarantees remain op en. Our w ork op ens several directions for future research. First, can we achiev e b est-of-both-worlds guaran- tees without assuming a Condorcet winner, requiring a regret notion that handles cyclic preferences while remaining sublinear in the adv ersarial case? Second, can m ulti-wa y comparisons ( m > 2) b e exploited more efficien tly , p oten tially leveraging all m 2 pairwise comparisons to accelerate learning beyond the tw o-arm m ultiset approach? Third, can we comp ete with time-v arying b enc hmarks, such as sequences of changing Condorcet winners, yielding dynamic regret b ounds that scale with the num b er of changes or path length? F ourth, can structure in the preference matrix (e.g., low rank, Plack ett-Luce, or feature-based mo dels) b e lev eraged to reduce exploration or regret, particularly in the Borda setting with its O ( K 2 ) o verhead? Fi- nally , can our tec hniques extend beyond winner-only feedbac k to richer feedbac k models (rankings or pairwise margins) or sparser observ ations (e.g., only whether the top choice won)? Best-of-Both-Worlds Multi-Dueling Bandits 12 In conclusion, this pap er provides the first comprehensive treatmen t of best-of-b oth-w orlds guarantees for multi-dueling bandits under Condorcet and Borda ob jectives. Our MetaDueling reduction demonstrates that optimal Condorcet regret is achiev able via a simple, mo dular transformation of existing dueling bandit algorithms. Our SA-MiDEX algorithm shows that gap-dep enden t sto c hastic rates and sublinear adv ersarial rates are sim ultaneously ac hiev able for Borda ob jectiv es, despite the additional challenges of Borda score estimation. W e complemen t our upper b ounds with matc hing low er b ounds for the Condorcet setting. F or the Borda setting, our upp er b ounds are near-optimal with resp ect to the low er bounds (within a factor of K ) and match the b est-kno wn results in the literature. The mo dularit y of our approach, particularly the blac k-b o x nature of MetaDueling ensures that future adv ances in dueling bandits will automatically b enefit m ulti-dueling applications. References Arpit Agarwal, Nicholas Johnson, and Shiv ani Agarwal. Choice bandits. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , v olume 33, pages 18399–18410, 2020. P eter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adv ersarial bandits. In Pr o c e e dings of the 29th A nnual Confer enc e on L e arning The ory , pages 116–120. PMLR, 2016. URL https://proceedings.mlr.press/v49/auer16.html . P eter Auer, Nicolo Cesa-Bianchi, Y oav F reund, and Rob ert E. Schapire. The nonsto c hastic m ultiarmed bandit problem. SIAM Journal on Computing , 32(1):48–77, 2002. doi: 10.1137/S0097539701398375. URL https://doi.org/10.1137/S0097539701398375 . Ralph Allan Bradley and Milton E T erry . Rank analysis of incomplete block designs: I. the metho d of paired comparisons. Biometrika , 39(3/4):324–345, 1952. Brian Brost, Y evgeny Seldin, Ingemar J. Co x, and Christina Lioma. Multi-dueling bandits and their applica- tion to online rank er ev aluation. In Pr o c e e dings of the 25th ACM International Confer enc e on Information and Know le dge Management (CIKM) , pages 2161–2166. ACM, 2016. S ´ ebastien Bubeck and Aleksandrs Slivkins. The b est of b oth worlds: Stochastic and adversarial bandits. In Confer enc e on L e arning The ory (COL T) , pages 42–1. JMLR, 2012. Mirosla v Dud ´ ık, Katja Hofmann, Rob ert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. In Pr o c e e dings of the 28th Confer enc e on L e arning The ory (COL T) , pages 563–587, 2015. Mo ein F alahatgar, Yi Hao, Alon Orlitsky , V enk atadheera j Pichapati, and V Ravindrakumar. Maxing and ranking in linear time. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , v olume 30, 2017. Pratik Ga jane. Adv ersarial multi-dueling bandits. In ICML 2024 Workshop on Mo dels of Human F e e db ack for AI Alignment , 2024. Kevin Jamieson, Sumeet Katariya, Atul Deshpande, and Rob ert Now ak. Sparse dueling bandits. In Pr o- c e e dings of the 18th International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS) , pages 416–424. PMLR, 2015a. Kevin Jamieson, Sumeet Katariya, A tul Deshpande, and Rob ert Now ak. Sparse Dueling Bandits. In Guy Lebanon and S. V. N. Vishw anathan, editors, Pr o c e e dings of the Eighte enth International Confer enc e on A rtificial Intel ligenc e and Statistics , volume 38 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 416– 424, San Diego, California, USA, 09–12 May 2015b. PMLR. URL https://proceedings.mlr.press/ v38/jamieson15.html . Chi Jin, Haip eng Luo, Chen-Y u W ei, and Zhuoran Y ang. Best of b oth worlds: Reinforcemen t learning with comp etitiv e bounds in b oth stochastic and adversarial MDPs. In Confer enc e on L e arning The ory (COL T) , pages 2625–2633. PMLR, 2021. Best-of-Both-Worlds Multi-Dueling Bandits 13 Junp ei Komiy ama, Juny a Honda, Hisashi Kashima, and Hiroshi Nak aga wa. Regret low er b ound and optimal algorithm in dueling bandit problem. In Pr o c e e dings of the 28th Confer enc e on L e arning The ory (COL T) , pages 1141–1154. PMLR, 2015. Junp ei Komiyama, Jun ya Honda, and Hiroshi Nak agaw a. Copeland dueling bandit problem: Regret low er b ound, optimal algorithm, and computationally efficient algorithm. In Pr o c e e dings of the 33r d International Confer enc e on Machine L e arning (ICML) , pages 1235–1244. PMLR, 2016. Ch ung-W ei Lee, Haip eng Luo, Chen-Y u W ei, and Mengxiao Zhang. Achieving near-optimal regret in b oth sto c hastic and adv ersarial linear bandits. In International Confer enc e on Machine L e arning (ICML) , pages 6145–6155. PMLR, 2021. R Duncan Luce. Individual Choic e Behavior: A The or etic al Analysis . John Wiley , 1959. Haip eng Luo, Chen-Y u W ei, Nima Alemazkoor, and Jialin Zheng. Efficien t contextual bandits in non- stationary w orlds. In Confer enc e On L e arning The ory (COL T) , pages 1739–1776. PMLR, 2018. Robin L Plack ett. The analysis of p ermutations. Journal of the R oyal Statistic al So ciety: Series C (Applie d Statistics) , 24(2):193–202, 1975. T ao Qin, Tie-Y an Liu, Jun Xu, and Hang Li. Letor: A b enc hmark collection for research on learning to rank for information retriev al. Information R etrieval , 13(4):346–374, Aug 2010. ISSN 1573-7659. doi: 10.1007/s10791- 009- 9123- y . URL https://doi.org/10.1007/s10791- 009- 9123- y . Aadirupa Saha and Pierre Gaillard. V ersatile dueling bandits: Best-of-both-world analyses for learning from relativ e preferences. In Pr o c e e dings of the 39th International Confer enc e on Machine L e arning (ICML) , pages 19011–19026. PMLR, 2022. Aadirupa Saha and Adit ya Gopalan. Battle of bandits. In Pr o c e e dings of the Thirty-F ourth Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI) , pages 805–814. AUAI Press, 2018. Aadirupa Saha and Adity a Gopalan. P A C battling bandits in the Plack ett-Luce mo del. In Pr o c e e dings of the 30th International Confer enc e on A lgorithmic L e arning The ory (AL T) , pages 700–737. PMLR, 2019. Aadirupa Saha, T omer Koren, and Yisha y Mansour. Adversarial dueling bandits. In Pr o c e e dings of the 38th International Confer enc e on Machine L e arning (ICML) , pages 9235–9244. PMLR, 2021. Y evgeny Seldin and Aleksandrs Slivkins. One practical algorithm for b oth sto c hastic and adv ersarial bandits. In International Confer enc e on Machine L e arning (ICML) , pages 1287–1295. PMLR, 2014. Hossein Azari Soufiani, David C. Park es, and Lirong Xia. Random utility theory for so cial choice. In Pr o c e e dings of the 25th International Confer enc e on Neur al Information Pr o c essing Systems - V olume 1 , NIPS’12, page 126–134, Red Ho ok, NY, USA, 2012. Curran Asso ciates Inc. Yisong Y ue and Thorsten Joachims. In teractively optimizing information retriev al systems as a dueling ban- dits problem. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning (ICML) , pages 1201–1208. ACM, 2009. Yisong Y ue, Josef Bro der, Robert Klein b erg, and Thorsten Joac hims. The K-armed dueling bandits problem. Journal of Computer and System Scienc es , 78(5):1538–1556, 2012. Julian Zimmert and Y evgeny Seldin. An optimal algorithm for sto c hastic and adv ersarial bandits. In International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS) , pages 467–475. PMLR, 2019. Julian Zimmert and Y evgeny Seldin. Tsallis-INF: An optimal algorithm for sto c hastic and adversarial bandits. In Journal of Machine L e arning R ese ar ch , volume 22, pages 1–49, 2021. Julian Zimmert, T or Lattimore, and Y evgeny Seldin. Beating sto c hastic and adversarial semi-bandits opti- mally and simultaneously . In International Confer enc e on Machine L e arning (ICML) , pages 7683–7692. PMLR, 2019. Best-of-Both-Worlds Multi-Dueling Bandits 14 Masrour Zoghi, Shimon Whiteson, R´ emi Munos, and Maarten de Rijke. Relativ e upp er confidence b ound for the K-armed dueling bandit problem. In Pr o c e e dings of the 31st International Confer enc e on Machine L e arning (ICML) , pages 10–18. PMLR, 2014. Masrour Zoghi, Zohar Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , volume 28, 2015. A Extended Related W ork Dueling Bandits. The dueling bandit problem w as introduced by Y ue and Joac hims ( 2009 ) for information retriev al applications. Early work fo cused on sto c hastic settings with Condorcet winners: Y ue et al. ( 2012 ) prop osed the Interlea ved Filter algorithm, Zoghi et al. ( 2014 ) introduced RUCB with impro ved regret b ounds, and Komiyama et al. ( 2015 ) established instance-optimal rates. Alternative notions of optimality ha ve also b een studied: Cop eland winners ( Zoghi et al. , 2015 ; Komiyama et al. , 2016 ), Borda winners ( Jamieson et al. , 2015a ; F alahatgar et al. , 2017 ), and von Neumann winners ( Dud ´ ık et al. , 2015 ). Saha et al. ( 2021 ) initiated the study of adversarial dueling bandits, ac hieving O ( K 1 / 3 T 2 / 3 ) regret. Most relev ant to our work, Saha and Gaillard ( 2022 ) achiev ed b est-of-both-worlds guarantees for dueling bandits using Tsallis entrop y regularization, obtaining O ( √ K T ) adversarial regret and instance-optimal O ( P i = a ⋆ log T / ∆ i ) sto chastic regret simultaneously . Our MetaDueling reduction leverages their Versatile-DB algorithm as a blac k-b o x. Multi-Dueling Bandits. Multi-dueling bandits extend dueling bandits to settings where m ≥ 2 arms are compared simultaneously . Brost et al. ( 2016 ) in tro duced the problem with subset-wise feedbac k, where the learner observ es preferences among all pairs in the selected subset. Saha and Gopalan ( 2018 ) studied the more c hallenging winner fe e db ack model, where only the identit y of the winning arm is revealed, under sto c hastic preferences with Condorcet winners, achieving O ( K log T / ∆) regret. Agarwal et al. ( 2020 ) extended this to generalized Condorcet winners under m ultinomial logit choice models. Saha and Gopalan ( 2019 ) studied P AC iden tification in multi-dueling bandits. Ga jane ( 2024 ) in tro duced adv ersarial multi-dueling bandits with Borda ob jectives, ac hieving O (( K log K ) 1 / 3 T 2 / 3 ) regret. Our work bridges these lines: for the Condorcet setting, we pro vide the first b est-of-b oth-w orlds guar- an tee; for the Borda setting, we improv e up on purely adv ersarial approaches by achieving gap-dep enden t sto c hastic regret while maintaining sublinear adversarial regret. Best-of-Both-W orlds Bandits. The quest for algorithms that ac hieve optimal rates in b oth sto c hastic and adv ersarial settings without prior kno wledge of the regime has a rich history . Bubeck and Slivkins ( 2012 ) first achiev ed best-of-b oth-w orlds guaran tees for multi-armed bandits, though with suboptimal constan ts. Seldin and Slivkins ( 2014 ) and Auer and Chiang ( 2016 ) refined these results. The breakthrough came with Zimmert and Seldin ( 2019 ; 2021 ), who show ed that Tsallis en tropy regularization (sp ecifically , the Tsallis-INF algorithm) achiev es optimal O ( √ K T ) adversarial regret and O ( P i log T / ∆ i ) sto c hastic regret sim ultaneously . This approach has since b een extended to linear bandits ( Lee et al. , 2021 ), combinatorial bandits ( Zimmert et al. , 2019 ), and dueling bandits ( Saha and Gaillard , 2022 ). Our work extends this line to multi-dueling bandits. F or the Condorcet setting, w e ac hiev e this via a blac k-b o x reduction that preserves the b est-of-both-worlds prop ert y of the base algorithm. F or the Borda setting, we adopt a different approac h based on the sto chastic-and-adv ersarial paradigm. Sto c hastic-and-Adv ersarial Learning. An alternative approac h to b est-of-b oth-w orlds guarantees is the sto c hastic-and-adversarial paradigm, introduced b y Auer and Chiang ( 2016 ). Rather than designing a single algorithm that sim ultaneously adapts to b oth regimes, this approac h b egins by assuming a sto chas- tic environmen t and monitors for deviations; upon detecting adversarial b ehavior, the algorithm switc hes (p oten tially irreversibly) to a robust adversarial strategy . This paradigm has b een applied to multi-armed bandits ( Auer and Chiang , 2016 ), contextual bandits ( Luo et al. , 2018 ), and reinforcement learning ( Jin et al. , 2021 ). Our SA-MiDEX algorithm for the Borda setting adopts this paradigm. The key challenge is designing a deviation detection mec hanism that (i) nev er triggers false alarms in sto c hastic en vironments, ensuring Best-of-Both-Worlds Multi-Dueling Bandits 15 gap-dep enden t regret, and (ii) detects adversarial shifts b efore they accumulate excessive regret. W e achiev e this using martingale concentration b ounds calibrated to the multi-dueling feedback structure. Choice Models and Preference Learning. The pairwise-subset choice mo del, in which our results hold, follo ws Saha and Gopalan ( 2018 ), where the probabilit y of each arm winning is determined b y a veraging pairwise preferences. This generalizes the Bradley-T erry-Luce mo del ( Bradley and T erry , 1952 ; Luce , 1959 ) and is related to Plack ett-Luce mo dels for ranking ( Plac kett , 1975 ). Alternativ e c hoice mo dels include m ultinomial logit ( Agarwal et al. , 2020 ) and general random utility models ( Soufiani et al. , 2012 ). B Relationship Bet w een Condorcet and Borda Settings The Condorcet and Borda criteria capture fundamentally different asp ects of preference learning. A Con- dorcet winner a ⋆ satisfies P ( a ⋆ , i ) ≥ 1 / 2 for all i ∈ [ K ]. A Borda winner i ⋆ maximizes b ( i ) = 1 K − 1 P j = i P ( i, j ). In arbitrary preference matrices, these tw o notions of optimality need not coincide. A Condorcet winner ma y secure narrow pairwise victories against all opp onen ts, while another arm migh t lose narrowly to the Condorcet winner but dominate all other arms by a massiv e margin, thereby ac hieving a higher Borda score. The t w o criteria are guaranteed to align only under structural assumptions on the preference matrix, suc h as the Bradley-T erry-Luce (BTL) mo del or Strong Sto c hastic T ransitivity (SST). Prop osition B.1 (Condorcet and Borda Equiv alence under SST) . Assume the pr efer enc e matrix P satisfies Str ong Sto chastic T r ansitivity (SST): for any i, j, k ∈ [ K ] , if P ( i, j ) ≥ 1 / 2 and P ( j, k ) ≥ 1 / 2 , then P ( i, k ) ≥ max { P ( i, j ) , P ( j, k ) } . If a ⋆ is a Condor c et winner, then a ⋆ is also Bor da-optimal. Pr o of. Let a ⋆ b e the Condorcet winner, meaning P ( a ⋆ , i ) ≥ 1 / 2 for all i . Let i b e any other arm. W e compare their Borda scores: b ( a ⋆ ) − b ( i ) = 1 K − 1 X j = a ⋆ P ( a ⋆ , j ) − 1 K − 1 X j = i P ( i, j ) = 1 K − 1 X j = a ⋆ ,i P ( a ⋆ , j ) − P ( i, j ) + P ( a ⋆ , i ) − P ( i, a ⋆ ) . F or any arm j = a ⋆ , i , there are t wo cases: • Case 1: P ( i, j ) ≥ 1 / 2 . By the SST assumption, since P ( a ⋆ , i ) ≥ 1 / 2 and P ( i, j ) ≥ 1 / 2, it follows that P ( a ⋆ , j ) ≥ P ( i, j ). Th us, P ( a ⋆ , j ) − P ( i, j ) ≥ 0. • Case 2: P ( i, j ) < 1 / 2 . Because a ⋆ is the Condorcet winner, P ( a ⋆ , j ) ≥ 1 / 2. Therefore, P ( a ⋆ , j ) − P ( i, j ) > 0. In all cases, P ( a ⋆ , j ) ≥ P ( i, j ). F urthermore, since P ( a ⋆ , i ) ≥ 1 / 2, we ha ve P ( a ⋆ , i ) − P ( i, a ⋆ ) = 2 P ( a ⋆ , i ) − 1 ≥ 0. Summing these non-negativ e terms yields b ( a ⋆ ) − b ( i ) ≥ 0, pro ving that a ⋆ maximizes the Borda score. Because suc h transitivity cannot be guaran teed in adversarial en vironments or complex user-preference datasets (where cyclic preferences naturally arise), the Borda ob jective remains the most robust benchmark for general multi-dueling problems. C V alidit y of Rescaled Preferences W e no w v erify that the rescaled preference matrix b P t is a v alid preference matrix and preserv es the Condorcet winner. Lemma C.1 (V alidit y of Rescaled Preferences) . The r esc ale d matrix b P t satisfies: (i) R e cipr o city: b P t ( i, j ) + b P t ( j, i ) = 1 for al l i, j ∈ [ K ] . Best-of-Both-Worlds Multi-Dueling Bandits 16 (ii) Self-c omp arison: b P t ( i, i ) = 1 2 for al l i ∈ [ K ] . (iii) Bounde dness: b P t ( i, j ) ∈ [0 , 1] for al l i, j ∈ [ K ] . (iv) Condor c et pr eservation: If a ⋆ is a Condor c et winner under P t , then a ⋆ is a Condor c et winner under b P t . Pr o of. (i) R e cipr o city: By linearity of the rescaling: b P t ( i, j ) + b P t ( j, i ) = 2 α m + β m ( P t ( i, j ) + P t ( j, i )) = 2 α m + β m = (1 − β m ) + β m = 1 , where we used α m = (1 − β m ) / 2 and P t ( i, j ) + P t ( j, i ) = 1. (ii) Self-c omp arison: b P t ( i, i ) = α m + β m · 1 2 = 1 − β m 2 + β m 2 = 1 2 . (iii) Bounde dness: Since P t ( i, j ) ∈ [0 , 1], α m ≥ 0, and β m > 0: b P t ( i, j ) ∈ [ α m , α m + β m ] = [ α m , 1 − α m ] ⊆ [0 , 1] . (iv) Condor c et pr eservation: If P t ( a ⋆ , i ) ≥ 1 2 for all i , then: b P t ( a ⋆ , i ) = α m + β m P t ( a ⋆ , i ) ≥ α m + β m 2 = 1 2 . D Pro ofs for Condorcet Setting D.1 Pro of of Lemma 3.3 (Unbiased F eedback) Pr o of. W e prov e that E [ o t | x t , y t ] = b P t ( x t , y t ) = α m + β m P t ( x t , y t ). Let n x denote the n umber of copies of x t in A t and n y = m − n x the n umber of copies of y t . By the pairwise-subset choice mo del (Definition 2 ), we compute the probability that some cop y of x t wins. The m ultiset A t con tains n x copies of x t at positions 1 , . . . , n x and n y copies of y t at positions n x + 1 , . . . , m . The winning probability for p osition i ∈ [ n x ] (a copy of x t ) is: W ( i | A t , P t ) = X j = i 2 P t ( A t ( i ) , A t ( j )) m ( m − 1) . (20) F or p osition i ≤ n x : • Comparisons against other copies of x t (p ositions j ≤ n x , j = i ): There are n x − 1 such p ositions, each con tributing 2 · 1 2 m ( m − 1) = 1 m ( m − 1) . • Comparisons against copies of y t (p ositions j > n x ): There are n y suc h p ositions, each contributing 2 P t ( x t ,y t ) m ( m − 1) . Th us: W ( i | A t , P t ) = n x − 1 m ( m − 1) + 2 n y P t ( x t , y t ) m ( m − 1) for i ≤ n x . (21) The total probability that some copy of x t wins is: P [ o t = 1 | x t , y t , n x ] = n x X i =1 W ( i | A t , P t ) = n x n x − 1 m ( m − 1) + 2 n y P t ( x t , y t ) m ( m − 1) = n x ( n x − 1) m ( m − 1) + 2 n x n y P t ( x t , y t ) m ( m − 1) . (22) Best-of-Both-Worlds Multi-Dueling Bandits 17 Case 1: m ev en. When m is even, n x = n y = m/ 2 deterministically . Substituting in to ( 22 ): P [ o t = 1 | x t , y t ] = ( m/ 2)( m/ 2 − 1) m ( m − 1) + 2( m/ 2)( m/ 2) P t ( x t , y t ) m ( m − 1) = m ( m − 2) / 4 m ( m − 1) + m 2 P t ( x t , y t ) / 2 m ( m − 1) = m − 2 4( m − 1) + mP t ( x t , y t ) 2( m − 1) . (23) Comparing with the rescaling constants for even m : β m = m 2( m − 1) , α m = 1 − β m 2 = 1 2 − m 4( m − 1) = 2( m − 1) − m 4( m − 1) = m − 2 4( m − 1) . (24) Th us ( 23 ) equals α m + β m P t ( x t , y t ) = b P t ( x t , y t ). Case 2: m o dd. When m is o dd, with probability 1 / 2 w e hav e ( n x , n y ) = ( m +1 2 , m − 1 2 ), and with probabilit y 1 / 2 we ha ve ( n x , n y ) = ( m − 1 2 , m +1 2 ). F or ( n x , n y ) = ( m +1 2 , m − 1 2 ): P [ o t = 1 | n x = m +1 2 ] = m +1 2 · m − 1 2 m ( m − 1) + 2 · m +1 2 · m − 1 2 · P t ( x t , y t ) m ( m − 1) = ( m + 1)( m − 1) 4 m ( m − 1) + ( m + 1)( m − 1) P t ( x t , y t ) 2 m ( m − 1) = m + 1 4 m + ( m + 1) P t ( x t , y t ) 2 m . (25) F or ( n x , n y ) = ( m − 1 2 , m +1 2 ): P [ o t = 1 | n x = m − 1 2 ] = m − 1 2 · m − 3 2 m ( m − 1) + 2 · m − 1 2 · m +1 2 · P t ( x t , y t ) m ( m − 1) = ( m − 1)( m − 3) 4 m ( m − 1) + ( m − 1)( m + 1) P t ( x t , y t ) 2 m ( m − 1) = m − 3 4 m + ( m + 1) P t ( x t , y t ) 2 m . (26) T aking the av erage ov er the symmetrization: P [ o t = 1 | x t , y t ] = 1 2 m + 1 4 m + m − 3 4 m + ( m + 1) P t ( x t , y t ) 2 m = 1 2 · 2 m − 2 4 m + ( m + 1) P t ( x t , y t ) 2 m = m − 1 4 m + m + 1 2 m P t ( x t , y t ) . (27) Comparing with the rescaling constants for o dd m : β m = m + 1 2 m , α m = 1 − β m 2 = 1 2 − m + 1 4 m = 2 m − m − 1 4 m = m − 1 4 m . (28) Th us ( 27 ) equals α m + β m P t ( x t , y t ) = b P t ( x t , y t ). In b oth cases, E [ o t | x t , y t ] = b P t ( x t , y t ). Best-of-Both-Worlds Multi-Dueling Bandits 18 D.2 Pro ofs for Gap Rescaling Pro of for Lemma 3.5 Pr o of. Using α m = (1 − β m ) / 2: b ∆ t ( i ) = b P t ( a ⋆ , i ) − 1 2 = α m + β m P t ( a ⋆ , i ) − 1 2 = 1 − β m 2 + β m P t ( a ⋆ , i ) − 1 2 = β m P t ( a ⋆ , i ) − 1 2 = β m ∆ t ( i ) . Pro of for Lemma 3.6 Pr o of. F or even m : β m = m 2( m − 1) = 1 2 · m m − 1 > 1 2 . F or o dd m : β m = m +1 2 m = 1 2 + 1 2 m > 1 2 . In b oth cases, β m ≥ 1 2 , with β m → 1 2 as m → ∞ . D.3 Regret Equiv alence in Rescaled Preferences Pro of for Lemma 3.7 Pr o of. Define the instantaneous regrets: r C t := 1 m m X j =1 ∆ t ( A t ( j )) , ˆ r DB t := 1 2 b ∆ t ( x t ) + b ∆ t ( y t ) = β m 2 (∆ t ( x t ) + ∆ t ( y t )) . Case 1: m ev en. The m ultiset A t con tains m/ 2 copies eac h of x t and y t : r C t = 1 2 ∆ t ( x t ) + 1 2 ∆ t ( y t ) = 1 β m ˆ r DB t . Case 2: m o dd. T aking exp ectation o ver the symmetrization B t : E [ r C t | x t , y t ] = 1 2 · ⌈ m/ 2 ⌉ ∆ t ( x t ) + ⌊ m/ 2 ⌋ ∆ t ( y t ) m + 1 2 · ⌊ m/ 2 ⌋ ∆ t ( x t ) + ⌈ m/ 2 ⌉ ∆ t ( y t ) m = 1 2 ∆ t ( x t ) + 1 2 ∆ t ( y t ) = 1 β m ˆ r DB t . Summing o ver t = 1 , . . . , T and applying the tow er prop ert y yields E [ R C T ] = 1 β m E [ b R DB T ]. D.4 Pro of for Theorem 4.1 Pr o of. By Lemma 3.3 , the base algorithm receiv es feedback satisfying E [ o t | x t , y t ] = b P t ( x t , y t ) ∈ [0 , 1], whic h constitutes v alid dueling bandit feedbac k for the rescaled preference matrix b P t (Lemma C.1 ). P art (i): Adversarial b ound. The Versatile-DB algorithm achiev es adversarial pseudo-regret E [ b R DB T ] ≤ 4 √ K T + 1 on any sequence of v alid preference matrices, including { b P t } T t =1 ( Saha and Gaillard , 2022 , Theorem 1). By Lemma 3.7 : E [ R C T ] = 1 β m E [ b R DB T ] ≤ 1 β m (4 √ K T + 1) . Since β m ≥ 1 / 2 for all m ≥ 2 (Lemma 3.6 ), w e hav e 1 /β m ≤ 2, establishing ( 4 ). P art (ii): Bound for self-b ounding preference matrices. The self-bounding condition ( 5 ) relates exp ected regret to the cumulativ e probability mass placed on sub optimal arms. When this condition holds, Versatile-DB satisfies ( Saha and Gaillard , 2022 , Theorem 2): E [ b R DB T ] ≤ X i = a ⋆ 16 log T + 48 b ∆ i + 4 log T + 12 + 8 √ K . Best-of-Both-Worlds Multi-Dueling Bandits 19 By Lemma 3.5 , b ∆ i = β m ∆ i , so: E [ b R DB T ] ≤ 1 β m X i = a ⋆ 16 log T + 48 ∆ i + 4 log T + 12 + 8 √ K . Applying Lemma 3.7 (multiplication by 1 /β m ): E [ R C T ] ≤ 1 β 2 m X i = a ⋆ 16 log T + 48 ∆ i + 4 log T + 12 β m + 8 √ K β m . Using 1 /β 2 m ≤ 4 and 1 /β m ≤ 2 completes the pro of. D.5 Pro of of Low er Bounds Pr o of of The or em 4.4 (A dversarial L ower Bound). W e prov e the low er b ound by reduction from adv ersarial dueling bandits. Step 1: Reduction construction. Giv en any m ulti-dueling bandit algorithm A MDB , we construct a dueling bandit algorithm A DB as follo ws. A t each round t : 1. Run A MDB to obtain multiset A t of size m . 2. Sample tw o distinct indices i t , j t ∈ [ m ] uniformly without replacement. 3. Play the duel ( A t ( i t ) , A t ( j t )) in the dueling environmen t. 4. Receive outcome w t ∈ { i t , j t } indicating the winner. 5. Construct a winner for the m ulti-dueling feedback: if w t = i t , return i t as the winner index to A MDB ; otherwise return j t . This construction is v alid: the feedbac k to A MDB is consisten t with the pairwise-subset c hoice model restricted to the pair ( A t ( i t ) , A t ( j t )). Step 2: Regret equiv alence. The instantaneous dueling regret for the constructed algorithm is: r DB t = 1 2 (∆ t ( A t ( i t )) + ∆ t ( A t ( j t ))) . (29) T aking exp ectation o ver the uniform sampling of ( i t , j t ): E i t ,j t [ r DB t ] = 1 2 · 1 m 2 X i 0. Best-of-Both-Worlds Multi-Dueling Bandits 20 The hard instance can be constructed to admit a Condorcet winner: fix a ⋆ = 1 and set P t (1 , i ) = 1 2 + ϵ t ( i ) for adv ersarially c hosen ϵ t ( i ) ∈ [0 , 1 2 ]. The adv ersary con trols the gap magnitudes while main taining the Condorcet structure. Step 4: T ransfer to m ulti-dueling. Suppose for con tradiction that there exists a multi-dueling bandit algorithm A MDB ac hieving E [ R C T ] = o ( √ K T ) on all adversarial Condorcet instances. By Step 2, the constructed A DB w ould satisfy: E [ R DB T ] = E [ R C T ] = o ( √ K T ) , (33) con tradicting the low er b ound of Step 3. Therefore, for any multi-dueling bandit algorithm: E [ R C T ] ≥ Ω( √ K T ) . (34) Pr o of of The or em 4.5 (Sto chastic L ower Bound). W e pro ve the lo w er b ound b y reduction from sto c hastic dueling bandits. Step 1: Reduction construction. W e use the same reduction as in Theorem 4.4 : given A MDB , construct A DB b y sampling tw o indices uniformly and playing the corresp onding duel. Step 2: Regret equiv alence. Since the preference matrix is now fixed ( P t = P for all t ), the same calculation as in ( 31 ) gives: E [ R DB T ] = E [ R C T ] . (35) Step 3: Apply stochastic dueling bandit lo wer b ound. Komiyama et al. ( 2015 ) established that for an y sto chastic dueling bandit algorithm and an y instance with Condorcet winner a ⋆ : lim inf T →∞ E [ R DB T ] log T ≥ X i = a ⋆ 1 2 · KL( P ( a ⋆ , i ) ∥ 1 2 ) , (36) where KL( p ∥ q ) = p log p q + (1 − p ) log 1 − p 1 − q is the binary KL divergence. Step 4: Simplify the KL term. F or P ( a ⋆ , i ) = 1 2 + ∆ i with ∆ i > 0, we compute: KL 1 2 + ∆ i 1 2 = 1 2 + ∆ i log 1 2 + ∆ i 1 2 + 1 2 − ∆ i log 1 2 − ∆ i 1 2 = 1 2 + ∆ i log(1 + 2∆ i ) + 1 2 − ∆ i log(1 − 2∆ i ) . (37) Using the T aylor expansion log(1 + x ) = x − x 2 2 + O ( x 3 ) for small x : KL 1 2 + ∆ i 1 2 = 1 2 + ∆ i 2∆ i − 2∆ 2 i + O (∆ 3 i ) + 1 2 − ∆ i − 2∆ i − 2∆ 2 i + O (∆ 3 i ) = ∆ i + 2∆ 2 i − ∆ 2 i − 2∆ 3 i − ∆ i + 2∆ 2 i + ∆ 2 i − 2∆ 3 i + O (∆ 3 i ) = 2∆ 2 i + O (∆ 3 i ) . (38) Th us for small ∆ i : 1 2 · KL( 1 2 + ∆ i ∥ 1 2 ) = 1 4∆ 2 i + O (∆ 3 i ) = Θ 1 ∆ 2 i . (39) Ho wev er, the regret contribution from arm i is ∆ i p er pull. The low er b ound ( 36 ) states that the exp ected n umber of pulls of suboptimal arm i must b e at least Ω(log T / KL( · )). Since pulling arm i con tributes ∆ i to regret, the total regret contribution from arm i is: Ω ∆ i · log T KL( 1 2 + ∆ i ∥ 1 2 ) = Ω ∆ i · log T 2∆ 2 i = Ω log T ∆ i . (40) Best-of-Both-Worlds Multi-Dueling Bandits 21 Step 5: T ransfer to multi-dueling. Summing ov er all sub optimal arms and applying the reduction: E [ R C T ] = E [ R DB T ] ≥ Ω X i = a ⋆ log T ∆ i . (41) E Pro ofs for Borda Setting E.1 T ransformed F eedbac k Prop erties Lemma E.1 (T ransformed F eedback Statistics) . The tr ansforme d fe e db ack g t = ( o t − α m ) /β m satisfies: (i) E [ g t | x t , y t ] = P t ( x t , y t ) . (ii) g t ∈ h − α m β m , 1 − α m β m i ⊆ [ − 1 , 2] . (iii) V ar( g t | x t , y t ) ≤ 1 4 β 2 m ≤ 1 . Pr o of. (i) By Lemma 3.3 , E [ o t | x t , y t ] = α m + β m P t ( x t , y t ). Thus: E [ g t | x t , y t ] = E [ o t | x t , y t ] − α m β m = α m + β m P t ( x t , y t ) − α m β m = P t ( x t , y t ) . (42) (ii) Since o t ∈ { 0 , 1 } : g t ∈ 0 − α m β m , 1 − α m β m = − α m β m , 1 − α m β m . (43) F or m ≥ 2, we hav e β m ≥ 1 / 2 and α m = (1 − β m ) / 2 ≤ 1 / 4. Thus: − α m β m ≥ − 1 / 4 1 / 2 = − 1 2 > − 1 , 1 − α m β m ≤ 1 1 / 2 = 2 . (44) (iii) The v ariance of a Bernoulli-derived random v ariable: V ar( g t | x t , y t ) = V ar( o t | x t , y t ) β 2 m ≤ 1 / 4 β 2 m ≤ 1 / 4 1 / 4 = 1 . (45) E.2 Exploration Phase Analysis Pr o of of L emma 5.7 (Explor ation Guar ante es). (i) Sample c ounts. The round-robin schedule cycles through all K ( K − 1) ordered pairs. With T 0 = 4 K ( K − 1) n ′ 0 where n ′ 0 = ⌈ log (2 K 2 T 2 /δ ) ⌉ , each ordered pair ( i, j ) is observed exactly 4 n ′ 0 = n 0 times. (ii) Pairwise c onc entr ation. Fix pair ( i, j ). Let τ 1 , . . . , τ n 0 b e the rounds where this pair is compared. Define Z k = g τ k − P ( i, j ), whic h satisfies: • E [ Z k ] = 0 (by Lemma E.1 (i)). • | Z k | ≤ 3 (since g τ k ∈ [ − 1 , 2] and P ( i, j ) ∈ [0 , 1]). • V ar( Z k ) ≤ 1 (by Lemma E.1 (iii)). The empirical estimate is: ˆ P T 0 ( i, j ) = 1 n 0 n 0 X k =1 g τ k = P ( i, j ) + 1 n 0 n 0 X k =1 Z k . (46) By Ho effding’s inequality for b ounded random v ariables: Pr h | ˆ P T 0 ( i, j ) − P ( i, j ) | > ϵ i = Pr " 1 n 0 n 0 X k =1 Z k > ϵ # ≤ 2 exp − 2 n 0 ϵ 2 9 . (47) Best-of-Both-Worlds Multi-Dueling Bandits 22 Setting ϵ = q 9 log (4 K 2 /δ ) 2 n 0 and using n 0 = 4 ⌈ log(2 K 2 T 2 /δ ) ⌉ ≥ 4 log(4 K 2 /δ ) (for T ≥ 2): Pr | ˆ P T 0 ( i, j ) − P ( i, j ) | > s 9 log (4 K 2 /δ ) 2 n 0 ≤ 2 exp( − log (4 K 2 /δ )) = δ 2 K 2 . (48) By union b ound o ver K ( K − 1) < K 2 pairs: Pr ∃ ( i, j ) : | ˆ P T 0 ( i, j ) − P ( i, j ) | > s 9 log (4 K 2 /δ ) 2 n 0 ≤ K 2 · δ 2 K 2 = δ 2 . (49) (iii) Bor da sc or e c onc entr ation. By definition: | ˆ b T 0 ( i ) − b ( i ) | = 1 K − 1 X j = i ( ˆ P T 0 ( i, j ) − P ( i, j )) ≤ 1 K − 1 X j = i | ˆ P T 0 ( i, j ) − P ( i, j ) | . (50) On the even t that all pairwise estimates are within ϵ of truth: | ˆ b T 0 ( i ) − b ( i ) | ≤ 1 K − 1 · ( K − 1) · ϵ = ϵ = s 9 log (4 K 2 /δ ) 2 n 0 = O r log T n 0 ! . (51) E.3 Deviation Detection Analysis Pr o of of L emma 5.8 (No F alse A larm). W e sho w that in the stochastic setting, the deviation detection nev er triggers with high probability . Fix pair ( i, j ) with i < j . F or rounds t > T 0 where this pair is compared, let τ 1 , τ 2 , . . . denote these rounds in order. Define the martingale: M n = n X k =1 ( g τ k − P ( i, j )) . (52) Step 1: Martingale prop erties. The incremen ts X k = g τ k − P ( i, j ) satisfy: • E [ X k | F k − 1 ] = 0 (martingale prop ert y). • | X k | ≤ 3 (b ounded increments). • E [ X 2 k | F k − 1 ] ≤ 1 (b ounded conditional v ariance). Step 2: F reedman’s inequalit y . By F reedman’s inequalit y , for a martingale ( M n ) with | M n − M n − 1 | ≤ c and predictable quadratic v ariation V n = P n k =1 E [ X 2 k | F k − 1 ]: Pr [ ∃ n ≤ N : | M n | ≥ x and V n ≤ v ] ≤ 2 exp − x 2 2 v + 2 cx/ 3 . (53) With c = 3, v = n (since V n ≤ n ), and x = τ ( √ n + 1): Pr | M n | > τ ( √ n + 1) ≤ 2 exp − τ 2 ( √ n + 1) 2 2 n + 2 τ ( √ n + 1) . (54) F or n ≥ 1 and τ = p 8 log ( K 2 T 2 /δ ): τ 2 ( √ n + 1) 2 2 n + 2 τ ( √ n + 1) ≥ τ 2 n 2 n + 2 τ √ n + 2 τ ≥ τ 2 4 for n ≥ τ 2 . (55) Best-of-Both-Worlds Multi-Dueling Bandits 23 F or n < τ 2 , we use a cruder b ound. In either case: Pr ∃ n ≤ T : | M n | > τ ( √ n + 1) ≤ 2 T · exp − τ 2 8 = 2 T · exp( − log( K 2 T 2 /δ )) = 2 δ K 2 T . (56) Step 3: Connect to detection statistic. The detection statistic is: D n = | S obs t ( i, j ) − S obs T 0 ( i, j ) − n · ˆ P T 0 ( i, j ) | = n X k =1 g τ k − n · ˆ P T 0 ( i, j ) . (57) In the sto c hastic setting, ˆ P T 0 ( i, j ) estimates P ( i, j ). By Lemma 5.7 (ii): | ˆ P T 0 ( i, j ) − P ( i, j ) | ≤ ϵ 0 := O r log T n 0 ! . (58) Th us: D n = n X k =1 g τ k − n · P ( i, j ) + n · ( P ( i, j ) − ˆ P T 0 ( i, j )) ≤ | M n | + n · | ˆ P T 0 ( i, j ) − P ( i, j ) | ≤ | M n | + nϵ 0 . (59) F or detection to trigger, w e need D n > τ ( √ n + 1). If | M n | ≤ τ ( √ n + 1) / 2, then we need: nϵ 0 > τ ( √ n + 1) 2 . (60) F or n 0 = Θ(log T ) and τ = Θ( √ log T ), we ha ve ϵ 0 = O (1 / √ log T ). The condition b ecomes: n · O (1 / p log T ) > Ω( p log T · √ n ) , (61) whic h simplifies to √ n > Ω(log T ), i.e., n > Ω(log 2 T ). F or n ≤ T and sufficiently large T , the threshold is not exceeded when | M n | ≤ τ ( √ n + 1) / 2. Step 4: Union b ound. By union b ound ov er all K 2 < K 2 / 2 pairs: Pr[an y false alarm] ≤ K 2 2 · 2 δ K 2 T = δ T ≤ δ 2 . (62) Pr o of of L emma 5.9 (A dversarial Dete ction). Supp ose the adversary shifts preferences suc h that: n X k =1 ( P τ k ( i, j ) − ˆ P T 0 ( i, j )) ≥ 3 τ ( √ n + 1) . (63) Decomp ose the observed cum ulative: n X k =1 g τ k − n · ˆ P T 0 ( i, j ) = n X k =1 ( g τ k − P τ k ( i, j )) | {z } =: M n + n X k =1 ( P τ k ( i, j ) − ˆ P T 0 ( i, j )) | {z } adversarial shift . (64) The term M n is a martingale even under adversarial preferences, since E [ g τ k | F k − 1 , P τ k ] = P τ k ( i, j ). By the same F reedman argumen t as in Lemma 5.8 : Pr[ | M n | > τ ( √ n + 1)] ≤ 2 δ K 2 T 2 . (65) Best-of-Both-Worlds Multi-Dueling Bandits 24 Under condition ( 63 ), with probability at least 1 − 2 δ K 2 T 2 : D n = n X k =1 g τ k − n · ˆ P T 0 ( i, j ) ≥ | adversarial shift | − | M n | ≥ 3 τ ( √ n + 1) − τ ( √ n + 1) = 2 τ ( √ n + 1) > τ ( √ n + 1) . (66) This exceeds the detection threshold, triggering a mo de switch. E.4 Confidence and Elimination Analysis Lemma E.2 (Borda Score Concentration) . During the elimination phase, for any arm i and any ϵ > 0 , let N min ( i ) = min j = i N n ( i, j ) b e the minimum numb er of c omp arisons arm i has against any sp e cific opp onent. Then: Pr[ | ˆ b n ( i ) − b ( i ) | > ϵ ] ≤ 2 K exp − N min ( i ) ϵ 2 5 . (67) Pr o of. The empirical Borda score is the un weigh ted av erage of pairwise estimates: ˆ b n ( i ) = 1 K − 1 X j = i ˆ P n ( i, j ) . (68) The error in the Borda score is b ounded by the maximum pairwise error: | ˆ b n ( i ) − b ( i ) | ≤ 1 K − 1 X j = i | ˆ P n ( i, j ) − P ( i, j ) | ≤ max j = i | ˆ P n ( i, j ) − P ( i, j ) | . (69) F or any sp ecific pair ( i, j ) observed N n ( i, j ) times, applying Ho effding’s inequality on the transformed feedbac k g t ∈ [ − 1 , 2] (range length 3) yields: Pr[ | ˆ P n ( i, j ) − P ( i, j ) | > ϵ ] ≤ 2 exp − 2 N n ( i, j ) ϵ 2 9 ≤ 2 exp − N min ( i ) ϵ 2 5 . (70) Applying a union b ound o ver all K − 1 opp onen ts j = i ensures that the maxim um error stays bounded: Pr max j = i | ˆ P n ( i, j ) − P ( i, j ) | > ϵ ≤ 2( K − 1) exp − N min ( i ) ϵ 2 5 ≤ 2 K exp − N min ( i ) ϵ 2 5 . (71) Pr o of of L emma 5.10 (Confidenc e V alidity). Define the confidence radius based on the total pulls of arm i : conf t ( i ) = s 20 K log(2 K 2 T /δ ) N t ( i ) . (72) W e show that | ˆ b t ( i ) − b ( i ) | ≤ conf t ( i ) holds for all i and all t > T 0 with high probability . Step 1: Relating N t ( i ) to N min ( i ) . Because the algorithm samples the opp onen t y t uniformly from [ K ] \ { x t } , the num b er of times arm i plays a s pecific opp onen t j follo ws a binomial distribution. By standard Chernoff b ounds, with high probabilit y , the minimum pairwise count concen trates heavily around its exp ectation: N min ( i ) ≥ N t ( i ) 2( K − 1) ≥ N t ( i ) 2 K . Substituting this into the denominator yields 1 N min ( i ) ≤ 2 K N t ( i ) . Step 2: Fixed-time bound. F or fixed n , let the confidence radius b e conf n ( i ) = q 10 log (2 K 2 T /δ ) N min ( i ) . By Lemma E.2 with ϵ = conf n ( i ): Pr h | ˆ b n ( i ) − b ( i ) | > conf n ( i ) i ≤ 2 K exp − N min ( i ) 5 · 10 log (2 K 2 T /δ ) N min ( i ) = δ K T . (73) Best-of-Both-Worlds Multi-Dueling Bandits 25 Using the relationship from Step 1, this radius is bounded b y q 20 K log(2 K 2 T /δ ) N t ( i ) . Step 3: Time-uniform b ound via p eeling. P artition the range of N min ( i ) in to ep o c hs: E ℓ = { t : 2 ℓ − 1 ≤ N min ( i ) < 2 ℓ } , ℓ = 1 , 2 , . . . , ⌈ log 2 T ⌉ . (74) Step 4: Union b ound. Summing o ver O (log T ) ep ochs and K arms: Pr[ E c conf ] ≤ K · log 2 T · δ K T = δ log 2 T T ≤ δ 4 . (75) Pr o of of L emma 5.11 (Elimination Sample Complexity). Conditioned on the confidence even t E conf holding (Lemma 5.10 ), the true optimal arm i ⋆ is never eliminated from the active set C . This is b ecause for any arm j ∈ C , the follo wing holds: ˆ b t ( i ⋆ ) + conf t ( i ⋆ ) ≥ b ( i ⋆ ) ≥ b ( j ) ≥ ˆ b t ( j ) − conf t ( j ) . (76) Therefore, the strict elimination condition ˆ b t ( j ) − conf t ( j ) > ˆ b t ( i ⋆ ) + conf t ( i ⋆ ) can never b e satisfied against the optimal arm. A sub optimal arm j is eliminated from C when there exists an y arm i ∈ C such that: ˆ b t ( i ) − conf t ( i ) > ˆ b t ( j ) + conf t ( j ) . (77) Since the true Borda gap is ∆ B j = b ( i ⋆ ) − b ( j ), and i ⋆ remains in C , this elimination condition is mathe- matically guaran teed to trigger against i ⋆ once: 2conf t ( i ⋆ ) + 2conf t ( j ) < ∆ B j . (78) Because the algorithm selects x t ∈ C using a round-robin schedule, all arms in the active set are pulled symmetrically , implying conf t ( i ⋆ ) ≈ conf t ( j ). Therefore, arm j is guaranteed to b e eliminated once : 4conf t ( j ) ≤ ∆ B j . (79) Substituting the definition of the confidence radius conf t ( j ) = q 20 K log(2 K 2 T /δ ) N t ( j ) : 4 s 20 K log(2 K 2 T /δ ) N t ( j ) ≤ ∆ B j . (80) Squaring both sides and solving for N t ( j ) yields the maxim um num b er of times arm j can be pulled during the elimination phase b efore it is p ermanen tly remov ed: N T ( j ) ≤ 320 K log(2 K 2 T /δ ) (∆ B j ) 2 + 1 . (81) E.5 Adv ersarial Mo de Analysis Pr o of of L emma 5.12 (EXP3 R e gr et). Let T ′ = T − t 0 +1 b e the num b er of adv ersarial rounds. As established b y Ga jane ( 2024 ), the primary challenge in Borda estimation under winner-only feedback is that the v ariance of the imp ortance-w eighted estimator scales with P j 1 /q t ( j ). The explicit exploration parameter γ guarantees that q t ( j ) ≥ γ /K for all j ∈ [ K ], strictly bounding the maximum v ariance. Applying the standard EXP3 regret analysis with the mixed distribution q t (see Theorem 1 of Ga jane ( 2024 )) bounds the expected shifted Borda regret o ver T ′ rounds b y: E [ R s T ′ ] ≤ O K 1 / 3 ( T ′ ) 2 / 3 (log K ) 1 / 3 . (82) Best-of-Both-Worlds Multi-Dueling Bandits 26 E.6 Main Theorem Pro ofs Pr o of of The or em 5.3 (Sto chastic R e gr et). Define the high-probability even t E = E conf ∩ E no-switch . By union b ounds ov er the confidence interv als and the martingale detection thresholds, Pr[ E ] ≥ 1 − O (1 /T ), meaning Pr[ E c ] ≤ O (1 /T ). Conditioned on E , the algorithm never switches to adversarial mode, and the true optimal arm i ⋆ is never eliminated from C . Step 1: Regret from Baseline Exploration Phase. During rounds t ≤ T 0 , the instantaneous Borda regret is at most 1. Thus, the initial exploration phase contributes: E [ R B T 0 | E ] ≤ T 0 = 4 K ( K − 1) ⌈ log (2 K 2 T 2 /δ ) ⌉ = O ( K 2 log K T ) . (83) Step 2: Regret from Elimination Phase. F or any round t > T 0 where a suboptimal arm j is selected as x t ∈ C , the opp onent y t is drawn uniformly from [ K ] \ { x t } . The exp ected instan taneous regret is: E [ r t | x t = j, E ] = b ( i ⋆ ) − b ( j ) 2 + 1 K − 1 X k = j b ( k ) 2 = ∆ B j 2 + 1 2( K − 1) X k = j ∆ B k = Θ(1) . (84) By Lemma 5.11 , arm j is pulled at most N T ( j ) ≤ O ( K log T / (∆ B j ) 2 ) times b efore elimination. The cumula- tiv e exp ected regret strictly from pulling sub optimal arms during the elimination phase is: X j = i ⋆ N T ( j ) · E [ r t | x t = j, E ] = O X j = i ⋆ K log K T (∆ B j ) 2 . (85) Step 3: Regret from Exploitation Phase. Once all sub optimal arms are eliminated, C = { i ⋆ } . The algorithm plays x t = y t = i ⋆ , resulting in an exp ected instantaneous regret of exactly 0 for all subsequen t rounds not sub ject to background monitoring. Step 4: Regret from Bac kground Monitoring. At an y round t > T 0 , the algorithm ov errides the elimination logic with probabilit y p t = min(1 , K log t t ) to randomly sample a pair and feed the deviation detection statistic. The maxim um instantaneous regret of any pair is b ounded by 1. The total expected regret from this monitoring is: T X t =1 p t · 1 = T X t =1 1 1 ≤ K log t t min 1 , K log t t + T X t =1 1 1 > K log t t min 1 , K log t t = T X t =1 1 1 ≤ K log t t · 1 + T X t =1 1 1 > K log t t K log t t ≤ T X t =1 K log t t = O ( K log 2 T ) . Step 5: Regret on the F ailure Ev ent. On the complement even t E c , the algorithm’s confidence b ounds or detection thresholds fail. In the w orst case, the algorithm suffers the maximum p ossible instan- taneous regret of 1 for all T rounds, yielding a maximum total regret of T . Since Pr[ E c ] ≤ O (1 /T ), the exp ected regret contribution from this failure case is strictly b ounded: E [ R B T · 1 [ E c ]] ≤ T · O 1 T = O (1) . (86) Step 6: T otal Regret. Combining the exp ected regret conditioned on E (Steps 1, 2, and 4) with the exp ected regret on the failure ev ent (Step 5), the total exp ected sto chastic regret is: E [ R B T ] ≤ O K 2 log K T + K log 2 T + X i :∆ B i > 0 K log K T (∆ B i ) 2 . (87) Best-of-Both-Worlds Multi-Dueling Bandits 27 Pr o of of The or em 5.4 (A dversarial R e gr et). W e analyze tw o cases based on whether a mo de switch occurs. Case 1: Switch o ccurs at round t sw ≤ T . Explor ation r e gr et: R B T 0 ≤ T 0 = O ( K 2 log T ) . (88) Pr e-switch r e gr et ( T 0 < t < t sw ): During this phase, the algorithm runs Successive Elimination and Bac kground Monitoring. The adv ersary can manipulate preferences to cause regret while staying b elow the detection threshold. F or each pair ( i, j ), detection o ccurs when the cumulativ e deviation exceeds τ ( √ n + 1) where n = N t ( i, j ) − N T 0 ( i, j ). The maxim um undetected deviation p er pair is O ( τ ( √ T + 1)). The total adversarial manipulation across all K 2 pairs b efore detection is: R pre ≤ X ( i,j ) O ( τ p N t sw ( i, j )) ≤ O τ K s X ( i,j ) N t sw ( i, j ) = O ( τ K √ T ) , (89) where we used Cauch y-Sch warz and P ( i,j ) N t ( i, j ) ≤ 2 T . Since τ = O ( √ log K T ): R pre ≤ O ( K p T log K T ) . (90) Post-switch r e gr et ( t ≥ t sw ): After switching, the algorithm runs EXP3. By Lemma 5.12 : E [ R post ] ≤ O ( K 1 / 3 ( T − t sw ) 2 / 3 (log K ) 1 / 3 ) ≤ O ( K 1 / 3 T 2 / 3 (log K ) 1 / 3 ) . (91) Case 2: No switc h o ccurs ( t sw > T ). If no switch occurs throughout T rounds, the adversary’s cumulativ e deviation on each pair is b ounded b y the detection threshold. The total regret from adversarial deviations is: R adv T ≤ X ( i,j ) O ( τ ( p N T ( i, j ) + 1)) ≤ O ( τ K 2 ) + O τ K √ T = O ( K 2 p log K T + K p T log K T ) . (92) Com bined b ound. In both cases, the total adversarial regret is b ounded by the sum of the exploration o verhead, the maxim um undetected pre-switch deviation, and the p ost-switc h EXP3 regret: E [ R B T ] ≤ O ( K 2 log K T ) + O ( K p T log K T ) + O ( K 1 / 3 T 2 / 3 (log K ) 1 / 3 ) . (93) F or T ≥ K 2 , the initial exploration phase O ( K 2 log K T ) is strictly dominated b y the pre-switc h deviation term O ( K √ T log K T ). Ho wev er, neither the pre-switch deviation nor the p ost-switc h EXP3 regret strictly dominates the other across all time horizons. The pre-switch detection delay dominates for mo derate horizons ( T < K 4 / log 2 K ), while the EXP3 regret dominates for massive horizons ( T > K 4 / log 2 K ). Therefore, the final rigorous bound retains b oth primary terms: E [ R B T ] ≤ O K p T log K T + K 1 / 3 T 2 / 3 (log K ) 1 / 3 . (94) F Pro ofs for Extensions W e pro vide the formal corollaries and proofs for the time-v arying subset size extension discussed in Section 6 . Corollary F.1 (Time-V arying Subset Size: Condorcet Setting) . The or em 4.1 and Cor ol lary 4.2 hold when the subset size m t ∈ { 2 , . . . , K } varies arbitr arily acr oss r ounds. Pr o of. The proof of Lemma 3.7 holds for each round indep enden tly: at round t , the rescaling factor β m t dep ends only on the curren t subset size m t . The key observ ations are: Best-of-Both-Worlds Multi-Dueling Bandits 28 (i) The feedback o t remains an unbiased estimate of b P t ( x t , y t ) = α m t + β m t P t ( x t , y t ). (ii) The base algorithm op erates on the rescaled preferences and is agnostic to the specific v alue of β m t . (iii) The regret equiv alence E [ r C t ] = 1 β m t E [ ˆ r DB t ] holds p oin twise. Since β m t ≥ 1 / 2 for all m t ≥ 2, the b ounds carry through with the same constants. Corollary F.2 (Time-V arying Subset Size: Borda Setting) . The or ems 5.3 and 5.4 hold when m t ∈ { 2 , . . . , K } varies arbitr arily acr oss r ounds, pr ovide d the tr ansforme d fe e db ack g t fr om Met aDueling uses the curr ent m t . Pr o of. The transformed feedback g t = ( o t − α m t ) /β m t remains an unbiased estimate of P t ( x t , y t ) for any m t . The sto c hastic analysis (exploration, deviation detection) and adversarial analysis (EXP3) are unaffected by v arying m t . G Additional T ec hnical Results G.1 Concen tration Inequalities W e state the concentration inequalities used throughout the pro ofs. Lemma G.1 (Hoeffding’s Inequality) . L et X 1 , . . . , X n b e indep endent r andom variables with X i ∈ [ a i , b i ] . Then for ¯ X = 1 n P n i =1 X i : Pr[ | ¯ X − E [ ¯ X ] | ≥ t ] ≤ 2 exp − 2 n 2 t 2 P n i =1 ( b i − a i ) 2 . (95) Lemma G.2 (F reedman’s Inequality) . L et ( M n ) n ≥ 0 b e a martingale with M 0 = 0 and | M n − M n − 1 | ≤ c almost sur ely. L et V n = P n k =1 E [( M k − M k − 1 ) 2 | F k − 1 ] b e the pr e dictable quadr atic variation. Then for any x, v > 0 : Pr[ ∃ n : M n ≥ x and V n ≤ v ] ≤ exp − x 2 2 v + 2 cx/ 3 . (96)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment