Composite Binary Losses

Comp osite Binary Losses Mark D. Reid Australian National Univ ersit y and NICT A Can b erra A CT 0200, Australia Mark.Reid@an u.edu.au Rob ert C. Williamson Australian National Univ ersit y and NICT A Can b erra A CT 0200, Australia Bob.Williamson@an u.edu.au Octob er 25, 2018 Abstract W e study losses for binary classiﬁcation and class probability estimation and extend the understanding of them from margin losses to general comp osite losses whic h are the comp osition of a prop er loss with a link function. W e c haracterise when margin losses can be proper composite losses, explicitly sho w ho w to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of comp osite binary losses and giv e a complete characterisation of the relationship b et w een prop er losses and “classiﬁcation calibrated” losses. W e also c onsider the question of the “b est” surrogate binary loss. W e introduce a precise notion of “b est” and show there exist situations where t wo conv ex surrogate losses are incommensurable. W e pro vide a complete explicit characterisation of the con vexit y of comp osite binary losses in terms of the link function and the w eigh t function asso ciated with the proper loss whic h make up the comp osite loss. This c haracterisation suggests new wa ys of “surrogate tuning”. Finally , in an appendix w e present some new algorithm-indep enden t results on the relationship b et w een prop erness, conv exit y and robustness to misclassiﬁcation noise for binary losses and sho w that all conv ex prop er losses are non-robust to misclassiﬁcation noise. 1 In tro duction A loss function is the means by whic h a learning algorithm’s p erformance is judged. A binary loss function is a loss for a supervised prediction problem where there are t w o p ossible lab els asso ciated with the examples. A c omp osite loss is the comp osition of a prop er loss (deﬁned b elo w) and a link function (also deﬁned b elo w). In this pap er w e study comp osite binary losses and develop a n umber of new c haracterisation results. Informally , prop er losses are well-calibrated losses for class probabilit y estimation, that is for the problem of not only predicting a binary classiﬁcation lab el, but provi ding 1 an estimate of the probability that an example will hav e a p ositiv e lab el. Link functions are often used to map the outputs of a predictor to the interv al [0 , 1] so that they can b e in terpreted as probabilities. Ha ving suc h probabilities is often important in applications, and there has b een considerable in terest in understanding how to get accurate probabilit y estimates [36, 16, 10] and understanding the implications of requiring loss functions pro vide go od probabilit y estimates [6]. Muc h previous work in the machine learning literature has fo cussed on mar gin losses whic h intrinsically treat p ositive and negative classes symmetrically . How ever it is now w ell understo o d ho w imp ortan t it is to b e able to deal with the non-symmetric case [2, 12, 8, 9, 37]. A k ey goal of the present work is to consider comp osite losses in the general (non-symmetric) situation. Ha ving the ﬂexibility to choose a loss function is imp ortan t in order to “tailor” the solution to a mac hine learning problem; confer [18, 19, 9]. Understanding the structure of the set of loss functions and ha ving natural parametrisations of them is useful for this purp ose. Ev en when one is using a loss as a surrogate for the loss one w ould ideally lik e to minimise, it is helpful to hav e an easy to use parametrisation — see the discussion of “surrogate tuning” in the Conclusion. The pap er is structured as follows. In § 2 we in tro duce the notions of a loss, the conditional and full risk whic h w e will make extensiv e use of throughout the pap er. In § 3 w e in tro duce losses for Class Probability Estimation (CPE), deﬁne some tech- nical prop erties of them, and present some structural results. W e in tro duce and exploit Sa v age’s characterisation of prop er losses and use it to characterise proper symmetric CPE-losses. In § 4 w e deﬁne comp osite losses formally and c haracterise when a loss is a proper com- p osite loss in terms of its partial losses. W e introduce a natural and in trinsic parametri- sation of prop er comp osite losses and characterise when a margin loss can be a prop er comp osite loss. W e also sho w the relationship b etw een regret and Bregman divergences for general comp osite losses. In § 5 w e characterise the relationship b et w een classiﬁcation calibrated losses (as studied for example by Bartlett et al. [7]) and prop er comp osite losses. In § 6, motiv ated b y the question of which is the b est surrogate loss, w e characterise when a prop er comp osite loss is conv ex in terms of the natural parametrisation of such losses. In § 7 w e study surrogate losses making use of some of the earlier material in the pap er. A surr o gate loss function is a loss function which is not exactly what one wishes to minimise but is easier to w ork with algorithmically . W e deﬁne a well founded notion of “b est” surrogate loss and show that some con v ex surrogate losses are incommensurable on some problems. W e also study other notions of “best” and explicitly determine the surrogate loss that has the b est surrogate regret b ound in a certain sense. Finally , in § 8 w e dra w some more general conclusions. App endix C builds up on some of the results in the main pap er and presen ts some new algorithm-indep endent results on the relationship b etw een prop erness, conv exit y and robustness to misclassiﬁcation noise for binary losses and shows that all conv ex 2 prop er losses are non-robust to misclassiﬁcation noise. 2 Losses and Risks W e write x ∧ y := min( x, y ) and J p K = 1 if p is true and J p K = 0 otherwise 1 . The generalised function δ ( · ) is deﬁned by R b a δ ( x ) f ( x ) dx = f (0) when f is contin uous at 0 and a < 0 < b . Random v ariables are written in sans-serif font: X , Y . Giv en a set of lab els Y := {− 1 , 1 } and a set of prediction v alues V w e will say a loss is an y function 2 ` : Y × V → [0 , ∞ ). W e in terpret suc h a loss as giving a penalty ` ( y , v ) when predicting the v alue v when an observ ed lab el is y . W e can alwa ys write an arbitrary loss in terms of its p artial losses ` 1 := ` (1 , · ) and ` − 1 := ` ( − 1 , · ) using ` ( y , v ) = J y = 1 K ` 1 ( v ) + J y = − 1 K ` − 1 ( v ) . (1) Our deﬁnition of a loss function co vers all commonly used mar gin losses ( i.e. those whic h can b e expressed as ` ( y , v ) = φ ( yv ) for some function φ : R → [0 , ∞ )) suc h as the 0-1 loss ` ( y , v ) = J y v > 0 K , the hinge loss ` ( y , v ) = max(1 − y v , 0), the lo gistic loss ` ( y , v ) = log(1 + e y v ), and the exp onential loss ` ( y , v ) = e − y v commonly used in b oosting. It also co v ers class pr ob ability estimation losses where the predicted v alues ˆ η ∈ V = [0 , 1] are directly interpreted as probabilit y estimates. 3 W e will use ˆ η instead of v as an argumen t to indicate losses for class probabilit y estimation and use the shorthand CPE losses to distinguish them from general losses. F or example, squar e loss has partial losses ` − 1 ( ˆ η ) = ˆ η 2 and ` 1 ( ˆ η ) = (1 − ˆ η ) 2 , the lo g loss ` − 1 ( ˆ η ) = log(1 − ˆ η ) and ` 1 ( ˆ η ) = log( ˆ η ), and the family of c ost-weighte d misclassiﬁc ation losses parametrised b y c ∈ (0 , 1) is giv en b y ` c ( − 1 , ˆ η ) = c J ˆ η ≥ c K and ` c (1 , ˆ η ) = (1 − c ) J ˆ η < c K . (2) 2.1 Conditional and F ull Risks Supp ose we hav e random examples X with associated lab els Y ∈ {− 1 , 1 } The join t distribution of ( X , Y ) is denoted P and the marginal distribution of X is denoted M . Let the observ ation conditional density η ( x ) := Pr( Y = 1 | X = x ). Thus one can sp ecify an exp erimen t by either P or ( η , M ). If η ∈ [0 , 1] is the probability of observing the lab el y = 1 the p oint-wise risk (or c onditional risk ) of the estimate v ∈ V is deﬁned as the η -av erage of the p oint-wise risk for v : L ( η , v ) := E Y ∼ η [ ` ( Y , v )] = η ` 1 ( v ) + (1 − η ) ` − 1 ( v ) . Here, Y ∼ η is a shorthand for labels b eing drawn from a Bernoulli distribution with parameter η . When η : X → [0 , 1] is an observ ation-conditional densit y , taking the M - a v erage of the p oin t-wise risk gives the (ful l) risk of the estimator v , now in terpreted as 1 This is the Iverson brac ket notation as recommended b y Knuth [27]. 2 Restricting the output of a loss to [0 , ∞ ) is equiv alent to assuming the loss has a low er b ound and then translating its output. 3 These are known as sc oring rules in the statistical literature [16]. 3 a function v : X → V : L ( η , v, M ) := E X ∼ M [ L ( η ( X ) , v ( X ))] . W e sometimes write L ( v , P ) for L ( η , v , M ) where ( η , M ) corresp onds to the joint dis- tribution P . W e write ` , L and L for the loss, point-wise and full risk throughout this pap er. The Bayes risk is the minimal achiev able v alue of the risk and is denoted L ( η , M ) := inf v ∈ V X L ( η , v, M ) = E X ∼ M [ L ( η ( X ))] , where [0 , 1] 3 η 7→ L ( η ) := inf v ∈ V L ( η , v ) is the p oint-wise or c onditional Bayes risk . There has b een increasing a w areness of the imp ortance of the conditional Bay es risk curv e L ( η ) — also kno wn as “generalized en tropy” [17] — in the analysis of losses for probabilit y estimation [23, 24, 1, 32]. Below we will see ho w it is eﬀectively the curv ature of L that determines m uc h of the structure of these losses. 3 Losses for Class Probability Estimation W e b egin by considering CPE losses, that is, functions ` : {− 1 , 1 } × [0 , 1] → [0 , ∞ ) and brieﬂy summarise a n um b er of imp ortan t existing structural results for pr op er losses — a large, natural class of losses for class probability estimation. 3.1 Prop er, F air, Deﬁnite and Regular Losses There are a few prop erties of losses for probabilit y estimation that we will require. If ˆ η is to b e interpreted as an estimate of the true positive class probability η ( i.e. , when y = 1) then it is desirable to require that L ( η , ˆ η ) b e minimised by ˆ η = η for all η ∈ [0 , 1]. Losses that satisfy this constraint are said to b e Fisher c onsistent and are known as pr op er losses [9, 16]. That is, a prop er loss ` satisﬁes L ( η ) = L ( η , η ) for all η ∈ [0 , 1]. A strictly pr op er loss is a prop er loss for whic h the minimiser of L ( η , ˆ η ) o v er ˆ η is unique. W e will say a loss is fair whenev er ` − 1 (0) = ` 1 (1) = 0 . (3) That is, there is no loss incurred for p erfect prediction. The main place fairness is relied up on is in the integral representation of Theorem 6 where it is used to get rid of some constan ts of integration. In order to explicitly construct losses from their asso ciated “w eigh t functions” as shown in Theorem 7, we will require that the loss b e deﬁnite , that is, its p oin t-wise Bay es risk for deterministic even ts ( i.e. , η = 0 or η = 1) must b e b ounded from b elo w: L (0) > −∞ , L (1) > −∞ . (4) Since prop erness of a loss ensures L ( η ) = L ( η , η ) we see that a fair prop er loss is neces- sarily deﬁnite since L (0 , 0) = ` − 1 (0) = 0 > −∞ , and similarly for L (1 , 1). Con v ersely , if 4 a prop er loss is deﬁnite then the ﬁnite v alues ` − 1 (0) and ` 1 (1) can b e subtracted from ` (0 , · ) and ` (1 , · ) to make it fair. Finally , for Theorem 4 to hold at the endp oin ts of the unit interv al, we require a loss to b e r e gular 4 ; that is, lim η & 0 η ` 1 ( η ) = lim η % 1 (1 − η ) ` − 1 ( η ) = 0 . (5) In tuitiv ely , this condition ensures that making mistakes on even ts that never happ en should not incur a p enalt y . In most of the situations w e consider in the remainder of this pap er will inv olve losses whic h are prop er, fair, deﬁnite and regular. 3.2 The Structure of Prop er Losses A key result in the study of proper losses is originally due to Sh uford et al. [45] though our presen tation follo ws that of Buja et al. [9]. It characterises prop er losses for probability estimation via a constraint on the relationship b et w een its partial losses. Theorem 1 Supp ose ` : {− 1 , 1 } × [0 , 1] → R is a loss and that its p artial losses ` 1 and ` − 1 ar e b oth diﬀer entiable. Then ` is a pr op er loss if and only if for al l ˆ η ∈ (0 , 1) − ` 0 1 ( ˆ η ) 1 − ˆ η = ` 0 − 1 ( ˆ η ) ˆ η = w ( ˆ η ) (6) for some weigh t function w : (0 , 1) → R + such that R 1 −   w ( c ) dc < ∞ for al l  > 0 . The equalities in (6) should b e interpreted in the L 1 sense. This simple characterisation of the structure of prop er losses has a num ber of in ter- esting implications. Observe from (6) that if ` is prop er, given ` 1 w e can determine ` − 1 or vice versa. Also, the partial deriv ative of the conditional risk can b e seen to be the pro duct of a linear term and the w eigh t function: Corollary 2 If ` is a diﬀer entiable pr op er loss then for al l η ∈ [0 , 1] ∂ ∂ ˆ η L ( η , ˆ η ) = (1 − η ) ` 0 − 1 ( ˆ η ) + η ` 0 1 ( ˆ η ) = ( ˆ η − η ) w ( ˆ η ) . (7) Another corollary , observ ed by Buja et al. [9], is that the w eight function is related to the curv ature of the conditional Bay es risk L . Corollary 3 L et ` b e a a twic e diﬀer entiable 5 pr op er loss with weight function w deﬁne d as in e quation (6). Then for al l c ∈ (0 , 1) its c onditional Bayes risk L satisﬁes w ( c ) = − L 00 ( c ) . (8) 4 This is equiv alen t to the conditions of Sav age [41] and Schervish [42]. 5 The restriction to diﬀeren tiable losses can be remo ved in most cases if generalised weigh t functions— that is, p ossibly inﬁnite but deﬁning a measure on (0 , 1)—are p ermitted. F or example, the w eight function for the 0-1 loss is w ( c ) = δ ( c − 1 2 ). 5 One immediate consequence of this corollary is that the conditional Bay es risk for a prop er loss is alwa ys conca v e. Along with an extra constraint, this giv es another char- acterisation of prop er losses [41, 38]. Theorem 4 (Sa v age) A loss function ` is pr op er if and only if its p oint-wise Bayes risk L ( η ) is c onc ave and for e ach η , ˆ η ∈ (0 , 1) L ( η , ˆ η ) = L ( ˆ η ) + ( η − ˆ η ) L 0 ( ˆ η ) . (9) F urthermor e if ` is r e gular this char acterisation also holds at the endp oints η , ˆ η ∈ { 0 , 1 } . This link b et w een loss and conca v e functions mak es it easy to establish a connection, as Buja et al. [9] do, b et w een r e gr et ∆ L ( η , ˆ η ) := L ( η , ˆ η ) − L ( η ) for prop er losses and Br e gman diver genc es . The latter are generalisations of distances and are deﬁned in terms of con vex functions. Sp eciﬁcally , if f : S → R is a con v ex function ov er some con v ex set S ⊆ R n then its asso ciated Bregman divergence 6 is D f ( s, s 0 ) := f ( s ) − f ( s 0 ) − h s − s 0 , ∇ f ( s 0 ) i for an y s, s 0 ∈ S , where ∇ f ( s 0 ) is the gradien t of f at s 0 . By noting that ov er S = [0 , 1] we ha v e ∇ f = f 0 , these deﬁnitions lead immediately to the following corollary of Theorem 4. Corollary 5 If ` is a pr op er loss then its r e gr et is the Br e gman diver genc e asso ciate d with f = − L . That is, ∆ L ( η , ˆ η ) = D − L ( η , ˆ η ) . (10) Man y of the abov e results can be observed graphically by plotting the conditional risk for a prop er loss as in Figure 1. Here we see the tw o partial losses on the left and righ t sides of the ﬁgure are related, for eac h ﬁxed ˆ η , b y the linear map η 7→ L ( η , ˆ η ) = (1 − η ) ` − 1 ( ˆ η ) + η ` 1 ( ˆ η ). F or each ﬁxed η the prop erness of ` requires that these conv ex com binations of the partial losses (each slice parallel to the left and right faces) are minimised when ˆ η = η . Thus, the lines joining the partial losses are tangent to the conditional Ba y es risk curv e η 7→ L ( η ) = L ( η , η ) sho wn ab o ve the dotted diagonal. Since the conditional Ba y es risk curve is the lo w er env elop e of these tangents it is necessarily conca v e. The coupling of the partial loss es via the tangents to the conditional Bay es risk curve demonstrates wh y muc h of the structure of prop er losses is determined b y the curv ature of L — that is, by the w eigh t function w . The relationship b et ween a proper loss and its asso ciated w eigh t function is captured succinctly via the following representation of prop er losses as a weigh ted integral of the cost-w eigh ted misclassiﬁcation losses ` c deﬁned in (2). The reader is referred to [39] for the details, pro of and the history of this result. 6 A concise summary of Bregman div ergences and their prop erties is giv en by Banerjee et al. [4, App endix A]. 6 Figure 1: The structure of the conditional risk L ( η , ˆ η ) for a prop er loss (grey surface). The loss is log loss and its partials ` − 1 ( ˆ η ) = − log( ˆ η ) and ` 1 ( ˆ η ) = − log(1 − ˆ η ) sho wn on the left and righ t faces of the b o x (blue curv es). The conditional Ba y es risk is the (green) curv e ab o v e the dotted line ˆ η = η . The (red) line connecting p oin ts on the partial loss curv es shows the conditional risk for a ﬁxed prediction ˆ η . 7 w ( c ) ` − 1 ( ˆ η ) ` 1 ( ˆ η ) Loss 2 δ  1 2 − c  J ˆ η > 1 2 K J ˆ η ≤ 1 2 K 0-1 δ ( c − c 0 ) c 0 J ˆ η ≥ c 0 K (1 − c 0 ) J ˆ η < c 0 K ` c 0 , c 0 ∈ [0 , 1] 1 (1 − c ) 2 c h 2 ln(1 − ˆ η ) + ˆ η 1 − ˆ η i h ln 1 − ˆ η ˆ η − 1 i — 1 ˆ η 2 / 2 (1 − ˆ η ) 2 / 2 Square 1 (1 − c ) c − ln(1 − ˆ η ) − ln( ˆ η ) Log 1 (1 − c ) 2 c 2 h ln((1 − ˆ η ) ˆ η ) − 1 − 2 ˆ η ˆ η i h ln((1 − ˆ η ) ˆ η ) + 1 − 2 ˆ η ˆ η i — 1 [(1 − c ) c ] 3 / 2 2 q ˆ η 1 − ˆ η 2 q 1 − ˆ η ˆ η Bo osting T able 1: W eight functions and asso ciated partial loss es. Theorem 6 L et ` : Y × [0 , 1] → R b e a fair, pr op er loss Then for e ach ˆ η ∈ (0 , 1) and y ∈ Y ` ( y , ˆ η ) = Z 1 0 ` c ( y , ˆ η ) w ( c ) dc, (11) wher e w = − L 00 . Conversely, if ` is deﬁne d by (11) for some w eight function w : (0 , 1) → [0 , ∞ ) then it is pr op er. Some example losses and their asso ciated w eigh t functions are giv en in T able 1. Buja et al. [9] show that ` is strictly prop er if and only if w ( c ) > 0 in the sense that w has non-zero mass on every op en subset of (0 , 1). The following theorem from Reid and Williamson [38] shows how to explicitly construct a loss in terms of a w eigh t function. Theorem 7 Given a weight function w : [0 , 1] → [0 , ∞ ) , let W ( t ) = R t w ( c ) dc and W ( t ) = R t W ( c ) dc . Then the loss ` w deﬁne d by ` w ( y , ˆ η ) = − W ( ˆ η ) − ( y − ˆ η ) W ( ˆ η ) (12) is a pr op er loss. A dditional ly, if W (0) and W (1) ar e b oth ﬁnite then ` w ( y , ˆ η ) + ( W (1) − W (0)) y + W (0) (13) is a fair, pr op er loss. Observ e that if w and v are functions whic h diﬀer on a set of measure zero then they will lead to the same loss. A simple corollary to Theorem 6 is that the partial losses are 8 giv en by ` 1 ( ˆ η ) = Z 1 ˆ η (1 − c ) w ( c ) dc and ` − 1 ( ˆ η ) = Z ˆ η 0 cw ( c ) dc. (14) . 3.3 Symmetric Losses W e will sa y a loss is symmetric if ` 1 ( ˆ η ) = ` − 1 (1 − ˆ η ) for all ˆ η ∈ [0 , 1]. W e sa y a weigh t function for a prop er loss or the conditional Bay es risk is symmetric if w ( c ) = w (1 − c ) or L ( c ) = L (1 − c ) for all c ∈ [0 , 1]. P erhaps unsurprisingly , an immediate consequence of Theorem 1 is that these tw o notions are identical. Corollary 8 A pr op er loss is symmetric if and only if its weight function is symmetric. Requiring a loss to b e prop er and symmetric constrains the partial losses signiﬁ cantly . Prop erness alone completely sp eciﬁes one partial loss from the other. No w suppose in addition that ` is symmetric. Combining ` 1 ( ˆ η ) = ` − 1 (1 − ˆ η ) with (6) implies ` 0 − 1 (1 − ˆ η ) = 1 − ˆ η ˆ η ` 0 − 1 ( ˆ η ) . (15) This shows that ` − 1 is completely determined by ` − 1 ( ˆ η ) for ˆ η ∈ [0 , 1 2 ] (or ˆ η ∈ [ 1 2 , 1]). Th us in order to sp eciﬁc a symmetric prop er loss, one needs to only sp ecify one of the partial losses on one half of the interv al [0 , 1]. Assuming ` − 1 is con tin uous at 1 2 (or equiv alently that w has no atoms at 1 2 ), by integrating b oth sides of (15) we can derive an explicit formula for the other half of ` − 1 in terms of that whic h is sp eciﬁed: ` − 1 ( ˆ η ) = ` − 1 ( 1 2 ) + Z ˆ η 1 2 x 1 − x ` 0 − 1 (1 − x ) dx, (16) whic h w orks for determining ` − 1 on either [0 , 1 2 ] or [ 1 2 , 1] when ` − 1 is sp eciﬁed on [ 1 2 , 1] or [0 , 1 2 ] resp ectively (recalling the usual conv en tion that R b a = − R a b ). W e hav e thus shown: Theorem 9 If a loss is pr op er and symmetric, then it is c ompletely determine d by sp e cifying one of the p artial losses on half the unit interval (either [0 , 1 2 ] or [ 1 2 , 0] ) and using (15) and (16). W e demonstrate (16) with four examples. Supp ose that ` − 1 ( ˆ η ) = 1 1 − ˆ η for ˆ η ∈ [0 , 1 2 ]. Then one can readily determine the complete partial loss to b e ` − 1 ( ˆ η ) = J ˆ η ≤ 1 2 K 1 − ˆ η + J ˆ η > 1 2 K  2 + log ˆ η 1 − ˆ η  . (17) Supp ose instead that ` − 1 ( ˆ η ) = 1 1 − ˆ η for ˆ η ∈ [ 1 2 , 1]. In that case w e obtain ` − 1 ( ˆ η ) = J ˆ η ≤ 1 2 K  2 + log ˆ η 1 − ˆ η  + J ˆ η ≥ 1 2 K 1 − ˆ η . (18) 9 Supp ose ` − 1 ( ˆ η ) = 1 (1 − ˆ η ) 2 for ˆ η ∈ [0 , 1 2 ]. Then one can determine that ` − 1 ( ˆ η ) = J ˆ η < 1 2 K (1 − ˆ η ) 2 + J ˆ η ≥ 1 2 K (4 + 2(2 ˆ η + ˆ η log ˆ η − ˆ η log (1 − ˆ η ) − 1)) ˆ η . Finally consider sp ecifying that ` − 1 ( ˆ η ) = ˆ η for ˆ η ∈ [0 , 1 2 ]. In this case we obtain that ` − 1 ( ˆ η ) = J ˆ η ≤ 1 2 K ˆ η + J ˆ η ≥ 1 2 K (1 − log 2 − ˆ η − log(1 − ˆ η )) . 4 Comp osite Losses General loss functions are often constructed with the aid of a link function . F or a particular set of prediction v alues V this is an y con tin uous mapping ψ : [0 , 1] → V . In this pap er, our fo cus will b e c omp osite losses for binary class probabilit y estimation. These are the comp osition of a CPE loss ` : {− 1 , 1 } × [0 , 1] → R and the inv erse of a link function ψ , an inv ertible mapping from the unit in terv al to some range of v alues. Unless stated otherwise we will assume ψ : [0 , 1] → R . W e will denote a comp osite loss b y ` ψ ( y , v ) := ` ( y , ψ − 1 ( v )) . (19) The classical motiv ation for link functions [33] is that often in estimating η one uses a parametric represen tation of ˆ η : X → [0,1] whic h has a natural scale not matching [0 , 1]. T raditionally one writes ˆ η = ψ − 1 ( ˆ h ) where ψ − 1 is the “inv erse link” (and ψ is of course the forw ard link). The function ˆ h : X → R is the hyp othesis . Often ˆ h = ˆ h α is parametrised linearly in a parameter v ector α . In such a situation it is computationally con v enien t if ` ( η , ψ − 1 ( ˆ h )) is conv ex in ˆ h (which implies it is conv ex in α when ˆ h α is linear in α ). Often one will choose the loss ﬁrst (tailoring its prop erties b y the w eigh ting given ac- cording to w ( c )), and then choose the link somewhat arbitrarily to map the h yp otheses appropriately . An in teresting alternative p ersp ectiv e arises in the literature on “elic- itabilit y”. Lambert et al. [28] 7 pro vide a general characterisation of prop er scoring rules (i.e. losses) for general pr op erties of distributions, that is, contin uous and lo cally non- constan t functions Γ which assign a real v alue to each distribution ov er a ﬁnite sample space. In the binary case, these prop erties pro vide another interpretation of links that is complemen tary to the usual one that treats the inv erse link ψ − 1 as a w a y of interpreting scores as class probabilities. T o see this, we ﬁrst iden tify distributions ov er {− 1 , 1 } with the probabilit y η of observing 1. In this case prop erties are contin uous, lo cally non-constant maps Γ : [0 , 1] → R . When a link function ψ is con tinuous it can therefore be in terpreted as a property since its assumed in v ertibilit y implies it is lo cally non-constant. A prop ert y Γ is said to b e elicitable whenev er there exists a strictly prop er loss ` for it so that the comp osite loss ` Γ satisﬁes for all ˆ η 6 = η L Γ ( η , ˆ η ) := E Y ∼ η [ ` Γ ( Y , ˆ η )] > L Γ ( η , η ) . 7 See also [15]. 10 Theorem 1 of [28] shows that Γ is elicitable if and only if Γ − 1 ( r ) is con v ex for all r ∈ range(Γ). This immediately giv es us a c haracterisation of “proper” link functions: those that are b oth contin uous and ha v e con v ex level sets in [0 , 1] — they are the non-decreasing contin uous functions. Th us in Lambert’s p erspective, one chooses a “prop ert y” ﬁrst (i.e. the inv ertible link) and then chooses the prop er loss. 4.1 Prop er Comp osite Losses W e will call a comp osite loss ` ψ (19) a pr op er c omp osite loss if ` in (19) is a prop er loss for class probability estimation. As in the case for losses for probabilit y estimation, the requirement that a comp osite loss b e prop er imp oses some constrain ts on its partial losses. Man y of the results for proper losses carry ov er to comp osite losses with some extra factors to account for the link function. Theorem 10 L et λ = ` ψ b e a c omp osite loss with diﬀer entiable and strictly monotone link ψ and supp ose the p artial losses λ − 1 ( v ) and λ 1 ( v ) ar e b oth diﬀer entiable. Then λ is a pr op er c omp osite loss if and only if ther e exists a weight function w : (0 , 1) → R + such that for al l ˆ η ∈ (0 , 1) − λ 0 1 ( ψ ( ˆ η )) 1 − ˆ η = λ 0 − 1 ( ψ ( ˆ η )) ˆ η = w ( ˆ η ) ψ 0 ( ˆ η ) =: ρ ( ˆ η ) , (20) wher e e quality is in the L 1 sense. F urthermor e, ρ ( ˆ η ) ≥ 0 for al l ˆ η ∈ (0 , 1) . Pro of This is a direct consequence of Theorem 1 for prop er losses for probability esti- mation and the chain rule applied to ` y ( ˆ η ) = λ y ( ψ ( ˆ η )). Since ψ is assumed to b e strictly monotonic we know ψ 0 > 0 and so, since w ≥ 0 we hav e ρ ≥ 0. As we shall see, the ratio ρ ( ˆ η ) is a key quan tit y in the analysis of prop er comp osite losses. F or example, Corollary 2 has natural analogue in terms of ρ that will b e of use later. It is obtained by letting ˆ η = ψ − 1 ( v ) and using the chain rule. Corollary 11 Supp ose ` ψ is a pr op er c omp osite loss with c onditional risk denote d L ψ . Then ∂ ∂ v L ψ ( η , v ) = ( ψ − 1 ( v ) − η ) ρ ( ψ − 1 ( v )) . (21) Lo osely sp eaking then, ρ is a “co-ordinate free” weigh t function for comp osite losses where the link function ψ is interpreted as a mapping from arbitrary v ∈ V to v alues whic h can b e interpreted as probabilities. Another immediate corollary of Theorem 10 shows ho w prop erness is characterised b y a particular relationship b et w een the choice of link function and the c hoice of partial comp osite losses. 11 Corollary 12 L et λ := ` ψ b e a c omp osite loss with diﬀer entiable p artial losses λ 1 and λ − 1 . Then ` ψ is pr op er if and only if the link ψ satisﬁes ψ − 1 ( v ) = λ 0 − 1 ( v ) λ 0 − 1 ( v ) − λ 0 1 ( v ) , ∀ v ∈ V . (22) Pro of Substituting ˆ η = ψ − 1 ( v ) in to (20) yields − ψ − 1 ( v ) λ 0 1 ( v ) = (1 − ψ − 1 ( v )) λ 0 − 1 ( v ) and solving this for ψ − 1 ( v ) giv es the result. These results give some insight in to the “degrees of freedom” a v ailable when sp eci- fying prop er comp osite losses. Theorem 10 sho ws that the partial losses are completely determined once the w eigh t function w and ψ (up to an additive constan t) is ﬁxed. Corollary 12 sho ws that for a giv en link ψ one can sp ecify one of the partial losses λ y but then prop erness ﬁxes the other partial loss λ − y . Similarly , giv en an arbitrary choice of the partial losses, equation 22 giv es the single link whic h will guaran tee the o v erall loss is prop er. W e see then that Corollary 12 provides us with a w a y of constructing a r efer enc e link for arbitrary comp osite losses sp eciﬁed by their partial losses. The reference link can b e seen to satisfy ψ ( η ) = arg min v ∈ R L ψ ( η , v ) for η ∈ (0 , 1) and th us c alibr ates a given comp osite loss in the sense of [10]. W e now brieﬂy consider an application of the parametrisation of prop er losses as a w eigh t function and link. In order to implemen t Sto c hastic Gradient Descent (SGD) algorithms one needs to compute the deriv ative of the loss with respect to predictions v ∈ R . Letting ˆ η ( v ) = ψ − 1 ( v ) b e the probabilit y estimates associated with the prediction v , we can use (21) when η ∈ { 0 , 1 } to obtain the up date rules for p ositive and negative examples: ∂ ∂ v ` ψ 1 ( v ) = ( ˆ η ( v ) − 1) ρ ( ˆ η ( v )) , (23) ∂ ∂ v ` ψ − 1 ( v ) = ˆ η ( v ) ρ ( ˆ η ( v )) . (24) Giv en an arbitrary w eigh t function w (whic h deﬁnes a prop er loss via Corollary 2 and Theorem 4) and link ψ , the ab o ve equations show that one could implemen t SGD directly parametrised in terms of ρ without needing to explicitly compute the partial losses themselv es. Finally , w e mak e a note of an analogue of Corollary 5 for comp osite losses. It shows that the regret for an arbitrary comp osite loss is related to a Bregman div ergence via its link. Corollary 13 L et ` ψ b e a pr op er c omp osite loss with invertible link. Then for al l η , ˆ η ∈ (0 , 1) , ∆ L ψ ( η , v ) = D − L ( η , ψ − 1 ( v )) . (25) 12 This corollary generalises the results due to Zhang [50] and Masnadi-Shirazi and V as- concelos [32] who considered only margin losses resp ectiv ely without and with links. 4.2 Margin Losses The mar gin asso ciated with a real-v alued prediction v ∈ R and lab el y ∈ {− 1 , 1 } is the pro duct z = y v . An y function φ : R → R + can b e used as a mar gin loss by interpreting φ ( y v ) as the p enalty for predicting v for an instance with lab el y . Margin losses are inheren tly symmetric since y v = ( − y )( − v ) and so the p enalt y φ ( y v ) given for predicting v when the lab el is y is necessarily the same as the p enalt y for predicting − v when the lab el is − y . Margin losses hav e attracted a lot of attention [5] b ecause of their central role in Support V ector Mac hines [11]. In this section we explore the relationship b et w een these margin losses and the more general class of comp osite losses and, in particular, symmetric comp osite losses. Recall that a general comp osite loss is of the form ` ψ ( y , v ) = ` ( y , ψ − 1 ( v )) for a loss ` : Y × [0 , 1] → [0 , ∞ ) and an in v ertible link ψ : R → [0 , 1]. W e would like to understand when margin losses can be understo o d as losses suitable for probabilit y estimation tasks. As discussed abov e, prop er losses are a natural class of losses o v er [0 , 1] for probability estimation so a natural question in this v ein is the following: given a margin loss φ can w e c ho ose a link ψ so that there exists a prop er loss ` suc h that φ ( yv ) = ` ψ ( y , v )? In this case the prop er loss will b e ` ( y , ˆ η ) = φ ( y ψ ( ˆ η )). The follo wing corollary of Theorem 10 giv es necessary and suﬃcient conditions on the choice of link ψ to guaran tee when a margin loss φ can b e expressed as a prop er comp osite loss. Corollary 14 Supp ose φ : R → R is a diﬀer entiable mar gin loss. Then, φ ( y v ) c an b e expr esse d as a pr op er c omp osite loss ` ψ ( y , v ) if and only if the link ψ satisﬁes ψ − 1 ( v ) = φ 0 ( − v ) φ 0 ( − v ) + φ 0 ( v ) . (26) Pro of Margin losses, by deﬁnition, ha v e partial losses λ y ( v ) = φ ( y v ) which means λ 0 1 ( v ) = φ 0 ( v ) and λ 0 − 1 ( v ) = − φ 0 ( − v ). Substituting these in to (22) gives the result. This result pro vides a w a y of in terpreting predictions v as probabilities ˆ η = ψ − 1 ( v ) in a consistent manner, for a problem deﬁned by a margin loss. Con v ersely , it also guaran tees that using any other link to interpret predictions as probabilities will b e inconsisten t. 8 Another immediate implication is that for a margin loss to b e considered a prop er loss its link function m ust b e symmetric in the sense that ψ − 1 ( − v ) = φ 0 ( v ) φ 0 ( v ) + φ 0 ( − v ) = 1 − φ 0 ( − v ) φ 0 ( − v ) + φ 0 ( v ) = 1 − ψ − 1 ( v ) , 8 Strictly sp eaking, if the margin loss has “ﬂat sp ots” — i.e., where φ 0 ( v ) = 0 — then the choice of link may not b e unique. 13 and so, by letting v = ψ ( ˆ η ), w e hav e ψ (1 − ˆ η ) = − ψ ( ˆ η ) and th us ψ ( 1 2 ) = 0. Corollary 14 can also b e seen as a simpliﬁed and generalised version of the argument b y Masnadi-Shirazi and V asconcelos [32] that a conca v e minimal conditional risk function and a symmetric link completely determines a margin loss 9 . W e no w consider a couple of sp eciﬁc margin losses and show ho w they can b e asso ci- ated with a proper loss through the choice of link giv en in Corollary 14. The exp onen tial loss φ ( v ) = e − v giv es rise to a prop er loss ` ( y , ˆ η ) = φ ( y ψ ( ˆ η )) via the link ψ − 1 ( v ) = − e v − e v − e − v = 1 1 + e − 2 v whic h has non-zero denominator. In this case ψ ( ˆ η ) = 1 2 log  ˆ η 1 − ˆ η  is just the logistic link. Now consider the family of margin losses parametrised by α ∈ (0 , ∞ ) φ α ( v ) = log(exp(1 − v ) α ) + 1) α . This family of diﬀerentiable con vex losses approximates the hinge loss as α → 0 and w as studied in the multiclass case by Zhang et al. [51]. Since these are all diﬀerentiable functions with φ 0 α ( v ) = − e α (1 − v ) e α (1 − v ) +1 , Corollary 14 and a little algebra gives ψ − 1 ( v ) = " 1 + e 2 α + e α (1 − v ) e 2 α + e α (1+ v ) # − 1 . Examining this family of in v erse links as α → 0 giv es some insight into wh y the hinge loss is a surrogate for classiﬁcation but not probability estimation. When α ≈ 0 an estimate ˆ η = ψ − 1 ( v ) ≈ 1 2 for all but v ery large v ∈ R . That is, in the limit all probabilit y estimates sit inﬁnitesimally to the righ t or left of 1 2 dep ending on the sign of v . 5 Classiﬁcation Calibration and Prop er Losses The notion of prop erness of a loss designed for class probability estimation is a natural one. If one is only interested in classiﬁcation (rather than estimating probabilities) a w eak er condition suﬃces. In this section we will relate the w eak er condition to prop er- ness. 5.1 Classiﬁcation Calibration for CPE Losses W e b egin by giving a deﬁnition of classiﬁcation calibration for CPE losses ( i.e. , o v er the unit interv al [0 , 1]) and relate it to comp osite losses via a link. 9 Shen [44, Section 4.4] seems to ha ve b een the ﬁrst to view margin losses from this more general p erspective. 14 Deﬁnition 15 We say a CPE loss ` is classiﬁcation calibrated at c ∈ (0 , 1) and write ` is CC c if the asso ciate d c onditional risk L satisﬁes ∀ η 6 = c, L ( η ) < inf ˆ η : ( ˆ η − c )( η − c ) ≤ 0 L ( η , ˆ η ) . (27) The expression constraining the inﬁmum ensures that ˆ η is on the opp osite side of c to η , or ˆ η = c . The condition CC 1 2 is equiv alent to what is called “classiﬁcation calibrated” b y Bartlett et al. [7] and “Fisher consistent for classiﬁcation problems” b y Lin [30] although their deﬁnitions were only for margin losses. One might susp ect that there is a connection betw een classiﬁcation calibrated at c and standard Fisher consistency for class probabilit y estimation losses. The following theorem, which captures the in tuition b ehind the “probing” reduction [29], characterises the situation. Theorem 16 A CPE loss ` is CC c for al l c ∈ (0 , 1) if and only if ` is strictly pr op er. Pro of L is CC c for all c ∈ (0 , 1) is equiv alen t to ∀ c ∈ (0 , 1) , ∀ η 6 = c  L ( η ) < inf ˆ η ≥ c L ( η , ˆ η ) , η < c L ( η ) < inf ˆ η ≤ c L ( η , ˆ η ) , η > c ⇔ ∀ η ∈ (0 , 1) , ∀ c 6 = η  ∀ c > η , L ( η ) < inf ˆ η ≥ c L ( η , ˆ η ) ∀ c < η , L ( η ) < inf ˆ η ≤ c L ( η , ˆ η ) ⇔ ∀ η ∈ (0 , 1) ,  L ( η ) < inf ˆ η ≥ c>η L ( η , ˆ η ) L ( η ) < inf ˆ η ≤ c<η L ( η , ˆ η ) ⇔ ∀ η ∈ (0 , 1) , L ( η ) < inf ( ˆ η >η ) or ( ˆ η <η ) L ( η , ˆ η ) ⇔ ∀ η ∈ (0 , 1) , L ( η ) < inf ˆ η 6 = η L ( η , ˆ η ) whic h means L is strictly prop er. The following theorem is a generalisation of the c haracterisation of CC 1 2 for margin losses via φ 0 (0) due to Bartlett et al. [7]. Theorem 17 Supp ose ` is a loss and supp ose that ` 0 1 and ` 0 − 1 exist everywher e. Then for any c ∈ (0 , 1) ` is CC c if and only if ` 0 − 1 ( c ) > 0 and ` 0 1 ( c ) < 0 and c` 0 1 ( c ) + (1 − c ) ` 0 − 1 ( c ) = 0 . (28) Pro of Since ` 0 1 and ` 0 − 1 are assumed to exist ev erywhere ∂ ∂ ˆ η L ( η , ˆ η ) = η ` 0 1 ( ˆ η ) + (1 − η ) ` 0 − 1 ( ˆ η ) 15 exists for all ˆ η . L is CC c is equiv alent to ∂ ∂ ˆ η L ( η , ˆ η )     ˆ η = c  > 0 , η < c < ˆ η < 0 , ˆ η < c < η ⇔  ∀ η < c, η ` 0 1 ( c ) + (1 − η ) ` 0 − 1 ( c ) > 0 ∀ η > c, η ` 0 1 ( c ) + (1 − η ) ` 0 − 1 ( c ) < 0 (29) ⇔ c` 0 1 ( c ) + (1 − c ) ` 0 − 1 ( c ) = 0 and ` 0 − 1 ( c ) > 0 and ` 0 1 ( c ) < 0 , (30) where w e ha v e used the fact that (29) with η = 0 and η = 1 respectively substituted implies ` 0 − 1 ( c ) > 0 and ` 0 1 ( c ) < 0. If ` is proper, then b y ev aluating (7) at η = 0 and η = 1 we obtain ` 0 1 ( ˆ η ) = − w ( ˆ η )(1 − ˆ η ) and ` 0 − 1 ( ˆ η ) = w ( ˆ η ) ˆ η . Thus (30) implies − w ( c )(1 − c ) < 0 and w ( c ) c > 0 whic h holds if and only if w ( c ) 6 = 0. W e ha v e thus shown the following corollary . Corollary 18 If ` is pr op er with weight w , then for any c ∈ (0 , 1) , w ( c ) 6 = 0 ⇔ ` is CC c . The simple form of the weigh t function for the cost-sensitive misclassiﬁcation loss ` c 0 ( w ( c ) = δ ( c − c 0 )) gives the following corollary (confer Bartlett et al. [7]): Corollary 19 ` c 0 is CC c if and only if c 0 = c . 5.2 Calibration for Comp osite Losses The translation of the abov e results to general proper composite losses with in v ertible diﬀeren tiable link ψ is straight forward. Condition (27) b ecomes ∀ η 6 = c, L ψ ( η ) < inf v : ( ψ − 1 ( v ) − c )( η − c ) ≤ 0 L ψ ( η , ψ − 1 ( v )) . Theorem 16 then immediately giv es: Corollary 20 A c omp osite loss ` ψ ( · , · ) = ` ( · , ψ − 1 ( · )) with invertible and diﬀer entiable link ψ is CC c for al l c ∈ (0 , 1) if and only if the asso ciate d pr op er loss ` is strictly pr op er. Theorem 17 immediately gives: Corollary 21 Supp ose ` ψ is as in Cor ol lary 20 and that the p artial losses ` 1 and ` − 1 of the asso ciate d pr op er loss ` ar e diﬀer entiable. Then for any c ∈ (0 , 1) , ` ψ is CC c if and only if (28) holds. It can b e shown that in the sp ecial case of margin losses L φ , which satisfy the conditions of Corollary 14 such that they are prop er comp osite losses, Corollary 21 leads to the condition φ 0 (0) < 0 which is the same as obtained by Bartlett et al. [7]. 16 6 Con v exit y of Comp osite Losses W e ha ve seen that comp osite losses are deﬁned by the prop er loss ` and the link ψ . W e ha v e further seen from (14) that it is natural to parametrise comp osite losses in terms of w and ψ 0 , and combine them as ρ . One ma y wish to c ho ose a weigh t function w and determine which links ψ lead to a conv ex loss; or choose a link ψ and determine whic h w eigh t functions w (and hence prop er losses) lead to a con vex composite loss. The main result of this section is Theorem 29 answers these questions b y characterising the con v exit y of comp osite losses in terms of ( w , ψ 0 ) or ρ . W e ﬁrst establish some conv exit y results for losses and their conditional and full risks. Lemma 22 L et ` : Y × V → [0 , ∞ ) denote an arbitr ary loss. Then the fol lowing ar e e quivalent: 1. v 7→ ` ( y , v ) is c onvex for al l y ∈ {− 1 , 1 } , 2. v 7→ L ( η , v ) is c onvex for al l η ∈ [0 , 1] , 3. v 7→ ˆ L ( v , S ) := 1 | S | P ( x,y ) ∈ S ` ( y , v ( x )) is c onvex for al l ﬁnite S ⊂ X × Y . Pro of 1 ⇒ 2: By deﬁnition, L ( η , v ) = (1 − η ) ` ( − 1 , v ) + η ` (1 , v ) whic h is just a con vex com bination of conv ex functions and hence conv ex. 2 ⇒ 1: Cho ose η = 0 and η = 1 in the deﬁnition of L . 1 ⇒ 3: F or a ﬁxed ( x, y ), the function v 7→ ` ( y , v ( x )) is con v ex since ` is conv ex. Th us, ˆ L is conv ex as it is a non-negative weigh ted sum of conv ex functions. 3 ⇒ 1: The conv exit y of ˆ L holds for every S so for each y ∈ {− 1 , 1 } choose S = { ( x, y ) } for some x . In each case v 7→ ˆ L ( v , S ) = ` ( y , v ( x )) is conv ex as required. The follo wing theorem generalises the corollary on page 12 of Buja et al. [9] to arbitrary comp osite losses with in v ertible links. It has less practical v alue than the previous lemma since, in general, sums of quasi-conv ex functions are not necessarily quasi-con v ex (a function f is quasi-con v ex if the set { x : f ( x ) ≥ α } is con v ex for all α ∈ R ). Th us, assuming prop erness of the loss ` do es not guarantee its empirical risk ˆ L ( · , S ) will not ha v e lo cal minima. Theorem 23 If ` ψ ( y , v ) = ` ( y , ψ − 1 ( v )) is a c omp osite loss wher e ` is pr op er and ψ is invertible and diﬀer entiable then L ψ ( η , v ) is quasi-c onvex in v for al l η ∈ [0 , 1] . Pro of Since ` is proper w e kno w by Corollary 11 that the conditional Ba y es risk satisﬁes ∂ ∂ v L ψ ( η , v ) = ( ψ − 1 ( v ) − η ) ρ ( ψ − 1 ( v )) . Since ψ is inv ertible and ρ ≥ 0 w e see that ∂ ∂ v L ψ ( η , v ) only changes sign at η = ψ − 1 ( v ) and so L ψ is quasi-conv ex as required. The following theorem characterises conv exity of comp osite losses with in v ertible links. 17 Theorem 24 L et ` ψ ( y , v ) b e a c omp osite loss c omprising an invertible link ψ with in- verse q := ψ − 1 and strictly pr op er loss with weight function w . Assume q 0 ( · ) > 0 . Then v 7→ ` ψ ( y , v ) is c onvex for y ∈ {− 1 , 1 } if and only if − 1 x ≤ w 0 ( x ) w ( x ) − ψ 00 ( x ) ψ 0 ( x ) ≤ 1 1 − x , ∀ x ∈ (0 , 1) . (31) This theorem suggests a very natural parametrisation of comp osite losses is via ( w , ψ 0 ). Observ e that w , ψ 0 : [0 , 1] → R + . (But also see the comment following Theorem 29.) Pro of W e can write the conditional comp osite loss as L ψ ( η , v ) = η ` 1 ( q ( v )) + (1 − η ) ` − 1 ( q ( v )) and by substituting q = ψ − 1 in to (21) we hav e ∂ ∂ v L ψ ( η , v ) = w ( q ( v )) q 0 ( v )[ q ( v ) − η ] . (32) A necessary and suﬃcien t condition for v 7→ ` ψ ( y , v ) = L ψ ( y , v ) to be conv ex for y ∈ {− 1 , 1 } is that ∂ 2 ∂ v 2 L ψ ( y , v ) ≥ 0 , ∀ v ∈ R , ∀ y ∈ {− 1 , 1 } . Using (32) the ab ov e condition is equiv alent to [ w ( q ( v )) q 0 ( v )] 0 ( q ( v ) − J y = 1 K ) + w ( q ( v )) q 0 ( v ) q 0 ( v ) ≥ 0 , ∀ v ∈ R , (33) where [ w ( q ( v )) q 0 ( v )] 0 := ∂ ∂ v w ( q ( v )) q 0 ( v ) . Inequalit y (33) is equiv alent to [9, equation 39]. By further manipulations, we can simplify (33) considerably . Since J y = 1 K is either 0 or 1 w e equiv alently hav e the tw o inequalities [ w ( q ( v )) q 0 ( v )] 0 q ( v ) + w ( q ( v ))( q 0 ( v )) 2 ≥ 0 , ∀ v ∈ R , ( y = − 1) [ w ( q ( v )) q 0 ( v )] 0 ( q ( v ) − 1) + w ( q ( v ))( q 0 ( v )) 2 ≥ 0 , ∀ v ∈ R , ( y = 1) , whic h we shall rewrite as the pair of inequalities w ( q ( v ))( q 0 ( v )) 2 ≥ − q ( v )[ w ( q ( v )) q 0 ( v )] 0 , ∀ v ∈ R , (34) w ( q ( v ))( q 0 ( v )) 2 ≥ (1 − q ( v ))[ w ( q ( v )) q 0 ( v )] 0 , ∀ v ∈ R . (35) Observ e that if q ( · ) = 0 (resp. 1 − q ( · ) = 0) then (34) (resp. (35)) is satisﬁed an ywa y b ecause of the assumption on q 0 and the fact that w is non-negativ e. It is thus equiv alen t to restrict consideration to v in the set { x : q ( x ) 6 = 0 and (1 − q ( x )) 6 = 0 } = q − 1 ((0 , 1)) = ψ ((0 , 1)) . 18 Com bining (34) and (35) w e obtain the equiv alen t condition ( q 0 ( v )) 2 1 − q ( v ) ≥ [ w ( q ( v )) q 0 ( v )] 0 w ( q ( v )) ≥ − ( q 0 ( v )) 2 q ( v ) , ∀ v ∈ ψ ((0 , 1)) , (36) where w e ha v e used the fact that q : R → [0 , 1] and is thus sign-deﬁnite and consequen tly − q ( · ) is alwa ys negative and division by q ( v ) and 1 − q ( v ) is p ermissible since as argued w e can neglect the cases when these take on the v alue zero, and division b y w ( q ( v )) is p ermissible by the assumption of strict prop erness since that implies w ( · ) > 0. No w [ w ( q ( · )) q 0 ( · )] 0 = w 0 ( q ( · )) q 0 ( · ) q 0 ( · ) + w ( q ( · )) q 00 ( · ) and thus (36) is equiv alen t to ( q 0 ( v )) 2 1 − q ( v ) ≥ w 0 ( q ( v ))( q 0 ( v )) 2 + w ( q ( v )) q 00 ( v ) w ( q ( v )) ≥ − ( q 0 ( v )) 2 q ( v ) , ∀ v ∈ ψ ((0 , 1)) (37) No w divide all sides of (37) b y ( q 0 ( · )) 2 (whic h is p ermissible b y assumption). This gives the equiv alent condition 1 1 − q ( v ) ≥ w 0 ( q ( v )) w ( q ( v )) + q 00 ( v ) ( q 0 ( v )) 2 ≥ − 1 q ( v ) , ∀ v ∈ ψ ((0 , 1)) . (38) Let x = q ( v ) and so v = q − 1 ( x ) = ψ ( x ). Then (38) is equiv alent to 1 1 − x ≥ w 0 ( x ) w ( x ) + q 00 ( ψ ( x )) ( q 0 ( ψ ( x ))) 2 ≥ − 1 x , ∀ x ∈ (0 , 1) . (39) No w 1 q 0 ( ψ ( x )) = 1 q 0 ( q − 1 ( x )) = ( q − 1 ) 0 ( x ) = ψ 0 ( x ). Thus (39) is equiv alent to 1 1 − x ≥ w 0 ( x ) w ( x ) + Φ ψ ( x ) ≥ − 1 x , ∀ x ∈ (0 , 1) , (40) where Φ ψ ( x ) := q 00 ( ψ ( x ))  ψ 0 ( x )  2 . (41) All of the ab ov e steps are equiv alences. W e hav e th us shown that (40) is true ⇔ v 7→ L ψ ( y , v ) is conv ex for y ∈ {− 1 , 1 } where the right hand side is equiv alent to the assertion in the theorem b y Lemma 22. Finally we simplify Φ ψ . W e ﬁrst compute q 00 in terms of ψ = q − 1 . Observe that q 0 = ( ψ − 1 ) 0 = 1 ψ 0 ( ψ − 1 ( · )) . Thus q 00 ( · ) = ( ψ − 1 ) 00 ( · ) =  1 ψ 0 ( ψ − 1 ( · ))  0 = − 1 ( ψ 0 ( ψ − 1 ( · ))) 2 ψ 00 ( ψ − 1 ( · ))  ψ − 1 ( · )  0 = − 1 ( ψ 0 ( ψ − 1 ( · ))) 3 ψ 00 ( ψ − 1 ( · )) . 19 Th us by substitution Φ ψ ( · ) = − 1 ( ψ 0 ( ψ − 1 ( ψ ( · )))) 3 ψ 00 ( ψ ( ψ − 1 ( · )))  ψ 0 ( · )  2 = − 1 ( ψ 0 ( · )) 3 ψ 00 ( · )  ψ 0 ( · )  2 = − ψ 00 ( · ) ψ 0 ( · ) . (42) Substituting the simpler expression (42) for Φ ψ in to (40) completes the pro of. Lemma 25 If q is aﬃne then Φ ψ = 0 . Pro of Using (42), this is immediate since in this case ψ 00 ( · ) = 0. Corollary 26 Comp osite losses with a line ar link (including as a sp e cial c ase the identity link) ar e c onvex if and only if − 1 x ≤ w 0 ( x ) w ( x ) ≤ 1 1 − x , ∀ x ∈ (0 , 1) . 6.1 Canonical Links Buja et al. [9] introduced the notion of a c anonic al link deﬁned b y ψ 0 ( v ) = w ( v ). The canonical link corresponds to the notion of “matc hing loss” as developed b y Helm bold et al. [20] and Kivinen and W armuth [26]. Note that c hoice of canonical link implies ρ ( c ) = w ( c ) /ψ 0 ( c ) = 1. Lemma 27 Supp ose ` is a pr op er loss with weight function w and ψ is the c orr esp onding c anonic al link, then Φ ψ ( x ) = − w 0 ( x ) w ( x ) . (43) Pro of Substitute ψ 0 = w in to (42). This lemma gives an immediate pro of of the follo wing result due to Buja et al. [9]. Theorem 28 A c omp osite loss c omprising a pr op er loss with weight function w c om- bine d with its c anonic al link is always c onvex. Pro of Substitute (43) into (31) to obtain − 1 x ≤ 0 ≤ 1 1 − x , ∀ x ∈ (0 , 1) whic h holds for any w . An alternative view of canonical links is given in Appendix B. 20 6.2 A Simpler Characterisation of Conv ex Comp osite Losses The follo wing theorem prrovides a simpler c haracterisation of the conv exit y of comp osite losses. Noting that loss functions can be multiplied b y a scalar without aﬀecting what a learning algorithm will do, it is con v enient to normalise them. If w satisﬁes (31) then so do es αw for all α ∈ (0 , ∞ ). Th us without loss of generalit y w e will normalise w suc h that w ( 1 2 ) = 1. W e c hose to normalise ab out 1 2 for tw o reasons: symmetry and the fact that w can hav e non-integrable singularities at 0 and 1; see e.g. [9]. Theorem 29 Consider a pr op er c omp osite loss ` ψ with invertible link ψ and (strictly pr op er) weight w normalise d such that w ( 1 2 ) = 1 . Then ` is c onvex if and only if ψ 0 ( x ) x Q 2 ψ 0 ( 1 2 ) w ( x ) Q ψ 0 ( x ) 1 − x , ∀ x ∈ (0 , 1) , (44) wher e Q denotes ≤ for x ≥ 1 2 and denotes ≥ for x ≤ 1 2 . Observ e that the condition (44) is equiv alent to 1 2 ψ 0 ( 1 2 ) x Q ρ ( x ) Q 1 2 ψ 0 ( 1 2 )(1 − x ) , ∀ x ∈ (0 , 1) , (45) whic h suggests the imp ortance of the function ρ ( · ). Pro of Observing that w 0 ( x ) w ( x ) = (log w ) 0 ( x ) we let g ( x ) := log w ( x ). Observe that g ( v ) = R v 1 2 g 0 ( x ) dx + g ( 1 2 ) and g ( 1 2 ) = log w ( 1 2 ) = 0. Thus from (31) w e obtain − 1 x − Φ ψ ( x ) ≤ g 0 ( x ) ≤ 1 1 − x − Φ ψ ( x ) . F or v ≥ 1 2 w e thus hav e Z v 1 2 − 1 x − Φ ψ ( x ) dx ≤ g ( v ) ≤ Z v 1 2 1 1 − x − Φ ψ ( x ) dx. Con v ersely , for v ≤ 1 2 w e hav e Z v 1 2 − 1 x − Φ ψ ( x ) dx ≥ g ( v ) ≥ Z v 1 2 1 1 − x − Φ ψ ( x ) dx, and thus − ln v − ln 2 − Z v 1 2 Φ ψ ( x ) dx Q g ( v ) Q − ln 2 − ln(1 − v ) − Z v 1 2 Φ ψ ( x ) dx. Since exp( · ) is monotone increasing w e can apply it to all terms and obtain 1 2 v exp − Z v 1 2 Φ ψ ( x ) dx ! Q w ( v ) Q 1 2(1 − v ) exp − Z v 1 2 Φ ψ ( x ) dx ! . (46) 21 Figure 2: Allow able normalised w eight functions to ensure conv exit y of composite loss functions with identit y link (left) and logistic link (righ t). No w Z v 1 2 Φ ψ ( x ) dv = Z v 1 2 − ψ 00 ( x ) ψ 0 ( x ) dx = − Z v 1 2 (log ψ 0 ) 0 ( x ) dx = − log ψ 0 ( v ) + log ψ 0 ( 1 2 ) and so exp − Z v 1 2 Φ ψ ( x ) dx ! = ψ 0 ( v ) ψ 0 ( 1 2 ) . Substituting into (46) completes the pro of. If ψ is the iden tit y ( i.e. if ` ψ is itself prop er) we get the simpler constrain ts 1 2 x Q w ( x ) Q 1 2(1 − x ) , ∀ x ∈ (0 , 1) , (47) whic h are illustrated as the shaded region in Figure 2. Observ e that the (normalised) w eigh t function for squared loss is w ( c ) = 1 which is indeed within the shaded region as one would exp ect. Consider the link ψ logit ( c ) := log  c 1 − c  with corresponding in v erse link q ( c ) = 1 1+ e − c . One can c hec k that ψ 0 ( c ) = 1 c (1 − c ) . Thus the constrain ts on the weigh t function w to ensure conv exity of the comp osite loss are 1 8 x 2 (1 − x ) Q w ( x ) Q 1 8 x (1 − x ) 2 , ∀ x ∈ (0 , 1) . This is shown graphically in Figure 2. One can compute similar regions for any link. Tw o other examples are the Complemen tary Log-Log link ψ CLL ( x ) = log ( − log(1 − x )) 22 Figure 3: Allow able normalised w eigh t functions to ensure con v exity of loss functions with complementary log-log, square and cosine links. (confer McCullagh and Nelder [33]), the “square link” ψ sq ( x ) = x 2 and the “cosine link” ψ cos ( x ) = 1 − cos( π x ). All of these are illustrated in Figure 3. The reason for considering these last t w o rather unusual links is to illustrate the following fact. Observing that the allo w able region in Figure 2 precludes weigh t functions that approac h zero at the endp oin ts of the interv al, and noting that in order to w ell appro ximate the behaviour of 0-1 loss (with its weigh t function being w 0 − 1 ( c ) = δ ( c − 1 2 )) one would like a weigh t function that do es indeed approac h zero at the end p oin ts, it is natural to ask what constrain ts are imp osed up on a link ψ such that a comp osite loss with that link and a w eigh t function w ( c ) such that lim c & 0 w ( c ) = lim c % 1 w ( c ) = 0 (48) is conv ex. Insp ection of (44) rev eals it is necessary that ψ 0 ( x ) → 0 as x → 0 and x → 1. Suc h ψ necessarily ha v e bounded range and th us the in v erse link ψ − 1 is only deﬁned on a ﬁnite interv al and furthermore the gradient of ψ − 1 will b e arbitrarily large. If one w an ts in v erse links deﬁned on the whole real line (suc h as the logistic link) then one can not obtain a conv ex comp osite link with the asso ciated prop er loss having a weigh t function satisfying (48). Thus one can not choose an eﬀectiv ely usable link to ensure con vexit y of a prop er loss that is arbitrarily “close to” 0-1 loss in the sense of the corresponding w eigh t functions. Corollary 30 If a loss is pr op er and c onvex, then it is strictly pr op er. The proof of Corollary 30 mak es use of the following sp ecial case of the Gronw all style Lemma 1.1.1 of Bainov and Simeono v [3]. Lemma 31 L et b : R → R b e c ontinuous for t ≥ α . L et v ( t ) b e diﬀer entiable for t ≥ α and supp ose v 0 ( t ) ≤ b ( t ) v ( t ) , for t ≥ α and v ( α ) ≤ v 0 . Then for t ≥ α , v ( t ) ≤ v 0 exp  Z t α b ( s ) ds  . 23 Pro of (Corollary 30) Observe that the RHS of (31) implies w 0 ( v ) ≤ w ( v ) 1 − v , v ≥ 0 . Supp ose w (0) = 0. Then v 0 = 0 and the setting α = 0 the lemma implies w ( t ) ≤ v 0 exp  Z t 0 1 1 − s ds  = v 0 1 − t = 0 , t ∈ (0 , 1] . Th us if w (0) = 0 then w ( t ) = 0 for all t ∈ (0 , 1). Cho osing an y other α ∈ (0 , 1) leads to a similar conclusion. Thus if w ( t ) = 0 for some t ∈ [0 , 1), w ( s ) = 0 for all s ∈ [ t, 1]. Hence w ( t ) > 0 for all t ∈ [0 , 1] and hence b y the remark immediately following Theorem 6 ` is strictly prop er. 7 Cho osing a Surrogate Loss A surr o gate loss function is a loss function which is not exactly what one wishes to minimise but is easier to w ork with algorithmically . Con v ex surrogate losses are often used in place of the 0-1 loss which is not con v ex. Surrogate losses hav e garnered increasing in terest in the mac hine learning commu- nit y [50, 7, 46, 47]. Some of the questions considered to date are b ounding the regret of a desired loss in terms of a surrogate (“surrogate regret b ounds” — see [39] and references therein), the relationship b et w een the decision theoretic p ersp ectiv e and the elicitabilit y p ersp ectiv e [32], and eﬃcient algorithms for minimising conv ex surrogate margin losses [35, 34]. T ypically con v ex surrogates are used b ecause they lead to con vex, and thus tractable, optimisation problems. T o date, wor k on surrogate losses has fo cussed on margin losses whic h necessarily are symmetric with respect to false p ositiv es and false negatives [9]. In line with the rest of this pap er, our treatment will not b e so restricted. 7.1 The “Best” Surrogate Loss There are many choices of surrogate loss one can choose. A natural question is thus “whic h is b est?”. In order to do this we need to ﬁrst deﬁne how we are ev aluating losses as surrogates. T o do this w e require notation to describe the set of minimisers of the conditional and full risk asso ciated with a loss. Given a loss ` : {− 1 , 1 } × V → R its c onditional minimisers at η ∈ [0 , 1] is the set H ( `, η ) := { v ∈ V : L ( η , v ) = L ( η ) } . (49) Giv en a set of h yp otheses H ⊆ V X , the (constrained) Bay es optimal risk is L H := inf h ∈ H L ( h, P ) . 24 The (ful l) minimisers over H for P is the set H ( `, P ) := { h ∈ H : L ( h ) = L H } , where H ⊆ V X is some restricted set of functions and L ( h ) := E ( X , Y ) ∼ P [ ` ( Y , h ( X ))] and the exp ectation is with resp ect to P . Given a r efer enc e loss ` ref , we will sa y the ` ref -surr o gate p enalty of a loss ` o ver the function class H on a problem ( η , M ) (or equiv alently P ) is S ` ref ( `, η , M ) = S ` ref ( `, P ) := inf h ∈ H ( `, P ) L ref ( h ) , where it is imp ortan t to remember that L is with resp ect to P . That is, S ` ref ( `, P ) is the minim um ` ref risk obtainable by a function in H that minimises the ` risk. Giv en a ﬁxed exp erimen t P , if L is a class of losses then the b est surr o gate losses in L for the reference loss ` ref are those that minimise the ` ref -surrogate p enalt y . This deﬁnition is motiv ated by the manner in whic h surrogate losses are used — one minimizes L ( h ) o ver h to obtain the minimiser h ∗ and one hop es that L ref ( h ∗ ) is small. Clearly , if the class of losses contains the reference loss ( i.e. , ` ref ∈ L ) then ` ref will be a b est surrogate loss. Therefore, the question of b est surrogate loss is only in teresting when ` ref / ∈ L . One particular case w e will consider is when the reference loss is the 0-1 loss and the class of surrogates L is the set of con v ex prop er losses. Since 0-1 loss is not con v ex the question of whic h surrogate is b est is non-trivial. It would b e nice if one could reas on ab out the “best” surrogate loss using the con- ditional p ersp ectiv e (that is working with L instead of L ) and in a manner indep endent of H . It is simple to see wh y this can not b e done. Since all the losses we consider are prop er, the minimiser o v er ˆ η of L ( η , ˆ η ) is η . Th us an y prop er loss would lead to the same ˆ η ∈ [0 , 1]. It is only the in tro duction of the restricted class of h yp otheses H that prev en ts this reasoning b eing applied for L : restrictions on h ∈ H prev ent h ( x ) = η ( x ) for all x ∈ X . W e conclude that the problem of b est surrogate loss only mak es sense when one both tak es expectations o v er X and restricts the class of h ypotheses h to b e dra wn from some set H ( [0 , 1] X . This reasoning accords with that of No c k and Nielsen [35, 34] who examined which surrogate to use and prop osed a data-dep enden t scheme that tunes surrogates for a problem. They explicitly considered prop er losses and said that “minimizing an y [low er- b ounded, symmetric prop er] loss amounts to the same ultimate goal” and concluded that “the crux of the c hoice of the [loss] relies on data-dep enden t considerations”. W e demonstrate the diﬃculty of ﬁnding a univ ersal b est surrogate loss in by con- structing a simple example. One can construct exp erimen ts ( η 1 , M ) and ( η 2 , M ) and prop er losses ` 1 and ` 2 suc h that S ` 0 − 1 ( ` 1 , ( η 1 , M )) > S ` 0 − 1 ( ` 2 , ( η 1 , M )) but S ` 0 − 1 ( ` 1 , ( η 2 , M )) < S ` 0 − 1 ( ` 2 , ( η 2 , M )) . (The examples we construct hav e w eigh t functions that “cross-ov er” each other; the details are in Appendix A.) Ho w ever, this does not imply there can not exist a particular con v ex ` ∗ that minorises all prop er losses in this sense. Indeed, w e conjecture that, in the sense describ ed ab o v e, there is no b est prop er, conv ex surrogate loss. 25 Conjecture 32 Given a pr op er, c onvex loss ` ther e exists a se c ond pr op er, c onvex loss ` ∗ 6 = ` , a hyp othesis class H , and an exp eriment P such that S ` 0 − 1 ( ` ∗ , P ) < S ` 0 − 1 ( `, P ) for the class H . T o prov e the ab o v e conjecture it would suﬃce to sho w that for a ﬁxed h yp othesis class and any pair of losses one can construct t wo exp erimen ts suc h that one loss minorises the other loss on one exp erimen t and vic e versa on the other exp eriment. Supp osing the ab o v e conjecture is true, one might then ask for a b est surrogate loss for some reference loss ` ref in a minimax sense. F ormally , w e would like the loss ` ∗ ∈ L suc h that the worst-case p enalt y for using ` ∗ , Υ L ( ` ∗ ) := sup P  S ` ref ( ` ∗ , P ) − inf ` ∈ L S ` ref ( `, P )  is minimised. That is, Υ L ( ` ∗ ) ≤ Υ L ( ` ) for all ` ∈ L . 7.2 The “Minimal” Symmetric Con v ex Prop er Loss Theorem 29 suggests an answer to the question “What is the prop er conv ex loss closest to the 0-1 loss?” A w a y of making this question precise follows. Since ` is presumed prop er, it has a weigh t function w . Supp ose w.l.o.g. that w ( 1 2 ) = 1. Supp ose the link is the identit y . The constraints in (31) imply that the w eigh t function that is most similar to that for 0-1 loss meets the constraints. Th us from (47) w minimal ( c ) = 1 2  1 c ∧ 1 1 − c  (50) is the weigh t for the con v ex prop er loss closest to 0-1 loss in this sense. It is the w eigh t function that forms the low er en velope of the shaded region in the left diagram of Figure 2. Using (14) one can readily compute the corresp onding partial losses explicitly ` minimal − 1 ( ˆ η ) = 1 2  J ˆ η < 1 2 K ( − ˆ η − ln(1 − ˆ η )) + J ˆ η ≥ 1 2 K ( ˆ η − 1 − ln( 1 2 ))  (51) and ` minimal 1 ( ˆ η ) = 1 2  J ˆ η < 1 2 K ( − ˆ η − log( 1 2 )) + J ˆ η ≥ 1 2 K ( ˆ η − 1 − ln ˆ η )  . (52) Observ e that the partial losses are (in part) linear, which is unsurprising as linear func- tions are on the b oundary of the set conv ex functions. This loss is also b est in another more precise (but ultimately unsatisfactory) sense, as we shall no w sho w. Surrogate regret bounds are theoretical bounds on the regret of a desired loss (sa y 0-1 loss) in terms of the regret with resp ect to a surrogate. Reid and Williamson [39] ha v e shown the following (w e only quote the simpler symmetric case here): Theorem 33 Supp ose ` is a pr op er loss with c orr esp onding c onditional Bayes risk L which is symmetric ab out 1 2 : L ( 1 2 − c ) = L ( 1 2 + c ) for c ∈ [0 , 1 2 ] . If the r e gr et for the ` 1 2 loss ∆ L 1 2 ( η , ˆ η ) = α , then the r e gr et ∆ L with r esp e ct to ` satisﬁes ∆ L ( η , ˆ η ) ≥ L ( 1 2 ) − L ( 1 2 + α ) . (53) 26 Figure 4: Upp er b ound on the 0-1 regret in terms of ∆ L minimal as given by (54). The b ound in the theorem can b e inv erted to upper b ound ∆ L 1 2 giv en an upp er b ound on ∆ L ( η , ˆ η ). Considering all symmetric prop er losses normalised such that w ( 1 2 ) = 1, the righ t side of (53) is maximised and thus the b ound on ∆ L 1 2 in terms of ∆ L is minimised when L ( 1 2 + α ) is maximised (o ver all losses normalised as men tioned). But since w = − L 00 , that o ccurs for the p oin t wise minimiser of w (sub ject to w ( 1 2 ) = 1). Since we are interested in con v ex losses, the minimising w is given by (50). In this case the righ t hand side of (53) can b e explicitly determined to b e ( α 2 + 1 4 ) log (2 α + 1) − α 2 , and the b ound can b e inv erted to obtain the result that if ∆ L minimal ( η , ˆ η ) = x then ∆ L 1 2 ( η , ˆ η ) ≤ 1 2 exp  Lam b ertW  (4 x − 1) e  + 1  − 1 2 (54) whic h is plotted in Figure 4. The ab o ve argumen t do es not sho w that the loss given by (51,52) is the b est surrogate loss. Nevertheless it do es suggest it is at least worth considering using ` minimal as a con v ex prop er surrogate binary loss. 8 Conclusions Comp osite losses are widely used. In this pap er w e hav e c haracterised a num ber of asp ects of them: their relationship to margin losses, the connection b et w een prop erness and classiﬁcation calibration, the constrain ts symmetry imp oses, when comp osite losses are con v ex, and natural wa ys to parametrise them. W e ha v e also considered the question of the “b est” surrogate loss. 27 The parametrisation of a comp osite loss in terms ( w, ψ 0 ) (or ρ ) has adv antages ov er using ( φ, ψ ) or ( L, ψ ). As explained by Masnadi-Shirazi and V asconcelos [32], the rep- resen tation in terms of ( φ, ψ ) is in general not unique. The representation in terms of L is harder to intuit: whilst indeed the Bay es risk for squared loss and 0-1 loss are “close” (compare the graph of c 7→ c (1 − c ) with that of c 7→ c ∧ (1 − c )), by examining their w eigh t functions they are seen to b e v ery diﬀeren t ( w ( c ) = 1 v ersus w ( c ) = 2 δ ( c − 1 2 )). W e ha ve also seen that on the basis of Theorem 24, the parametrisation ( w , ψ 0 ) is per- haps the most natural — there is a pleasing symmetry b et w een the loss and the link as they are in this form b oth parametrised in terms of non-negativ e weigh t functions on [0 , 1]. Recall to o that the canonical link sets ψ 0 equal to w . The observ ation suggests an alternate inductive principle known as surr o gate tuning , whic h seems to hav e b een ﬁrst suggested b y Nock and Nielsen [35]. The idea of surrogate tuning is simple: noting that the b est surrogate dep ends on the problem, adapt the surrogate you are using to the problem. In order to do so it is imp ortan t to ha v e a go o d parametrisation of the loss. The w eight function p erspective do es just that, esp ecially giv en Theorem 29. It w ould be straigh t forward to develop low dimensional parametrisations of w that satisfy the conditions of this theorem which w ould th us allow a learning algorithm to explore the space of conv ex losses. One could (taking due care with the subsequent m ultiple hypothesis testing problem) regularly evaluate the 0-1 loss of the hypotheses so obtained. The observ ations made in section 4 regarding sto c hastic gradien t descent algorithms may b e of help in this regard. Surrogate tuning diﬀers from loss tailoring [18, 19, 9] which inv olv es adapting the loss to what y ou really think is imp ortan t. In the surrogate tuning setting, we hav e ﬁxed on 0-1 loss as what we really wan t to minimise and use a surrogate solely for computational reasons. Finally , we conjecture that ` minimal (equations 51 and 52) is somehow sp ecial in the class of proper con v ex losses in some w ay other than b eing the p oint wise minimiser of w eigh ts (and the normalised loss with smallest regret b ound with resp ect to ` 0 − 1 ), but the exact nature of the sp ecialness still eludes us. Perhaps it is optimal in some weak er (minimax) sense. The reason for this suggestion is that it is not hard to show that for reasonable P there exists H suc h that c 7→ L c ( h, P ) tak es on all p ossible v alues within the constraints 0 ≤ L c ( h, P ) ≤ max( c, 1 − c ) whic h follows immediately from the deﬁnition of cost-sensitive misclassiﬁcation loss. F urthermore the example in the appendix b elow seems to require loss functions whose corresp onding weigh t functions cross o ver each other and there is no weigh t function corresp onding to a con v ex prop er loss that crosses o v er w minimal . Ac kno wledgemen ts This work was motiv ated in part b y a question due to John Langford. Thanks to F angfang Lu for discussions and ﬁnding sev eral bugs in an earlier v ersion. Thanks to Ingo Stein w art for p oin ting out the η α tric k. Thanks to Tim v an Er- v en for comments and corrections. This work w as supp orted b y the Australian Research Council and NICT A through Bac king Australia’s Ability . 28 A Example Sho wing Incommensurability of Tw o Prop er Surrogate Losses W e consider X = [0 , 1] with M b eing uniform on X , and consider the t wo problems that are induced by η 1 ( x ) = x 2 and η 2 ( x ) = 1 3 + x 3 . W e use a simple linear hypothesis class H := { h α ( x ) := αx : α ∈ [0 , 1] } , with iden tit y link function and consider the tw o surrogate prop er losses ` 1 and ` 2 with w eigh t functions w 1 ( c ) = 1 c , w 2 ( c ) = 1 1 − c . These weigh t functions corresp ond to the tw o curv es that construct the left diagram in Figure 2. The corresp onding conditional losses can b e readily calculated to b e L 1 ( η , h ) := η ( h − 1 − log ( h )) + (1 − η ) h L 2 ( η , h ) := η (1 − h ) + (1 − η )( − h − log(1 − h )) . One can n umerically compute the parameters for the constrained Bay es optimal for each problem and for each surrogate loss: α ∗ 1 , 1 = arg min α ∈ [0 , 1] L 1 ( η 1 , h α , M ) = 0 . 66666667 α ∗ 2 , 1 = arg min α ∈ [0 , 1] L 2 ( η 1 , h α , M ) = 0 . 81779259 α ∗ 1 , 2 = arg min α ∈ [0 , 1] L 1 ( η 2 , h α , M ) = 1 . 00000000 α ∗ 2 , 1 = arg min α ∈ [0 , 1] L 2 ( η 2 , h α , M ) = 0 . 77763472 . F urthermore L 0 − 1 ( η 1 , h α ∗ 1 , 1 , M ) = 0 . 3580272 , L 0 − 1 ( η 1 , h α ∗ 2 , 1 , M ) = 0 . 3033476 , L 0 − 1 ( η 2 , h α ∗ 1 , 2 , M ) = 0 . 4166666 , L 0 − 1 ( η 2 , h α ∗ 2 , 2 , M ) = 0 . 4207872 . Th us for problem η 1 the surrogate loss L 2 has a constrained Ba yes optimal hypothesis h α ∗ 2 , 1 whic h has a low er 0-1 risk than the constrained Bay es optimal hypothesis h α ∗ 1 , 1 for the surrogate loss L 1 . Th us for problem η 1 surrogate L 2 is b etter than surrogate L 1 . Ho w ev er for problem η 2 the situation is rev ersed: surrogate L 2 is worse than surrogate L 1 . 29 B An Alternate View of Canonical Links This app endix contains an alternate approach to understanding canonical links using con v ex duality . In doing so w e presen t an improv ed formulation of a result on the dualit y of Bregman divergences that may b e of indep enden t interest. The L e gendr e-F enchel (LF) dual φ ? of a function φ : R → R is a function deﬁned b y φ ? ( s ? ) := sup s ∈ R {h s, s ? i − φ ( s ) } . (55) The LF dual of an y function is conv ex. When φ ( s ) is a function of a real argumen t s and the deriv ativ e φ 0 ( s ) exists, the Legendre-F enchel conjugate φ ? is given by the L e gendr e tr ansform [40, 21] φ ? ( s ) = s · ( φ 0 ) − 1 ( s ) − φ  ( φ 0 ) − 1 ( s )  . (56) Th us (writing ∂ f := f 0 ) f 0 = ( ∂ f ? ) − 1 . Thus with w , W , and W deﬁned as abov e, W = ( ∂ ( W ? )) − 1 , W − 1 = ∂ ( W ? ) , W ? = Z W − 1 . (57) Let w , W , W b e as in Theorem 7. Denote b y L W the w -weigh ted conditional loss parametrised b y W = R w and let ∆ L W b e the corresp onding regret (w e can in terchange ∆ L and D here by (25) since ψ L = id. D w ( η , ˆ η ) = W ( η ) − W ( ˆ η ) − ( η − ˆ η ) W ( ˆ η ) . (58) W e no w further consider D w as giv en b y (58). It will be con v enien t to parametrise D b y W instead of w . Note that the standard parametrisation for a Bregman divergence is in terms of the conv ex function W . Thus will write D W , D W and D w to all represen t (58). The follo wing theorem is kno wn (e.g [49]) but as will b e seen, stating it in terms of D W pro vides some adv an tages. Theorem 34 L et w , W , W and D W b e as ab ove. Then for al l x, y ∈ [0 , 1] , D W ( x, y ) = D W − 1 ( W ( y ) , W ( x )) . (59) Pro of Using (56) we ha v e W ? ( u ) = u · W − 1 ( u ) − W ( W − 1 ( u )) ⇒ W ( W − 1 ( u )) = u · W − 1 ( u ) − W ? ( u ) . (60) Equiv alently (using (57)) W ? ( W ( u )) = u · W ( u ) − W ( u ) . (61) Th us substituting and then using (60) we hav e D W ( x, W − 1 ( v )) = W ( x ) − W ( W − 1 ( v )) − ( x − W − 1 ( v )) · W ( W − 1 ( v )) = W ( x ) + W ? ( v ) − v W − 1 ( v ) − ( x − W − 1 ( v )) · v = W ( x ) + W ? ( v ) − x · v . (62) 30 Similarly (this time using (61) w e hav e D W − 1 ( v , W ( x )) = W ? ( v ) − W ? ( W ( x )) − ( v − W ( x )) · W − 1 ( W ( x )) = W ? ( v ) − xW ( x ) + W ( x ) − v · x + xW ( x ) = W ? ( v ) + W ( x ) − v · x (63) Comparing (62) and (63) w e see that D W ( x, W − 1 ( v )) = D W − 1 ( v , W ( x )) Let y = W − 1 ( v ). Thus subsitituting v = W ( y ) leads to (59). The weigh t function corresp onding to D W − 1 is ∂ ∂ x W − 1 ( x ) = 1 w ( W − 1 ( x )) . Theorem 35 If the inverse link ψ − 1 = W − 1 (and thus ˆ η = W − 1 ( ˆ h ) ) then D W ( η , ˆ η ) = D W ( η , W − 1 ( ˆ h )) = W ( η ) + W ? ( ˆ h ) − η · ˆ h L W ( η , ˆ η ) = L W ( η , W − 1 ( ˆ h )) = W ? ( ˆ h ) − η · ˆ h + η ( W (1) + W (0)) − W (0) ∂ ∂ ˆ h L W ( η , W − 1 ( ˆ h )) = ˆ η − η and furthermor e D W ( η , W − 1 ( ˆ h )) and L W ( η , W − 1 ( ˆ h )) ar e c onvex in ˆ h . Pro of The ﬁrst t w o expressions follow immediately from (62) and (63) by substitu- tion. The deriv ative follows from calculation: ∂ ∂ ˆ h L W ( η , W − 1 ( ˆ h )) = ∂ ∂ ˆ h ( W ? ( ˆ h ) − η · ˆ h ) = W − 1 ( ˆ h ) − η = ˆ η − η . The conv exit y follo ws from the fact that W ? is con vex (since it is the LF dual of a conv ex function W ) and the ov erall expression is the sum of this and a linear term, and th us con vex. [9] call W the c anonic al link . W e ha ve already seen (Theorem 27) that the composite loss constructed using the canonical link is conv ex. C Con v exit y and Robustness In this app endix we show ho w the characterisation of the conv exit y of prop er losses (Theorem 29) allo ws one to mak e general algorithm indep enden t statements about the robustness of conv ex prop er losses to random mis-classiﬁcation noise. Long and Servedio [31] hav e shown that that b oosting with con vex p oten tial functions (i.e., con v ex margin losses) is not robust to random class noise 10 . That is, they are 10 W e deﬁne exactly what w e mean b y robustness below. The notion that Long and Servedio [31] examine is akin to that studied for instance by Kearns [25]. There are many other meanings of “robust” whic h are diﬀerent to that which we consider. The classical notion of robust statistics [22] is motiv ated b y robustness to contamination of additive observ ation noise (some hea vy-tail noise mixed in with the Gaussian noise often assumed in designing estimators). There are some results ab out particular machine learning algorithms being robust in that sense [43]. “Robust” is also used to mean robustness with resp ect to random attribute noise [48], robustness to unknown prior class probabilities [37], or a Hub er- st yle robustness to attribute noise (“outliers”) for classiﬁcation [13]. W e only study robustness in the sense of random lab el noise. 31 susceptible to random class noise. In particular they presen t a very simple learning task which is “b o ostable” – can be perfectly solv ed using a linear combination of base classiﬁers – but for which, in the presence of any amount of lab el noise, idealised, early stopping and L 1 regularised b oosting algorithms will learn a classiﬁer with only 50% accuracy . This has led to the recent prop osal of b oosting algorithms that use non-con v ex margin losses and exp erimental evidence suggests that these are more robust to class noise than their conv ex counterparts. F reund [14] recently describ ed RobustBo ost, which uses a parameterised family of non-conv ex surrogate losses that appro ximates the 0-1 loss as the num ber of b o osting iterations increases. Exp erimen ts on a v ariant of the task prop osed by Long and Serv edio [31] sho w that RobustBoost is v ery insensitive to class noise. Masnadi-Shirazi and V asconcelos [32] presented Sa v ageBo ost, a b o osting algorithm built up on a non-con v ex margin function. They argued that ev en when the margin function is non-con v ex the conditional risk may still b e conv ex. W e elucidate this via our characterisation of the con v exit y of comp osite losses. Although all these results are suggestiv e, it is not clear from these results whether the robustness or not is a prop ert y of the loss function, the algorithm or a combination. W e study that question b y considering robustness in an algorithm-indep enden t fashion. F or α ∈ (0 , 1 2 ) and η ∈ [0 , 1] we will deﬁne η α := α (1 − η ) + (1 − α ) η as the α -c orrupte d version of η . This captures the idea that instead of drawing a p ositiv e lab el for the p oin t x with probability η ( x ) there is a random class ﬂip with probability α . Since η α is a conv ex combination of α and 1 − α it follows that η α ∈ [ α, 1 − α ]. The eﬀect of α -corruption on the conditional risk of a loss can b e seen as a transformation of the loss. Lemma 36 If ` ψ is any c omp osite loss then its c onditional risk satisﬁes L ψ ( η α , v ) = L ψ α ( η , v ) , η ∈ [0 , 1] , v ∈ V , wher e ` ψ α ( y , v ) = (1 − α ) ` ψ ( y , v ) + α` ψ ( − y , v ) . Pro of By simple algebraic manipulation w e hav e L ψ ( η α , v ) = (1 − η α ) ` ψ ( − 1 , v ) + η α ` ψ (1 , v ) = [(1 − α )(1 − η ) + αη ] ` ψ ( − 1 , v ) + [ α (1 − η ) + (1 − α ) η ] ` ψ (1 , v ) = (1 − η )[(1 − α ) ` ψ ( − 1 , v ) + α ` ψ (1 , v )] + η [ α` ψ ( − 1 , v ) + (1 − α ) ` ψ (1 , v )] = (1 − η ) ` ψ α ( − 1 , v ) + η ` ψ α (1 , v ) = L ψ α ( η , v ) pro ving the result. In particular, if ` is strictly prop er then ` α cannot b e prop er b ecause the minimiser of L ( η α , · ) is η α and so η α 6 = η m ust also b e the minimiser of L α ( η , · ). This suggests that strictly prop er losses are not robust to an y class noise. 32 C.1 Robustness implies Non-conv exity W e no w deﬁne a general notion of robustness for losses for class probability estimation. Deﬁnition 37 Given an α ∈ [0 , 1 2 ) , we wil l say a loss ` : {− 1 , 1 } × [0 , 1] → R is α -robust at η if the set of minimisers of the c onditional risk for η and the set of minimisers of the c onditional risk for η α have some c ommon p oints. That is, a loss is α -robust for a particular η if minimising the noisy conditional risk can p oten tially give an estimate that is also a minimiser of the non-noisy conditional risk. F ormally , ` is α -r obust at η when H ( `, η α ) ∩ H ( `, η ) 6 = ∅ , where H ( `, η ) is deﬁned in (49). Lab el noise is symmetric ab out 1 2 and so the map η 7→ η α preserv es the side of 1 2 on whic h the v alues η and η α are found. That is, η ≤ 1 2 if and only if η α ≤ 1 2 for all α ∈ [0 , 1 2 ). This means that 0-1 misclassiﬁcation loss or, equiv alently , ` 1 2 is α -robust for all η and for all α . F or other c , the range of η for which ` c is α -robust is more limited. Theorem 38 F or e ach c ∈ (0 , 1) , the loss ` c is α -r obust at η if and only if η / ∈  c − α 1 − 2 α , c  for c < 1 2 or η / ∈  c, c − α 1 − 2 α  for c ≥ 1 2 . Pro of By the deﬁnition of L c and J ˆ η < c K = 1 − J ˆ η ≥ c K w e ha ve L c ( η , ˆ η ) = (1 − η ) c J ˆ η ≥ c K + η (1 − c ) J ˆ η < c K = η (1 − c ) + ( c − η ) J ˆ η ≥ c K . Since c − η is p ositiv e iﬀ c > η w e see L c ( η , ˆ η ) is minimised for η < c when ˆ η < c and for η ≥ c when ˆ η ≥ c . So H ( ` c , η ) = [0 , c ) for η < c and H ( ` c , η ) = [ c, 1] for η ≥ c . Since [0 , c ) and [ c, 1] are disjoint for all c ∈ [0 , 1] we see that H ( ` c , η ) and H ( ` c , η α ) coincide if and only if η , η α < c or η , η α ≥ c and are disjoint otherwise. W e pro ceed by cases. First, suppose c < 1 2 . F or η < c < 1 2 it is easy to sho w η α ≥ c iﬀ η ≥ c − α 1 − 2 α and so ` c is not α -robust for η ∈ [ c − α 1 − 2 α , c ). F or c ≤ η w e see ` c m ust b e α -robust since η α < c iﬀ η < c − α 1 − 2 α but c − α 1 − 2 α < c for c < 1 2 whic h is a contradiction. Th us, for c < 1 2 w e hav e ` c is α -robust iﬀ η / ∈ [ c − α 1 − 2 α , c ). F or c > 1 2 the main diﬀerences are that c − α 1 − 2 α > c for c > 1 2 and η α < η for η > 1 2 . Th us, by a similar argumen t as ab o v e we see that ` c is α -robust iﬀ η / ∈ [ c, c − α 1 − 2 α ). This theorem allo ws us to c haracterise the robustness of arbitrary proper losses b y app ealing to the in tegral representation in (11). 33 Lemma 39 If ` is a pr op er loss with weight function w then H ( `, η ) = T c : w ( c ) > 0 H ( ` c , η ) and so H ( `, η ) ∩ H ( `, η α ) = \ c : w ( c ) > 0 H ( ` c , η ) ∩ H ( ` c , η α ) . Pro of W e ﬁrst sho w that H ( `, η ) ⊆ T c : w ( c ) > 0 H ( ` c , η ) by con tradiction. Assume there is an ˆ η ∈ H ( `, η ) but for which there is some c 0 suc h that w ( c 0 ) > 0 and ˆ η / ∈ H ( ` c 0 , η ). Then there is a ˆ η 0 ∈ H ( ` c 0 , η ) and ˆ η 0 ∈ H ( ` c ) for all other c for whic h w ( c ) > 0 (otherwise H ( `, η ) = { ˆ η } ). Thus, L c 0 ( η , ˆ η 0 ) < L c 0 ( η , ˆ η 0 ) and so R 1 0 L c ( η , ˆ η 0 ) w ( c ) dc < R 1 0 L c ( η , ˆ η ) w ( c ) dc since w ( c 0 ) > 0. No w supp ose ˆ η ∈ T c : w ( c ) > 0 H ( ` c , η ). That is, ˆ η is a minimiser of L c ( η , · ) for all c suc h that w ( c ) > 0 and therefore must also be a minimiser of L ( η , · ) = R 1 0 L c ( η , · ) w ( c ) dc and is therefore in H ( `, η ), proving the con verse. One consequence of this lemma is that if w ( c ) > 0 and ` c is not α -robust at η then, b y deﬁnition, H ( ` c , η ) ∩ H ( ` c , η α ) = ∅ and so ` cannot be α -robust at η . This means w e hav e established the following theorem regarding the α -robustness of an arbitrary prop er loss in terms of its weigh t function. Theorem 40 If ` is a pr op er loss with weight function w then it is not α -r obust for any η ∈ [ c : w ( c ) > 0  c − α 1 − 2 α , c  ∪  c, c − α 1 − 2 α  . By Corollary 30 we see that conv ex prop er losses are strictly proper and th us ha ve w eigh t functions which are non-zero for all c ∈ [0 , 1] and so by Theorem 40 we hav e the follo wing corollary . Corollary 41 If a pr op er loss is c on vex, then for al l α ∈ (0 , 1 2 ) it is not α -r obust at any η ∈ [0 , 1] . A t a high level, this result – “con v exity implies non-robustness” – app ears to b e log- ically equiv alen t to Long and Servedio’s result that “robustness implies non-conv exity”. Ho w ev er, there are a few discrepancies that mean they are not directly comparable. The deﬁnitions of robustness diﬀer. W e fo cus on the p oint-wise minimisation of conditional risk as this is, ideally , what most risk minimisation approac h try to achiev e. Ho wev er, this means that robustness of ERM with regularisation or restricted function classes is not directly captured with our deﬁnition whereas Long and Servedio analyse this latter case directly . In our deﬁnition the fo cus is on probability estimation robustness while the earlier w ork is focussed on classiﬁcation accuracy . Our w ork could b e extended to this case by analysing H ( `, η ) ∩ H ( ` 1 2 , η ). Additionally , their work restricts atten tion to the robustness of b o osting algorithms that use conv ex p oten tial functions whereas our analysis is not tied to any sp eciﬁc algorithm. By restricting their atten tion to a speciﬁc learning task and class of functions 34 they are able to sho w a very strong result: that conv ex losses for b o osting lead to arbitrarily bad p erformance with arbitrarily little noise. Also, our fo cus on prop er losses excludes some conv ex losses (such as the hinge loss) that is cov ered by Long and Serv edio’s results. Finally , it is worth noting that there are non-conv ex loss functions that are strictly prop er and so are not robust in the sense w e use here. That is, the conv erse of Corol- lary 41 is not true. F or example, any loss with w eight function that sits ab o v e 0 but outside the shaded region in Figure 2 will b e non-conv ex and non-robust. This suggests that the arguments made by Masnadi-Shirazi and V asconcelos [32], F reund [14] for the robustness of non-conv ex losses need further inv estigation. References [1] Jacob Ab erneth y , Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A sto c hastic view of optimal regret through minimax duality . March 2009. URL http://arxiv.org/abs/0903.5328 . [2] F.R. Bac h, D. Hec k erman, and E. Horvitz. Considering Cost Asymmetry in Learn- ing Classiﬁers. Journal of Machine L e arning R ese ar ch , 7:1713–1741, 2006. [3] Drumi Bainov and P av el Simeonov. Inte gr al Ine qualities and Applic ations . Kluw er, Dordrec h t, 1992. [4] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with Bregman div ergences. The Journal of Machine L e arning R ese ar ch , 6:1705–1749, 2005. [5] P eter J. Bartlett, Bernhard Sc h¨ olk opf, Dale Sch uurmans, and Alexander J. Smola, editors. A dvanc es in L ar ge-Mar gin Classiﬁers . MIT Press, 2000. [6] P .L. Bartlett and A. T ewari. Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results. The Journal of Machine L e arning R ese ar ch , 8:775–790, 2007. [7] P .L. Bartlett, M.I. Jordan, and J.D. McAuliﬀe. Con v exit y , classiﬁcation, and risk b ounds. Journal of the Americ an Statistic al Asso ciation , 101(473):138–156, March 2006. [8] Alina Beygelzimer, John Langford, and Bianca Zadrozny . Mac hine learning tec hniques — reductions b et w een prediction quality metrics. In Zhen Liu and Cath y H. Xia, editors, Performanc e Mo deling and Engine ering , pages 3–28. Springer US, April 2008. URL http://hunch.net/ ~ jl/projects/reductions/tutorial/ paper/chapter.pdf . [9] Andreas Buja, W erner Stuetzle, and Yi Shen. Loss functions for binary class prob- abilit y estimation and classiﬁcation: Structure and applications. T ec hnical rep ort, Univ ersit y of Pennsylv ania, No v em b er 2005. 35 [10] Ira Cohen and Moises Goldszmidt. Prop erties and b eneﬁts of calibrated classiﬁers. T echnical Rep ort HPL-2004-22(R.1), HP Lab oratories, Palo Alto, July 2004. [11] C. Cortes and V. V apnik. Supp ort-vector netw orks. Machine L e arning , 20(3):273– 297, 1995. [12] C. Elk an. The foundations of cost-sensitive learning. In Pr o c e e dings of the Seven- te enth International Joint Confer enc e on Artiﬁcial Intel ligenc e , v olume 17, pages 973–978, 2001. [13] S. Fidler, D. Skoca j, and A. Leonardis. Combining reconstructiv e and discriminative subspace metho ds for robust classiﬁcation and regression b y subsampling. IEEE tr ansactions on Pattern Analysis and Machine Intel ligenc e , 28(3):337–350, 2006. [14] Y oa v F reund. A more robust bo osting algorithm. arXiv:0905.2138v1 [stat.ML], May 2009. URL . [15] Tilmann Gneiting. Ev aluating p oin t forecasts. arXiv:0912.0902v1, December 2009. [16] Tilmann Gneiting and Adrian E. Raftery . Strictly prop er scoring rules, prediction, and estimation. Journal of the A meric an Statistic al Asso ciation , 102(477):359–378, Marc h 2007. [17] P eter D. Gr ¨ un w ald and A. Phillip Dawid. Game theory , maxim um entrop y , mini- m um discrepancy and robust ba y esian decision theory . The Annals of Statistics , 32 (4):1367–1433, 2004. [18] D.J. Hand. Deconstructing Statistical Questions. Journal of the R oyal Statistic al So ciety. Series A (Statistics in So ciety) , 157(3):317–356, 1994. [19] D.J. Hand and V. Vinciotti. Lo cal V ersus Global Mo dels for Classiﬁcation Problems: Fitting Mo dels Where it Matters. The Americ an Statistician , 57(2):124–131, 2003. [20] D.P . Helm b old, J. Kivinen, and M.K. W armuth. Relative loss b ounds for single neurons. IEEE T r ansactions on Neur al Networks , 10:1291–1304, 1999. [21] Jean-Baptiste Hiriart-Urruty and Claude Lemar´ echal. F undamentals of Convex A nalysis . Springer, Berlin, 2001. [22] P .J. Hub er. R obust Statistics . Wiley , New Y ork, 1981. [23] Y. Kalnishk an, V. V o vk, and M.V. Vyugin. Loss functions, complexities, and the Legendre transformation. The or etic al Computer Scienc e , 313(2):195–207, 2004. [24] Y uri Kalnishk an, Vladimir V o vk, and Michael Vyugin. Generalised entrop y and asymptotic complexities of languages. In L e arning The ory , volume 4539/2007 of L e ctur e Notes in Computer Scienc e , pages 293–307. Springer, 2007. 36 [25] M. Kearns. Eﬃcient noise-tolerant learning from statistical queries. Journal of the A CM , 45(6):983–1006, Nov em ber 1998. [26] Jyrki Kivinen an d Manfred K. W arm uth. Relative loss b ounds for m ultidimensional regression problems. Machine L e arning , 45:301–329, 2001. [27] D.E. Knuth. Two notes on notation. A meric an Mathematic al Monthly , pages 403– 422, 1992. [28] Nicolas Lambert, David Pennock, and Y oa v Shoham. Eliciting prop erties of proba- bilit y distributions. In Pr o c e e dings of the ACM Confer enc e on Ele ctr onic Commer c e , pages 129–138, 2008. [29] John Langford and Bianca Zadrozn y . Estimating class membership probabilities using classiﬁer learners. In Pr o c e e dings of the T enth International Workshop on A rtiﬁcial Intel ligenc e and Statistics (AIST A T’05) , 2005. [30] Yi Lin. A note on margin-based loss functions in classiﬁcation. T echnical Rep ort 1044, Department of Statistics, Univ ersit y of Wisconsin, Madison, F ebruary 2002. [31] Philip M. Long and Ro cco A. Servedio. Random classiﬁcation noise defeats all con v ex p otential b oosters. In William W. Cohen, Andrew McCallum, and Sam T. Ro w eis, editors, ICML , pages 608–615, 2008. doi: 10.1145/1390156.1390233. [32] Hamed Masnadi-Shirazi and Nuno V asconcelos. On the design of loss functions for classiﬁcation: theory , robustness to outliers, and sa v ageb o ost. In D. Koller, D. Sc h uurmans, Y. Bengio, and L. Bottou, editors, A dvanc es in Neur al Information Pr o c essing Systems 21 , pages 1049–1056. 2009. [33] P . McCullagh and J.A. Nelder. Gener alize d Line ar Mo dels . Chapman & Hall/CRC, 1989. [34] Ric hard No c k and F rank Nielsen. Bregman div ergences and surrogates for learning. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 2009. T o app ear. [35] Ric hard No c k and F rank Nielsen. On the eﬃcient minimization of classiﬁcation calibrated surrogates. In Daphne Koller, Dale Sc h uurmans, Y oshua Bengio, and L ´ eon Bottou, editors, A dvanc es in Neur al Information Pr o c essing Systems 21 , pages 1201–1208. MIT Press, 2009. [36] J. Platt. Probabilities for SV machines. In A. Smola, P . Bartlett, B. Sc h¨ olk opf, and D. Sc h uurmans, editors, A dvanc es in L ar ge Mar gin Classiﬁers , pages 61–71. MIT Press, 2000. [37] F. Prov ost and T. F aw cett. Robust Classiﬁcation for Imprecise Environmen ts. Ma- chine L e arning , 42(3):203–231, 2001. 37 [38] Mark D. Reid and Rob ert C. Williamson. Surrogate regret b ounds for prop er losses. In Pr o c e e dings of the International Confer enc e on Machine L e arning , pages 897–904, 2009. [39] Mark D. Reid and Rob ert C. Williamson. Information, divergence and risk for binary exp erimen ts. arXiv preprint arXiv:0901.0356v1, Jan uary 2009. [40] R. T. Ro c k afellar. Convex Analysis . Princeton Universit y Press, 1970. [41] Leonard J. Sav age. Elicitation of p ersonal probabilities and exp ectations. Journal of the A meric an Statistic al Asso ciation , 66(336):783–801, 1971. [42] M.J. Schervish. A general metho d for comparing probability assessors. The Annals of Statistics , 17(4):1856–1879, 1989. [43] B. Sc h¨ olk opf, A. Smola, R. C. Williamson, and P . L. Bartlett. New support v ector algorithms. Neur al Computation , 12:1207–1245, 2000. [44] Yi Shen. L oss F unctions for Binary Classiﬁc ation and Class Pr ob ability Estimation . PhD thesis, Department of Statistics, Univ ersit y of Pennsylv ania, Octob er 2005. [45] E. Shuford, A. Alb ert, and H.E. Massengill. Admissible probabilit y measurement pro cedures. Psychometrika , 31(2):125–145, June 1966. [46] Ingo Stein w art. Ho w to compare diﬀerent loss functions and their risks. Constructive Appr oximation , 26(2):225–287, August 2007. [47] Ingo Steinw art and Andreas Christmann. Supp ort V e ctor Machines . Springer, New Y ork, 2008. [48] T.B. T rafalis and R.C. Gilb ert. Robust classiﬁcation and regression using supp ort v ector mac hines. Eur op e an Journal of Op er ational R ese ar ch , 173(3):893–909, 2006. [49] J. Zhang. Divergence function, dualit y , and conv e x analysis. Neur al Computation , 16(1):159–195, 2004. [50] T ong Zhang. Statistical b eha viour and consistency of classiﬁcation methods based on conv ex risk minimization. Annals of Mathematic al Statistics , 32:56–134, 2004. [51] Z. Zhang, M. I. Jordan, W. J. Li, and D. Y. Y eung. Coherence functions for m ulticategory margin-based classiﬁcation metho ds. In Pr o c e e dings of the Twelfth Confer enc e on A rtiﬁcial Intel ligenc e and Statistics (AIST A TS) , 2009. 38

Composite Binary Losses

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment