Technical Note: Towards ROC Curves in Cost Space
ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area…
Authors: Jose Hern, ez-Orallo, Peter Flach
T echnical Note: T owards R OC Curv es in Cost Space Jos ´ e Hern ´ andez-Orallo (jorallo@dsic.upv .es) Departament de Sistemes Inform ` atics i Computaci ´ o Uni versitat Polit ` ecnica de V al ` encia, Spain Peter Flach (Peter .Flach@bristol.ac.uk) Intelligent Systems Laboratory Uni versity of Bristol, United Kingdom C ` esar Ferri (cferri@dsic.upv .es) Departament de Sistemes Inform ` atics i Computaci ´ o Uni versitat Polit ` ecnica de V al ` encia, Spain Nov ember 11, 2021 Abstract R OC curves and cost curves are two popular ways of visualising classifier performance, finding appro- priate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the R OC curve ( A UC ) or the area under the optimal cost curve. In this note we present some ne w findings and connections between R OC space and cost space, by using the expected loss o ver a range of operating conditions. In particular, we sho w that R OC curves can be transferred to cost space by means of a very natural w ay of understanding how thresholds should be chosen, by selecting the threshold such that the proportion of positiv e predictions equals the operating condition (either in the form of cost proportion or ske w). W e call these new curves ROC cost curves , and we demonstrate that the expected loss as measured by the area under these curves is linearly related to A UC . This opens up a series of new possibilities and clarifies the notion of cost curve and its relation to ROC analysis. In addition, we show that for a classifier that assigns the scores in an ev enly-spaced way , these curves are equal to the Brier curves. As a result, this establishes the first clear connection between A UC and the Brier score. Keyw ords: cost curv es, R OC curves, Brier curves, classifier performance measures, cost-sensiti ve e v al- uation, operating condition, Brier score, Area Under the R OC Curve ( A UC ). 1 Intr oduction There are many graphical representations and tools for classifier e valuation, such as R OC curves [17, 7], R OC isometrics [9], cost curves [4, 5], DET curv es [13], lift charts [15], calibration maps [3], among others. In this paper , we will focus on R OC curves and cost curves. These are often considered to be two sides of the same coin, where a point in ROC space corresponds to a line in cost space. Ho wev er , this is only true up to a point, as a curve in ROC space has no corresponding representation in cost space. It is true that the con vex hull of a R OC curve corresponds to the lower env elope of all the cost lines, but this is not the ROC curve. In fact, the area under this lower en velope has no clear connection with A UC . As a result, cost space cannot be used in the same way as R OC spaces, and we find some adv antages in one representation ov er the other and vice versa. One of the issues with this lack of full correspondence is that the definition of what a cost curve is has been rather v ague in the literature. In some occasions, only the cost lines are formally defined [5], where the curve is just defined as the lower en v elope of all these lines. Ho wev er , this assumes that threshold choices are optimal, which is not generally the case. This curve is what we call here ‘the optimal cost curve’ (frequently referred to in the literature as ‘ the cost curv e’). It is worth mentioning that Drummond and Holte [5] talk about ‘selection criteria’ (instead of ‘threshold choice methods’), they distinguish between ‘performance-independent selection criteria’ and ‘cost-minimizing selection criteria’, and they show some curves using different ‘selection criteria’. Howe ver , they do not de velop the ideas further and they do not use this to generalise the notion of cost curve. In previous work, we have generalised and systematically dev eloped the concept of threshold choice method. F or instance, in [10] we have explored a new instance-uniform threshold choice method while in [12] we explore the probabilistic threshold method. In [8] we analyse this in general, leading to a total of six threshold choice methods and its corresponding measures. In this paper , we are interested in how all this can be plotted in cost space and, in particular , we analyse a new threshold choice method which assigns the threshold such that the proportion or rate of positi ve pre- dictions equals the operating condition (cost proportion). This leads to a cost curve where all the segments hav e equal length in terms of its projection over the x -axis. In other words, each segment covers a range of cost proportions of equal length. A first graphical analysis of this curv e indicates that each segment corresponds to a point in R OC space, and its position with respect to the optimal cost curve giv es virtually the same information as the R OC curv e does. Consequently , we call this new curve the ROC cost curve . It can also be interpreted as a cost-based analysis of rankers. Further analysis of this curve shows that the area under the R OC cost curve is a linear function of A UC , so doubly justifying the name giv en to this curve, its interpretation and its applications. The paper is organised as follows. Section 2 introduces some basic notation and definitions. In Section 3 we refer to the relation between the R OC con vex hull and the optimal cost curve. Section 4 introduces one of the contributions in the paper by using a threshold choice method which leads to the R OC cost curves. It also explains ho w these curv es can be plotted easily and what their interpretation is. Section 5 shows that the area under this curv e is a linear function of A UC , and demonstrates the correspondence for some typical cases (random classifier , perfect classifier , worst classifier). Section 6 analyses when a classifier chooses its scores in an ev enly-spaced way . In this case, it turns out that the area under the R OC cost curve is exactly the Brier score. Section 7 closes the paper with some conclusions and future work. 2 Notation and basic definitions In this section we introduce some basic notation and the notions of R OC curve, cost curves and the way expected loss is aggre gated using a threshold-choice method. Most of this section is reused from [8]. 1 2.1 Notation W e denote by U S ( x ) the continuous uniform distrib ution of v ariable x ov er an interv al S ⊂ R . If this interv al S is [ 0 , 1 ] then S can be omitted. Examples (also called instances) are taken from an instance space. The instance space is denoted X and the output space Y . Elements in X and Y will be referred to as x and y respectiv ely . For this paper we will assume binary classifiers, i.e., Y = { 0 , 1 } . A crisp or categorical classifier is a function that maps examples to classes. A probabilistic classifier is a function m : X → [ 0 , 1 ] that maps e xamples to estimates ˆ p ( 1 | x ) of the probability of example x to be of class 1. A scoring classifier is a function m : X → ℜ that maps examples to real numbers on an unspecified scale, such that scores are monotonically related to ˆ p ( 1 | x ) . In order to make predictions in the Y domain, a probabilistic or scoring classifier can be con verted to a crisp classifier by fixing a decision threshold t on the scores. Gi ven a predicted score s = m ( x ) , the instance x is classified in class 1 if s > t , and in class 0 otherwise. For a given, unspecified classifier and population from which data are drawn, we denote the score density for class k by f k and the cumulati ve distribution function by F k . Thus, F 0 ( t ) = R t − ∞ f 0 ( s ) d s = P ( s ≤ t | 0 ) is the proportion of class 0 points correctly classified if the decision threshold is t , which is the sensitivity or true positi ve rate at t . Similarly , F 1 ( t ) = R t − ∞ f 1 ( s ) d s = P ( s ≤ t | 1 ) is the proportion of class 1 points incorrectly classified as 0 or the false positi ve rate at threshold t ; 1 − F 1 ( t ) is the true negati ve rate or specificity . 1 Gi ven a data set D ⊂ h X , Y i of size n = | D | , we denote by D k the subset of examples in class k ∈ { 0 , 1 } , and set n k = | D k | and π k = n k / n . W e will use the term class pr oportion for π 0 (other terms such as ‘class ratio’ or ‘class prior’ ha ve been used in the literature). The av erage score of class k is s k = 1 n k ∑ h x , y i∈ D k m ( x ) . Gi ven any strict order for a data set of n examples we will use the index i on that order to refer to the i -th example. Thus, s i denotes the score of the i -th example and y i its true class. W e use I to denote the set of indices, i.e. I = 1 .. n . Giv en a data set and a classifier , we can define empirical score distributions for which we will use the same symbols as the population functions. W e then hav e f k ( s ) = 1 n k |{h x , y i ∈ D k | m ( x ) = s }| which is non-zero only in n 0 k points, where n 0 k ≤ n k is the number of unique scores assigned to instances in D k (when there are no ties, we hav e n 0 k = n k ). Furthermore, the cumulati ve distrib ution functions and F k ( t ) = ∑ s ≤ t f k ( s ) are piece wise constant with n 0 k + 1 segments. F 0 is called sensiti vity and F 1 is called specificity . The meaning of F 0 ( t ) can be seen as the proportion of examples of class 0 which are correctly classified if the threshold is set at t . Conv ersely , the meaning of 1 − F 1 ( t ) can be seen as the proportion of examples of class 1 which are correctly classified if the threshold is set at t . 2.2 Operating conditions and overall loss When a classification model is applied, the conditions or context might be different to those used during its training might. In fact, a classifier can be used in several contexts, with different results. A context can imply different class proportions, different cost over examples (either for the attrib utes, for the class or any other kind of cost), or some other details about the effects that the application of a model might entail and the se verity of its errors. In practice, an operating condition or deployment conte xt is usually defined by a misclassification cost function and a class distrib ution. Clearly , there is a dif ference between operating when the cost of misclassifying 0 into 1 is equal to the cost of misclassifying 1 into 0 and doing so when the former is ten times the latter . Similarly , operating when classes are balanced is different from when there is an ov erwhelming majority of instances of one class. One general approach to cost-sensitiv e learning assumes that the cost does not depend on the example but only on its class. In this way , misclassification costs are usually simplified by means of cost matrices, 1 W e use 0 for the positive class and 1 for the negativ e class, but scores increase with ˆ p ( 1 | x ) . That is, a ranking from strongest positiv e prediction to strongest negati ve prediction has non-decreasing scores. This is the same con vention as used by , e.g., [11]. 2 where we can express that some misclassification costs are higher than others [6]. T ypically , the costs of correct classifications are assumed to be 0. 2 This means that for binary classifiers we can describe the cost matrix by two values c k ≥ 0, representing the misclassification cost of an example of class k . Additionally , we can normalise the costs by setting b = c 0 + c 1 and c = c 0 / b ; we will refer to c as the cost pr oportion . Since this can also be expressed as c = ( 1 + c 1 / c 0 ) − 1 , it is often called ‘cost ratio’ e ven though, technically , it is a proportion ranging between 0 and 1. W e can see the dependency between b and c 3 , which leaves just one de gree of freedom, and we can set one of them constant. Consequently , choosing b constant we see that it only af fects the magnitude of the costs but is independent of the classifier . W e set b = 2 so that loss is commensurate with error rate (which just assumes c 0 = c 1 = 1). The loss which is produced at a decision threshold t and a cost proportion c is then giv en by the formula: Q c ( t ; c ) , c 0 π 0 ( 1 − F 0 ( t )) + c 1 π 1 F 1 ( t ) (1) = 2 { c π 0 ( 1 − F 0 ( t )) + ( 1 − c ) π 1 F 1 ( t ) } W e often are interested in analysing the influence of class proportion and cost proportion at the same time. Since the relev ance of c 0 increases with π 0 , an appropriate way to consider both at the same time is by the definition of ske w , which is a normalisation of their product: z , c 0 π 0 c 0 π 0 + c 1 π 1 = c π 0 c π 0 + ( 1 − c )( 1 − π 0 ) (2) It follo ws that c = z π 1 z π 1 +( 1 − z )( 1 − π 1 ) . From Eq. (1) we obtain Q c ( t ; c ) c 0 π 0 + c 1 π 1 = z ( 1 − F 0 ( t )) + ( 1 − z ) F 1 ( t ) , Q z ( t ; z ) (3) This giv es an expression for loss at a threshold t and a skew z . W e will assume that the operating condition is either defined by the cost proportion (using a fixed class distrib ution) or by the ske w . W e then have the follo wing simple but useful result Lemma 1. If π 0 = π 1 then z = c and Q z ( t ; z ) = 2 b Q c ( t ; c ) . Pr oof. If classes are balanced we hav e c 0 π 0 + c 1 π 1 = b / 2, and the result follows from Eq. (2) and Eq. (3). This further justifies taking b = 2, which means that Q z and Q c are expressed on the same 0-1 scale, and, as said above, are also commensurate with error rate which assumes c 0 = c 1 = 1. The upshot of Lemma 1 is that we can transfer any expression for loss in terms of cost proportion to an equiv alent expression in terms of ske w by just setting π 0 = π 1 = 1 / 2 and z = c . In many real problems, when we have to e v aluate or compare classifiers, we do not kno w the cost proportion or sk ew that will apply during application time. One general approach is to e valuate the classifier on a range of possible operating points. In order to do this, we hav e to set a weight or distribution on cost proportions or ske ws. In this paper , we will consider the continuous uniform distribution U . A key issue when applying a classifier to se veral operating conditions is how the threshold is chosen in each of them. If we work with a crisp classifier , this question v anishes, since the threshold is already settled. Howe v er , in the general case when we work with a soft probabilistic classifier , we ha ve to decide ho w to establish the threshold. The crucial idea explored in this paper is the notion of thr eshold choice 2 Not doing so, or just considering one of the correct classifications to have 0 cost will lead to results which are different to the simplified setting by a constant term or factor , as happens with the model for cost-loss ratio used by Murphy in [14]. 3 Hand [11, p115] assumes b and c to be independent, and hence considers b not necessarily a constant. Howev er , in the end, he also assumes that the result is only affected by a constant f actor . 3 method , a function T ( c ) or T ( z ) , which con v erts an operating condition (cost proportion or ske w) into an appropriate threshold for the classifier . There are several reasonable options for the function T . W e can set a fixed threshold for all operating conditions, we can set the threshold by looking at the ROC curve (or its con ve x hull) and using the cost proportion or the ske w to intersect the R OC curve (as R OC analysis does), we can set a threshold looking at the estimated scores, especially when they represent probabilities, or we can set a threshold independently from the rank or the scores. The way in which we set the threshold may dramatically affect performance. But, not less importantly , the performance measure used for ev aluation must be in accordance with the threshold choice method. From this interpretation, Adams and Hand [1] suggest to set a distribution o ver the set of possible operating points and integrate over them. In this way , we can define the overall or average expected loss in a range of situations as follo ws: L c , Z 1 0 Q c ( T c ( c ) ; c ) w c ( c ) d c (4) where Q c ( t ) is the expected cost for threshold t as seen abov e, T c is a threshold choice method, which maps cost proportions to thresholds, and w c ( c ) is a distrib ution for costs in [ 0 , 1 ] . Clearly we see that any performance measure which attempts to measure a verage expected cost in a wide range of operating condition depends on two things. First, the distribution w c ( c ) that we use to weight the range of conditions. Second, the threshold choice method T c . Additionally , we can define this overall or a verage expected cost to be independent of the class priors, so defining a similar construction for ske ws instead of costs: L z , Z 1 0 Q z ( T z ( z ) ; z ) w z ( z ) d z (5) If we draw Q c or Q z ov er c and z respecti vely , we get a plot space known as cost plots or curves, as we will illustrate below . Cost curv es are also kno wn as risk curves (see, e.g. [16], where the plot can also be sho wn in terms of priors , i.e. class proportions). So a cost curve as a function of z in our notation is simply: C C z ( z ) , Q z ( T ( z ) ; z ) (6) and similarly for cost proportions. Note that it is the threshold choice method T which can draw a different curve for the same classifier . 2.3 Some common plots and measures In what follows, we introduce some common e v aluation measures: the Brier Score, the R OC space and the Area Under the R OC curve ( A UC ). In the following section we also introduce the con ve x hull and the optimal cost curves. The Brier score is a well-kno wn ev aluation measure for probabilistic classifiers. It is an alternativ e name for the Mean Squared Error or MSE loss [2], especially for binary classification. BS ( m , D ) is the Brier score of classifier m with data D ; we will usually omit m and D when clear from the context. W e define BS k ( m , D ) = BS ( m , D k ) . BS is defined as follows: BS , 1 n n ∑ i = 1 ( s i − y i ) 2 = π 0 BS 0 + π 1 BS 1 (7) where s i is the score predicted for example i and y i is the true class for example i . The corresponding population quantities are BS 0 = R 1 0 s 2 f 0 ( s ) d s and BS 1 = R 1 0 ( 1 − s ) 2 f 1 ( s ) d s . The R OC curve [17, 7] is defined as a plot of F 1 ( t ) (i.e., false positi ve rate at decision threshold t ) on the x -axis against F 0 ( t ) (true positi ve rate at t ) on the y -axis, with both quantities monotonically non-decreasing 4 with increasing t (remember that scores increase with ˆ p ( 1 | x ) and 1 stands for the neg ativ e class). Figure 1 (Leftmost: dash lines) shows a R OC curve for a classifier with 4 e xamples of class 1 and 11 examples of class 0. Because of ties, there are 11 distinct scores and hence 11 bins/segements in the R OC curve. From a R OC curve, we can deri ve the Area Under the R OC curve ( A UC ) as: A UC , Z 1 0 F 0 ( s ) d F 1 ( s ) = Z + ∞ − ∞ F 0 ( s ) f 1 ( s ) d s = Z + ∞ − ∞ Z s − ∞ f 0 ( t ) f 1 ( s ) d t d s (8) = Z 1 0 ( 1 − F 1 ( s )) d F 0 ( s ) = Z + ∞ − ∞ ( 1 − F 1 ( s )) f 0 ( s ) d s = Z + ∞ − ∞ Z + ∞ s f 1 ( t ) f 0 ( s ) d t d s When dealing with empirical distributions the inte gral is replaced by a sum. 3 The optimal cost curv e Gi ven a scoring (or soft) classifier , one approach for choosing a classification threshold is to consider that (1) we are having complete information about the operating condition (class proportions and costs) and (2) we are able to use that information to choose the threshold that will minimise the cost using the current classifier . R OC analysis is precisely based on these two points and, as we ha ve seen, using the ske w and the con ve x hull, we can calculate the threshold which giv es the smallest loss (for the training set). This threshold choice method, denoted by T o c is: T o c ( c ) , ar g min t { Q c ( t ; c ) } = arg min t 2 { c π 0 ( 1 − F 0 ( t )) + ( 1 − c ) π 1 F 1 ( t ) } (9) which matches the optimal threshold for a gi ven ske w z : T o z ( z ) , ar g min t { Q z ( t ; z ) } = T o c ( c ) (10) This threshold giv es the con v ex hull in the R OC space. The con vex hull of a R OC curve (R OCCH) is a construction o ver the R OC curve in such a way that all the points on the R OCCH hav e minimum loss for some choice of c or z . This means that we restrict attention to the optimal threshold for a gi ven cost proportion c . Note that the arg min will typically gi ve a range (interv al) of values which giv e the same optimal value. The con vex hull is defined by the points { F 1 ( t ) , F 0 ( t ) } where t = T o c ( c ) for some c . Then, in order to make a hull, all the remaining points are linearly interpolated (pairwise). All this is shown in Figure 1 (leftmost). The Area Under the ROCCH (denoted by A UCH ) can be computed in a similar way as the A UC with modified versions of f k and F k . Obviously , A UCH ≥ A UC , with equality implying the R OC curve is con vex. A cost plot as defined by [5] has Q z ( t ; z ) on the y -axis against ske w z on the x -axis (Drummond and Holte use the term ‘probability cost’ rather than skew). Since Q z ( t ; z ) = z ( 1 − F 0 ( t )) + ( 1 − z ) F 1 ( t ) , cost lines for a giv en decision threshold t are straight lines Q z = a 0 + a 1 z with intercept a 0 = F 1 ( t ) and slope a 1 = 1 − F 0 ( t ) − F 1 ( t ) . A cost line visualises ho w cost at that threshold changes between F 1 ( t ) for z = 0 and 1 − F 0 ( t ) for z = 1. From all the set of cost lines, we can choose line segments and by piecewise connecting them we hav e a ‘hybrid cost curve’ [5]. One w ay of choosing these segments is by considering the optimal threshold. Hence, the optimal or minimum cost curve is then the lower en velope of all the cost lines, obtained by only considering the optimal threshold (the lo west cost line) for each ske w . The cost curve for this optimal choice is just gi ven by instantiating equation (6) with the optimal threshold choice method. Namely , for ske ws, we would ha ve: C C o z ( z ) , Q z ( T o z ( z ) ; z ) (11) 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ROC curve and con vex hull F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 loss by cost cost loss ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 loss by skew skew loss Figure 1: Sev eral graphical representations for the classifier with probability estimates (0.95, 0.90, 0.90, 0.85, 0.70, 0.70, 0.70, 0.55, 0.45, 0.20, 0.20, 0.18, 0.16, 0.15, 0.05) and classes (1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0) Left: R OC curve (solid) and con ve x hull (dashed). Middle: cost lines and optimal cost curve against cost proportions. Right: cost lines and optimal cost curve against sk ews. Follo wing the classifier and the R OC curve sho wn in Figure 1 (leftmost), we also sho w the optimal cost curve (rightmost) for that classifier . W e observe 7 se gments in the original R OC curve on the left, and 5 segments in its con v ex hull. W e see that these 5 segments correspond to the 5 points in the optimal cost curve on the right. The optimal cost curv e is ‘constructed’ as the lower env elope of the 12 cost lines (one more than the number of distinct scores). The middle plot in Figure 1 is an alternati ve cost plot with cost proportion rather than skew on the x - axis. That is, here the cost lines are straight lines Q c = a 0 0 + a 0 1 c with intercept a 0 0 = 2 π 1 F 1 ( t ) and slope a 0 1 = 2 π 0 ( 1 − F 0 ( t )) − 2 π 1 F 1 ( t ) . W e can clearly observ e the class imbalance. For the classifier shown in Figure 1, if we are giv en an extreme ske w= 0.8, we know that any threshold between 0.90 and 0.95 will be optimal, since it will classify example 15 as negati ve (1) and the rest as positi ve (0). This cutpoint (e.g. t = 0 . 92 ∈ [ 0 . 90 , 0 . 95 ] ) gives F 0 = 11 / 11 and F 1 = 3 / 4, and minimises the loss for this ske w , as giv en by Eq. (3), i.e. Q z ( 0 . 92 , 0 . 8 ) = 0 . 8 ∗ ( 1 − 11 / 11 ) + ( 1 − 0 . 8 ) ∗ ( 3 / 4 ) = 0 . 15. Another cutpoint, e.g. t = 0 . 85, gi ves F 0 = 10 / 11 and F 1 = 2 / 4, with a higher Q z ( 0 . 85 , 0 . 8 ) = 0 . 8 ∗ ( 1 − 10 / 11 ) + ( 1 − 0 . 8 ) ∗ ( 2 / 4 ) = 0 . 17. W e may be interested in calculating the area under this optimal cost curve. If we use skews, we can deri ve: L o z , Z 1 0 Q z ( T o z ( z ) ; z ) w z ( z ) d z (12) But this equation is exactly the TEC (from ‘T otal Expected Cost’) given by Drummond and Holte ([5] page 106, bottom). Drummond and Holte use the term ‘probability times cost’ for skew (or simply , and some what misleadingly , ‘probability cost’). The distrib ution of probability costs is denoted by prob ( x ) ( w z ( z ) in our notation). For prob ( x ) , Drummond and Holte choose the uniform distrib ution, i.e.: L o U ( z ) , Z 1 0 Q z ( T o z ( z ) ; z ) U ( z ) d z (13) This expression is just the area under the cost curve. In Drummond and Holte’ s words: “The area under a cost curve is the expected cost of the classifier assuming all possible probability cost v alues are equally likely , i.e. that prob ( x ) is the uniform distrib ution. ” ( prob ( x ) is w z ( z ) in our notation). The problem of all this is that we are not always given all the information about the operating condition. In fact, e v en having that information, there are perfect techniques (namely R OC analysis) to get the optimal 6 threshold for a data set (e.g. the training or v alidation data set), but this does not ensure that these choices are going to be optimal for a test set. Consequently , ev aluating classifiers in this way is a strong assumption. Additionally , how close the estimated optimal threshold is to the actual optimal threshold may depend on the classifier as well. One option is to consider confidence bands, but another option is just to drop this assumption. 4 The R OC cost curv e The easiest way to choose the threshold is to set it independently from the classifier and also from the operating condition. This mechanism can set the threshold in an absolute or a relativ e way . The absolute way , as explored in [8], just sets T ( c ) = t (or , for skews, T ( z ) = t ), with t being a fixed threshold. A simple v ariant of the fixed threshold is to consider that it is not the absolute v alue of the threshold which is fixed, but a relati ve rate or proportion r over the data set. In other words, this method tries to quantify the number of positi ve examples gi ven by the threshold. For e xample, we could say that our threshold is fix ed to predict 30% positi ves and the rest negativ es. This of course in v olves ranking the examples by their scores and setting a threshold at the appropriate position. W e will dev elop this idea for cost proportions below . 4.1 The R OC cost curve f or cost pr oportions The definition of of the rate-fixed threshold choice method for costs is as follows: T q c [ r ]( c ) , { t : P ( s i < t ) = r } (14) In other words, we choose the threshold such that the probability that a score is lower than the threshold – i.e., the positi ve prediction rate, is r . In the example in Figure 1, any value in the interval [ 0 . 3 , 0 . 2 ] makes that the probability (or proportion) of the score being lo wer than that v alue is 2 / 6 = 0 . 33, which approximates r = 0 . 3. It is interesting to connect the previous expression of this threshold given by Eq. (14) with the cumulative distributions. Lemma 2. T q c [ r ]( c ) = { t : F 0 ( t ) π 0 + F 1 ( t ) π 1 = r } (15) Pr oof. W e can re write: P ( s < t ) = P ( s < t | 0 ) P ( 0 ) + P ( s < t | 1 ) P ( 1 ) But using the definition of P ( s < t | 0 ) and P ( s < t | 1 ) in the preliminaries in terms of the cumulative distri- butions, we ha ve: P ( s < t ) = F 0 ( t ) P ( 0 ) + F 1 ( t ) P ( 1 ) = F 0 ( t ) π 0 + F 1 ( t ) π 1 so substituting into Eq. 14 we hav e the result: T q c [ r ]( c ) = { t : F 0 ( t ) π 0 + F 1 ( t ) π 1 = r } This straightforward result sho ws that this criterion clearly depends on the classifier , but it only takes the ranks into account, not the magnitudes of the scores. Ho wev er , there is a natural way of setting the positive prediction rate in an adaptive way . Instead of fixing the proportion of positive predictions, we may take the operating condition into account. If we hav e an operating condition, we can use the information about the skew or cost proportion to adjust the positi ve 7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 2: Se veral graphical representations for the class ifier with probability estimates (0.95,0.9,0.8,0.3,0.2, 0.1,0.05) and classes (1,0,1,1,0,0,0). Left: R OC curve (solid) and con ve x hull (dashed). Right: cost lines, optimal cost curve (dashed) and R OC cost curve (thick and solid) against cost proportions. prediction rate to that proportion. This leads to the rate-driven threshold selection method: if we are giv en cost proportion c , we choose the threshold t in such a way that we get a proportion of c positi ve predictions. T n c ( c ) , T q c [ c ]( c ) = { t : P ( s i < t ) = c } (16) And gi ven this threshold selection method, we can no w deriv e its cost curve: C C n c ( c ) , Q c ( T n c ( c ) ; c ) (17) Because of Lemma 2, we can see that this is equi valent to: C C n c ( c ) = Q c ( { t : F 0 ( t ) π 0 + F 1 ( t ) π 1 = c } ; c ) (18) Assuming no ties, we see that the e xpression F 0 ( t ) π 0 + F 1 ( t ) π 1 only changes its v alue between scores. If hav e n examples, it only changes n + 1 times. So for finite populations, this has to be re written as follo ws: C C n c ( c ) = Q c ( { t : c − 1 n + 1 < F 0 ( t ) π 0 + F 1 ( t ) π 1 ≤ c } ; c ) (19) This leads to n + 1 intervals in cost space where the threshold is not changed in each of these intervals. This means that the cost line is the same. This leads to the following procedure: R OC cost curve f or cost proportions : C C n c Gi ven a classifier and a data set with n examples: 1. Draw the n + 1 cost lines, C L 0 to C L n . 2. From left to right, draw the curve follo wing each cost line (from C L 0 to CL n ) with a width on the x − axis of 1 n + 1 . Figure 2 shows a small classifier for a data set with 3 positiv e examples and 4 negati ve examples. The R OC curve on the left has 8 points, since there are 8 cut points to choose the threshold, leading to 8 crisp classifiers and, accordingly , 8 cost lines. These cost lines are sho wn in the cost space in the plot on the right. W e see that the projection of each segment onto the x -axis has exactly a length of 1/8. Note that each segment uses a portion of each cost line. 8 It is relativ ely easy to understand what these curves mean and to see their correspondence to R OC curves. Follo wing Figure 2, going from (0,0) to (1,1) in the R OC curve, the first three points are sub-optimal. The fourth point is a good point, because this point is going to be chosen for man y slopes. The fifth and the sixth are bad points, since they are under the con ve x hull, and they will ne ver be chosen. The sev enth is a good point again. The eighth is a bad point. This is exactly what the ROC cost curve sho ws. Only the fourth and sev enth segments are optimal and match the optimal cost curve. So, the R OC cost curve has a segment intersecting with the optimal cost curve for e very point on the con ve x hull. All other segments correspond to sub-optimal decision thresholds. 5 The ar ea under the R OC cost curve If we plug the rate-dri ven threshold choice method T n c (Eq. 16) into the general formula of the av erage expected cost for a range of cost proportions (Eq. 4) we ha ve: L n c , Z 1 0 Q c ( T n c ( c ) ; c ) w c ( c ) d c (20) Using the uniform distrib ution, this e xpected loss equals the area under the R OC cost curv e. It can be linked to the area under the R OC curve as follo ws. Proposition 3. L n U ( c ) = 2 π 1 π 0 ( 1 − A UC ) + 1 3 − π 1 π 0 Pr oof. L n U ( c ) = Z 1 0 Q c ( T n c ( c ) ; c ) U ( c ) d c = Z 1 0 2 { c π 0 ( 1 − F 0 ( T n c ( c ))) + ( 1 − c ) π 1 F 1 ( T n c ( c )) } d c = Z 1 0 2 { c π 0 − c [ π 0 F 0 ( T n c ( c )) + π 1 F 1 ( T n c ( c ))] } d c + Z 1 0 2 { π 1 F 1 ( T n c ( c )) } d c From Lemma 2 we hav e that: T q c [ r ]( c ) = { t : F 0 ( t ) π 0 + F 1 ( t ) π 1 = r } and of course T n c ( c ) = { t : F 0 ( t ) π 0 + F 1 ( t ) π 1 = c } Since this is the t which makes the expression equal to c we can find that e xpression and substitute by c . Then we hav e: L n U ( c ) = Z 1 0 2 { c π 0 − c ( c ) } d c + Z 1 0 2 { π 1 F 1 ( T n c ( c )) } d c = c 2 π 0 − 2 c 3 3 1 0 + Z 1 0 2 { π 1 F 1 ( T n c ( c )) } d c = π 0 − 2 3 + 2 π 1 Z 1 0 F 1 ( T n c ( c )) d c 9 W e have to solve the term R 1 0 F 1 ( T n c ( c )) d c . In order to do this, we have to see that the use of T n c ( c ) and integrating o ver d c is like using the mixture distribution for thresholds t and integrating o ver d t . Z 1 0 F 1 ( T n c ( c )) d c = Z ∞ − ∞ F 1 ( t )( π 0 f 0 ( t ) + π 1 f 1 ( t )) d t = π 0 Z ∞ − ∞ F 1 ( t ) f 0 ( t ) d t + π 1 Z ∞ − ∞ F 1 ( t ) f 1 ( t ) d t = − π 0 Z ∞ − ∞ ( − 1 + 1 − F 1 ( t )) f 0 ( t ) d t + π 1 Z ∞ − ∞ F 1 ( t ) d F 1 ( t ) = − π 0 Z ∞ − ∞ − 1 d t − π 0 Z ∞ − ∞ ( 1 − F 1 ( t )) f 0 ( t ) d t + π 1 Z ∞ − ∞ F 1 ( t ) d F 1 ( t ) = π 0 − π 0 A UC + π 1 2 = π 0 ( 1 − A UC ) + π 1 2 And no w we can plug this in the expression for the expected cost: L n U ( c ) = π 0 − 2 3 + 2 π 1 ( π 0 ( 1 − A UC ) + π 1 2 ) = π 0 − 2 3 + 2 π 1 π 0 ( 1 − A UC ) + π 1 π 1 = 2 π 1 π 0 ( 1 − A UC ) + π 1 ( 1 − π 0 ) + π 0 − 2 3 = 2 π 1 π 0 ( 1 − A UC ) + 1 − π 1 π 0 − 2 3 = 2 π 1 π 0 ( 1 − A UC ) + 1 3 − π 1 π 0 This shows that not only has this ne w curve a clear correspondence to R OC curves, but its area is linearly related to A UC . From costs to ske ws we hav e by Lemma 1 : Corollary 4. L n U ( z ) = 1 − A UC 2 + 1 12 Thus, e xpected loss is 1/3 for a random classifier , 1 / 3 − 1 / 4 = 1 / 12 for a perfect classifier and 1 / 3 + 1 / 4 = 7 / 12 for the worst possible classifier . The previous results are obtained for continuous curves with an infinite number of examples. For em- pirical curves with a limited number of examples, the result is not exact, but a good approximation. For instance, for the example in Figure 2, we ha ve that A UC is 0.83333. The area under the R OC cost curve is 0.1695 for cost proportions, while the theoretical result 2 π 1 π 0 ( 1 − A UC ) + 1 3 − π 1 π 0 gi ves 0.1701. It should be possible to come up with an exact formula for empirical R OC curves; we leav e this as an open problem. It is interesting to use these general results to get more insight about what the R OC cost curves mean exactly . For instance, Figure 3 shows the R OC curv e and the R OC cost curv es for a perfect ranker and a balanced data set. W e used a large number of split points in the ranking to simulate the continuous case. W e see that our new threshold choice method makes optimal choices for c = 0, c = 1 / 2 and c = 1 but sub- optimal choices for other operating conditions, which explains the non-zero area under the ROC cost curve 10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 3: Se veral graphical representations for a perfect and balanced classifier with 20 positive examples and 20 ne gati ve examples. Left: R OC curv e (solid) and con ve x hull (dashed). Right: cost lines, optimal cost curve (dashed) and R OC cost curve (thick and solid) against cost proportions. (1 / 12 in the continuous case). The optimal choice in this case is to ignore the operating condition altogether and always split the ranking in the middle. Figure 4 sho ws what the ROC cost curve looks like for the worst ranker possible. The lower en velope of the cost lines sho ws that in this case the optimal choice is to alw ays predict 0 if c < 1 / 2 and 1 if c > 1 / 2, which results in an expected loss of 1 / 4. In contrast, our new threshold choice method also takes the non- optimal split points into account and hence incurs a higher e xpected loss (7 / 12 in the continuous case). Figure 5 sho ws what the ROC cost curv e looks like for a classifier which is alternating (close to the diagonal in the ROC space) with A UC ≈ 0 . 5. Here, the expected loss approximates 4 / 12 = 1 / 3, while the optimal choice is the same as in the previous case. It is not hard to prov e that in the limiting case n → ∞ , the R OC cost curve for a random classifier is described by the function y = 2 c ( 1 − c ) , which is the Gini index (the impurity measure, not to be confused with the Gini coef ficient which is 2 A UC − 1). 6 Evenly-spaced scor es. The relation between A UC and the Brier score An alternativ e threshold choice method is to choose ˆ p ( 1 | x ) = o p where o p is the operating condition. This is a natural criterion as it has been used especially when the classifier is a probability estimator . Drummond and Holte [5] say it is a common example of a “performance independent criterion”. Referring to Figure 22 in their paper which uses this threshold choice they say: “the performance independent criterion, in this case, is to set the threshold to correspond to the operating conditions. For example, if P C (+) = 0.2. the Nai ve Bayes threshold is set to 0.2”. The term P C (+) is equi v alent to our ‘skew’. Let us see the definition of this method, that we call probabilistic threshold choice (as presented in [12]). W e first gi ve the formulation which uses cost proportions for operating conditions: T p c ( c ) , c (21) W e define the same thing for T p z ( z ) : T p z ( z ) , z (22) If we plug T p c into the general formula of the av erage expected cost (Eq. 4) we ha ve the expected probabilistic 11 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 4: Se veral graphical representations for a v ery bad classifier (all 0 are ranked before the 1) and balanced classifier with 20 positiv e examples and 20 ne gativ e examples. Left: R OC curv e (solid) and con ve x hull (dashed). Right: cost lines, optimal cost curve (dashed) and R OC cost curve (thick and solid) against cost proportions. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 5: Se veral graphical representations for an alternating (order is 1,0,1,0, ....) and balanced classifier with 20 positi ve examples and 20 ne gativ e examples. Left: ROC curve (solid) and conv ex hull (dashed). Right: cost lines, optimal cost curve (dashed) and ROC cost curv e (thick and solid) against cost proportions. 12 cost: L p c , Z 1 0 Q c ( T p c ( c ) ; c ) w c ( c ) d c = Z 1 0 Q c ( c ; c ) w c ( c ) d c (23) And if we use the uniform distribution and the definition of Q c (Eq. 1): L p U ( c ) , Z 1 0 Q c ( c ; c ) U ( c ) d c = Z 1 0 2 { c π 0 ( 1 − F 0 ( c )) + ( 1 − c ) π 1 F 1 ( c ) } U ( c ) d c = Z 1 0 2 { c π 0 ( 1 − F 0 ( c )) } d c + Z 1 0 2 { ( 1 − c ) π 1 F 1 ( c ) } d c (24) From here, it is easy to get the follo wing: Theorem 5 ([12]) . The expected loss using a uniform distrib ution for cost pr oportions is the Brier scor e. Pr oof. W e hav e BS = π 0 BS 0 + π 1 BS 1 . Using integration by parts, we hav e BS 0 = Z 1 0 s 2 f 0 ( s ) d s = s 2 F 0 ( s ) 1 s = 0 − Z 1 0 2 sF 0 ( s ) d s = 1 − Z 1 0 2 sF 0 ( s ) d s = Z 1 0 2 sd s − Z 1 0 2 sF 0 ( s ) d s Similarly for the negati ve class: BS 1 = Z 1 0 ( 1 − s ) 2 f 1 ( s ) d s = ( 1 − s ) 2 F 1 ( s ) 1 s = 0 + Z 1 0 2 ( 1 − s ) F 1 ( s ) d s = Z 1 0 2 ( 1 − s ) F 1 ( s ) d s T aking their weighted a verage, we obtain BS = π 0 BS 0 + π 1 BS 1 = Z 1 0 { π 0 ( 2 s − 2 sF 0 ( s )) + π 1 2 ( 1 − s ) F 1 ( s ) } d s which, after reordering of terms and change of v ariable, is the same expression as Eq. (24). In [12] we introduced the Brier curve as a plot of Q c ( c ; c ) against c , so this theorem states that the area under the Brier curve is the Brier score. Gi ven a classifier with scores, we may use its scores to try to get better threshold choices with this choice, or we may ignore the scores and use e venly-spaced scores. Namely , we can just assign the n scores such that s i = i − 1 n − 1 , going then from 0 to 1 with steps of 1 n − 1 . W ith this simple idea we see that the probabilistic threshold choice method reduces to T n c , which was analysed in the pre vious two sections. And now we get a very interesting result. Corollary 6. If scor es ar e evenly spaced, we get that: 2 π 1 π 0 ( 1 − A UC ) + 1 3 − π 1 π 0 = BS (25) 13 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 6: Several graphical representations for a rank er with evenly spaced scores 1 0.957 0.913 0.870 0.826 0.782 0.739 0.696 0.652 0.609 0.565 0.522 0.478 0.435 0.391 0.348 0.304 0.261 0.217 0.174 0.130 0.087 0.043 0 and true classes 1,1,1,1,0,1,1,1,0,1,1,1,1,0,1,0,1,1,0,0,0,0,1,0 (15 negati v e examples and 9 positiv e examples). Left: ROC curve (solid) and con vex hull (dashed). Right: cost lines, optimal cost curv e (dashed), R OC cost curve (bro wn, thick and solid) and Brier curve (pink, thin and solid) ag ainst cost proportions. As far as we are a ware, this is the first published connection between the area under the R OC Curve and the Brier score. Of course, this is related to the Brier curves introduced in [12], so we can also say that Brier curv es and ROC cost curves are closely related (have the same area) if the classifier has e venly-spaced scores. 4 Figure 6 shows a classifier with e venly-spaced scores, so that the previous corollary holds. W e can see that the Brier curve and the R OC cost curve have similar shapes, although they are not identical. W e hav e that the A UC = 0 . 7777, 2 π 1 π 0 ( 1 − A UC ) + 1 3 − π 1 π 0 = 0 . 203125, where the Brier score is 0.2047101 and the area under the Brier curve is 0.2006. 5 Finally , we sho w a perfectly calibrated classifier and the R OC cost curv es with the Brier curves in Figure 7. The pink curve (Brier curve) for cost proportions matches the black curve (the optimal curve).The R OC cost curve shows that the rate-driv en threshold choice methods sometimes makes sub-optimal choices: for example, it only switches to the second point from the left in the ROC curve when c = 4 / 11 = 0 . 36, whereas the optimal decision would be to switch to this point from c = 0 . 25. 7 Conclusions The definition of cost curv e in the literature has been partially elusiv e. While it is clear what cost lines are, it was not clear what dif ferent options we may ha ve to draw dif ferent curv es on the cost space, which of them 4 W orking with skews instead of cost proportions, the deriv ation should lead to a corresponding equation to Corollary 6, i.e. 1 2 ( 1 − A UC ) + 1 12 = BS 0 + BS 1 2 . This ex ercise is left to the reader . 5 These two latter numbers should be exactly equal but some small problems when dealing with ties in the implementation of the curves are causing this small dif ference. 14 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 F1 F0 ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 cost loss ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 7: Se veral graphical representations for a perfectly calibrated classifier with scores (1,0.833333,0.833333,0.833333,0.833333,0.833333,0.833333, 0.25, 0.25, 0.25, 0.25) and true classes (1,1,1,1,1,1,0,0,0,0,1) (4 positi ve examples and 7 negati ve examples). Left: con ve x R OC curve. Right: cost lines, optimal cost curve (dashed), R OC cost curve (bro wn, thick and solid) and Brier curve (pink, thin and solid) against cost proportions. were v alid and which were not, and, more importantly , if the y correspond to some curv es or representations in R OC space. In this paper , we hav e clarified the relation between ROC space and cost space, by finding the corre- sponding curves for R OC curves in cost space. These represent cost curves for rankers that do not commit to a fixed decision threshold. Cost plots ha ve some advantages o ver R OC plots, and the possibility of drawing R OC cost curves may gi ve further support to use cost plots and use their R OC cost curves there. In addition, we have shown that when the scores of a classifier are set in an evenly-spaced way , the R OC cost curves correspond to the previously presented Brier curves and we ha ve the first firm connection between the Brier Score and A UC . This also suggests that there might be a way to draw Brier curv es in R OC space. Gi ven the exploratory character of this paper , there are many interesting options to follo w up. Our focus will be on ho w to use R OC cost curves to choose among models and construct hybrid classifiers. Refer ences [1] NM Adams and DJ Hand. Comparing classifiers when the misallocation costs are uncertain. P attern Recognition , 32(7):1139–1147, 1999. [2] G.W . Brier . V erification of forecasts expressed in terms of probability. Monthly weather r eview , 78(1):1–3, 1950. [3] I. Cohen and M. Goldszmidt. Properties and benefits of calibrated classifiers. Knowledg e Discovery in Databases: PKDD 2004 , pages 125–136, 2004. 15 [4] C. Drummond and R.C. Holte. Explicitly representing e xpected cost: An alternati ve to R OC represen- tation. In Knowledge Disco very and Data Mining , pages 198–207, 2000. [5] C. Drummond and R.C. Holte. Cost Curves: An Improv ed Method for V isualizing Classifier Perfor- mance. Machine Learning , (65):95–130, 2006. [6] C. Elkan. The foundations of Cost-Sensiti ve learning. In Bernhard Nebel, editor , Pr oceedings of the se venteenth International Conference on Artificial Intelligence (IJCAI-01) , pages 973–978, San Francisco, CA, 2001. [7] T . Fawcett. An introduction to R OC analysis. P attern Recognition Letters , 27(8):861–874, 2006. [8] P . Flach, C. Ferri, and Hern ´ andez-Orallo. A unified view of classifier performance measures. In pr eparation , 2011. [9] P .A. Flach. The geometry of R OC space: Understanding machine learning metrics through R OC isometrics. In Machine Learning, Pr oceedings of the T wentieth International Conference (ICML 2003) , pages 194–201, 2003. [10] P .A. Flach, J. Hern ´ andez-Orallo, and C. Ferri. A coherent interpretation of A UC as a measure of ag- gregated classification performance. In Pr oceedings of the 28th International Conference on Mac hine Learning, ICML2011 , 2011. [11] D.J. Hand. Measuring classifier performance: a coherent alternati ve to the area under the R OC curve. Machine learning , 77(1):103–123, 2009. [12] J. Hern ´ andez-Orallo, P . Flach, and C. Ferri. Brier curves: a new cost-based visualisation of classifier performance. In Proceedings of the 28th International Conference on Machine Learning, ICML2011 , 2011. [13] A. Martin, G. Doddington, T . Kamm, M. Ordo wski, and M. Przybocki. The DET curve in assess- ment of detection task performance. In F ifth Eur opean Confer ence on Speec h Communication and T echnology . Citeseer , 1997. [14] A.H. Murphy . A note on the utility of probabilistic predictions and the probability score in the cost-loss ratio decision situation. J ournal of Applied Meteor ology , 5:534–536, 1966. [15] G. Piatetsky-Shapiro and B. Masand. Estimating campaign benefits and modeling lift. In Pr oceedings of the fifth ACM SIGKDD international confer ence on Knowledge discovery and data mining , page 193. A CM, 1999. [16] M.D. Reid and R.C. W illiamson. Information, diver gence and risk for binary e xperiments. The Journal of Machine Learning Resear c h , 12:731–817, 2011. [17] J.A. Swets, R.M. Dawes, and J. Monahan. Better decisions through science. Scientific American , 283(4):82–87, October 2000. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment