Loss-sensitive Training of Probabilistic Conditional Random Fields
We consider the problem of training probabilistic conditional random fields (CRFs) in the context of a task where performance is measured using a specific loss function. While maximum likelihood is the most common approach to training CRFs, it ignore…
Authors: Maksims N. Volkovs, Hugo Larochelle, Richard S. Zemel
Loss-sensiti v e T raining of Probabilistic Conditional Random Fields Maksims N. V olko vs Department of Computer Science Uni versity of T oronto T oronto, Canada Hugo Larochelle D ´ epartement d’informatique Uni versit ´ e de Sherbrooke Sherbrooke, Canada Richard S. Zemel Department of Computer Science Uni versity of T oronto T oronto, Canada Abstract W e consider the problem of training probabilistic conditional random fields (CRFs) in the context of a task where performance is measured using a specific loss function. While maximum likelihood is the most common approach to training CRFs, it ignores the inherent structure of the task’ s loss function. W e describe alternati ves to maximum likelihood which take that loss into account. These include a novel adaptation of a loss upper bound from the structured SVMs literature to the CRF context, as well as a new loss-inspired KL di vergence objecti ve which relies on the probabilistic nature of CRFs. These loss-sensitiv e objectiv es are compared to maximum likelihood using ranking as a benchmark task. This comparison confirms the importance of incorporating loss information in the probabilistic training of CRFs, with the loss-inspired KL outperforming all other objectiv es. 1 Intr oduction Conditional random fields (CRFs) [1] form a flexible family of models for capturing the interaction between an input x and a tar get y . CRFs hav e been designed for a v ast v ariety of problems, including natural language processing [2, 3, 4], speech processing [5], computer vision [6, 7, 8] and bioinformatics [9, 10] tasks. One reason for their popularity is that they provide a flexible framework for modeling the conditional distrib utions of targets constrained by some specific structure, such as chains [1], trees [11], 2D grids [7, 12], permutations [13] and many more. While there has been a lot of work on developing appropriate CRF potentials and energy functions, as well as on deriving ef ficient (approximate) inference procedures for some giv en target structure, much less attention has been paid to the loss function under which the CRF’ s performance is ultimately ev aluated. Indeed, CRFs are usually trained by maximum likelihood (ML) or the maximum a posteriori criterion (MAP or regularized ML), which ignores the task’ s loss function. Y et, several tasks are associated with loss functions 1 that are also structured and do not correspond to a simple 0/1 loss: labelwise error (Hamming loss) for item labeling, BLEU score for machine translation, normalized discounted cumulativ e gain (NDCG) for ranking, etc. Ignoring this structure can prove as detrimental to performance as ignoring the target’ s structure. The inclusion of loss information into learning is an idea that has been more widely explored in the context of struc- tured support vector machines (SSVMs) [14, 15]. SSVMs and CRFs are closely related models, both trying to shape an energy or score function over the joint input and tar get space to fit the av ailable training data. Howe ver , while an SSVM attempts to satisfy margin constraints without inv oking a probabilistic interpretation of the model, a CRF follows a prob- abilistic approach and instead aims at calibrating its probability estimates to the data. Similarly , while an SSVMs relies 1 W ithout loss of generality , for tasks where a performance measure is instead provided (i.e. where higher values is better), we assume it can be con verted into a loss, e.g. by setting the loss to the negati ve of the performance measure. 1 on maximization procedures to identify the most violated margin constraints, a CRF relies on (approximate) inference or sampling procedures to estimate probabilities under its distribution and compare it to the empirical distrib ution. While there are no ob vious reasons to prefer one approach to the other , a currently unanswered question is whether the known methods that adapt SSVM training to some gi ven loss (i.e., upper bounds based on margin and slack scaling [15]) can also be applied to the probabilistic training of CRFs. Another question is how such methods would compare to other loss-sensitiv e training objectiv es which rely on the probabilistic nature of CRFs and which may hav e no analog in the SSVM framew ork. W e in vestigate these questions in this paper . First, we describe upper bounds similar to the margin and slack scaling upper bounds of SSVMs, but that correspond to maximum likelihood training of CRFs with loss-augmented and loss-scaled energy functions. Second, we describe tw o other loss-inspired training objecti ves for CRFs which rely on the probabilistic nature of CRFs: the standard average expected loss objecti ve and a novel loss-inspired KL-di ver gence objecti ve. Finally , we compare these loss-sensitiv e objecti ves on ranking benchmarks based on the NDCG performance measure. T o our knowledge, this is the first systematic e valuation of loss-sensiti ve training objectiv es for probabilistic CRFs. 2 Conditional Random Fields This work is concerned with the general problem of supervised learning, where the relationship between an input x and a tar get y must be learned from a training set of instantiated pairs D = { x t , y t } . More specifically , we are interested in learning a predictiv e mapping from x to y . Conditional random fields (CRFs) tackle this problem by defining directly the conditional distribution p ( y | x ) through some energy function E ( y , x ; θ ) as follo ws: p ( y | x ) = exp( − E ( y , x ; θ )) / Z ( x ) , Z ( x ) = X y ∈Y ( x ) exp( − E ( y , x ; θ )) (1) where Y ( x ) is the set of all possible configurations for y given the input x and θ is the model’ s parameter vector . The parametric form of the ener gy function E ( y , x ; θ ) will depend on the nature of x and y . A popular choice is that of a linear function of a set of features on x and y , i.e., E ( y , x ; θ ) = − P i θ i f i ( x , y ) . 2.1 Maximum Likelihood Objectiv e The most popular approach to training CRFs is conditional maximum likelihood. It corresponds to the minimization with respect to θ of the objective L ML ( D ; θ ) : − 1 |D | X ( x t , y t ) ∈D log p ( y t | x t ) = 1 |D | X ( x t , y t ) ∈D E ( y t , x t ; θ ) + log X y ∈Y ( x ) exp( − E ( y , x t ; θ )) . (2) T o this end, one can use any gradient-based optimization procedure, which can con vergence to a local optimum, or ev en a global optimum if the problem is con vex (e.g., by choosing an energy function E ( y , x ; θ ) linear in θ ). The gradients have an elegant form: ∂ L ML ( D ; θ ) ∂ θ = 1 |D | X ( x t , y t ) ∈D ∂ E ( y t , x t ; θ ) ∂ θ − E E y | x t ∂ E ( y , x t ; θ ) ∂ θ . (3) Hence exact gradient ev aluations are possible when the conditional expectation in the second term is tractable. This is the case for CRFs with a chain or tree structure, for which belief propagation can be used. When gradients are intractable, tw o 2 approximate alternatives can be considered. The first is to approximate the intractable expectation using either Markov chain Monte Carlo sampling or variational inference algorithms such as mean-field or loopy belief propagation, the latter being the most popular . The second approach is to use alternative objectiv es such as pseudolikelihood [16] or piece-wise training [17] 2 . 3 Loss-sensitiv e T raining Objectives Unfortunately , maximum likelihood and its associated approximations all suffer from the problem that the loss function under which the performance of the CRF is ev aluated is ignored. In the well-specified case and for large datasets, this would probably not be a problem because of the asymptotic consistenc y and efficienc y properties of maximum lik elihood. Howe ver , almost all practical problems do not f all in the well-specified setting, which justifies the e xploration of alternativ e training objectiv es. Let ˆ y ( x t ) denote the prediction made by a CRF for some given input x t . Most commonly 3 , this prediction will be ˆ y ( x t ) = arg max y ∈Y ( x t ) p ( y | x t ) = arg min y ∈Y ( x t ) E ( y , x t ) . W e assume that we are gi ven some loss l t ( ˆ y ( x t )) under which the performance of the CRF on some dataset D will be measured. W e will also assume that l t ( y t ) = 0 . The goal is then to achiev e a low a verage 1 |D| P ( x t , y t ) ∈D l t ( ˆ y ( x t )) under that loss. Directly minimizing this average loss is hard, because l t ( ˆ y ( x t )) is not a smooth function of the CRF parameters θ . In fact, the loss itself l t ( ˆ y ( x t )) is normally not a smooth function of the prediction ˆ y ( x t ) , and ˆ y ( x ) is also not a smooth function of the model parameters θ . This non-smoothness makes it impossible to apply gradient-based optimization. Howe ver , one could attempt to indirectly optimize the a verage loss by deriving smooth objectives that also depend on the loss. In the next sections, we describe three separate formulations of this approach. 3.1 Loss Upper Bounds The loss function provides important information as to how good a potential prediction y is with respect to the ground truth y t . In particular , it specifies an ordering from the best prediction ( y = y t ) to increasingly bad predictions with increasing value of their associated loss l t ( y ) . It might then be desirable to ensure that the CRF assigns particularly lo w probability (i.e., high energy) to the w orst possible predictions, as measured by the loss. A first way of achie ving this is to augment the energy function at a giv en training example ( x t , y t ) by including the loss function for that example, producing a Loss-A ugmented energy: E LA t ( y , x t ; θ ) = E ( y , x t ; θ ) − l t ( y ) . (4) By artificially reducing the energy of bad values of y as a function of their loss, this will force the CRF to increase ev en more the value of E ( y , x ; θ ) for those v alues of y with high loss. This idea is similar to the concept of margin re-scaling in structured support vector machines (SSVMs) [14, 15], a similarity that has been highlighted pre viously by Hazan and Urtasun [18]. Moreover , as in SSVMs, it can be shown that by replacing the regular energy function with this loss-augmented energy function in the maximum likelihood objectiv e of Equation 2, we obtain a new Loss-Augmented 2 V ariational inference-based training can also be interpreted as training based on a dif ferent objectiv e. 3 For loss functions that decompose into loss terms over subsets of target variables, it may be more appropriate to use the mode of the marginals over each subset as the prediction. 3 objectiv e that upper bounds the av erage loss: L LA ( D ; θ ) = 1 |D | X ( x t , y t ) ∈D E LA t ( y t , x t ) + log X y ∈Y ( x ) exp( − E LA t ( y , x t )) ≥ 1 |D | X ( x t , y t ) ∈D E LA t ( y t , x t ) + log exp( − E LA t ( ˆ y ( x t ) , x t )) = 1 |D | X ( x t , y t ) ∈D E ( y t , x t ) − E ( ˆ y ( x t ) , x t ) + l t ( ˆ y ( x t )) − l t ( y t ) ≥ 1 |D | X ( x t , y t ) ∈D l t ( ˆ y ( x t )) . W e see that the higher l t ( y ) is for some given y , the more important the ener gy term associated with it will be in the global objectiv e. Hence, introducing the loss this way will indeed force the optimization to focus more on increasing the ener gy for configurations of y associated with high loss. As an alternative to subtracting the loss, we could further increase the weight of terms associated with high loss by also multiplying the original energy function, as follo ws: E LS t ( y , x t ; θ ) = l t ( y )( E ( y , x t ; θ ) − E ( y t , x t ; θ )) − l t ( y ) . (5) The advantage of this Loss-Scaled energy is that when a configuration of y with high loss already has higher energy than the target (i.e., E ( y , x t ; θ ) − E ( y t , x t ; θ ) > 0 ), then the ener gy is going to be further increased, reducing its weight in the optimization. In other words, focus in the optimization is put on bad configurations of y only when they have lo wer ener gy than the target. Finally , we can also show that the Loss-Scaled objective obtained from this loss-scaled energy leads to an upper bound on the av erage loss: L LS ( D ; θ ) = 1 |D | X ( x t , y t ) ∈D E LS t ( y t , x t ) + log X y ∈Y ( x ) exp( − E LS t ( y , x t )) = 1 |D | X ( x t , y t ) ∈D log X y ∈Y ( x ) exp( − l t ( y )( E ( y , x t ; θ ) − E ( y t , x t ; θ )) + l t ( y )) ≥ 1 |D | X ( x t , y t ) ∈D l t ( ˆ y ( x t ))(1 + E ( y t , x t ; θ ) − E ( ˆ y ( x t ) , x t ; θ )) ≥ 1 |D | X ( x t , y t ) ∈D l t ( ˆ y ( x t )) . There is a connection with SSVM training objectives here as well. This loss-scaled CRF is the probabilistic equiv alent of SSVM training with slack re-scaling [15]. Since both the loss-augmented and loss-scaled CRF objectiv es follow the general form of the maximum likelihood objectiv e but with dif ferent ener gy functions, the form of the gradient is also that of Equation 3. The two key differences are that the energy function is now different, and the conditional expectation on y given x t is according to the CRF distribution with the associated loss-sensitive energy . In general (particularly for the loss-scaled CRF), it will not be 4 possible to run belief propagation to compute the expectation 4 , b ut adapted forms of loopy belief propagation or MCMC (e.g., Gibbs sampling) could be used. 3.2 Expected Loss A second approach to deri ving a smooth v ersion of the average loss is to optimize the av erage Expected Loss , where the expectation is based on the CRF’ s distribution: L EL ( D ; θ ) = 1 |D | X ( x t , y t ) ∈D E E y | x t [ l t ( y )] = 1 |D | X ( x t , y t ) ∈D X y ∈Y ( x t ) l t ( y ) p ( y | x t ) . (6) While this objective is not an upper bound, it becomes increasingly closer to the average loss as the entropy of p ( y | x t ) becomes smaller and puts all its mass on ˆ y ( x t ) . The parameter gradient has the following form: ∂ L EL ( D ; θ ) ∂ θ = 1 |D | X ( x t , y t ) ∈D E E y | x t [ l t ( y )] E E y | x t ∂ E ( y , x t ; θ ) ∂ θ − E E y | x t l t ( y ) ∂ E ( y t , x t ; θ ) ∂ θ . (7) If the required expectations cannot be computed tractably , MCMC sampling can be used to approximate them. Another alternativ e is to use a fixed set of representati ve samples [13]. 3.3 Loss-inspired Kullback-Leibler Both the average expected loss and the loss upper bound objecti ves hav e in common that their objectives are perfectly minimized when the posteriors p ( y | x t ) put all their mass on the targets y t . In practice, this is bound not to happen, since this is likely to correspond to an ov erfitted solution which will be av oided using additional regularization. Instead of relying on a generic regularizer such as the ` 2 -norm of the parameter vector , perhaps the loss function itself might provide cues as to how best to regularize the CRF . Indeed, we can think of the loss as a ranking of all potential predictions, from perfect to adequate to w orse. Hence, if we are not to put all probability mass on p ( y t | x t ) , we could make use of the information provided by the loss in order to determine how to distribute the excess mass 1 − p ( y t | x t ) on other configurations of y . In particular , it would be sensible to distrib ute it on other values of y in proportion to the loss l t ( y ) . T o achieve this, we propose to first con vert the loss into a distribution over the target q ( y | t ) and then minimize the K ullback-Leibler (KL) diver gence between this target distrib ution and the CRF posterior: L KL ( D ; θ ) = 1 |D | X ( x t , y t ) ∈D D KL ( q ( ·| t ) || p ( ·| x t )) = − 1 |D | X ( x t , y t ) ∈D X y ∈Y ( x t ) q ( y | t ) log p ( y | x t ) − C (8) where constant C = H ( q ( ·| t )) is the entropy of the tar get distribution, which does not depend on parameter v ector θ . There are sev eral ways of defining the target distrib ution q ( y | t ) . In this work, we define it as follows: q ( y | t ) = exp( − l t ( y ) /T ) / Z t , Z t = X y ∈Y ( x t ) exp( − l t ( y ) /T ) (9) where the temperature parameter T controls how peaked this distrib ution is around y t . The maximum lik elihood objecti ve is recov ered as T approaches 0. 4 In the loss-augmented case, one exception is if the loss decomposes into individual losses o ver each target variable y i and the CRF follows a tree structure in its output. In this case, the loss terms can be inte grated into the CRF unary features and belief propagation will perform exact inference. 5 ML LA LS EL KL −1 −0.5 0 0.5 1 − ∂ L/ ∂ E E << 0, L >> 0 E < 0, L > 0 E = 0, L = 0 E > 0, L > 0 E >> 0, L >> 0 Figure 1: Negativ e deri vati ves of the objectiv e with respect to energy for each of the fiv e training objectives presented. The fiv e training objectiv es ar: maximum lik elihood (ML), loss-augmented ML (LA), loss-scaled ML (LS), e xpected loss (EL) and loss-inspired K ullback-Leibler di vergence (KL). For each objecti ve we consider fi ve dif ferent configurations: from left to right, their energies are [ − 1 , − 0 . 5 , 0 , 0 . 5 , 1] and the losses are [5 , 1 , 0 , 1 , 5] . The middle one therefore corresponds to a ground-truth configuration; those to its left are currently more likely under the model, and loss increases with distance from this middle one. The deriv ativ es for each objectiv e are normalized by the ` 2 norm. The gradient with respect to θ is simply the expectation of the gradient for maximum likelihood L ML according to the target distribution q ( y | t ) . Here too, if the expectation is not tractable, one can using sampling to approximate it. In particular , since we hav e total control ov er the form of q ( y | t ) , it is easy to define it such that it can be sampled from exactly . 3.4 Analyzing the Beha vior of the T raining Objectiv es Figure 1 sho ws how the gradient with respect to the ener gy changes for each objectiv e as we consider configurations y with varying energy and loss v alues. From this figure we see significant dif ferences in the beha viors of the introduced objecti ve functions. Only the expected-loss and loss-inspired Kullback-Leibler objectiv es will attempt to lower the energies of configurations that hav e non-zero loss. The maximum likelihood objectiv e aims to raise the energies of the non-zero loss configurations, in proportion to how probable they are. On the other hand the loss-augmented and loss-scaled objecti ves concentrate on the most probable configurations that have the highest loss (worst violators), with the loss-scaled objectiv e having the most extreme behavior and putting all the gradient on the worst violator . This behavior is expected as the energies get amplified by the addition (multiplication) of the loss which artificially raises the probability of the already probable violators. The behavior of the expected-loss objective is counter-intuiti ve as it tries to lower the energy of all configurations that 6 hav e low loss, including those that are already more probable than the zero-loss one. In this e xample, it ev en pushes down more the energy of a non-zero loss configuration more than that of the zero-loss (target) configuration. The loss-inspired KL objecti ve adjusts this and only lo wers the energy of the zero-loss (ground-truth) and the lo w-loss configuration that has low probability . 4 Learning with Multiple Gr ound T ruths In certain applications, for some gi ven input x t , there is not only a single target y t that is correct (see Section 6 for the case of ranking). This information can easily be encoded within the loss function, by setting l t ( y ) = 0 for all such valid predictions. In this context, maximum likelihood training corresponds to the objecti ve: L ML ( D ; θ ) = − 1 |D | X ( x t , r t ) ∈D X y t ∈Y 0 ( x t ) log p ( y t | x t ) (10) where Y 0 ( x t ) = { y | y ∈ Y ( x t ) , l t ( y ) = 0 } . This is equiv alent to maximizing the likelihood of all predictions y that are consistent with the loss, i.e. that have zero loss. The loss-augmented variant is similarly adjusted. As for loss-scaling, we replace the energy at the ground truth with the a verage ener gy of all valid ground truths in the loss-scaled ener gy: E LS t ( y , x t ; θ ) = l t ( y ) E ( y , x t ; θ ) − 1 |Y t ( x t ) | X y t ∈Y 0 ( x t ) E ( y t , x t ; θ ) − l t ( y ) . (11) No changes to the av erage expected loss and loss-inspired KL objecti ves are necessary as they consider all v alid y . In the setting of multiple ground truths, a clear distinction can be made between the av erage expected loss and the other objectiv es, in terms of the solutions they encourage. Indeed, the expected loss will be minimized as long as P y t ∈Y t ( x t ) p ( y t | x t ) = 1 , i.e. probability mass is only put on configurations of y that hav e zero loss. On the other hand, the maximum likelihood and loss upper bound objectiv es add the requirement that the mass be equally distributed amongst those configurations. As for the loss-inspired KL, it requires that the sum of the probability mass sum to a constant smaller than 1, specifically 1 − P y t ∈Y t ( x t ) q ( y t | t ) . 5 Related W ork While maximum likelihood is the dominant approach to training CRFs in the literature, others have proposed ways of adapting the CRF training objecti ve for specific tasks. For sequence labeling problems, Kakade et al. [19] proposed to maximize the label-wise mar ginal likelihood instead of the joint label sequence likelihood, to reflect the fact that the task’ s loss function is the sum of label-wise classification errors. Suzuki et al. [20], Gross et al. [21] went a step further by proposing to directly optimize a smoothed version of the label-wise classification error (Suzuki et al. [20] also described how to apply it to optimize an F-score). Their approach is similar to the av erage expected loss described in Section 3.2, howe ver they do not discuss ho w to generalize it to arbitrary loss functions. The av erage expected loss objecti ve for CRFs was formulated by T aylor et al. [22] and V olkovs and Zemel [13], in the conte xt of ranking. W ork in other framew orks than CRFs for structured output prediction have looked at ho w to incorporate loss informa- tion into learning. Tsochantaridis et al. [15] describe how to upper bound the average loss with mar gin and slack scaling. McAllester et al. [23] propose a perceptron-like algorithm based on an update which in e xpectation is close to the gradient on the true expected loss (i.e., the expectation is with respect to the true generative process). Both SSVMs and perceptron 7 algorithms require procedures for computing the so-called loss-adjusted MAP assignment of the output y which, for richly structured losses, can be intractable. One adv antage of CRFs is that they can instead leverage the v ast MCMC litera- ture to sample from CRFs with loss-adjusted energies. Moreov er, they open the door to alternative (i.e. not necessarily upper-bounding) objecti ves. Finally , while Hazan and Urtasun [18] described ho w margin scaling can be applied to CRFs, we gi ve for the first time the equiv alent of slack scaling for CRFs in Section 3.1. 6 Experiments W e e valuate the usefulness of the dif ferent loss-sensitive training objectives on ranking tasks. In this setting, the input x = ( q , D ) corresponds to a pair made of a query vector q and a set of documents D = { d ( i ) } , and y is a vector corresponding to a ranking 5 of each document d ( i ) among the whole set of documents D . Ranking is particularly interesting as a benchmark task for loss-sensitiv e training of CRFs for two reasons. The first is the complexity of the output space Y ( q , D ) , which corresponds to all possible permutations of documents D , making the application of CRFs to this setting more challenging than sequential labeling problems with chain structure. The second is that learning to rank is an example of a task with multiple ground truths (see Section 4), which is a more challenging setting than the single ground truth case. Indeed, for each training input x t = ( q t , D t ) , we are not given a single target rank y t , but a vector r t of relevance lev el values for each document. The higher the level, the more relev ant the document is and the better its rank should be. Moreover , two documents d ( i ) t and d ( j ) t with the same relev ance le vel (i.e., r ti = r tj ) are indistinguishable in their ranking, meaning that they can be swapped within some ranking without affecting the quality of that ranking. The quality of a ranking is measured by the Normalized Discounted Cumulativ e Gain: N D C G ( y , r t ) = N t m t X i =1 r ti log(2) log(1 + y i ) (12) where N t = 1 / N D C G (arg sort( − r t ) , r t ) is a normalization constant that insures the maximum v alue of NDCG is 1, which is achie ved when documents are ordered in decreasing order of their relev ance le vels. Note that this is not a standard definition of NDCG, we use it here because this form was adopted for ev aluation of the baselines on the Microsoft’ s LETOR4.0 datset collection [24]. T o con vert NDCG into a loss, we simply define l t ( y ) = 1 − N D C G ( y , r t ) . A common approach to ranking is to learn a scoring function f ( q , d ( i ) ) which outputs for all documents d ( i ) ∈ D a score corresponding to how rele vant document d ( i ) is for query q . Here, we follo w the same approach by incorporating this scoring function into the energy function of the CRF . W e use an energy function linear in the scores: E ( y , q , D ) = | D | X i =1 α y i f ( q , d ( i ) ) (13) where α is a weight vector of decreasing values (i.e., α i > α j for i < j ). In our experiments, we use a weighting inspired by the NDCG measure: α i = log (2) / log( i + 1) . Using this ener gy function, we can show that the prediction ˆ y ( q , D ) is obtained by sorting the documents in decreasing order of their scores: ˆ y ( q , D ) = arg min y ∈Y ( q , D ) E ( y , q , D ) = arg sort([ − f ( q , d (1) ) , . . . , − f ( q , d ( | D | ) )]) . (14) 5 For example, if y i = 3 , then document d ( i ) is ranked third amongst all documents D for the query q . 8 @1 @2 @3 @4 @5 38 39 40 41 42 ML LA LS EL KL (a) MQ2007 @1 @2 @3 @4 @5 36 38 40 42 44 46 48 ML LA LS EL KL (b) MQ2008 Figure 2: NDCG@1-5 results on MQ2008 and MQ2007 datasets for different learning objecti ves. As for the scoring function, we use a simple linear function f ( q , d ( i ) ) = θ T φ ( q , d ( i ) ) on a joint query-document feature representation φ ( q , d ( i ) ) . A standard feature representation is provided in each ranking datasets we considered. W e trained CRFs according to maximum likelihood as well as the different loss-sensitiv e objectives described in Sec- tion 3. In all cases, stochastic gradient descent was used by iterating over queries and performing a gradient step update for each query . Moreover , because the size of Y ( q , D ) is factorial in the number of documents, explicit summation over that set is only tractable for a small number of documents. T o av oid this problem we use an approach similar to the one suggested by Petterson et al. [25]. Every time a query q t is visited and its associated set of documents D t is greater than 6 , we randomly select a subset of 6 documents e D ⊂ D t , ensuring that it contains at least one document of ev ery relev ance lev el found for that query . The exact parameter gradients can then be computed for this reduced set by enumerating all possible permutations, and the CRF can be updated. 6.1 Datasets In our experiments we use the LETOR [24] benchmark datasets. These data sets were chosen because they are publicly av ailable, include se veral baseline results, and provide ev aluation tools to ensure accurate comparison between methods. In LETOR4.0 there are two learning to rank data sets MQ2007 and MQ2008. MQ2007 contains 1692 queries with 69623 documents and MQ2008 contains 784 queries and a total of 15211 documents. Each query document pair is assigned one of three relev ance judgments: 2 = highly relev ant, 1 = relev ant and 0 = irrelev ant. Both datasets come with five precomputed folds with 60/20/20 slits for training validation and testing. The results show for each model the averages of the test set results for the fiv e folds. 6.2 Results W e experimented with fi ve objecti ve functions, namely: maximum likelihood (ML), loss-augmented ML (LA), loss-scaled ML (LS), expected loss (EL) and loss-inspired Kullback-Leibler div ergence (KL). For the loss-augmented objective we introduced an additional weight α > 0 modifying the energy to: E t ( y , x t ; θ ) = E ( y , x t ; θ ) − αl t ( y ) . In this form α controls the contribution of the loss to the overall energy . For all objectives we did a sweep ov er learning rates in 9 T able 1: NDCG@1-5 results on MQ2008 and MQ2007 datasets. MQ2007: NDCG MQ2008: NDCG @1 @2 @3 @4 @5 @1 @2 @3 @4 @5 Regression 38.94 39.60 39.86 40.53 41.11 36.67 40.62 42.89 45.60 47.73 SVM-Struct 40.96 40.73 40.62 40.84 41.42 36.26 39.84 42.85 45.08 46.95 ListNet 40.02 40.63 40.91 41.44 41.70 37.54 41.12 43.24 45.68 47.47 AdaRank 38.76 39.67 40.44 40.67 41.02 38.26 42.11 44.20 46.53 48.21 KL 41.06 40.90 40.93 41.33 41.75 39.47 41.80 43.74 46.18 47.84 [0.5, 0.01, 0.01, 0.001]. Moreover we experimented with α in [1, 10, 20, 50] for the loss-augmented objecti ve and T in [1, 10, 20, 50] for the KL objectiv e. For each fold the setting that gav e the best validation NDCG was chosen and the corresponding model was then tested on the test set. The results for the fi ve objecti ve functions are shown in Figures 2(a) and 2(b). First, we see that in almost all cases loss- augmentation produces better results than the base maximum likelihood approach. Second, loss-scaling further improv es on the loss-augmentation results and has similar performance to the expected objectiv e. Finally , among all objectives, KL consistently produces the best results on both datasets. T aken together , these results strongly support our claim that incorporating the loss into the learning procedure of CRFs is important. Comparisons of the CRFs trained on the KL objectiv e with other models is also shown in T able 1, where the perfor- mance of linear regression and other linear baselines listed on LETOR’ s website is provided. W e see that KL outperforms the baselines on the MQ2007 dataset on all truncations except 4 . Moreov er , on MQ2008 the performance KL is compara- ble to the best baseline AdaRank, with KL beating AdaRank on NDCG@1. W e note also that KL consistently outperforms LETOR’ s SVM-Struct baseline. 7 Conclusion In this work, we explored dif ferent approaches to incorporating loss function information into the training objecti ve of a probabilistic CRF . W e discussed ho w to adapt ideas from the SSVM literature to the probabilistic context of CRFs, intro- ducing for the first time the equiv alent of slack scaling to CRFs. W e also described objecti ves that rely on the probabilistic nature of CRFs, including a novel loss-inspired KL objectiv e. In an empirical comparison on ranking benchmarks, this new KL objecti ve was sho wn to consistently outperform all other loss-sensitive objecti ves. T o our kno wledge, this is the broadest comparison of loss-sensiti ve training objecti ves for probabilistic CRFs yet to be made. It strongly suggests that the most popular approach to CRF training, maximum likelihood, is likely to be suboptimal. While ranking was considered as the benchmark task here, in future work, we would like to extend our empirical analysis to other tasks such as labeling tasks. Refer ences [1] John Laf ferty , Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for seg- menting and labeling sequence data. In ICML , pages 282–289. Mor gan Kaufmann, San Francisco, CA, 2001. [2] Fei Sha and Fernando Pereira. Shallo w parsing with conditional random fields. In HLT/N AA CL , 2003. 10 [3] Sunita Sarawagi and William W . Cohen. Semi-markov conditional random fields for information extraction. In Lawrence K. Saul, Y air W eiss, and L ´ eon Bottou, editors, NIPS , pages 1185–1192. MIT Press, Cambridge, MA, 2005. [4] Dan Roth and W en-tau Y ih. Integer linear programming inference for conditional random fields. In Pr oceedings of the 22nd international confer ence on Mac hine learning , ICML, pages 736–743, New Y ork, NY , USA, 2005. A CM. [5] Asela Gunawardana, Milind Mahajan, Alex Acero, and John C. Platt. Hidden conditional random fields for phone classification. In Interspeec h , pages 1117–1120, 2005. [6] Xuming He, Richard S. Zemel, and Miguel ´ A. Carreira-Perpi ˜ n ´ an. Multiscale conditional random fields for image labeling. In CVPR , pages 695–702, 2004. [7] Sanji v Kumar and Martial Hebert. Discriminative fields for modeling spatial dependencies in natural images. In NIPS , 2003. [8] Ariadna Quattoni, Michael Collins, , and Tre vor Darrell. Conditional random fields for object recognition. In NIPS , 2005. [9] K engo Sato and Y asubumi Sakakibara. Rna secondary structural alignment with conditional random fields. Bioinfor- matics , 2005. [10] Y an Liu, Jaime Carbonell, Peter W eigele, and V anathi Gopalakrishnan. Protein fold recognition using segmentation conditional random fields (scrfs). J ournal of Computational Biology , 2006. [11] T rev or Cohn and Philip Blunsom. Semantic role labelling with tree conditional random fields. In CoNLL , pages 169–172, 2005. [12] Zaiqing Nie, Ji rong W en, Bo Zhang, and W ei ying Ma. 2d conditional random fields for web information e xtraction. In ICML , pages 1044–1051. A CM Press, 2005. [13] Maksims V olkovs and Richard Zemel. BoltzRank: Learning to maximize expected ranking g ain. In L ´ eon Bottou and Michael Littman, editors, ICML , pages 1089–1096, Montreal, June 2009. Omnipress. [14] Ben T askar , Carlos Guestrin, and Daphne K oller . Max-margin marko v netw orks. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch ¨ olkopf, editors, NIPS . MIT Press, Cambridge, MA, 2004. [15] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Y asemin Altun. Large margin methods for struc- tured and interdependent output variables. JMLR , 6:1453–1484, 2005. ISSN 1532-4435. [16] Julian Besag. Statistical analysis of non-lattice data. The Statistician , 24(3):179–195, 1975. [17] C. Sutton and A. Mccallum. Piece wise T raining for Undirected Models. In U AI , 2005. [18] T amir Hazan and Raquel Urtasun. A primal-dual message-passing algorithm for approximated large scale structured prediction. In J. Lafferty , C. K. I. Williams, J. Shawe-T aylor , R.S. Zemel, and A. Culotta, editors, NIPS , pages 838–846. MIT Press, 2010. [19] Sham Kakade, Y ee Whye T eh, and Sam T . Roweis. An alternate objective function for markovian fields. In ICML , pages 275–282, 2002. 11 [20] Jun Suzuki, Erik McDermott, and Hideki Isozaki. Training conditional random fields with multi variate ev aluation measures. In ICCL-A CL , pages 217–224, Stroudsbur g, P A, USA, 2006. Association for Computational Linguistics. [21] Samuel S. Gross, Olga Russakovsky , Chuong B. Do, and Serafim Batzoglou. Training conditional random fields for maximum labelwise accurac y . In B. Sch ¨ olkopf, J. Platt, and T . Hoffman, editors, NIPS , pages 529–536. MIT Press, Cambridge, MA, 2007. [22] Michael J. T aylor , John Guiver , Stephen Robertson, and T om Minka. SoftRank: optimizing non-smooth rank metrics. In WSDM , pages 77–86, 2008. [23] David McAllester , T amir Hazan, and Joseph Keshet. Direct loss minimization for structured prediction. In J. Lafferty , C. K. I. W illiams, J. Shawe-T aylor , R.S. Zemel, and A. Culotta, editors, NIPS , pages 1594–1602. MIT Press, 2010. [24] T . Liu, J. Xu, W . Xiong, and H. Li. LETOR: Benchmark dataset for search on learning to rank for information retriev al. In ACM SIGIR W orkshop on Learning to Rank for Information Retrie val , 2007. [25] J. Petterson, T . S. Caetano, J. J. McAuley , and J. Y u. Exponential family graph matching and ranking. In Neural Information Pr ocessing Systems , pages 1455–1463, 2009. 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment