Bethe Projections for Non-Local Inference
Many inference problems in structured prediction are naturally solved by augmenting a tractable dependency structure with complex, non-local auxiliary objectives. This includes the mean field family of variational inference algorithms, soft- or hard-…
Authors: Luke Vilnis, David Belanger, Daniel Sheldon
Bethe Pr ojections f or Non-Local Infer ence Luke V ilnis * UMass Amherst luke@cs.umass.edu David Belanger * UMass Amherst belanger@cs.umass.edu Daniel Sheldon UMass Amherst sheldon@cs.umass.edu Andrew McCallum UMass Amherst mccallum@cs.umass.edu Abstract Many inference problems in structured pre- diction are naturally solved by augmenting a tractable dependency structure with comple x, non-local auxiliary objectives. This includes the mean field family of v ariational inference algorithms, soft- or hard-constrained inference using Lagrangian relaxation or linear program- ming, collecti ve graphical models, and forms of semi-supervised learning such as posterior regularization. W e present a method to dis- criminativ ely learn broad families of inference objectiv es, capturing po werful non-local statis- tics of the latent variables, while maintain- ing tractable and prov ably fast inference using non-Euclidean projected gradient descent with a distance-generating function giv en by the Bethe entropy . W e demonstrate the performance and flexibility of our method by (1) extracting struc- tured citations from research papers by learning soft global constraints, (2) achieving state-of-the- art results on a widely-used handwriting recog- nition task using a novel learned non-con ve x inference procedure, and (3) providing a fast and highly scalable algorithm for the challeng- ing problem of inference in a collecti ve graphical model applied to bird migration. 1 INTR ODUCTION Structured prediction has shown great success in modeling problems with complex dependencies between output v ari- ables. Practitioners often use undirected graphical models, which encode conditional dependency relationships via a graph. Howe ver , the tractability of exact inference in these models is limited by the graph’ s tr eewidth , often yielding a harsh tradeoff between model e xpressivity and tractability . * Equal contribution. Graphical models are good at modeling local dependencies between variables, such as the importance of surrounding context in determining the meaning of words or phrases. Howe ver , their sensitivity to c yclic dependencies often ren- ders them unsuitable for modeling preferences for certain globally consistent states. For example, in the canonical NLP task of part-of-speech tagging, there is no clear way to enforce the constraint that every sentence have at least one verb without increasing the lik elihood that every token is predicted to be a verb . Concretely , exact marginal inference in a discrete graphical model can be posed as the following optimization problem µ ∗ = arg min µ ∈M − H ( µ ) − h θ , µ i , (1) where µ is a concatenated vector of node and clique marginals, H ( µ ) is the entropy , M is the marginal poly- tope, and θ are parameters. Here we face a tradeoff: adding long-range dependencies directly to the model increases the clique size and thus the complexity of the problem and size of µ , rendering inference intractable. Howe ver , the linear scoring function θ breaks down over cliques, pre- venting us from enforcing global regularities in any other way . In this work, we propose to augment the inference objectiv e (1) and instead optimize µ ∗ = arg min µ ∈M − H ( µ ) − h θ , µ i + L ψ ( µ ) . (2) Here, L ψ is some arbitrary parametric function of the en- tire concatenated marginal v ector, where ψ may depend on input features. Since L ψ is non-linear , it can enforce many types of non-local properties. Interestingly , whenever L ψ is conv ex, and whenev er inference is easy in the underlying model, i.e., solving (1) is tractable, we can solve (2) using non-Euclidean projected gradient methods using the Bethe entropy as a distance-generating function. Unlike many message-passing algorithms, our procedure maintains pri- mal feasibility across iterations, allo wing its use as an any- time algorithm. Furthermore, for non-con vex L ψ , we also show conv ergence to a local optimum of (2). Finally , we present algorithms for discriminative learning of the pa- rameters ψ . In a slight ab use of terminology , we call L ψ a non-local ener gy function . Ours is not the first work to consider modeling global pref- erences by augmenting a tractable base inference objec- tiv e with non-local terms. For e xample, generalized mean- field variational inference algorithms augment a tractable distribution (the Q distribution) with a non-linear , non- con vex global energy function that scores terms in the full model (the P distribution) using products of marginals of Q (W ainwright & Jordan, 2008). This is one special case of our non-local inference frame work, and we present algo- rithms for solving the problem for much more general L ψ , with compelling applications. Additionally , the modeling utility pro vided by global pref- erences has motiv ated work in dual decomposition , where inference in loopy or globally-constrained models is de- composed into repeated calls to inference in tractable in- dependent subproblems (Komodakis et al., 2007; Sontag et al., 2011). It has seen wide success due to its ease of im- plementation, since it reuses existing inference routines as black boxes. Howe ver , the technique is restricted to mod- eling linear constraints, imposed a priori . Similarly , these types of constraints hav e also been imposed on expecta- tions of the posterior distribution for use in semi-supervised learning, as in posterior r egularization and generalized expectation (Ganchev et al., 2010; Mann & McCallum, 2010). In contrast, our methods are designed to discrim- inativ ely learn expressiv e inference procedures, with min- imal domain knowledge required, rather than regularizing inference and learning. First, we provide efficient algorithms for solving the marginal inference problem (2) and performing MAP pre- diction in the associated distribution, for both con vex and non-con vex global energy functions. After that, we provide a learning algorithm for θ and the parametrized L ψ func- tions using an interpretation of (2) as approximate v aria- tional inference in a a probabilistic model. All of our algo- rithms are easy to implement and rely on simple wrappers around black-box inference subroutines. Our experiments demonstrate the power and generality of our approach by achie ving state-of-the-art results on sev- eral tasks. W e extract accurate citations from research pa- pers by learning discriminati ve global regularities of valid outputs, outperforming a strong dual decomposition-based baseline (Anzaroot et al., 2014). In a benchmark OCR task (T askar et al., 2004), we achieve state-of-the-art results with a learned non-con vex, non-local energy function, that guides output decodings to lie near dictionary words. Fi- nally , our general algorithm for solving (2) provides large speed improv ements for the challenging task of inference in chain-structured collective graphical models (CGMs), ap- plied to bird migration (Sheldon & Dietterich, 2011). 2 B A CKGR OUND Let y = ( y 1 , . . . , y n ) denote a set of discrete variables and x be a collection of input features. W e define the conditional distribution P θ ( y | x ) = exp( h θ ( x ) , S ( y ) i ) / Z , where S ( y ) is a mapping from y to a set of sufficient statistics, θ ( x ) is a differentiable vector-v alued mapping, and Z = P y exp( h θ , S ( y ) i ) . Conditional random fields (CRFs) assume that ( y 1 , . . . , y n ) are giv en a graph struc- ture and S ( y ) maps y to a 0-1 vector capturing joint set- tings of each clique (Lafferty et al., 2001). Going forward, we often suppress the explicit dependency of θ on x . For fixed θ , the model is called a Marko v random field (MRF). Giv en a distribution P ( y ) , define the expected suf ficient statistics operator µ ( P ) = E P [ S ( y )] . For the CRF statis- tics S ( y ) above, µ is a concatenated vector of node and clique marginals. Therefore, marginal infer ence , the task of finding the marginal distribution of P θ ( y | x ) over y , is equiv alent to computing the expectation µ ( P θ ( y | x )) . For tree-structured graphical models, P θ ( y | x ) ← → µ ( P θ ( y | x )) is a bijection, though this is not true for general graphs. Furthermore, for trees the en- tropy H ( P θ ( y | x )) is equal to the Bethe entropy H B ( µ ( P θ ( y | x ))) , defined, for example, in W ainwright & Jordan (2008). The mar ginal polytope M is the set of µ that correspond to some P θ . As mentioned in the introduction, marginal inference can be posed as the optimization problem (1). MAP inference finds the joint setting y with maximum probability . For CRFs, this is equiv alent to arg min y h− θ ( x ) , S ( y ) i . (3) For tree-structured CRFs, mar ginal and MAP inference can be performed ef ficiently using dynamic programming. Our experiments focus on such graphs. Howe ver , the inference algorithms we present can be e xtended to general graphs wherev er mar ginal inference is tractable using a con vex en- tropy approximation and a local polytope relaxation. 3 MARGINAL INFERENCE WITH NON-LOCAL ENERGIES W e move beyond the standard inference objectiv e (1), aug- menting it with a non-local energy term as in (2): µ ∗ = arg min µ ∈M − H B ( µ ) − h θ , µ i + L ψ ( µ ) . Here, L ψ is some arbitrary parametrized function of the marginals, and ψ may depend on input features x . Intuitiv ely , we are augmenting the inference objectiv e (1) by allowing it to optimize a broader set of tradeoffs – not only between expected node scores, clique scores, and en- tropy , b ut also global functions of the mar ginals. T o be con- crete, in our citation extraction experiments (Section 8.1), for example, we employ the simple structure: L ψ ( µ ) = X j ψ j ` j ( µ ) , (4) Where each ` j is a uni variate con vex function and each ψ j is constrained to be non-negati ve, in order to maintain the ov erall conv exity of L ψ . W e further employ ` j ( µ ) = ˜ ` j a > j µ , (5) where a j encodes a ‘linear measurement’ of the marginals and ˜ ` j is some univ ariate con vex function. 4 V ARIA TIONAL INTERPRET A TION AND MAP PREDICTION W e next provide two complementary interpretations of (2) as variational inference in a class of tractable probability distributions over y . They yield precisely the same vari- ational expression. Ho wever , both are useful because the first helps moti vate a MAP prediction algorithm, while the second helps characterize our learning algorithm in Sec- tion 7 as (approximate) variational EM. Proposition 1. F or fixed θ and L ψ , the output µ ∗ of in- fer ence in the augmented objective (2) is equivalent to the output of standard inference (1) in an MRF with the same clique structur e as our base model, but with a modified pa- rameter ˜ θ = θ − ∇ L ψ ( µ ∗ ) . Pr oof. Forming a Lagrangian for (2), the stationarity con- ditions with respect to the variable µ are: 0 = − ( θ − ∇ L ψ ( µ ∗ )) − ∇ H B ( µ ∗ ) + ∇ µ C ( µ , λ ) , (6) where C ( µ , λ ) are collected terms relating to the marginal polytope constraints. The proposition follows because (6) is the same as the stationarity conditions for µ ∗ = arg min µ ∈M − h θ − ∇ L ψ ( µ ∗ ) , µ i − H B ( µ ) . (7) Therefore, we can characterize a joint distribution ov er y by first finding µ ∗ by solving (2) and then defining an MRF ov er y with parameters ˜ θ . Even more conv eniently , our inference technique in Section 6 iteratively estimates ˜ θ on the fly , namely via the dual iterate θ t in Algorithm 1. Ultimately , in many prediction problems we seek a single output configuration y rather than an inferred distribution ov er outputs. Proposition 1 suggests a simple prediction procedure: first, find the variational distribution ov er y parametrized as an MRF with parameter ˜ θ . Then, perform MAP in this MRF . Assuming an av ailable marginal infer- ence routine for this MRF , we assume the tractability of MAP – for example using a dynamic program. W e avoid predicting y by locally maximizing nodes’ marginals, since this would not necessarily yield feasible outputs. Instead of solving (2), we could have introduced global energy terms to the MAP objecti ve (3) that act directly on v alues S ( y ) rather than on expectations µ , as in (2). Howe ver , this yields a dif ficult combinatorial optimization problem for prediction and does not yield a natural way to learn the parametrization of the global energy . Section 8.1 demonstrates that using energy terms defined on marginals, and performing MAP inference in the associated MRF , per- forms as well or better than an LP technique designed to directly perform MAP subject to global penalty terms. Our second v ariational interpretation characterizes µ ∗ as a variational approximation to a comple x joint distribution: P c ( y | x ) = (1 / Z θ , ψ ) P θ ( y | x ) P ψ ( y | x ) . (8) W e assume that isolated marginal inference in P θ ( y | x ) is tractable, while P ψ ( y | x ) is an alternative structured dis- tribution over y for which we do not have an efficient in- ference algorithm. Specifically , we assume that (1) can be solved for P θ . Furthemore, we assume that P ψ ( y | x ) ∝ exp ( L ψ ( S ( y ); x )) , where L ψ ( · ; x ) is a con vex function, conditional on input features x . Going forward, we will often surpress the dependence of L ψ on x . Above, Z θ , ψ is the normalizing constant of the combined distrib ution. Note that if L was linear, inference in both P ψ ( y | x ) and P c ( y | x ) would be tractable, since the distribution would decompose ov er the same cliques as P θ ( y | x ) . Not surprisingly , (8) is intractable to reason about, due to the non-local terms in (2), so we approximate it with a variational distribution Q ( y ) . The connection between this variational approximation and Proposition 1 is deriv ed in Appendix A. Here, we assume no clique structure on Q ( y ) , but show that minimizing a v ariational approxima- tion of K L ( Q ( y ) || P c ( y | x )) , for a giv en x , yields a Q that is parametrized compactly as the MRF in Proposition 1. W e discuss the relationship between this and general mean- field inference in Section 5. Although the analysis of this section assumes conv exity of L ψ , our inference techniques can be applied to non- con vex L ψ , as discussed in Section 6.3, and our learning algorithm produces state-of-the-art results ev en in the non- con vex re gime for a benchmark OCR task. 5 RELA TED MODELING TECHNIQ UES Mean field variational inference in undirected graphical models is a particular application of our inference frame- work, with a non-con vex L ψ (W ainwright & Jordan, 2008). The technique estimates marginal properties of a complex joint distribution P using the clique marginals µ of some tractable base distribution Q , not necessarily fully factor- ized. This induces a partitioning of the cliques of P into those represented directly by µ and those where we define clique marginals as a product distribution of the relev ant nodes’ mar ginals in µ . T o account for the energy terms of the full model in volving cliques absent in the simple base model, the energy h θ , µ i of the base model is augmented with an extra function of µ . L ( µ ) = − X c ∈C * θ c , O n ∈ c µ n + (9) where C is the set of cliques not included in the tractable sub-model, θ c are the potentials of the original graphical model corresponding to the missing cliques, and N n µ n represents a repeated outer (tensor) product of the node marginals for the nodes in those cliques. Note L ( µ ) is non-linear and non-con vex. Our work gener- alizes (9) by allowing arbitrary non-linear interaction terms between components of µ . This is very powerful – for ex- ample, in our citation extraction experiments in Section 8.1, expressing these global terms in a standard graphical model would require many factors touching all v ariables. Lo- cal coordinate ascent mean-field can be frustrated by these rigid global terms. Our gradient-based method a voids these issues by updating all marginals simultaneously . Dual decomposition is a popular method for performing MAP inference in complex structured prediction models by leveraging repeated calls to MAP in tractable submod- els (K omodakis et al., 2007; Sontag et al., 2011). The fam- ily of models solvable with dual decomposition is limited, howe ver , because the terms that link the submodels must be expressible as linear constraints. Similar MAP tech- niques (Ravikumar et al., 2010; Martins et al., 2011; Fu & Banerjee, 2013) based on the alternating direction method of multipliers ( ADMM ) can be adapted for mar ginal infer- ence, in problems where marginal inference in submodels is tractable. Ho wever , the non-local terms are defined as linear functions on settings of graphical model nodes, while our non-linear L ψ ( µ ) terms provide practitioners with an expressi ve means to learn and enforce regularities of the inference output. Posterior regularization (PR) (Ganche v et al., 2010), learning from measurements (LFM) Liang et al. (2009) , and generalized expectations (GE) (Mann & McCal- lum, 2010), are a family of closely-related techniques for performing unsupervised or semi-supervised learning of a conditional distribution P θ ( y | x ) or a generativ e model P θ ( x | y ) using expectation-maximization (EM), where the E-step for latent variables y does not come directly from in- ference in the model, b ut instead from projection onto a set of expectations obe ying global regularity properties. In PR and GE, this yields a projection objectiv e of the form (2), where the L ψ terms come from a Lagrangian relaxation of regularity constraints, and ψ corresponds to dual variables. Originally , PR employed linear constraints on marginals, but He et al. (2013) extend the frame work to arbitrary con- ve x dif ferentiable functions. Similarly , in LFM such an in- ference problem arises because we perform posterior in- ference assuming that the observations y hav e been cor- rupted under some noise model. T arlow & Zemel (2012) also present a method for learning with certain forms of non-local losses in a max-margin frame work. Our goals are very dif ferent than the abov e learning meth- ods. W e do not impose non-local terms L ψ in order to regu- larize our learning process or allo w it to cope with minimal annotation. Instead, we use L ψ to increase the expressi v- ity of our model, performing inference for ev ery test ex- ample, using a different ψ , since it depends on input fea- tures. Since we are ef fectiv ely ‘learning the regularizer , ’ on fully-labeled data, our learning approach in Section 7 dif- fers from these methods. Finally , unlike these frame works, we employ non-con vex L ψ terms in some of our experi- ments. The algorithmic consequences of non-conv exity are discussed in Section 6.3. 6 OPTIMIZING THE NON-LOCAL MARGINAL INFERENCE OBJECTIVE W e now present an approach to solving (2) using non- Euclidean projected gradient methods, which require ac- cess to a procedure for marginal inference in the base dis- tribution (which we term the mar ginal oracle ), as well as access to the gradient of the energy function L ψ . W e pose these algorithms in the composite minimization framework, which giv es us access to a wide variety of algorithms that are discussed in the supplementary material. 6.1 CONVEX OPTIMIZA TION B A CKGR OUND Before presenting our algorithms, we revie w several defi- nitions from con vex analysis (Rockafellar, 1997). W e call a function ϕ σ -str ongly con vex with respect to a norm k · k P , if for all x, y ∈ dom ( ϕ ) , ϕ ( y ) ≥ ϕ ( x ) + ∇ ϕ ( x ) T ( y − x ) + σ 2 k y − x k 2 P . Proposition 2 (e.g. Beck & T eboulle (2003)) . The nega- tive entr opy function − H ( x ) = P i x i log x i is 1-str ongly con vex with respect to the 1-norm k · k 1 over the interior of the simplex ∆ (r estricting dom ( H ) to int (∆) ). Giv en a smooth and strongly con vex function ϕ , we can also define an associated generalized (asymmetric) distance measure called the Br e gman diver gence (Bregman, 1967) Algorithm 1 Bethe-RD A Input: parameters θ , energy function L ψ ( µ ) set θ 0 = θ set µ 0 to prox-center MARGIN AL-ORACLE ( θ 0 ) ¯ g 0 = 0 repeat β t = constant ≥ 0 ¯ g t = t − 1 t ¯ g t − 1 + 1 t ∇ L ( µ t ) θ t = θ − t t + β t ¯ g t µ t = MARGIN AL-ORACLE ( θ t ) until CONVERGED ( µ t , µ t − 1 ) generated by ϕ , B ϕ ( x, x 0 ) = ϕ ( x ) − ϕ ( x 0 ) − h∇ ϕ ( x 0 ) , x − x 0 i . For example, the KL div ergence is the Bregman div ergence associated to the negati ve entropy function, and the squared Euclidean distance is its own associated di ver gence. Composite minimization (Passty, 1979) is a family of tech- niques for minimizing functions of the form h = f + R , where we have an oracle that allo ws us to compute min- imizations o ver R in closed form (usually R here takes the form of a regularizer). Problems of this form are often solved with an algorithm called pr oximal gradient , which minimizes h ( x ) ov er some conv ex set X using: x t +1 = arg min x ∈ X h∇ f ( x t ) , x i + 1 2 η t k x − x t k 2 2 + R ( x ) , for some decreasing sequence of learning rates η t . Note that because of the requirement x ∈ X , proximal gradi- ent generalizes projected gradient descent – since uncon- strained minimization might take us out of the feasible re- gion X , computing the update requires projecting onto X . But there is no reason to use the squared Euclidean dis- tance when computing our updates and performing the pro- jection. In fact, the squared term can be replaced by any Bregman di vergence. This family of algorithms includes the mirr or descent and dual aver aging algorithms (Beck & T eboulle, 2003; Nestero v, 2009). W e base our projected inference algorithms on r e gularized dual averaging (RD A) (Xiao, 2010). The updates are: x t +1 = arg min x ∈ X h ¯ g t , x i + β t t ϕ ( x ) + R ( x ) , (10) where ¯ g t = 1 t P t k ∇ f ( x k ) is the av erage gradient of f en- countered so far . One benefit of RDA is that it does not re- quire the use of a learning rate parameter ( β t = 0 ) when us- ing a strongly con vex regularizer . RD A can be interpreted as doing a projection onto X using the Bregman div ergence generated by the strongly con vex function ϕ + R . 6.2 OUR ALGORITHM These non-Euclidean proximal algorithms are especially helpful when we are unable to compute a projection in terms of Euclidean distance, b ut can do so using a dif ferent Bregman div ergence. W e will show that this is exactly the case for our problem of projected inference: the marginal oracle allows us to project in terms of KL di ver gence. Howe ver , to maintain tractability we a void using the entropy function H on the e xponentially-large simple x ∆ , and instead optimize over the structured, factorized marginal polytope M and its corresponding structured Bethe entropy H B . For tree-structured models, H and H B hav e identical values, b ut different inputs. It remains to show the strong con vexity of − H B so we can use it in RDA. Proposition 3. F or trees with n nodes, the negative Bethe entr opy function − H B is 1 2 (2 n − 1) − 2 -str ongly con vex with r espect to the 2-norm over the interior of the mar ginal poly- tope M . Pr oof. Consequence of Lemma 1 in Fu & Banerjee (2013). W ith these definitions in hand, we present Bethe-RDA pro- jected inference Algorithm 1. This algorithm corresponds to instantiating (10) with R = − H B − h θ , µ i and ϕ = − H B . Note the simplicity of the algorithm when choos- ing β t = 0 . It is intuitiv ely appealing that the algorithm amounts to no more than calling our mar ginal inference or - acle with iterativ ely modified parameters. Proposition 4. F or con vex energy functions and con vex − H B , the sequence of primal aver ages of Algorithm 1 con- ver ges to the optimum of the variational objective (2) with suboptimality of O ( ln ( t ) t ) at time t . Pr oof. This follo ws from Theorem 3 of Xiao (2010) along with the strong con vexity of − H B . If we hav e more structure in the energy functions, specifi- cally a Lipschitz-continuous gradient, we can modify the algorithm to use Nestero v’ s acceleration technique and achiev e a con vergence of O ( 1 t 2 ) . Details can be found in Appendix D. Additionally , in practice these problems need not be solved to optimality and giv e stable results after a few iterations, as demonstrated in Figure 8.1. 6.3 INFERENCE WITH NON-CONVEX, NON-LOCAL ENERGIES An analogy can be made here to loopy belief propaga- tion – e ven in the case of non-con vex loss functions (and ev en non-con vex entropy functions with associated inexact marginal oracles), the updates of our inference (and learn- ing) algorithms are well-defined. Importantly , since one of our motiv ations for dev eloping non-local inference was to Algorithm 2 Learning with non-local energies Input: examples x i , y i and inference oracle MARG () for distributions with the clique structure of P θ ( y | x ) . Output: parameters ( θ , ψ ) for P c ( y | x ) . repeat //E-Step for all ( x i , y i ) do µ i ← (Algorithm 1) // using θ , ψ and MARG() ρ i ← (Proposition 5) // using ψ , µ i // note Q i ( y i ) is a CRF with potentials θ + ρ i . end for //M-Step (gradient-based learning of CRF parameters) repeat m i ← MARG ( Q i ) ∀ j //standard CRF inference ∇ θ ← P i S ( y i ) − m i ∇ ψ ← P i dρ i d ψ > ( S ( y i ) − m i ) θ ← Gradient-Step( θ , ∇ θ ) ψ ← Gradient-Step( ψ , ∇ ψ ) until con verged until con verged OR iter > max iters generalize mean field inference, and the additional penalty terms are non-conv ex in that case, we would like our algo- rithms to work for the non-con vex case as well. Unlike loopy belief propagation, howe ver , since we derive our algorithms in the frame work of composition minimiza- tion, we have access to a wealth of theoretical guarantees. Based on results from the theory of optimization with first- order surrogate loss functions (Mairal, 2013), in Appendix C we propose a small modification to Algorithm 1 with an asymptotic con ver gence condition e ven for non-conv ex en- ergies. In practice we find that the unmodified Algorithm 1 also works well for these problems, and experimentally in Section 8.2, we see good performance in both inference and learning with non-con vex ener gy functions. 7 LEARNING MODELS WITH NON-LOCAL ENERGIES W e seek to learn the parameters θ and ψ of the underlying CRF base model and L ψ , respectiv ely . Let S = { y i , x i } be n training examples. Let Q ( y i ; µ i ) be the variational distribution for y i resulting from applying Proposition 1. Namely , Q ( y i ; µ i ) is an MRF with parameters ρ i . . = θ − ∇ µ L ψ ( µ i ) . (11) W e employ the notation Q ( y i ; µ i ) to highlight the role of µ i : for a gi ven ( y i , x i ) pair, the family of v ariational distri- butions over y i is index ed by possible values of µ i (recall we suppress the explicit dependence of θ and ψ on x ). Fi- nally , define the shorthand M = { µ 1 , . . . , µ n } . Algorithm 3 Doubly-stochastic learning with L ψ giv en by a sum of scalar functions of linear measurements (5). Input: examples x i , y i and MARGIN AL-ORACLE () for distributions with the clique structure of P θ ( y | x ) . Output: parameters ( θ , ψ ) for P c ( y | x ) . repeat sample ( x i , y i ) randomly µ i ← (Algorithm 1) ∇ θ ← S ( y i ) − µ i ∇ ψ j ← ∇ ` j ( µ i ) a > j ( S ( y i ) − µ i ) θ ← Gradient-Step( θ , ∇ θ ) ψ ← Gradient-Step( ψ , ∇ ψ ) until con verged OR iter > max iters ψ interacts with the data in a complex manner that pre vents us from using standard learning techniques for the expo- nential family . Namely , we can not easily differentiate a likelihood with respect to ψ , since this requires dif ferenti- ating the output µ of a con vex optimization procedure, and the extra L ψ term in (2) prev ents the use of conjugate du- ality relationships available for the exponential family . W e could have used automatic methods to differentiate the it- erativ e inference procedure (Stoyanov et al., 2011; Domke, 2012), but found our learning algorithm w orks well. W e employ a variational learning algorithm, presented in Algorithm 2, alternately updating the parameters M of our tractable CRF-structured variational distributions, and up- dating the parameters ( θ , ψ ) assuming the following surro- gate likelihood gi ven by these CRF approximations: L ( θ , ψ ; M ) = X i log Q ( y i ; µ i ) . (12) Giv en θ and ψ , we update M using Algorithm 1. Giv en M , we update θ and ψ by taking a single step in the di- rection of the gradient of the surrogate likelihood (12). W e av oid taking more than one gradient step, since the gradi- ents for θ and ψ depend on M and an update to θ and ψ will break the property that µ ( Q ( y ; µ i )) = µ i . Therefore, we recompute µ i ev ery time we update the parameters. Overall, it remains to sho w ho w to compute gradients of (12). For θ , we have the standard CRF lik elihood gradi- ent (Sutton & McCallum, 2006): ∇ θ L ( θ , ψ ; M ) = X i S ( y i ) − µ i . (13) For ψ , we have: ∇ ψ L ( θ , ψ ; M ) = X i d ρ i d ψ d d ρ i log Q ( y i ; µ i ) . (14) From (11), d d ρ i log Q ( y i ; µ i ) is also S ( y i ) − µ i and d ρ i d ψ = d d ψ d d µ L ψ ( µ ) (15) Clearly , this depends on the structure of L ψ . Consider the parametrization (4). With this, we ha ve: ∂ ∂ ψ j d d µ L ψ ( µ ) = ∇ ` j ( µ ) d d µ ` j ( µ ) (16) Therefore, we hav e ∂ ∂ ψ j log Q ( y i ; µ i ) = ∇ ` j ( µ ) d d µ ` j ( µ ) > ( S ( y ) − µ i ) . For linear measure- ments (5), this amounts to ∇ ` ( µ ) a > j S ( y ) − a > j µ i . (17) This has a simple interpretation: the gradient with respect to ψ j equals the gradient of the scalar loss ` j at the current marginals µ j times the difference in linear measurements between the ground truth labels and the inferred marginals. Algorithm 2 has an expensiv e double-loop structure. In practice it is suf ficient to employ a ‘doubly-stochastic’ ver - sion given in Algorithm 3, where we sample a training ex- ample ( x i , y i ) and use this to only perform a single gra- dient step on θ and ψ . T o demonstrate the simplicity of implementing our learning algorithm, we av oid any ab- stract deri vati ve notation in Algorithm 3 by specializing it to the case of (17). In our experiments, howe ver , we sometimes do not use linear measurements. Overall, all our experiments use the f ast doubly-stochastic approach of Algorithm 3 solely , since it performs well. In general, our learning algorithms are not guaranteed to con ver ge because we approximate the complex interaction between ψ and µ with alternating updates. In practice, ho wever , terminating after a fixed number of iterations yields models that gener- alize well. Finally , recall that the notation L ψ ( µ i ) suppresses the po- tential dependence of ψ on x i . W e assume each ψ j is a differentiable function of features of x i . Therefore, in our experiments where ψ depends on x i , we perform gradient updates for the parametrization of ψ ( x ) via further appli- cation of the chain rule. 8 EXPERIMENTS 8.1 CIT A TION EXTRA CTION Model F1 Our Baseline 94.47 Non-local Energies 95.47 Baseline (Anzaroot et al., 2014) 94.41 Soft-DD (Anzaroot et al., 2014) 95.39 T able 1: Comparison of F1 scores on Citation Extraction dataset. W e compare MAP inference F1 scores of our non- local ener gy model and the specialized dual decomposition model of Anzaroot et al. (2014). Both v ariants learn global regularities that significantly impro ve performance. 10 0 10 1 10 2 94 94.5 95 95.5 Max # Inference Iterations Test Accuracy Figure 1: Citation extraction F1 when limiting maximum number of test-time inference iterations. Most of our accu- racy gain is captured within the first 5-10 iterations. W e first apply our algorithm to the NLP task of performing text field segmentation on the UMass citation dataset (An- zaroot & McCallum, 2013), which contains strings of cita- tions from research papers, segmented into fields (author, title, etc.). Our modeling approach, closely follo ws Anza- root et al. (2014), who e xtract segmentations using a linear- chain segmentation model, to which they add a lar ge set of ‘soft’ linear global regularity constraints. Let y be a candidate labeling. Imagine, for example, that we constrain predicted segmentations to have no more pre- dicted last names than first names. Then, the numbers of first and last names can be computed by linear measure- ments a > first S ( y ) and a > last S ( y ) , respecti vely . A hard con- straint on y would enforce a > first S ( y ) − a > last S ( y ) = 0 . This is relaxed in Anzaroot et al. (2014) to a penalty term c` h a > first S ( y ) − a > last S ( y ) (18) that is added to the MAP inference objective, where ` h ( x ) = max(1 − x, 0) is a hinge function. F or multiple soft constraints, the ov erall prediction problem is arg min y h− θ , S ( y ) i + X j c j ` h a > j S ( y ) , (19) where θ are the parameters of the underlying linear-chain model. They use a dual decomposition style algorithm for solving (19), that crucially relies on the specific structure of the hinge terms ` h . They learn the c j for hundreds of ‘soft constraints’ using a perceptron-style algorithm. W e consider the same set of measurement vectors a j , but impose non-local terms that act on marginals µ rather than specific values y . Further , we use smoothed hinge functions, which improve the con ver gence rate of infer- ence (Rennie, 2005). W e find the v ariational distribution by solving the marginal inference version of (19), an instance of our inference framew ork with linear measurements (5): arg min µ h− θ , µ i − H B ( µ ) + X j c j ` h a > j µ , (20) As in Anzaroot et al. (2014), we first learn chain CRF pa- rameters θ on the training set. Then, we learn the c j param- eters on the dev elopment set, using Algorithm 3, and tune hyperparameters for development set performance. At both train and test time, we ignore any terms in (20) for which c j < 0 . W e present our results in T able 1, measuring segment- lev el F1. W e can see that our baseline chain has slightly higher accuracy than the baseline approach of Anzaroot et al. (2014), possibly due to optimization dif ferences. Our augmented model (Non-Local Energies) matches and very slightly beats their soft dual decomposition (Soft-DD) pro- cedure. This is especially impressi ve because they employ a specialized linear-programming solver and learning al- gorithm adapted to the task of MAP inference under hinge- loss soft constraints, whereas we simply plug in our general learning and inference algorithms for non-local structured prediction – applicable to any set of ener gy functions. Our comparable performance provides e xperimental evi- dence for our intuition that preferences about MAP con- figurations can be expressed (and “relaxed”) as functions of expectations. Anzaroot et al. (2014) solve a penalized MAP problem directly , while our prediction algorithm first finds a distribution satisfying these preferences, and then performs standard MAP inference in that distribution. Finally , in Figure 1 we present results demonstrating that our algorithm’ s high performance can be obtained using only 5-10 calls per test example to inference in the under- lying chain model. In Section B, we analyze the empirical con vergence beha vior of Algorithm 1. 8.2 HAND WRITING RECOGNITION N-Grams 2 3 4 5 6 Accuracy 85.02 96.20 97.21 98.27 98.54 T able 2: Character-wise accuracy of Structured Prediction Cascades (W eiss et al., 2012) on OCR dataset. SPC (W eiss et al., 2012) Accuracy 2-gram 85.02 3-gram 96.20 4-gram 97.21 5-gram 98.27 6-gram 98.54 T able 3: Character-wise accuracy of our baselines, and models using learned non-local energies on Handwriting Recognition dataset. Note that word classifier baseline is also giv en in character-wise accuracy for comparison. W e next apply our algorithms to the widely-used handwrit- ing recognition dataset of T askar et al. (2004). W e follow Model Accuracy 2-gram (base model) 84.93 L u ψ 94.01 L u ψ (MM) 94.96 L w ψ 98.26 L w ψ (MM) 98.83 55-Class Classifier (MM) 86.06 T able 4: Character-wise accuracy of our baselines, and models using learned non-local energies on Handwriting Recognition dataset. Note that word classifier baseline is also giv en in character-wise accuracy for comparison. the setup of W eiss et al. (2012), splitting the data into 10 equally sized folds, using 9 for training and one to test. W e report the cross-validation results across all 10 folds. The structur ed pr ediction cascades of W eiss et al. (2012) achiev e high performance on this dataset by using ex- tremely high order cliques of characters (up to 6-grams), for which they consider only a small number of candi- date outputs. Their state-of-the-art results are reproduced in T able 2. The excellent performance of these large-clique models is consequence of the fact that the data contains only 55 unique words, written by 150 different people. Once the model has access to enough higher -order context, the problem becomes much easier to solve. W ith this in mind, we design two non-con ve x, non-local energy functions. These ener gies are intended to regularize our predictions to lie close to known elements of the vo- cabulary . Our base model is a standard linear-chain CRF with image features on the nodes, and no features on the bigram edge potentials. Let U ( µ ) = P n µ n be a func- tion that takes the concatenated vector of node and edge marginals and sums up all of the node marginals, giv- ing the global unigram expected sufficient statistics. Let { u i } = { U ( µ ( y i )) } indicate the set of all such unique vec- tors when applying U to the train set empirical suf ficient statistics for each data case y i . Simply , this gives 55 vec- tors u i of length 26 containing the unigram counts for each unique word in the train set. Our intuition is that we would like to be able to “nudge” the results of inference in our chain model by pulling the inferred U ( µ ) to be close to one of these global statistics vectors. W e add the following non-con ve x non-local en- ergy function to the model: L u ψ ( µ ) = ψ min i k u i − U ( µ ) k 1 . (21) W e learn two variants of this model, which differently parametrize the dependence of ψ on x . The first has a single bias feature on the non-local energy . The second conditions on a global representation of the sequence: con- cretely , we approximate the RBF kernel mean map (MM) (Smola et al., 2007) using random Fourier features (RFF) (Rahimi & Recht, 2007). This simply in volves multiplying each image feature vector in the sequence by a random ma- trix with ∼ 1000 rows, applying a pointwise non-linearity , and taking ψ to be a linear function of the av erage vector . Results of these experiments can be seen in T able 4. Adding the non-local energy brings our performance well abov e the baseline bigram chain model, and our training procedure is able to give substantially better performance when ψ depends on the abov e input features. The energy L u ψ , based on unigram sufficient statistics, is not able to capture the relative ordering of letters in the v o- cabulary words, which the structured prediction cascades models do capture. This motiv ates us to consider another energy function. Let { w i } = { µ n ( y i ) } be the set of unique vectors of concatenated node marginal statistics for the train set. This giv es 55 vectors of length l i ∗ 26 , where l i is the length of the i th distinct train word. Next, we define a different ener gy function to add to our base chain model: L w ψ ( µ ) = ψ min i k w i − µ k 1 . (22) Once again we implement featurized and non-featurized versions of this model. As noted in structured prediction cascades, giving the model access to this le vel of high- order structure in the data makes the inference problem e x- tremely easy . Our model outperforms the best structured prediction cascades results, and we note again an improv e- ment from using the featurized ov er the non-featurized ψ . Of course, since the dataset has only 55 actual labels, and some of those are not v alid for different input sequences due to length mismatches, this is arguably a classification problem as much as a structured prediction problem. T o address this, we create another baseline, which is a con- strained 55-class logistic regression classifier (constrained to only allow choosing output classes with appropriate lengths giv en the input). W e use our same global mean- map features from the L ∗ ψ ( M M ) v ariants of the structured model and report these results in T able 4. W e also tune the number of random Fourier features as a hyperparameter to giv e the classifier as much expressi ve power as possible. As we can see, the performance is still significantly below the best structured models, indicating that the interplay be- tween local and global structure is important. 8.3 COLLECTIVE GRAPHICAL MODELS Next, we demonstrate that that our proximal gradient-based inference framew ork dramatically speeds up approximate inference in collective graphical models (CGMs) (Sheldon & Dietterich, 2011). CGMs are a method for structured learning and inference with noisy aggregate observ ation data. The large-scale dependency structure is represented via a graphical model, but the nodes represent not just sin- gle v ariables, but aggregate sufficient statistics of large sets of underlying variables, corrupted by some noise model. s 625 10k 50k Our Method 0.19 2.7 14 IP 2.8 93 690 T able 5: Comparison of runtime (in seconds, av eraged ov er 10 trials) between the interior point solver (IP) of Sheldon et al. (2013) v .s. Algorithm 1 on different CGM problem sizes s , the cardinality of the edge potentials in the under- lying graphical model, where marginal inference is O ( s ) . In pre vious work, CGMs ha ve been successfully applied to modeling bird migration. Here, the base model is a lin- ear chain representing a time series of bird locations. Each observed variable corresponds to counts from bird watch- ers in different locations. These observations are assumed to be Poisson distributed with rate proportional to the true count of birds present. The CGM MAP task is to infer the underlying migration patterns. Sheldon et al. (2013) demonstrate that MAP in CGMs is NP-hard, e ven for tr ees , but that approximate MAP can be performed by solving a problem of the form (2): µ ∗ = arg max µ h θ , µ i + H B ( µ ) + n X i P i ( µ i | ψ y i ) (23) where P i are (concav e) Poisson log-likelihoods and each y i is an observed bird count. For the case where the underlying CGM graph is a tree, the ‘hard EM’ learning algorithm of Sheldon et al. (2013) is the same as Algorithm 2 specialized to their model. There- fore, Sheldon et al. (2013) provide additional experimen- tal evidence that our alternating surrogate-likelihood opti- mization works well in practice. The learning procedure of Sheldon et al. (2013) is very computationally expensiv e because they solve instances of (23) using an interior-point solver in the inner loop. For the special case of trees, Algorithm 1 is directly applicable to (23). Using synthetic data and code obtained from the authors, we compare their generic solver to Algorithm 1 for solving instances of (23). In T able 5, we see that our method achieves a large speed-up with no loss in solution accuracy (since it solv es the same conv ex problem). 9 DISCUSSION AND FUTURE WORK Our results show that our inference and learning frame- work allows for tractable modeling of non-local depen- dency structures, resistant to traditional probabilistic for- mulations. By approaching structured modeling not via in- dependence assumptions, but as arbitrary penalty functions on the marginal vectors µ , we open many ne w modeling possibilities. Additionally , our generic gradient-based in- ference method can achieve substantial speedups on pre- existing problems of interest. In future work, we will apply our framew ork to new problems and ne w domains. A CKNO WLEDGEMENTS This work was supported in part by the Center for In- telligent Information Retrieval, in part by DARP A under agreement number F A8750-13-2-0020, and in part by NSF grant #CNS-0958392. The U.S. Gov ernment is authorized to reproduce and distribute reprints for Governmental pur- poses notwithstanding any copyright notation thereon. An y opinions, findings and conclusions or recommendations ex- pressed in this material are those of the authors and do not necessarily reflect those of the sponsor . References Anzaroot, Sam and McCallum, Andre w . A new dataset for fine-grained citation field extraction. In ICML W orkshop on P eer Reviewing and Publishing Models , 2013. Anzaroot, Sam, Passos, Alexandre, Belanger , David, and McCallum, Andrew . Learning soft linear constraints with application to citation field extraction. In ACL , 2014. Beck, Amir and T eboulle, Marc. Mirror descent and non- linear projected subgradient methods for conv ex opti- mization. Operations Resear ch Letters , 31(3):167–175, 2003. Bregman, Lev M. The relaxation method of finding the common point of con vex sets and its application to the solution of problems in con vex programming. USSR computational mathematics and mathematical physics , 7(3):200–217, 1967. Domke, Justin. Generic methods for optimization-based modeling. In AIST ATS , 2012. Duchi, John, Shalev-Shwartz, Shai, Singer , Y oram, and T e wari, Amb uj. Composite objecti ve mirror descent. In COLT , 2010. Fu, Qiang and Banerjee, Huahua W ang Arindam. Bethe- admm for tree decomposition based parallel map infer- ence. In U AI , 2013. Ganchev , Kuzman, Grac ¸ a, Joao, Gillenwater , Jennifer, and T askar, Ben. Posterior regularization for structured latent variable models. JMLR , 99:2001–2049, 2010. He, L., Gillenwater , J., and T askar , B. Graph-Based Poste- rior Regularization for Semi-Supervised Structured Pre- diction. In CoNLL , 2013. K omodakis, Nikos, Paragios, Nikos, and Tziritas, Geor- gios. Mrf optimization via dual decomposition: Message-passing revisited. In IEEE ICCV , 2007. Lafferty , John, McCallum, Andrew , and Pereira, Fer- nando CN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML , 2001. Liang, Percy , Jordan, Michael I, and Klein, Dan. Learning from measurements in exponential families. In ICML , 2009. Mairal, Julien. Optimization with first-order surrogate functions. In ICML , 2013. Mann, Gideon S and McCallum, Andrew . Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR , 11:955–984, 2010. Martins, Andr ´ e, Figueiredo, M ´ ario, Aguiar , Pedro, Smith, Noah A, and Xing, Eric P . An augmented lagrangian approach to constrained map inference. In ICML , 2011. Nesterov , Y urii. Primal-dual subgradient methods for con- ve x problems. Mathematical pr ogr amming , 120(1):221– 259, 2009. Passty , Gregory B. Ergodic con vergence to a zero of the sum of monotone operators in hilbert space. Journal of Mathematical Analysis and Applications , 72(2):383 – 390, 1979. Rahimi, Ali and Recht, Benjamin. Random features for large-scale k ernel machines. In NIPS , 2007. Ravikumar , Pradeep, Agarwal, Alekh, and W ainwright, Martin J. Message-passing for graph-structured linear programs: Proximal methods and rounding schemes. JMLR , 11:1043–1080, 2010. Rennie, Jason DM. Smooth hinge classification, 2005. Rockafellar , R T yrell. Con vex Analysis , v olume 28. Prince- ton Univ ersity Press, 1997. Sheldon, Daniel, Sun, T ao, K umar , Akshat, and Dietterich, Thomas G. Approximate inference in collectiv e graphi- cal models. In ICML , 2013. Sheldon, Daniel R and Dietterich, Thomas G. Collectiv e graphical models. In NIPS , 2011. Smola, Alex, Gretton, Arthur , Song, Le, and Sch ¨ olkopf, Bernhard. A hilbert space embedding for distributions. In Algorithmic Learning Theory , pp. 13–31. Springer , 2007. Sontag, David, Globerson, Amir, and Jaakkola, T ommi. In- troduction to dual decomposition for inference. Opti- mization for Machine Learning , 1:219–254, 2011. Stoyanov , V eselin, Ropson, Alexander , and Eisner , Jason. Empirical risk minimization of graphical model param- eters giv en approximate inference, decoding, and model structure. In AIST ATS , 2011. Sutton, Charles and McCallum, Andrew . An introduction to conditional random fields for relational learning. In- tr oduction to statistical r elational learning , pp. 93–128, 2006. T arlow , Daniel and Zemel, Richard S. Structured output learning with high order loss functions. In AIST A TS , 2012. T askar, Ben, Carlos, Guestrin, and Koller , Daphne. Max- margin mark ov netw orks. In NIPS , 2004. W ainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and v ariational inference. F oundations and T rends in Machine Learning , (1-2):1– 305, 2008. W eiss, D., Sapp, B., and T askar, B. Structured Prediction Cascades. ArXiv e-prints , August 2012. Xiao, Lin. Dual a veraging methods for regularized stochas- tic learning and online optimization. JMLR , 11:2543– 2596, 2010. Supplementary Material A V ariational A pproximation During learning, reasoning about P c ( y | x ) in (8) is dif ficult, due to the intractability of Z θ , ψ . In response, we approximate it with a variational distrib ution: Q ( y ) = arg min Q 0 F ( Q 0 ; x , θ , ψ ) , (24) where F ( Q 0 ) = K L ( Q 0 ( y ) || P c ( y | x )) = − H ( Q 0 ) − E Q 0 [ h θ , S ( y ) i ] + E Q 0 [ L ψ ( S ( y ))] (25) ≈ − H ( Q 0 ) − h θ , µ ( Q 0 ) i + L ψ ( µ ( Q 0 )) . (26) Giv en x , θ , and ψ , we select Q by minimizing the approximation (26). Note that the surrogate we minimize is a lower bound to (25), as E Q 0 [ L ψ ( S ( y ))] ≥ L ψ ( µ ( Q 0 )) , by Jensen’ s inequality and the con vexity of L . This differs from many mean-field variational inference approaches that minimize an upper bound. So far , we have not assumed any structure on Q . Next, we show that the minimizer of (26) is a MRF with the same clique structure as P θ . This provides an alternativ e deriv ation of the techniques in Section 4. Let q y denote the probability under Q of a given joint configuration y . There are exponentially-man y such q y , and H ( Q ) is the entropy on the simplex − P y q y log( q y ) . Since Q minimizes (26), we hav e the following stationarity condition for ev ery q y : d dq y [ − H ( Q φ ) − q y log( P θ ( y | x )) + L ψ ( µ ( Q φ ))] + λ = 0 (27) Here, λ is a dual variable for the constraint P y q y = 1 . Rearranging, we have: Q ( y ) = (28) (1 / Z ) P θ ( y | x ) exp − d d µ L ψ ( µ ( Q )) > d dq y µ ( Q ) ! , (29) where Z is a normalizing constant. Proposition 5. Ther e e xists a vector ρ such that the quantity d d µ L ψ ( µ ( Q )) > d dq y µ ( Q ) = ρ > S ( y ) for all q y . Fur- thermor e, ρ is a simple, closed-form function of µ ( Q ) . Pr oof. W e hav e d dq y µ ( Q ) = S ( y ) , since µ ( Q ) = P y q y S ( y ) . Therefore, ρ = d d µ L ψ ( µ ( Q )) . Corollary 1. Since P θ ( y | x ) ∝ h θ , S ( y ) i , Pr oposition 5 implies Q ( y ) is an MRF with the same clique decomposition as P θ ( y | x ) . So far , Q is implicitly defined in terms of its o wn marginals µ ( Q ) . Since we assume P θ and P ψ hav e the same sufficient statistics S ( y ) , we can use the Bethe entropy representation H ( Q ) = H B ( µ ( Q )) . This transforms (26) to the augmented inference problem (2). Therefore, we can directly solve for µ ( Q ) , which can then be used to provide a closed-form expression for the CRF distrib ution Q . B Additional Experiments In Figure 2, we examine the conv ergence behavior of our algorithm on the citation dataset. This demonstrates that our inference procedure con verges quite quickly except for a small number of dif ficult cases, where the global energy and the local evidence are in significant disagreement. 10 0 10 1 10 2 94 94.2 94.4 94.6 94.8 95 95.2 95.4 95.6 Max # Inference Iterations Test Accuracy Figure 2: The number of iterations taken for inference to con ver ge on test set citations, as a percentage of the total number of test cases. Number of iterations is capped at 40. W e can see that the distribution is long tailed. Inference conv erges within 40 iterations for 93 . 7 of examples, and each e xample takes an average of 9 . 8 iterations to con verge. Algorithm 4 Bethe-MD Input: parameters θ , energy function L ( µ ) , learning rate sequence { η t } set µ 0 to prox-center MARGIN AL-ORACLE ( θ ) repeat g t = ∇ H B ( µ t − 1 ) + η t ∇ L ( µ t − 1 ) µ t = MARGIN AL-ORACLE ( 1 1+ η t ( η t θ − g t )) until CONVERGED ( µ t , µ t − 1 ) C Non-Con vex Energies and Composite Mirr or Descent W e introduce a small modification of Algorithm 1, along with a rough proof sketch of its con vergence ev en in the case of non-con vex energy functions. Because it leans heavily on significant prior work in optimization, it is hard to giv e a self- contained proof of the results in this section, and our argument takes the form of a proof sketch that appeals to these other works. Howe ver , the basic argument simply combines the strong con vexity of H B and its associated Bregman di ver gence, along with the results of Mairal (2013) for the case of composite minimization of non-con vex functions using the Euclidean Bregman di ver gence, and the fact that the local updates performed using entropy H B as a distance-generating function hav e a log-barrier function for the constraint set M , effecti vely bounding the norm of the gradient of H B when restricted to the set of iterates actually visited during optimization. While Algorithm 1 was built on the framework of regularized dual averaging (RDA), we introduce a slightly different formulation based on composite mirr or descent (COMID) (Duchi et al., 2010). Like RDA, COMID is a gradient method for minimizing functions of the form h = f + R . At each time step t , COMID makes the update w t +1 = arg min w h∇ f ( w t ) , w i + 1 η t B ϕ ( w , w t ) + R ( w ) (30) where ϕ is some strongly con vex function and B ϕ is its associated Bregman div ergence. In Algorithm 4, we present an instantiation of composite mirror descent for our inference problem. At first glance, this seems significantly different from our original Algorithm 1, but remembering that ∇ H B ( µ t ) = θ t because of conjugate duality of the exponential family , we can see that it actually only a corresponds to a slight re-weighting of the iterates of Algorithm 1. First, we giv e Algorithm 4 similar guarantees in the conv ex setting as we did for Algorithm 1. Proposition 6. F or con vex energy functions and conve x − H B , given the learning rate sequence η t = 1 λt , where λ is the str ong con vexity of − H B , the sequence of primal averages of Algorithm 4 con ver ges to the optimum of the variational objective (2) with suboptimality of O ( ln ( t ) t ) at time t . Pr oof. This follo ws from a standard online-to-batch con version, along with the strong con vexity of H B and Theorem 7 of Duchi et al. (2010). Now , having introduced composite mirror descent in (30), will lean hea vily on the frame work for optimization with first- order surrogate losses of Mairal (2013) to show that these types of algorithms should con verge ev en in the non-con vex case. W e no w recall a fe w definitions from that work. First, we define the asymptotic stationary point condition, which giv es us a notion of conv ergence in the non-con ve x optimization case. Definition 1 (Asymptotic Stationary Point (Mairal, 2013)) . F or a sequence { θ n } n ≥ 0 , and differ entiable function f , we say it satisfies an asymptotic stationary point condition if lim n → + ∞ k∇ f ( θ n ) k 2 = 0 W e call a function L -strongly smooth if L is a bound on the lar gest eigen value of the Hessian – this tells us ho w the norm of the gradient changes. This is also known as a L -Lipschitz continuous gradient. Now we recall the notion of a majorant first-or der surr ogate function . Definition 2 (Majorant First-Order Surrogate (Mairal, 2013)) . A function g: R p → R is a major ant first-or der surr ogate of f near κ when the following conditions ar e satisfied • Majorant: we have g ≥ f . • Smoothness: the approximation err or h = g − f is dif ferentiable , and its gr adient is L -Lipschitz continuous, mor eover , we have h ( κ ) = 0 and ∇ h ( κ ) = 0 W e denote by S L ( f , κ ) the set of such surr ogates. Now we recall the majorant first-order surrogate property for the composite minimization step in the case of Euclidean Bregman di vergence (Euclidean distance). Proposition 7 (Proximal Gradient Surrogates (Mairal, 2013)) . Assume that h = f + R where f is differ entiable with an L -Lipschitz gr adient. Then, h admits the following majorant surr ogate in S 2 L ( f , κ ) : g ( θ ) = f ( κ ) + ∇ f ( κ ) > ( θ − κ ) + L 2 k θ − κ k 2 2 + R ( θ ) (31) W e can use this result to establish a majorant property for the composite mirror descent surrogate (30) gi ven a strongly con vex and strongly smooth Bre gman diver gence. Proposition 8 (Composite Mirror Descent Surrogates) . Assume that h = f + R wher e f is differ entiable with an L - Lipschitz gradient, ϕ is a σ -str ongly con vex and γ -str ongly smooth function, and B ϕ is its Br egman diverg ence. Then, h admits the following majorant surr ogate in S L + L γ σ ( f , κ ) : g ( θ ) = f ( κ ) + ∇ f ( κ ) > ( θ − κ ) + L 2 σ B ϕ ( θ , κ ) + R ( θ ) (32) Pr oof. By the definition of strong con ve xity and the Bregman div ergence, (32) upper bounds (31), so it is a majorant of h . Additionally , by the additi ve property of strong smoothness, we get the strong smoothness constant for the surrogate. Howe ver , small technical conditions keep Proposition 8 from applying directly to our case. The Bethe entropy H B , and thus its associated Bregman diver gence, is not strongly smooth – its gradient norm is unbounded as we approach the corners of the marginal polytope. Howe ver , it is locally Lipschitz – every point in the domain has a neighborhood for which the function is Lipschitz. In practice, since the − H B mirror descent updates ha ve a barrier function for the constraint set M , our iterativ e algorithm will nev er get too close to the boundary of the polytope and it is effecti vely strongly smooth for purposes of our minimization algorithm. This is not a rigorous argument, b ut is both intuiti vely plausible and born out in experiments. Algorithm 5 Accelerated Bethe-RD A Input: parameters θ , energy function L ( µ ) set µ 0 to prox-center MARGIN AL-ORACLE ( θ ) set ν 0 = µ 0 ¯ g 0 = 0 repeat c t = 2 t +1 u t = (1 − c t ) µ t − 1 + c t ν t − 1 ¯ g t = (1 − c t ) ¯ g t − 1 + c t ∇ L ( u t ) ν t = MARGIN AL-ORACLE ( t ( t +1) 4 L + t ( t +1) ( θ − ¯ g t )) µ t = (1 − c t ) µ t − 1 + c t ν t until CONVERGED ( µ t , µ t − 1 ) Proposition 9. The sequence of iter ates w t fr om Algorithm 4, when bounded away fr om the corners of the marginal polytope constraint set M , and for appr opriate choice of learning rates { η t } , conve x − H B , and L -str ongly smooth (but possibly non-con vex) ener gy function L ψ , satisfies an asymptotic stationary point condition. Pr oof. This follows from application of Proposition 8, and noting that Algorithm 4 corresponds to the generalized surrogate-minimization scheme in Algorithm 1 of Mairal (2013). The asymptotic stationary point condition then fol- lows from Proposition 2.1 of Mairal (2013). The appropriate learning rates { η t } must be chosen by the Lipschitz constant of the gradient of L ψ , as well as the effecti ve Lipschitz constant of the gradient of H B , gi ven ho w far we are bounded from the edge of the constraint set (this ef fecti ve smoothness constant is determined by the norm of our parameter vector θ ). In this section we hav e giv en a rough proof sketch for the asymptotic conv ergence of our inference algorithms even in the case of non-con ve x energies. Our heuristic argument for the effecti ve smoothness of the entropy H B is the most pressing av enue for future work, but we believ e it could be made rigorous by examining the norm of the parameter vector and how it contributes to the “sharpness” of the barrier function for the mirror descent iterates. D Accelerated Bethe-RD A If we hav e L -strongly smooth losses ( L is a bound on the lar gest eigen value of the Hessian), we can use an accelerated dual av eraging procedure to obtain an even faster con vergence rate of O ( 1 t 2 ) . Let D be the diameter of the marginal polytope as measured by the strongly con vex distance-generating function H B (using its associated Bregman diver gence.) Then Algorithm 5 giv es us a conv ergence rate of 4 LD 2 /t 2 by Corollary 7 of Xiao (2010).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment