A statistical and computational theory for robust and sparse Kalman smoothing

Kalman smoothers reconstruct the state of a dynamical system starting from noisy output samples. While the classical estimator relies on quadratic penalization of process deviations and measurement errors, extensions that exploit Piecewise Linear Qua…

Authors: Aleks, r Y. Aravkin, James V. Burke

A statistical and computational theory for robust and sparse Kalman smo othing Aleksandr Ara vkin ∗ James V. Burk e ∗∗ Gianluigi Pillonetto ∗∗∗ ∗ Dep artment of Earth and Oc e an Scienc es, University of British Columbia, V anc ouver, Canada (e-mail: sar avkin@e os.ub c.c a). ∗∗ Dep artment of Mathematics, University of Washington, Se attle, USA (e-mail: burke@math.washington.e du) ∗∗∗ Dep artment of Information Engine ering, University of Padova, Padova, Italy (e-mail: giapi@dei.unip d.it) Abstract: Kalman smo others reconstruct the state of a dynamical system starting from noisy output samples. While the classical estimator relies on quadratic penalization of pro cess deviations and measuremen t errors, extensions that exploit Piecewise Linear Quadratic (PLQ) p enalties hav e been recen tly proposed in the literature. These new formulations include smo others robust with respect to outliers in the data, and smo others that k eep b etter track of fast system dynamics, e.g. jumps in the state v alues. In addition to L 2 , w ell known examples of PLQ penalties include the L 1 , Hub er and V apnik losses. In this pap er, w e use a dual represen tation for PLQ penalties to build a statistical modeling framework and a computational theory for Kalman smo othing. W e develop a statistical framew ork by establishing conditions required to interpret PLQ p enalties as negativ e logs of true probabilit y densities. Then, we present a computational framew ork, based on interior-point metho ds, that solves the Kalman smo othing problem with PLQ p enalties and maintains the linear complexit y in the size of the time series, just as in the L 2 case. The framework presented extends the computational efficiency of the Ma yne-F raser and Rauch-T ung-Striebel algorithms to a muc h broader non-smo oth setting, and includes many kno wn robust and sparse smo others as sp ecial cases. Keyw ords: Piecewise linear quadratic p enalties; nonsmo oth optimization; L 1 /Hub er/V apnik loss functions; interior p oin t metho ds 1. INTR ODUCTION Consider the follo wing discrete-time linear state-space mo del x 1 = x 0 + w 1 x k = G k x k − 1 + w k , k = 2 , 3 , . . . , N z k = H k x k + v k , k = 1 , 2 , . . . , N (1.1) where x k ∈ R n is the state, x 0 is known, z k ∈ R m con tains noisy output samples, G k and H k are known matrices. F urther, { w k } and { v k } are mutually indep enden t zero- mean random v ariables with cov ariances giv en by { Q k } and { R k } , resp ectively . The classical fixed-interv al Kalman smo othing problem is to obtain the (unconditional) minimum v ariance linear estimator of the states { x k } N k =1 as a function of { z k } N k =1 . It is w ell kno wn that the structure of this estimator is related to the following optimization problem min { x k } N X k =1 k z k − H k x k k 2 R − 1 k + k x k − G k x k − 1 k 2 Q − 1 k (1.2) where G 1 denotes the iden tit y matrix and k a k 2 Σ := a > Σ a for every column vector a . When data b ecome a v ailable, the solution can b e computed by the classical Kalman smo other with the n umber of operations linear in N . This pro cedure also pro vides the minimum v ariance estimate of the states when all the system noises are assumed to b e Gaussian. In man y circumstances, linear estimators that rely on quadratic p enalization of mo del deviation, such as (1.2), lead to unsatisfactory results. F or instance, they are not robust with resp ect to the presence of outliers in the data [Hub er, 1981, Aravkin et al., 2011a, F arahmand et al., 2011] and ma y ha ve difficulties in reconstructing fast system dynamics, e.g. jumps in the state v alues [Ohlsson et al., 201 1]. In addition, sparsity-promoting regularization is often used in order to extract from a large measuremen t or parameter vector a small subset having greatest impact on the predictiv e capability of the estimate for future data. This sparsit y principle p ermeates many well known tec hniques in mac hine learning and signal processing, such as feature selection, selective shrink age, and compressed sensing [Hastie and Tibshirani, 1990, Efron et al., 2004, Donoho, 2006]. In many circumstances, when smoothing is considered, it can b e in terpreted as a sparse non Gaussian prior distribution on the noises entering the system. In these cases, the estimator (1.2) is often replaced by N X k =1 V ( z k − H k x k ; R k ) + J ( x k − G k x k − 1 ; Q k ) (1.3) where, for example, V can b e the Hub er or the V apnik’s  - insensitiv e loss, used in supp ort vector regression [V apnik, 1998, Evgeniou et al., 2000], while J may be the ` 1 -norm, as in the LASSO pro cedure [Tibshirani, 1996]. The interpretation of problems such as (1.3) in terms of Bay esian estimation has b een extensively studied in the statistical and machine learning literature in recen t y ears and probabilistic approaches used in the analysis of estimation and learning algorithms can b e found e.g. in [Mac k ay, 1994, Tipping, 2001, Wipf et al., 2011]. Non- Gaussian mo del errors and priors leading to a great v a- riet y of loss and p enalt y functions are also reviewed in [P almer et al., 2006] using conv ex-t yp e and in tegral-type v ariational representations, with the latter being related to Gaussian scale mixtures. The fundamen tal nov elty in this w ork is that, rather than taking this approach, we start with a particular class of losses, called PLQ penalties, w ell kno wn from optimization literature [Rock afellar and W ets, 1998]. W e establish conditions which allo w these losses to b e view ed as negative logs of true densities, ensuring that w k and v k in (1.1) come from true distributions. This in turn allo ws us to in terpret the solution to the problem (1.3) as a MAP estimator when the loss functions V and J come from this sub class of PLQ p enalties. W e will show that this sub class includes the four key examples, the L 2 , L 1 , Hub er, and V apnik p enalties. The density c haracterization of PLQ p enalties is achiev ed using a dual represen tation, whic h also underlies the devel- opmen t of algorithms for fitting mo dels of the form (1.3). In particular, in the second part of the paper we deriv e the conditions, complimentary to those needed to set up the statistical framew ork, that allo w the dev elopment of new and computationally efficient Kalman smo others designed using non-smo oth p enalties on the pro cess and measure- men t deviations. Amazingly , it turns out that the interior p oin t metho d used in [Aravkin et al., 2011a] generalizes p erfectly to the entire class of PLQ densities under a simple v erifiable non-degeneracy condition. Hence, the so- lutions of all the PLQ Kalman smoothers can b e computed with a num b er of op erations that scales linearly in N , as in the quadratic case. Suc h theoretical foundation generalizes the results recen tly obtained in [Aravkin et al., 2011a,b, F arahmand et al., 2011, Ohlsson et al., 2011], framing them as particular cases of the framework present ed here. The pap er is organized as follo ws. In Section 2 we in- tro duce the class of PLQ con v ex functions, and pro vide the conditions under whic h they can b e in terpreted as negativ e logs of corresp onding densities. In Section 3 w e presen t a new PLQ Kalman smo other theorem that gen- eralizes the well known Mayne-F raser t wo-filter and the Rauc h-T ung-Strieb el algorithm [Gelb, 1974] to nonsmooth form ulations. This theorem is obtained by solving the Karush-Kuhn-T uc ker (KKT) system for PLQ penalties using in terior p oin t methods, and exploiting the state space structure to obtain the solution in linear time. The necessary results and proofs supporting the main theorems app ear in the Appendix. W e end the pap er with a few concluding remarks. 2. PIECEWISE LINEAR QUADRA TIC PENAL TIES AND DENSITIES 2.1 Pr eliminaries W e recall a few definitions from con vex analysis. • (Affine hull) Define the affine h ull of an y set S , denoted b y aff S , as the smallest affine set that con tains S . • (Cone) F or any set S , denote b y cone S the set { ts | s ∈ S, t ∈ R + } . • (Polar Cone) F or an y cone K ⊂ R m , the p olar of K is defined to b e K ◦ := { v |h v , w i ≤ 0 ∀ w ∈ K } . • (Horizon cone). The (con vex) Horizon cone C ∞ is the set of ‘unbounded directions’ for C , i.e. d ∈ C ∞ if for an y p oin t ¯ w ∈ C we hav e { d | ¯ w + τ d ∈ cl C ∀ τ ≥ 0 } . 2.2 PLQ densities W e now in tro duce the PLQ p enalties and densities that are the fo cus of this pap er. Definition 2.1. (piecewise linear-quadratic p enalties) [Rock- afellar and W ets, 1998]. F or a nonempty p olyhedral set U ⊂ R m and a symmetric p ositiv e-semidefinite matrix M ∈ R m × m (p ossibly M = 0), the function θ U,M : R m → R defined by θ U,M ( w ) := sup u ∈ U  h u, w i − 1 2 h u, M u i  (2.1) is prop er, conv ex, and piecewise linear-quadratic. When M = 0, it is piecewise linear; θ U, 0 = σ U , the supp ort function of U . The effectiv e domain of θ U,M , denoted by dom( θ U,M ), is the set of w ∈ R m where θ U,M ( w ) < ∞ , and is given by ( U ∞ ∩ Null( M )) ◦ .  In order to capture the full class of p enalties of interest, w e consider injective affine transformations in to R m of the form b + B y . The requirements on B therefore are m ≥ n and Null( B ) = { 0 } . The final tec hnical requirement w e imp ose is that b ∈ dom θ U,M . Definition 2.2. (PLQ p enalties with shifts and trans- forms) Using (2.1), define ρ : R n → R as θ U,M ( b + B y ): ρ U,M,b,B ( y ) := sup u ∈ U  h u, b + B y i − 1 2 h u, M u i  (2.2)  The follo wing result c haracterizes the effectiv e domain of ρ (see App endix for pro of ). The or em 2.3. Let ρ denote ρ U,M,B ,b ( y ), and K denote U ∞ ∩ Null( M ). Supp ose U ⊂ R m is a p olyhedral set, y ∈ R n , b ∈ K ◦ , M ∈ R m × m is p ositiv e semidefinite, and B ∈ R m × n is injectiv e. Then w e hav e ( B T K ) ◦ ⊂ dom( ρ ) and ( B T ( K ∩ − K )) ⊥ = aff (dom( ρ )).  Note that the functions ρ are still piecewise linear- quadratic. All of the imp ortant examples mentioned b efore can b e represented in this wa y , as shown b elo w. R emark 2.4. (scalar examples). The L 2 , ` 1 , Hub er, and V apnik p enalties are representable in the notation of Definition 2.2. − K K y x V ( x ) = − K x − 1 2 K 2 ; x < − K V ( x ) = 1 2 x 2 ; − K ≤ x ≤ K V ( x ) = K x − 1 2 K 2 ; K < x −   y x V ( x ) = − x −  ; x < −  V ( x ) = 0; −  ≤ x ≤  V ( x ) = x −  ;  ≤ x Fig. 1. Hub er (left) and V apnik (right) Penalties (1) L 2 : T ak e U = R , M = 1, b = 0, and B = 1. W e obtain ρ ( y ) = sup u ∈ R  uy − 1 2 u 2  . The function inside the sup is maximized at u = y , whence ρ ( y ) = 1 2 y 2 . (2) ` 1 : T ake U = [ − 1 , 1], M = 0, b = 0, and B = 1. W e obtain ρ ( x ) = sup u ∈ [ − 1 , 1] h uy i . The function inside the sup is maximized by taking u = sign( y ), whence ρ ( x ) = | y | . (3) Hub er: T ake U = [ − K, K ], M = 1, b = 0, and B = 1. W e obtain ρ ( y ) = sup u ∈ [ − K,K ]  uy − 1 2 u 2  . T ak e the deriv ativ e with respect to u and consider the follo wing cases: (a) If y < − K , take u = − K to obtain − K y − 1 2 K 2 . (b) If − K ≤ y ≤ K , take u = y to obtain 1 2 y 2 . (c) If y > K , take u = K to obtain a contribution of K y − 1 2 K 2 . This is the Huber penalty with parameter K , sho wn in the left panel of Fig. 1. (4) V apnik: tak e U = [0 , 1] × [0 , 1], M = [ 0 0 0 0 ], B =  1 − 1  , and b =  −  −   , for some  > 0. W e obtain ρ ( y ) = sup u 1 ,u 2 ∈ [0 , 1]  y −  − y −   ,  u 1 u 2  . W e can obtain an explicit representation by considering three cases: (a) If | y | <  , tak e u 1 = u 2 = 0. Then ρ ( y ) = 0. (b) If y >  , tak e u 1 = 1 and u 2 = 0. Then ρ ( y ) = y −  . (c) If y < −  , tak e u 1 = 0 and u 2 = 1. Then ρ ( y ) = − y −  . This is the V apnik penalty with parameter  , shown in the right panel of Fig. 1. Note that the affine generalization (Definition 2.2) is already needed to express the V apnik p enalt y .  In order to c haracterize PLQ p enalties as negative logs of densit y functions, we need to ensure the in tegrability of said density functions. A function ρ ( x ) is called c o er cive if lim k x k→∞ ρ ( x ) = ∞ , and coercivity turns out to b e the k ey prop ert y to ensure in tegrabilit y . The pro of of this fact, and the c haracterization of coercivity for PLQ p enalties using the notation of Def. 2.2, are the sub ject of the next tw o theorems (see App endix for pro ofs). The or em 2.5. (PLQ Integrabilit y). Supp ose ρ ( y ) is co er- civ e, and let n aff denote the dimension of aff (dom ρ ). Then the function f ( y ) = exp( − ρ ( y )) is in tegrable on aff (dom ρ ) with the n aff -dimensional Leb esgue measure.  The or em 2.6. (Co ercivit y of ρ ). ρ is co erciv e if and only if [ B T cone( U )] ◦ = { 0 } .  Theorem 2.6 can be used to sho w the co ercivit y of familiar p enalties. Cor ol lary 2.7. The p enalties L 2 , L 1 , V apnik, and Hub er are all co ercive. Pro of: W e show all of these p enalties satisfy the h yp oth- esis of Theorem 2.6. (1) L 2 : U = R and B = 1, so [ B T cone( U )] ◦ = R ◦ = { 0 } . (2) ` 1 : U = [ − 1 , 1], so cone( U ) = R , and B = 1, so pro of reduces to that case 1. (3) Hub er: U = [ − K, K ], so cone( U ) = R , and B = 1, so pro of reduces to that of case 1. (4) V apnik: U = [0 , 1] × [0 , 1], so cone( U ) = R 2 + . B =  1 − 1  , so B T cone( U ) = R , and again we reduce to case 1.  W e now define a family of distributions on R n b y inter- preting piecewise linear quadratic functions ρ as negative logs of corresp onding densities. Note that the supp ort of the distributions is alwa ys con tained in the affine set aff (dom ρ ), characterized in Th. 2.3. Definition 2.8. (Piecewise linear quadratic densities). Let ρ b e an y co erciv e piecewise linear quadratic function on R n of the form ρ U,M,B ,b ; ( y ) = θ U,M ( b + B y ). Define p ( y ) to b e the follo wing densit y on R n : p ( y ) =  c − 1 1 exp( − ρ ( y )) y ∈ dom ρ 0 else , (2.3) where c 1 =  Z y ∈ dom ρ exp( − ρ ( y )) dy  , and in tegral is with resp ect to the Leb esgue measure with dimension dim  aff (dom ρ )  .  PLQ densities are true densities on the affine h ull of the domain of ρ . The pro of of Theorem 2.5 can b e easily adapted to show that they hav e moments of all orders. 3. KALMAN SMOOTHING WITH PLQ PENAL TIES In this section, w e consider the mo del (1.1), but in the general case where errors w k and v k can come from any of the densities in tro duced in the previous section. T o this end, w e first formulate the KS problem ov er the entire sequence { x k } . Giv en a sequence of column vectors { u k } and matrices { T k } we use the notation v ec( { u k } ) =     u 1 u 2 . . . u N     , diag( { T k } ) =      T 1 0 · · · 0 0 T 2 . . . . . . . . . . . . . . . 0 0 · · · 0 T N      . W e make the following definitions. x = v ec { x 1 , · · · , x N } , w = v ec { w 1 , · · · , w K } v = vec { v 1 , · · · , v k } , Q = diag { Q 1 , · · · , Q N } R = diag { R 1 , · · · , R N } , H = diag { H 1 , · · · , H N } . W e also introduce the matrices G and H : G =      I 0 − G 2 I . . . . . . . . . 0 − G N I      , H =      H 1 0 0 H 2 . . . . . . . . . 0 0 H N      . With this notation, mo del (1.1) can b e written ˜ x 0 = Gx + w z = H x + v , (3.1) where x ∈ R nN is the entire state sequence of in terest, w is corresp onding pro cess noise, z is the vector of all measuremen ts, v is the measuremen t noise, and ˜ x 0 is a v ector of size nN with the first n -blo c k equal to x 0 , the initial state estimate, and the other blo c ks set to 0. The general Kalman smo othing problem is describ ed in the following prop osition. Pr op osition 3.1. Supp ose that the noises w and v in the mo del (3.1) are PLQ densities with means 0, v ariances Q and R (see Def. 2.8). Then, for suitable U w , M w , b w , B w and U v , M v , b v , B v w e ha v e p ( w ) ∝ exp( − θ U w ,M w ( b w + B w Q − 1 / 2 w )) p ( v ) ∝ exp( − θ U v ,M v ( b v + B v R − 1 / 2 v )) (3.2) while the MAP estimator of x in the mo del (3.1) is arg min x ∈ R nN ( θ U w ,M w ( b w + B w Q − 1 / 2 ( Gx − ˜ x 0 )) + θ U v ,M v ( b v + B v R − 1 / 2 ( H x − z )) ) (3.3)  Note that since w k and v k are indep enden t, problem (3.3) is decomp osable into a sum of terms analogous to (1.2). This special structure is manifest in the blo c k diagonal structure of H , Q, R, B v , B w , the bidiagonal structure of G , and the structure of sets U w and U v , and is key in pro ving the linear complexity result that will b e derived in the next part of this section. F or our purposes, it is now important to recall that, when the sets U w and U v are p olyhedral, (3.3) is an Extended Linear Quadratic program (ELQP), described in [Ro c k afellar and W ets, 1998, Example 11.43]. Rather than directly solving (3.3), we work with the Karush- Kuhn-T uc ker (KKT) system. W e present the system in the following lemma, and derive it in the App endix. L emma 3.2. Supp ose that the sets U w and U v are p oly- hedral, i.e. can b e written U w = { u | ( A w ) T u ≤ a w } , U v = { u | ( A v ) T u ≤ a v } . Then the necessary first-order conditions for optimalit y of (3.3) are given by 0 = ( A w ) T u w + s w − a w ; 0 = ( A v ) T u v + s v − a v 0 = ( s w ) T q w ; 0 = ( s v ) T q v 0 = ˜ b w + B w Q − 1 / 2 G ¯ d − M w ¯ u w − A w q w 0 = ˜ b v − B v R − 1 / 2 H ¯ d − M v ¯ u v − A v q v 0 = G T Q − T / 2 ( B w ) T ¯ u w − H T R − T / 2 ( B v ) T ¯ u v 0 ≤ s w , s v , q w , q v . (3.4)  Our approach is to solv e (3.4) directly using Interior P oin t (IP) metho ds. IP metho ds work by applying a damp ed Newton iteration to a relaxed version of (3.4), sp ecifically relaxing the ‘complementarit y conditions’: ( s w ) T q w = 0 → Q w S w 1 − µ 1 = 0 ( s v ) T q v = 0 → Q v S v 1 − µ 1 = 0 , where Q w , S w , Q v , S v are diagonal matrices with diagonals q w , s w , q v , s v resp ectiv ely . The parameter µ is aggressively decreased to 0 as the IP iterations pro ceed. Typically , no more than 10 or 20 iterations of the relaxed system are required to obtain a solution of (3.4), and hence an optimal solution to (3.3). The follo wing theorem is k ey and represents the main result of this section. It sho ws that the computational effort required (p er IP iteration) is linear in the n umber of time steps whatever PLQ densit y en ters the state space mo del. The or em 3.3. (PLQ Kalman Smo other Theorem) Supp ose that all w k and v k in the Kalman smo othing mo del (1.1) come from PLQ densities that satisfy Null( M ) ∩ U ∞ = { 0 } , i.e. their corresp onding p enalties are finite- v alued. Then (3.3) can b e solv ed using an IP method, with computational complexity O ( N n 3 + N m ) time.  The pro of is presented in the App endix and shows that IP metho ds for solving (3.3) preserve the k ey blo c k tridiago- nal structure of the standard smo other. General smo othing estimates can therefore b e computed in O ( N n 3 ) time, as long as the n umber of IP iterations is fixed (as it usually is in practice, to 10 or 20). It is imp ortan t to observ e that the motiv ating examples (see Remark 2.4) all satisfy the conditions of Theorem 3.3. Cor ol lary 3.4. The densities corresp onding to L 1 , L 2 , Hu- b er, and V apnik penalties all satisfy the hypotheses of Theorem 3.3. Pro of: W e verify that Null( M ) ∩ Null( A T ) = 0 for eac h of the four p enalties. In the L 2 case, M has full rank. F or the L 1 , Hub er, and V apnik penalties, the respective sets U are b ounded, so U ∞ = { 0 } . 4. CONCLUSIONS W e ha ve presented a new theory for robust and sparse Kalman smo othing using nonsmooth PLQ p enalties ap- plied to pro cess and measuremen t deviations. These smo others can b e designed within a statistical framework obtained by viewing PLQ p enalties as negative logs of true probability densities, and we hav e presented necessary conditions that allo w this interpretation. In this regard, the co ercivit y condition characterized in Th. 2.6 has b een sho wn to play a central role. Notice that such a condition is also a nice example of how the statistical framew ork established in the first part of this pap er gives an alter- nativ e viewpoint for an idea useful in machine learning. In fact, co ercivit y is also a fundamental prerequisite in sparse and robust estimation as it precludes directions for which the loss and the regularizer are insensitive to large parameter/state c hanges. Th us, the condition for a (PLQ) penalty to b e a negativ e log of a true density also ensures that the problem is well p osed and that the learning machine/smoother can con trol mo del complexity . In the second part of the pap er, we ha ve shown that solutions to PLQ Kalman smo othing formulations can b e computed with a n umber of op erations that is linear in the length of the time series, as in the quadratic case. A sufficient condition for the successful execution of IP iterations is that the PLQ penalties used should b e finite v alued, which implies non-degeneracy of the corresp onding statistical distribution (the supp ort cannot b e contained in a lo w er-dimensional subspace). The statistical in ter- pretation is thus strongly linked to the computational pro cedure. The computational framework presen ted allows a broad application of in terior p oin t methods to a wide class of smo othing problems of in terest to practitioners. The p o w- erful algorithmic scheme designed here, together with the breadth and significance of the new statistical framew ork presen ted, underscores the practical utilit y and flexibilit y of this approach. W e b eliev e that this p erspective on mo del dev elopment and Kalman smo othing will b e useful in a n umber of applications in the years ahead. REFERENCES A. Aravkin, B. Bell, J.V. Burke, and G. Pillonetto. An ` 1 - Laplace robust Kalman smoother. IEEE T r ansactions on Automatic Contr ol , 2011a. A. Aravkin, B.M. Bell, J.V. Burk e, and G. Pillonetto. Learning using state space k ernel machines. In Pr o c. IF A C World Congr ess 2011 , Milan, Italy , 2011b. B.M. Bell. The marginal likelihoo d for parameters in a discrete Gauss-Mark ov pro cess. IEEE T r ansactions on Signal Pr o c essing , 48(3):626–636, August 2000. D. Donoho. Compressed sensing. IEEE T r ans. on Infor- mation The ory , 52(4):1289–1306, 2006. B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics , 32:407–499, 2004. T. Evgeniou, M. Pon til, and T. Poggio. Regularization net works and supp ort vector machines. A dvanc es in Computational Mathematics , 13:1–150, 2000. S. F arahmand, G.B. Giannakis, and D. Angelosante. Dou- bly robust smo othing of dynamical pro cesses via outlier sparsit y constraints. IEEE T r ansactions on Signal Pr o- c essing , 59:4529–4543, 2011. A. Gelb. Applie d Optimal Estimation . The M.I.T. Press, Cam bridge, MA, 1974. T. J. Hastie and R. J. Tibshirani. Generalized additive mo dels. In Mono gr aphs on Statistics and Applie d Pr ob a- bility , v olume 43. Chapman and Hall, London, UK, 1990. P .J. Hub er. R obust Statistics . Wiley , 1981. D.J.C. Mac k ay . Ba yesian non-linear mo delling for the prediction comp etition. ASHRAE T r ans. , 100(2):3704– 3716, 1994. H. Ohlsson, F. Gustafsson, L. Ljung, and S. Bo yd. State smo othing b y sum-of-norms regularization. Automatic a (to app e ar) , 2011. J.A. P almer, D.P . Wipf, K. Kreutz-Delgado, and B.D. Rao. V ariational em algorithms for non-gaussian latent v ariable models. In Pr o c. of NIPS , 2006. R.T. Ro c k afellar. Convex Analysis . Priceton Landmarks in Mathematics. Princeton Universit y Press, 1970. R.T. Ro c k afellar and R.J.B. W ets. V ariational Analysis , v olume 317. Springer, 1998. R. Tibshirani. Regression shrink age and selection via the LASSO. Journal of the R oyal Statistic al So ciety, Series B. , 58, 1996. M. Tipping. Sparse bay esian learning and the relev ance v ector machine. Journal of Machine L e arning R ese ar ch , 1:211–244, 2001. V. V apnik. Statistic al L e arning The ory . Wiley , New Y ork, NY, USA, 1998. D.P . Wipf, B.D. Rao, and S. Nagara jan. Latent v ariable ba yesian models for promoting sparsit y . IEEE T r ansac- tions on Information The ory (to app e ar) , 2011. APPENDIX Pr eliminaries Definition 4.1. (Horizon cone, sp ecialized to the conv ex setting by [Ro ck afellar and W ets, 1998, Theorem 3.6]). The Horizon cone C ∞ for a con vex set C is con vex, and for an y p oin t ¯ w ∈ C consists of the v ectors { d | ¯ w + τ d ∈ cl C ∀ τ ≥ 0 } . Definition 4.2. (Lineality). Define the lineality of con v ex cone K , denoted lin( K ), to b e K ∩ − K . Since K is a con vex cone, lin( K ) is the largest subspace contained in K . L emma 4.3. (Characterization of linealit y , [Ro ck afellar, 1970, Theorem 14.6]). Let K b e any closed set containing the origin. Then lin( K ) = ( K ◦ ) ⊥ . Definition 4.4. (Affine h ull). Define the affine h ull of any set S , denoted by aff S , as the smallest affine set that con tains S . Cor ol lary 4.5. (Characterization of aff K ◦ ) T aking the p erp of the c haracterization in Lemma 4.3, the affine h ull of the p olar of a closed conv ex cone K is giv en b y aff K ◦ = lin( K ) ⊥ . Pr o of of The or em 2.3 L emma 4.6. (P olars, linear transformations, and shifts) Let K ⊂ R n b e a closed conv ex cone, b ∈ R n , and B ∈ R n × k . Then w e hav e ( B T K ) ◦ ⊂ B − 1 ( K ◦ − b ) if b ∈ K ◦ . Pro of: Recall that a conv ex cone is closed under addition. Then for any b ∈ K ◦ , we ha ve K ◦ + b ⊂ K ◦ , and hence K ◦ ⊂ K ◦ − b . By [Rock afellar, 1970, Corollary 16.3.2] we get ( B T K ) ◦ = B − 1 K ◦ ⊂ B − 1 ( K ◦ − b ) .  Cor ol lary 4.7. Let K b e a closed conv ex cone, and B ∈ R n × k . If b ∈ K ◦ , then  B T (lin( K ))  ⊥ ⊂ aff ( B − 1 ( K ◦ − b )) . Pro of: By Lemma 4.6, aff ( B − 1 ( K ◦ − b )) ⊃ aff ( B T K ) ◦ =  lin( B T K )  ⊥ where the last equality is by Corollary 4.5. Since B T is a linear transformation, we hav e lin( B T K ) = B T lin( K ).  L emma 4.8. Let K ⊂ R n b e a closed conv ex cone, b ∈ aff ( K ) ◦ , and B ∈ R n × k . Then aff ( B − 1 ( K ◦ − b )) ⊂ B − 1 (lin( K )) ⊥ ⊂ ( B − 1 aff ( K ◦ − b )). Pro of: If w ∈ aff  B − 1 ( K ◦ − b )  , for some finite N we can find sets { λ i } ⊂ R and { w i } ⊂ B − 1 ( K ◦ − b ) such that P N i =1 λ i = 1 and P N i =1 λ i w i = w . F or each w i , we hav e B w i ∈ K ◦ − b , so b + B w i ∈ K ◦ . Then b + B w = N X i =1 λ i ( b + B w i ) ∈ aff ( K ◦ ) = lin( K ) ⊥ . Since b ∈ lin( K ) ⊥ b y assumption, w e hav e B w ∈ lin( K ) ⊥ , and so w ∈ B − 1 (lin( K ) ⊥ ). Next, starting with w ∈ B − 1 (lin( K ) ⊥ ) we hav e B w ∈ lin( K ) ⊥ and so b + B w ∈ lin( K ) ⊥ since lin( K ) ⊥ is a subspace and b ∈ lin( K ) ⊥ . Then for some finite ˜ N we can find sets { λ i } ⊂ R and { v i } ⊂ K ◦ suc h that P ˜ N i =1 λ i = 1 and P ˜ N i =1 λ i v i = b + B w . Subtracting b from both sides, we ha ve P ˜ N i =1 λ i ( v i − b ) = B w , so in particular B w ∈ aff ( K ◦ − b ). Then w ∈ B − 1 aff ( K ◦ − b ).  The or em 4.9. Let K ⊂ R n b e a closed conv ex cone, b ∈ R n , and B ∈ R n × k . If b ∈ K ◦ , then ( B T lin( K )) ⊥ = aff  B − 1 ( K ◦ − b )  = B − 1 (lin( K ) ⊥ ). Pro of: F rom Corollary 4.7 and Lemma 4.8, we immedi- ately hav e ( B T lin( K )) ⊥ ⊂ aff  B − 1 ( K ◦ − b )  ⊂ B − 1 (lin( K ) ⊥ ) . Note that for an y subspace S , S ⊥ = S ◦ . Then by [Ro c k afellar, 1970, Corollary 16.3.2], ( B T lin( K )) ⊥ = B − 1 (lin( K ) ⊥ ).  The pro of of Theorem 2.3 now follows from Lemma 4.6 and Theorem 4.9. Pr o of of The or em 2.5 Using the ch aracterization of a piecewise quadratic func- tion from [Ro c k afellar and W ets, 1998, Definition 10.20], the effective domain of ρ ( y ) can b e represen ted as the union of finitely many p olyhedral sets U i , relativ e to eac h of whic h ρ ( y ) is given b y an expression of the form 1 2 h y , A i y i + h a i , y i + α i for some scalar α i ∈ R , vec- tor a i ∈ R n and symmetric p ositiv e semidefinite matrix A i ∈ R n × n . Since ρ ( y ) is co erciv e, w e claim that on eac h un b ounded U i there must b e some constan ts N i and β i > 0 so that for k y k ≥ N i w e hav e ρ ( y ) ≥ β i k y k . Otherwise, we can find an index set J suc h that ρ ( y j ) ≤ β j k y j k , where β j ↓ 0 and k y j k ↑ ∞ . Without loss of generality , supp ose y j k y j k con verges to ¯ y ∈ U ∞ i , by [Ro ck afellar, 1970, Theorem 8.2]. By assumption, ρ ( y j ) k y j k ↓ 0, and we hav e ρ ( y j ) k y j k = k y j k  y j k y j k , A i y j k y j k  +  a i , y j k y j k  + α i k y j k . T aking the limit of b oth sides ov er J w e see that k y j k D y j k y j k , A i y j k y j k E m ust conv erge to a finite v alue. But this is only p ossible if h ¯ y , A i ¯ y i = 0, so in particular we m ust hav e ¯ y ∈ Null( A i ). Note also that h a i , ¯ y i ≤ 0, b y taking the limit ov er J of ρ ( y j ) k y j k ≥  a i , y j k y j k  + α k y i k , so for an y x 0 ∈ U i and λ > 0 w e hav e x 0 + λ ¯ y ∈ U i since ¯ y ∈ U ∞ i and ρ ( x 0 + λ ¯ y ) ≤ ρ ( x 0 ) + α i , so in particular ρ stays b ounded as λ ↑ ∞ and cannot b e co erciv e. The integrabilit y of f ( y ) is now clear. First note that f ( y ) is b ounded b elo w by 0. Recall that the e ffectiv e domain of ρ can b e represented as the union of finitely many p olyhedral sets U i , and for each un b ounded such U i w e hav e sho wn f ( y ) ≤ exp[ − β i k y k ] off of some b ounded subset of U i . An elementary application of the b ounded conv ergence theorem shows that f must b e integrable. Pr o of of The or em 2.6 First observ e that [ B − 1 (cone( U )] ◦ = [ B T cone( U )] ◦ b y [Ro c k afellar, 1970, Corollary 16.3.2]. Supp ose that ˆ y ∈ B − 1 ((cone U ) ◦ ), and ˆ y 6 = 0. Then B ˆ y ∈ cone( U ), and B ˆ y 6 = 0 since B is injectiv e, and w e ha ve ρ ( τ ˆ y ) = sup u ∈ U h b + τ B ˆ y , u i − 1 2 u T M u = sup u ∈ U h b, u i − 1 2 u T M u + τ h B ˆ y , u i ≤ sup u ∈ U h b, u i − 1 2 u T M u ≤ θ U,M ( b ) , so ρ ( τ ˆ y ) stays b ounded ev en as τ → ∞ , and so ρ cannot b e co erciv e. Con versely , supp ose that ρ is not co ercive. Then w e can find a sequence { y k } with k y k k > k and a constant K so that ρ ( y k ) ≤ K for all k > 0. Without loss of generality , w e ma y assume that y k k y k k → ¯ y . Then by definition of ρ , we hav e for all u ∈ U h b + B y k , u i − 1 2 u T M u ≤ K h b + B y k , u i ≤ K + 1 2 u T M u h b + B y k k y k k , u i ≤ K k y k k + 1 2 k y k k u T M u Note that ¯ y 6 = 0, so B ¯ y 6 = 0. When we take the limit as k → ∞ , w e get h B ¯ y , u i ≤ 0. F rom this inequality w e see that B ¯ y ∈ (cone U ) ◦ , and so ¯ y ∈ B − 1 ((cone U ) ◦ ). Pr o of of L emma 3.2 The Lagrangian for (3.3) for feasible ( x, u w , u v ) is L ( x, u w , u v ) =  ˜ b w ˜ b v  ,  u w u v  − 1 2  u w u v  T  M w 0 0 M v   u w u v  −  u w u v  ,  − B w Q − 1 / 2 G B v R − 1 / 2 H  x  (4.1) where ˜ b w = b w − B w Q − 1 / 2 ˜ x 0 and ˜ b v = b v − B v R − 1 / 2 z . The asso ciated optimality conditions for feasible ( x, u w , u v ) are giv en b y G T Q − T / 2 ( B w ) T ¯ u w − H T R − T / 2 ( B v ) T ¯ u v = 0 ˜ b w − M w ¯ u w + B w Q − 1 / 2 G ¯ x ∈ N U w ( ¯ u w ) ˜ b v − M v ¯ u v − B v R − 1 / 2 H ¯ x ∈ N U v ( ¯ u v ) , (4.2) where N C ( x ) denotes the normal cone to the set C at the p oin t x (see Ro c k afellar [1970] for details). Since U w and U v are p olyhedral, we can derive ex- plicit represen tations of the normal cones N U w ( ¯ u w ) and N U v ( ¯ u v ). F or a polyhedral set U ⊂ R m and an y point ¯ u ∈ U , the normal cone N U ( ¯ u ) is polyhedral. Indeed, relativ e to any representation U = { u | A T u ≤ a } and the activ e index set I ( ¯ u ) := { i |h A i , ¯ u i = a i } , where A i denotes the i th column of A , w e ha v e N U ( ¯ u ) =  q 1 A 1 + · · · + q m A m | q i ≥ 0 for i ∈ I ( ¯ u ) q i = 0 for i 6∈ I ( ¯ u )  . (4.3) Using (4.3), Then we may rewrite the optimality condi- tions (4.2) more explicitly as G T Q − T / 2 ( B w ) T ¯ u w − H T R − T / 2 ( B v ) T ¯ u v = 0 ˜ b w − M w ¯ u w + B w Q − 1 / 2 G ¯ d = A w q w ˜ b v − M v ¯ u v − B v R − 1 / 2 H ¯ d = A v q v { q v ≥ 0 | q v i = 0 for i 6∈ I ( ¯ u v ) } { q w ≥ 0 | q w i = 0 for i 6∈ I ( ¯ u w ) } (4.4) Define slack v ariables s w ≥ 0 and s v ≥ 0 as follows: s w = a w − ( A w ) T u w s v = a v − ( A v ) T u v . (4.5) Note that w e know the en tries of q w i and q v i are zero if and only if the corresponding slack v ariables s v i and s w i are nonzero, resp ectiv ely . Then we ha ve ( q w ) T s w = ( q v ) T s v = 0. These equations are known as the complementarit y con- ditions. T ogether, all of these equations give system (3.4). 4.1 Pr o of of The or em 3.3 IP methods apply a damp ed Newton iteration to find the solution of the relaxed KKT system F µ = 0, where F µ        s w s v q w q v u w u v x        =         ( A w ) T u w + s w − a w ( A v ) T u v + s v − a v D ( q w ) D ( s w ) 1 − µ 1 D ( q v ) D ( s v ) 1 − µ 1 ˜ b w + B w Q − 1 / 2 Gd − M w u w − A w q w ˜ b v − B v R − 1 / 2 H d − M v u v − A v q v G T Q − T / 2 ( B w ) T u w − H T R − T / 2 ( B v ) T ¯ u v         . This entails solving the system F (1) µ        s w s v q w q v u w u v d               ∆ s w ∆ s v ∆ q w ∆ q v ∆ u w ∆ u v ∆ d        = − F µ        s w s v q w q v u w u v d        , (4.6) where the deriv ative matrix F (1) µ is given by       I 0 0 0 ( A w ) T 0 0 0 I 0 0 0 ( A v ) T 0 Q w 0 S w 0 0 0 0 0 Q v 0 S v 0 0 0 0 0 − A w 0 − M w 0 B w Q − 1 / 2 G 0 0 0 − A v 0 − M v − B v R − 1 / 2 H 0 0 0 0 G T Q − T / 2 ( B w ) T − H T R − T / 2 ( B v ) T 0       (4.7) W e now sho w the ro w operations necessary to reduce the matrix F (1) µ in (4.7) to upp er blo ck triangular form. After eac h op eration, we sho w only the row that was mo dified. ro w 3 ← ro w 3 − D ( q w ) ro w 1  0 0 D ( s w ) 0 − D ( q w )( A w ) T 0 0  ro w 4 ← ro w 4 − D ( q v ) ro w 2  0 0 0 D ( s v ) 0 − D ( q v )( A v ) T 0  ro w 5 ← ro w 5 + A w D ( s w ) − 1 ro w 3  0 0 0 0 − T w 0 B w Q − 1 / 2 G  ro w 6 ← ro w 6 + A v D ( s v ) − 1 ro w 4  0 0 0 0 0 − T v − B v R − 1 / 2 H  . In the ab ov e expressions, T w := M w + A w ( S w ) − 1 Q w ( A w ) T T v := M v + A v ( S v ) − 1 Q v ( A v ) T , (4.8) where ( S w ) − 1 Q w and ( S v ) − 1 Q v are alwa ys full-rank di- agonal matrices, since the vectors s w , q w , s v , q v are alw a ys strictly p ositiv e in IP iterations. The in vertibilit y of T w and T v is charac h terized in the following lemma. L emma 4.10. (In v ertibility of T ) Let θ U,M ( · ) b e an y PLQ p enalt y on R k , with U = { u    A T u ≤ a } . Let T θ := M + AD A T , where D is an y diagonal k × k matrix with positive en tries on the diagonal. Then T θ is in v ertible if and only if Null( M ) ∩ U ∞ = { 0 } , or dom( θ U,M ) is R k . Pro of: Note that Null( M + AD A T ) = { w    w T M w + w T AD A T w = 0 } = { w    w ∈ Null( M ) , w ∈ Null( A T ) } = Null( M ) ∩ Null( A T ) . The first claim now follows from the fact that U ∞ = Null( A T ). Recall that the effectiv e domain of θ is given by (Null( M ) ∩ U ∞ ) ◦ , and it is immediate from the definition of ‘p olar’ that 0 ◦ = R k .  R emark 4.11. (Blo c k diagonal structure of T in i.d. case) Supp ose that y is a random vector, y = ( y 1 ··· y n ), where each y i is itself a random v ector in R m i , from some PLQ densit y p ( y i ) ∝ exp[ − c 2 θ U i ,M i (( · ))], and all y i are indep enden t. Let U i = { u : A T i u ≤ a i } . Then the matrix T θ is given by T θ = M + AD A T where M = diag [ M 1 , · · · , M N ], A = diag[ A 1 , · · · , A N ], D = diag[ D 1 , · · · , D N ], and { D i } are diagonal with p ositiv e en tries. Moreov er, T θ is blo c k diagonal, with i th diagonal blo c k given by M i + A i D i A T i .  Cor ol lary 4.12. ( T matrices in the Kalman smo othing con text) The matrices T w and T v in (4.8) are blo c k diago- nal pro vided that { w k } and { v k } are indep enden t vectors from an y PLQ densities. Moreo v er, if these densities all satisfy the characterization in Lemma 4.10, these matrices are also inv ertible.  W e no w finish the reduction of F (1) µ to upp er blo ck triangular form: ro w 7 ← ro w 7 +  G T Q − T / 2 ( B w ) T ( T w ) − 1  ro w 5 −  H T R − T / 2 ( B v ) T ( T v ) − 1  ro w 6         I 0 0 0 ( A w ) T 0 0 0 I 0 0 0 ( A v ) T 0 0 0 S w 0 − Q w ( A w ) T 0 0 0 0 0 S v 0 − Q v ( A v ) T 0 0 0 0 0 − T w 0 B w Q − 1 / 2 G 0 0 0 0 0 − T v − B v R − 1 / 2 H 0 0 0 0 0 0 Φ         where Φ = Φ G + Φ H = G T Q − T / 2 ( B w ) T ( T w ) − 1 B w Q − 1 / 2 G + H T R − T / 2 ( B v ) T ( T v ) − 1 B v R − 1 / 2 H . (4.9) Note that Φ is symmetric p ositive definite. Note also that Φ is blo ck tridiagonal, since (1) Φ H is blo ck diagonal. (2) Q − T / 2 ( B w ) T ( T w ) − 1 B w Q − 1 / 2 is block diagonal, and G is blo c k bidiagonal, hence Φ G is blo ck tridiagonal. Solving system (4.6) requires inv erting the blo c k diagonal matrices T v and T w at eac h iteration of the damp ed Newton’s metho d, as well as solving an equation of the form Φ∆ x = % . W e hav e already seen that Φ is blo c k tridiagonal, symmetric, and p ositiv e definite, so Φ∆ x = % can b e solv ed in O ( N n 3 ) time using the block tridiagonal algorithm in [Bell, 2000]. The remaining four back solves required to solve (4.6) can each b e done in O ( nN ) time.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment