Learning Hidden Markov Models using Non-Negative Matrix Factorization

SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , S EPTEMB ER 2008 1 Learning Hidden Mark ov Models using Non-Ne gati v e Matrix Fa ctorizati on George Cybenko, F ellow , IEEE , and V alentino Crespi, Member , IEEE Abstract —The Baum-W elch algorithm together with its deriva - tive s and v ariations has been the main technique f or learning Hidden M arko v Models ( HMM) from observ ational data. W e present an HMM l earning algorithm based on the non -negativ e matrix factorization (NMF) of hi gher order Markov ian statistics that is structurally different from the Baum-W elch an d its asso- ciated approaches. The described algorithm supports estimation of the number of recurrent states of an HM M and iterates the non-negative matrix factorization (NM F) algorithm to impro ve the learned HMM parameters. Nu merical examples ar e p ro vided as well. Index T erms —Hid den Markov Models, machine learning, non- negativ e matrix factorization. I . I N T R O D U C T I O N Hidden Markov Models (HMM) have been successfully used to mode l stochastic systems arising in a variety of appli- cations rang ing from biology to engine ering to ﬁnance [1], [2], [3], [4], [5], [6]. Follo wing accepted no tation f or representing the parameter s and structure of HMM’ s (see [7], [8], [9], [1], [10] for example), we will use the following terminolo gy and deﬁnitions: 1) N is the n umber of states of th e Markov cha in un derly- ing the HMM. The state sp ace is S = { S 1 , ..., S N } and the system’ s state process at time t is denoted by x t ; 2) M is the number of distinct ob servables or symb ols generated by th e HMM. The set of possible observables is V = { v 1 , ..., v M } and th e observation process at time t is den oted by y t . W e d enote by y t 2 t 1 the su bproce ss y t 1 y t 1 +1 . . . y t 2 ; 3) The join t prob abilities a ij ( k ) = P ( x t +1 = S j , y t +1 = v k | x t = S i ); are the time-inv ar iant p robabilities of transitioning to state S j at tim e t + 1 and emitting ob servation v k giv en that at time t the system was in state S i . Observation v k is emitted d uring th e tr ansition f rom state S i to state S j . W e use A ( k ) = ( a ij ( k )) to denote the matrix of state tran sition prob abilities th at emit the same symbol v k . Note that A = P k A ( k ) is the stoc hastic m atrix representin g the Markov chain state pro cess x t . 4) The initial state distribution, at time t = 1 , is giv en by Γ = { γ 1 , ..., γ N } wh ere γ i = P ( x 1 = S i ) ≥ 0 and P i γ i = 1 . G. Cybenk o is with the Thayer Schoo l of Engineering, Dartmouth Colle ge, Hanov er, NH 03755 USA e-mail: gvc@da rtmouth.edu. V . Crespi is with the Department of Computer Science , Cali fornia State Uni versit y at Los Angeles, LA, 90032 USA e-mail : vcrespi@calsta tela.edu. Manuscript submitte d September 2008 Collectiv ely , matrices A ( k ) and Γ completely deﬁne the HMM and we say that a model for the HMM is λ = ( { A ( k ) | 1 ≤ k ≤ M } , Γ) . W e present a n algo rithm for learning an HMM from sing le or mu ltiple obser vation seque nces. The traditional approach for learn ing an HMM is the Baum-W elch Algor ithm [1] wh ich has been extended in a v ariety of ways by other s [11], [12], [13]. Recently , a novel and promising approach to the HMM ap- proxim ation problem w a s propo sed by Finesso et al. [14]. That approa ch is based on Ander son’ s HMM stochastic re alization technique [15] which d emonstrates that a p ositiv e factorization of a certain Han kel matrix (consisting of observation string probab ilities) can be used to rec over the hidd en Markov model’ s pro bability matrices. Finesso and his coautho rs u sed recently developed n on-negative m atrix factorization (NMF) algorithm s [16] to express those stochastic realizatio n tech- niques as an ope rational algorithm . E arlier ideas in that vein were anticipated by Upper in 19 97 [ 17], althoug h that work did not bene ﬁt fro m HMM stochastic re alization techn iques or NMF alg orithms, bo th of which were dev eloped after 1997. Methods based on stochastic re alization tech niques, inc lud- ing the one presented he re, are fundam entally different from Baum-W elch based meth ods in that the algorithm s use as input observation sequence pr o babilities as opposed to r aw obser- vation sequen ces . An derson ’ s an d Fin esso’ s app roaches u se system realization method s while ou r algorithm is in the spirit of the My hill-Nerod e [18] construction for building auto mata from languag es. In the Myhill-Ner ode construction, states are deﬁned as equi valence classes of pasts wh ich produce th e same futures. In an HMM, the “future” of a state is a p robab ility distribution over futu re observations. Follo wing this intu ition we d erive o ur result in a manner that app ears co mparatively more concise and elementary , in r elation to the aforemention ed approa ches by Anderso n and Fin esso. At a concep tual le vel, our algorithm operates as follows. W e ﬁrst estimate the matrix of an observation sequence’ s high order statistics. This matrix has a natural non-negative matr ix factorization (NMF) [16] which can b e interpreted in terms of the probab ility distribution of fu ture o bservations gi ven the current state of th e un derlyin g Markov Chain. Once estimated, these pro bability distributions can be u sed to d irectly estimate the transition proba bilities of the HMM. The estimated HMM para meters can be used, in turn, to compute the NMF ma trix factors as well as the und erlying higher ord er c orrelation matrix from data ge nerated by the estimated HMM. W e p resent a simple examp le in which an NMF factoriza tion is exact but does not correspon d to any SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 2 HMM. This is a fact that can be established by comparin g th e factors comp uted by the NMF with the factors compu ted by the estimated HM M para meters. This kind of co mparison is not po ssible with other ap proach es [ 14]. It is impo rtant to point out that the optimal no n-negative matrix factor ization of a positive matrix is k nown to be NP- Hard in th e general case [19], so in p ractice one co mputes only locally optimal factorization s. As we will show through examples, the repeated iteration of the factorization an d tran - sition pro bability estimation steps improves the factorizations and overall mode l estimation. Details are provided below . A. Pr elimina ries an d Notation The only inp ut to ou r alg orithm is an observation seque nce of leng th T of the HMM, n amely: O 1: T = O 1 O 2 ... O T where O t ∈ V is the HM M o utput at observation time t . W e do not assume that th e observation time t = 1 coincides with the process’ initial state so that th e initial distribution o f states is not n ecessarily governe d by Γ . In fact, at pr esent, our algo rithm is capable of lea rning only the ergodic partition of an HMM, namely th e set of states that are recu rrent. Consequently , ou r mod el of an HMM ref ers o nly to the transition probab ility componen t λ = { A ( k ) } k that iden tiﬁes this ergod ic partition (see [2 0], [2 1] f or some back groun d on this co ncept). Giv en O 1: T , we co nstruct two summar y statistics repre- sented as matr ices R p,s and F p,s for positive integers p and s . R p,s is simply a histogram of co ntiguou s preﬁx- sufﬁx combinatio ns whose rows ar e indexed by observations subse- quences of length p and c olumns are indexed by ob servation subsequen ces of length s . If there a re M symbols in the ob servation alphabet, then R p,s is an M p by M s matrix whose ( i, j ) th entry is the number of times the preﬁx su bstring correspon ding to i is immediately followed by the substring corresponding to j . The correspo ndence be tween strings and in tegers is lexicographic in our e xamp les below alth ough any other cor respond ence will do as well. The matrix F p,s is simply R p,s normalized to be row stochastic. Speciﬁcally , if G = ( g i ) where g i = P j R p,s i,j then F p,s i,j = R p,s i,j /g i for g i 6 = 0 and F p,s i,j = 0 for g i = 0 . Rows of R p,s , and correspondin gly F p,s , are ze ro if the pr eﬁx correspo nding to the row label is not o bserved in the data. Zero rows of these matrices can be deleted reducing the size of the matrices without affecting the alg orithm describ e below . According ly , F p,s is con structed to be row stochastic. Entry F p,s u,v is essentially an estimate of P ( V | U ) th e prob- ability of observin g ob servation sequen ce V of leng th s , indexed by v , following observation sequen ce U of leng th p , indexed by u (see th e w ork of Marton, Katalin and Shields [2 2] for a study of the a ccuracy of su ch estimates). Note that while R , F an d G have expon entially many rows and columns with respect to p a nd s , the actual number of nonzer o entries in these matrices ar e bou nded above b y T so that, stored as sp arse matrices, th ey requir e no mor e storage than the original ob servation sequence. Note that Baum-W elch methods req uire storing and repeated ly accessing the original observation sequence. A simple but key observation ab out s tates of an HM M is tha t each state of an HMM indu ces a probab ility d istribution on symbol subsequ ences of any length s . Speciﬁcally , suppo se an HMM, λ , is in state S i 0 (having not yet emitted an observation in that state) and con sider the symbol subsequ ence V = v j 1 v j 2 ...v j s . Th en P ( V | S i 0 , s, λ ) = P ( y t + s t +1 = v j 1 v j 2 ...v j s | x t = S i 0 ) is independe nt of t under the ergodic a ssumption and can be computed fro m the A ( k ) ’ s acco rding to P ( V | S i 0 , s, λ ) = e ′ S i 0 s Y r =1 A ( j r ) 1 , (1) where e i denotes the (0 , 1) -vector whose only no nzero entry is in position i a nd 1 = [1 1 . . . 1] ′ . Call this pr obability distribution on substrings of length s , P ( ·| S i , s, λ ) . It is known that the distributions P ( ·| S i , s, λ ) for p + s ≥ 2 N − 1 are complete character izations of the ergodic states of the HMM with respect to the observables of th e HMM [23], [1 4]. W e now fo cus attentio n on sub strings tha t p r ec ede state occupan cy in the HMM’ s u nderlyin g Markov chain. Over the course of a long observation sequ ence such as O 1: T , there is some probability , P ( S i | U, p, λ ) that the HM M is in state i given th at we ha ve just observed the length p substring U = v j 1 v j 2 . . . v j p . T hese probabilities ca n be co mputed fr om the A ( k ) ’ s accor ding to P ( S i 0 | U, p, λ ) = π ′ Q p r =1 A ( j r ) e s i 0 P ( U | p, λ ) , (2) where π is the stationar y distrib ution of the underlying Ma rkov chain p rocess and P ( U | p, λ ) = π ′ Q p r =1 A ( j r ) 1 . Note that formulas (1 ) and ( 2) ar e closely re lated to com- putations arising in th e classical V iterbi alg orithm [1]. Let U, V be two string s o f obser vations of length p and s respectively . Let U an d V be identiﬁed with integers u and v as already explaine d b efore so that P ( V | U, λ ) = F p,s u,v . Assume V was emitted after time t and U immediately preceded V . W e call U the p reﬁx string and V the sufﬁx string. Then b y applying elemen tary prop erties of probability we can write: F p,s u,v ∼ P ( y t + s t +1 = V | y t t − p +1 = U, λ ) = N X k =1 P ( y t + s t +1 = V , x t = S k | y t t − p +1 = U, λ ) = N X k =1 P ( V | S k , s, λ ) P ( S k | U, p, λ ) . Consequently we can express the distribution F p,s u, : ∼ P ( ·| U, λ ) as a mixtur e F p,s u, : ∼ N X k =1 P ( ·| S k , s, λ ) P ( S k | U, p, λ ) . (3) If the un derlyin g state process x t is ergodic then in the lim it as T → ∞ relation (3) becomes an eq uality almost sur ely . As SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 3 a result of the above observations, for sufﬁciently large p an d s , th e matrix F p,s has th e f ollowing properties: • rank( F p,s ) ≤ N , where N is the min imal num ber of states repr esenting the HMM, λ ; • Each row of F p,s is a con vex comb ination (mixtur e) of the N genera tors, P ( ·| S i , s, λ ) , fo r i = 1 , 2 , . . . , N .; • Letting D be the N × M s nonnegative matrix whose rows are the d istributions P ( ·| S k , s, λ ) , i.e., D k, : = P ( ·| S k , s, λ ) , for k = 1 , 2 , . . . , N , we can r ewrite (3 ) as F p,s u, : ∼ [ P ( S 1 | U, p, λ ) P ( S 2 | U, p, λ ) · · · P ( S N | U, p, λ )] ∗ D . Consequently , if we let C = ( c u,k ) be the M p × N nonnegative matrix with c u,k = P ( S k | U, p, λ ) we can write F p,s ∼ C ∗ D . Observe that C and D are b oth (row) stochastic. • The factorization depend s o n the model λ . Moreover factors C and D can b e comp uted directly from λ using (1) and (2). Conseq uently , the size of th e smallest model c ompatible with the d ata is equal to prank ( F p,s ) , the positive rank of F p,s . ( The p ositiv e ran k, p rank( A ), of an m × n nonnegative matrix A is the smallest integer N such that A factors in the prod uct of two nonnegative m atrices of dimensions m × N and N × n respectively .) It is known th at rank( A ) ≤ pra nk( A ) ≤ min { m, n } an d that th e c omputatio n of prank ( A ) is NP-hard [19], [24]. So it would appear that in ge neral it is NP-hard to estimate N given F p,s ev en in ideal condition s ( T → ∞ ) since r ank( F p,s ) ≤ N . How- ev er , it is not obvious how difﬁcult it is to estimate when pr ank( F p,s ) < r ank( F p,s ) in the ca se F p,s was built f rom a typical realization of an HM M. In fact, typically rank([ P ( V | U )] U,V ) ≤ rank( P ( V | S ) S,V ) ≤ prank([ P ( V | U )] U,V ) but, in “noisy” cond itions, we ob- serve r a nk( D ) ≤ prank( F p,s ) < rank( F p,s ) . W e dis- cuss an e xample at the end o f this paper that illustrates the open problems an d challenges. On e way to circumvent the problem o f guessing N is to apply statistical methods directly to the obser vation sequen ce, without building any intermediate mo dels as done in [25]. T o summarize this discussion, note that the matrix F p,s is based on the d istribution of length p preﬁxes and c orre- sponding leng th s sufﬁx es and complete ly character izes an HMM providing M p ≥ N , s ≥ 2 N − 1 . Its positive rank is, in ideal condition s, equal to the minimal n umber of states in the underlying Markov chain. Mo reover , an appro priately constructed factorization of F p,s exposes the state transition and emission probab ilities of the HMM. It is w ell known that any two N -state HMMs co nsistent with the same con- ditional statistics [ P ( V | S )] S ∈S ,V ∈V 2 N − 1 generate the same ﬁnite dimension al distributions and so ar e, in this sense, equiv alent [26]. The algo rithm presented below extracts the state transition m atrices, { A ( k ) } k from this factorization. In turn, as shown ab ove, the A ( k ) ’ s can be used to con struct the probab ility distributions over sufﬁxes that gene rate F p,s and so can b e used to comp ute a new factorization. This iteration is essentially the basis fo r our alg orithm. In the mach ine learning con text, we have a ccess only to a ﬁnite amou nt of o bservation d ata ( T b ound ed). Co nsequently rank ( F p,s ) will be generally higher than N . This requires a decision abo ut the HMM’ s or der, N , not unlike that arising in prin cipal componen t analysis ( PCA) [27] to estimate the number of compo nents. I I . T H E A L G O R I T H M Based on the above discussion , our algorith m is outlined below . Numerical examples with discussion s follow the formal description. 1) Compute F p,s and G from the inp ut observation data, O 1: T , d eﬁned above. 2) Estimate the n umber of states, N , by a nalyzing the either F p,s or diag ( G ) ∗ F p,s , both com puted in Step 1 . In the cases in which prank( F p,s ) = rank( F p,s ) (e.g. when rank( F ) ≤ 2 ) one typical way to o btain this estimate is to compu te the SVD (sin gular value deco mposition) o f the afor ementione d matrices a nd then ob serve the rate of decrease of the singular values. For T suf ﬁciently large a sign iﬁcant gap betwee n the N th and the ( N + 1) th largest singular value beco mes app reciable. Note that since prank ≥ ran k, an e stimate based on the singular values is a lower bound for the ord er of the HMM. 3) Estimate distributions P ( ·| S i , s, λ ) , fo r i = 1 , 2 , . . . , N . This step is achieved throu gh th e Nonnegative Matrix Factorization (NMF) of F p,s . This yields F p,s ≈ C ∗ D with D i, : ≈ P ( ·| S i , s, λ ) as observed befor e. Note that because of the ﬁn iteness of T in gener al prank ( F p,s ) > N . So it is necessary to solve the appr oximate NMF which consists of deter mining C and D of dimen sions M p × N and N × M s respectively that minimize D I D ( F p,s || C ∗ D ) , where D I D ( K || W ) = X ij ( K ij log K ij W ij − K ij + W ij ) is the I- diver gence function [ 7] ( observe that if 1 ′ K 1 = 1 ′ W 1 = 1 then D I D ( K || W ) = P i,j K i,j log K i,j /W i,j so the I-div ergence fun ction is a generalizatio n of the Kullback-Leibler distance b etween probab ility d istributions). This optimization problem can be solved throu gh iterativ e method s [1 6], [28] tha t re- quire initial matrices C 0 , D 0 and can only be guarante ed to con verge to local optima. After e xecuting this step , we have a locally op timal estimate of the tru e distributions P ( ·| S i , s, λ ) . 4) Estimate matrice s A ( k ) , k = 1 , 2 , . . . , N , from D . Let us consider A (1) = ( a i,j (1)) , the other matrices are esti- mated in a similar manner . L et V ( s − 1) = v j 1 v j 2 · · · v j s − 1 be a generic sequenc e o f s − 1 observations. Th en by marginalization we can write P ( V ( s − 1) | S i , s − 1 , λ ) = M X k =1 P ( V ( s − 1) v k | S i , s, λ ) . Consequently , the cond itional distributions over sufﬁ xes of length s − 1 , P ( ·| S i , s − 1 , λ ) , can be estimated from D by add ing colu mns of D approp riately . Let SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 4 H be the matrix thus ob tained f rom D so th at H i, : ≈ P ( ·| S i , s − 1 , λ ) . Tho se co nditional distributions satisfy the following equality for any V ( s − 1) : P ( v 1 V ( s − 1) | S i , s, λ ) = N X j =1 a i,j (1) P ( V ( s − 1) | S j , s − 1 , λ ) . Therefo re P ( v 1 · | S i , s, λ ) = P N j =1 a i,j (1) P ( ·| S j , s − 1 , λ ) so we can obtain the unknown values a i,j (1) by solving th e following systems of linear equations: D i, 1: M s − 1 = A i, : (1) ∗ H , i = 1 , 2 , . . . , N where A i, : (1) = [ a i, 1 (1) a i, 2 (1) · · · a i,N (1)] . Compactly D : , 1: M s − 1 = A (1) ∗ H . As in step 2 , becau se of th e ﬁniteness of T and working with bou nded arithme tic precision we nee d to conten t ourselves with a solution that minimizes some d istance ( for example, the L 1 norm) b etween D i, 1: M s − 1 and A i, : (1) ∗ H , for a ll i . W e have formulated these prob lems as linear pro gramm ing problem s using the L 1 norm. 5) Output estimated HMM λ ′ = { A ( k ) | k = 1 , 2 , . . . , N } . This algo rithm can be iter ated u sing the estimated λ ′ and formu la (1) to compute ne w matrices C ′ 0 and D ′ 0 , a nd then restarting from step 3 a bove with m atrices C ′ 0 and D ′ 0 as initial factors in the appr oximate NMF . In particular: 6) Compute D ′ 0 = [ P ( j | S i , s, λ ′ )] i,j using fo rmula (1). 7) Compute C ′ 0 by solving the linear programming pr oblem F p,s = C ′ 0 ∗ D ′ 0 , f or a row stochastic C ′ 0 . 8) Set C 0 := C ′ 0 and D 0 := D ′ 0 . 9) goto 3 ). Another p ossibility for step 7) above is to compute C ′ 0 using fo rmulae (2) and (3) an d then use the resulting C ′ 0 and D ′ 0 as initial guesses for the NMF algo rithm. W e have tried this variant but it does not p roduce signiﬁcantly different ﬁnal results. I I I . N U M E R I C A L E X A M P L E S W e call an HMM “Determ inistic” (DHMM) if fo r each state there exists at most on e outgoing transition labeled with the same observable. W e demon strate our method on a DHMM , on an HMM that can be tran sformed in to an equiv alent DHMM and also o n an HMM fo r which such a transformation does not exist. W e ﬁna lly discuss an example that illustrates the situation wh en r ank < pr ank . It is important to note that the signiﬁcan t metric fo r learnin g an HMM is not the extent to which the transition prob abilities are accurately learned but the extent to which the observation statistics are lea rned. This is a consequence o f the fact that HMM’ s with d ifferent transition probabilities and different number s of st ates can pro duce observations sequ ences with the same statistics so that learn ing a speciﬁc transition pro bability characterizatio n is n ot a well-posed problem unless additional constraints to the learning pro blem are imp osed [ 29]. In ou r examples we m easure th e accuracy of our estimates by computin g the I-divergence ra te of the ﬁnite dimensiona l distributions associated with the ob servation pr ocess of th e original model from those associate d with the o bservation process of the estimated model. Formally , each HMM λ induces a family of ﬁnite dimen sional distributions P n ( y n 1 ) = N X i =1 π i P ( y n 1 | x 1 = S i , λ ) on sequen ces of observations of length n , whe re π is th e stationary distribution of the un derlying state pro cess. Let λ and λ ′ be tw o HMM’ s with P n and Q n their respecti ve indu ced ﬁnite dimension al distributions. The I-divergence rate of λ from λ ′ is deﬁn ed as D I D = lim n →∞ 1 n D I D ( P n || Q n ) when th e limit exists [14]. A. A DHMM Example Consider the stochastic process described b y model λ 1 = ( { A (0) , A (1) } , Γ = [0 1]) with A (0) =  0 . 5 0 0 0  and A (1) =  0 0 . 5 1 0  . This is sometimes referred to as the “Even Process” [30], [ 31]. W e simulated this process and produc ed a sequence of T = 1000 observations. Th en we ran our algorithm with p = 2 an d s = 3 : 1) Build F 2 , 3 from data O : F =     0 . 14 0 . 13 0 0 . 26 0 0 0 . 22 0 . 26 0 0 0 0 0 . 25 0 . 24 0 0 . 5 0 . 13 0 . 14 0 0 . 23 0 0 0 . 26 0 . 24 0 . 08 0 . 07 0 0 . 17 0 . 08 0 . 08 0 . 17 0 . 33     2) Estimate N = prank( F ) . Analyz e singu lar values of F :  0 . 88 0 . 48 0 . 03 3 0 . 011  . This sugg ests rank( F ) = prank( F ) = 2 . 3) Estimate distributions P ( ·| S 1 , 3 , λ 1 ) and P ( ·| S 2 , 3 , λ 1 ) by solv ing arg min C,D D I D ( F || C ∗ D ) : C =     0 . 02 0 . 98 1 0 0 1 0 . 34 0 . 66     , D =  0 0 0 0 . 0 0 . 25 0 . 2 4 0 . 0 0 . 5 0 . 13 0 . 13 0 0 . 25 0 0 0 . 25 0 . 24  4) Estimate matrices A (0) an d A (1) : ˜ A (0) =  2 . 2 e − 18 0 6 . 9 e − 18 0 . 51  , ˜ A (1) =  0 . 0077 0 . 99 0 . 49 0  . After a second iteration of the algorithm the recon structed matrices beco me: ˆ A (0) =  0 0 5 . 6 e − 17 0 . 5 1  ˆ A (1) =  0 . 0077 0 . 99 0 . 49 0  . The reconstructed model is essentially identical to th e original one except for state reord ering. T his result is compe titi ve with existing techn iques speciﬁc fo r the machine learn ing of DHMMs. For example, Shalizi et al [32], [30] demonstrated their Causal-State Splitting Reconstruction (CSSR) ǫ -machine reconstruc tion algorithm on th e same Even Process o btaining compara bly accurate models. SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 5 B. An HMM that h as an equivalent DHMM Consider th e mod el λ 2 = ( { A (0) , A (1) } , Γ = [0 1]) with A (0) =  0 . 67 0 . 33 0 0  and A (1) =  0 0 1 0  . W e simulated this pro cess an d produced a sequence o f T = 10000 ob servations. Th en we ran our algorithm with p = 2 and s = 3 : 1) Build F 2 , 3 from data O : F =     0 . 31 0 . 14 0 . 22 0 0 . 22 0 . 11 0 0 0 . 44 0 . 23 0 . 33 0 0 . 00 0 0 0 0 . 29 0 . 15 0 . 23 0 0 . 22 0 . 11 0 0 0 0 0 0 0 0 0 0     2) Estimate N . A nalyze sing ular values of F :  0 . 86 0 . 24 0 . 02 0  to e stimate N . Th is sug gests again N = 2 . 3) Estimate d istributions P ( ·| S 1 , 3 , λ 2 ) and P ( ·| S 2 , 3 , λ 2 ) . Solve arg min C,D D I D ( F || C ∗ D ) : C =     0 1 1 0 0 1 0 0     , D =  0 . 44 0 . 23 0 . 33 0 0 0 0 0 0 . 30 0 . 15 0 . 22 0 0 . 22 0 . 11 0 0  4) Reconstruct m atrices ˜ λ 2 = { ˜ A (0) , ˜ A (1) } : ˜ A (0) =  0 . 0033 0 . 9967 0 0 . 6691  , ˜ A (1) =  0 0 0 . 3309 0  . After a seco nd iteration of the algorithm th e reconstructed model b ecomes ˆ λ 2 = { ˆ A (0) , ˆ A (1) } : ˆ A (0) =  0 . 0039 0 . 9 96 0 0 . 6689  , ˆ A (1) =  0 0 0 . 3311 0  . These co mputed transition p robab ilities are different enou gh from the transition proba bilities o f the original HMM used to generate the data but the statistics of the observation sequ ences are very close. Figu re 1 shows the accuracy of these esti- mates in terms of the I-divergence r ate of the original model from the estimated ones. W e compu ted D I D ( P n || Q n ) /n for n = 1 , 2 , . . . , 15 , with P n being the ﬁnite d imensiona l p roba- bility distributions over sequ ences of observations of len gth n emitted b y mo del λ 2 and Q n those emitted by th e estimates ˜ λ 2 and ˆ λ 2 , in station ary conditions (the dotted cu rve r efers to ˜ λ 2 ). W e can o bserve that this q uantity , the divergence rate of P n from Q n , stabilizes to a very small value (smaller than 2 . 5 · 10 − 5 ) as expected. In fact, this examp le is equiv alen t to a DHMM model as the reader can r eadily ch eck ind epende ntly . C. An HMM that ha s no equivalen t ﬁnite state DHMM Consider the model λ 3 = ( { A (0) , A (1) } , Γ = [1 0 0 ]) with A (0) =   0 . 5 0 . 5 0 0 0 . 5 0 0 . 5 0 . 5 0   and A (1) =   0 0 0 0 0 0 . 5 0 0 0   . W e simulated this process and produc ed a sequence of T = 10000 ob servations. Th en we ran our algorithm with p = 4 and s = 5 . After the ﬁrst iteration we obtain ˜ λ 3 : ˜ A (0) =   0 0 . 2 0 . 8 0 0 . 35 0 0 0 . 4 0 . 6   , ˜ A (1) =   0 0 0 0 . 1 0 0 . 5 6 0 0 0   . After the second itera tion we obtain ˆ λ 3 : ˆ A (0) =   0 0 . 2 0 . 8 0 0 . 36 0 0 0 . 41 0 . 59   , ˆ A (1) =   0 0 0 0 . 09 0 0 . 56 0 0 0   . As before, Figure 1 (bottom) shows the accuracy of these estimates in terms of the I-diver gence rate of the original model fro m the estimated o nes. Observe that this HMM cannot be transfor med into an equiv alent determin istic HMM [3 3]. D. Discussion of Rank vs Prank W e ﬁrst provide an example of a stochastic matrix whose prank differs fro m its rank b ut th at matrix d oes no t repr esent the statistics of any HMM. F = 1 16 ·     1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1     ⊗ [1 1 1 1 1 1 1 1] , where ⊗ is the Kron ecker pr oduct. W e can verif y that rank( F ) = 3 wh ereas pra nk( F ) = 4 [2 8]. More over F = C D exactly with C =     0 . 5 0 0 . 5 0 0 . 5 0 . 5 0 0 0 0 0 . 5 0 . 5 0 0 . 5 0 0 . 5     and D = 1 8     1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1     ⊗ [1 1 1 1 1 1 1 1] . Assume that F was obtained from a typica l sequen ce of observations emitted by an HMM λ with 4 states so that F 2 , 5 = F = C D . Then it must be that C = [ P ( S j | i, 2 , λ )] i,j and D = [ P ( k | S j , 5 , λ )] j,k . Consider the following mo del λ = { A (1) , A (2) } with A (0) =     0 . 5 0 0 . 5 0 0 0 0 0 0 0 . 5 0 0 . 5 0 0 0 0     , A (1) =     0 0 0 0 0 . 5 0 0 . 5 0 0 0 0 0 0 0 . 5 0 0 . 5     . SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 6 0 5 10 15 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 x 10 −5 I−divergence rate Length of observation sequences D ID (P n ||Q n )/n 0 5 10 15 0 1 2 3 x 10 −4 I−divergence rate Length of observation sequences D ID (P n ||Q n )/n Fig. 1. Accurac y of HMM’ s ˜ λ 2 , ˆ λ 2 (top) and ˜ λ 3 , ˆ λ 3 (bottom). Here P n is the distrib ution over sequenc es of observ ations of ﬁxed le ngth n induced by the original model whereas Q n refers to the estimated models. The seque nce D I D ( P n || Q n ) /n is calcul ated for inc reasing val ues of n . The dotted curves refer to ˜ λ 2 (top) and ˜ λ 3 (bottom). One can verify tha t λ is the o nly f our-state mo del such th at D = [ P ( k | S j , 5 , λ )] j,k . In fact observe that the system of equations deﬁn ing λ in stage 4 o f the algorithm adm its, in this case, only o ne solutio n. Nevertheless, using for mula (2): [ P ( S j | i, 2 , λ )] i,j = (1 / 4) ∗  1 1 1 1  ⊗  1 1 1 1  6 = C . Consequently n o HMM can g enerate F . 1) An example of prank > rank for an exact HMM model: The following four-state m odel λ is an examp le of an HMM whose induced F 2 , 5 matrix has ran k 3 but p ositiv e rank pran k = 4 : A (0) =     0 . 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     , A (1) =     0 0 . 5 0 0 0 0 1 0 0 0 0 1 1 0 0 0     . T o verify the claim we computed factors C = [ P ( S j | i, λ )] i,j and D = [ P ( j | S i , 5 , λ )] i,j , for N = 4 , using formu lae (1) and (2) and then obtained F 2 , 5 = C ∗ D . Then we veriﬁed numerically that rank( F 2 , 5 ) = 3 . Finally , we app lied Lemma 2.4 in [28] to conﬁrm that pra nk( F 2 , 5 ) = 4 . W e also veriﬁed the ch aracter of this model by directly applying our algorithm to it in order to obtain F 2 , 5 empirically (for T = 10000 ). An analysis o f the singular values of F 2 , 5 , namely [0 . 8530 0 . 4 825 0 . 17 99 0 . 011 4] , demonstrates th e difﬁculty of this case. The fou rth singular v alue is nonzer o d ue to the ﬁniten ess of T . Conseque ntly it is difﬁcult to determin e whether N = 3 or N = 4 . I V . O P E N Q U E S T I O N S A N D F U T U R E W O R K A crucial issue is the estimatio n o f N , the size of the smallest HMM that generates the stream o f data. Un der ideal condition s, ( T → ∞ ), we h av e seen that N = prank( F p,s ) . Howe ver, ﬁltering out “noise” fro m the empirical matrix F p,s in order to have an accu rate estimate of the positive rank is an op en challeng e. Observe that a spectral analysis of F may , in gen eral, pr oduce only a lower bou nd to N . A second important issue in our methodo logy con cerns th e computatio n of the approxima te NMF . Existing method s are suboptimal due to the presence of local op tima. Th is p roblem affects the accur acy of the produced estimate at each iteration of our algor ithm. Con sequently it is impo rtant to investigate conv ergence prop erties when stages 3 − 5 of th e algo rithm are iterated with new initial factors C ′ 0 , D ′ 0 to seed the approxim ate NMF , u sing C ′ 0 and D ′ 0 as comp uted according to steps 6 − 8 , from mod el λ ′ that was estimated in th e p receding step. A th ird question con cerns with pro perties of F p,s as s → ∞ . In other words, can the Asy mptotic Equipar tition Pr operty be applied to distributions P ( ·| S i , s, λ ) so that the distribution on the “ typical” ﬁn ite sufﬁxes is unifo rm and the rest of the distribution is zero? V . C O N C L U S I O N W e have p resented a new algorithm for learnin g a n HMM from observations o f the HMM’ s outp ut. The algorithm is structurally different from tr aditional Baum-W elch based ap- proach es [1], [11], [1 2], [13]. It is related to but different from recent app roaches in stochastic systems realiza tion [1 4]. W e believe this method opens a new lin e of alg orithm de velopm ent for lear ning HMM’ s and has the advantage o f a estimating the HMM o rder fr om spectral p roperties of the high o rder correlation statistics of the observation sequence. The algo- rithm effecti vely co mpresses data by summarizing it into a statistical matrix. Options for recursively compu ting th e steps of the algo rithm to ach iev e o n-line algorithms will be explored. Additionally , sparse matrix a lgorithms can be explor ed fo r space and time efﬁcienc y when the underlyin g matric es are large and sparse. A C K N O W L E D G M E N T S This work results f rom resear ch prog rams at th e Thayer School of Engineerin g, Dartm outh College, an d at the Dep art- ment of Computer Scien ce a t the California State Un iv ersity , SUBMITTED TO IEEE TRANSA CTIONS ON INF ORMA TION THEOR Y , SE PTEMBER 2008 7 Los Angeles, suppor ted by the AFOSR Grant F A9550 -07-1 - 0421 an d by the NSF Gr ant HRD-0 93242 1. Th e Dartmou th effort was also supported by U.S. Department of Homelan d Security Gr ant A ward Num ber 200 6-CS-001 - 000001 and D ARP A Contract HR00 1-06- 1-003 3. The author s also sincerely thank Dr . Robert Savell fo r pointing o ut impo rtant related work to us. R E F E R E N C E S [1] L. R. Rabiner , “ A Tutorial on Hidden Mark ov Models and Selecte d Applica tions in Speech Recogniti on, ” Pr oceedi ng of the IEEE , vol . 77, no. 2, pp. 257–286, 1989. [2] D. X. Song, D. W agner , and X. Tian, “T iming Analysis of Ke ystrokes and Timing Attacks on SSH, ” in USENIX Security Symposium , 2001. [3] A. Apostolic o and G. Bejerano, “Optimal Am nesic Probabi listic Au- tomata or Ho w to Learn and Classify Proteins in Linear Time and Space, ” J ournal of Computati onal Biol ogy , vol . 7, no. 3/4, pp. 381–393, 2000. [4] D. Ron, Y . Singer , and N. Ti shby , “Learning Probabili stic Automata with varia ble memory length, ” in Proc eedings of the workshop on computat ional learning theory , 1994. [5] F . Perronnin, J.-L. Dugelay , and K. Rose, “ A Probabilistic Model of Face Mapping with Local Transformat ions and its Application to Person Recogni tion, ” P attern Analysis and Mac hine Intellig ence, IEEE T ransaction s on , vol. 27, no. 7, pp. 1157–11 71, July 2005. [6] A. Krogh, M. Brown, I. S. Mian, K. Sjolander , and D. Haussler , “Hidde n Marko v Models in Computationa l Biology: Applicat ions to Protein Modelin g, ” U CSC, T ech. Rep. UCSC-CRL-93-32, 1993. [7] L. Fine sso and P . Spreij, “Nonne gativ e Matrix Fact orixatio n and I- di ver gence Alternati ng Minimizat ion, ” Linea r Algebra and its Appli- cations , vo l. 416, pp. 270–287, 2006. [8] E. V idal, F . T hollard , C. de la Higuera, F . Casa cuberta, and R. Ca rrasco, “Probabi listic Finite-St ate Machines-Par t i, ” P AMI , vol. 27, no. 7, pp. 1013–1025, July 200 5. [9] ——, “Probabilisti c Finite-State Machine s-Part ii, ” P AMI , vol. 27 , no. 7, pp. 1026– 1039, July 2005. [10] Y . Ephraim, “Hidde n Marko v Processes, ” IEE E T ransactions on Infor - mation Theory , vol. 48, no. 6, pp. 1518–156 9, 2002. [11] E. B. Fox, E. B. Sudderth, M. I. Jordan, a nd A. S. W illsky , “ An HDP- HMM for Systems wit h State Persistence, ” in Pr oceeding s of the 25 th Internati onal Confere nce on Machine Learning, Helsinky , F inland , 2008. [12] D. Blei and M. Jordan, “V ariation al Methods for the Dirichlet Process, ” in Pr oceedi ngs of the 21st In ternationa l Confer ence on Mac hine Learn- ing , 2004. [13] Y . T eh, M. J ordan, M. Beal, and D. Blei, “Hie rarchica l Dirichlet Pro- cesses, ” T ech nical Report 653, Departmen t Of Stat istics, UC Berk ele y , 2003. Advance s in Neural Information Pr ocessing Syste ms, Cambridge (MA) , v ol. 17, 2005. [14] L. Finesso, A. Grassi, and P . Spreij, “ Approximatio n of Statio nary Processes by Hidde n Marko v Models, ” Math. Contr ol Signals Systems , vol. 22, pp. 1–22, 2010. [15] B. Anderson, “The Realiz ation Problem for Hidden Markov Models, ” Math. Cont r ol Signals Systems , v ol. 12, pp. 80–120, 1999. [16] D. D. Lee and H. S. Seung, “ Algorithms for Non-negat iv e Matri x Fact orizati on, ” in NIPS , 200 0, pp. 556–562 . [17] D. R. Upper , Theory and Algorithms for Hidden Markov models and gene ralized Hidden Mark ov models . Ph.D. Thesis, Uni versit y of Californi a, Berkele y , 1997. [Online] . A vail able: http:/ /www .santafe.edu/c mg/compmech/pubs/T AHMMGHMM.htm [18] J. E. Hopcroft and J. D. Ullman, Intr oduction to Automata Theory , Languag es and Computation . Addi son-W esley Publishing, Reading Massachuset ts, 1979. [19] S. V av asis, “On the Complex ity of Nonnegati ve Matrix Factoriza tion, ” arXiv:0708.4149v 2 , Februa ry 2008. [20] A. Berman and R. J. Plemmons, Nonne gative Matric es in the Mathe - matical Scie nces . SIAM, 1994. [21] P . Billingsl ey , P r obabili ty and Measure , 3rd Edition . Wi ley- Intersci ence, 1995. [22] K. Marton and P . C. Shields, “Entropy and the Consistent Estimation of Joint Distrib utions, ” The Annals of P r obabilit y , v ol. 22, pp. 960–9 77, 1994. [23] J. Carlyl e, Stoc hastic Finite-State s ystem theory , ser . Systems Theory . McGra w Hill, Ne w Y ork, 1969, ch. 10. [24] J. va n den Hof and J. van Schuppen, “Positi ve Matrix Factorizat ion via Extremal Polyhedral Cones, ” Linear Algebra and its Applications , vol. 293, pp. 171–186, 1999. [25] C. Liu and P . Narayan, “Order estimatio n and sequentia l univ ersal data compression of a hidde n Marko v source by the method of mixtures, ” IEEE T ransact ions on Informatio n T heory , vol. 40, 1994. [26] A. Paz, Intr oduction to Pr obabili stic Automata . Academic Press, New Y ork and London, 1971. [27] P . Comon, “Independ ent component analysis: a new concept? ” Signal Pr ocessing , v ol. 36, pp. 287–314, 1994. [28] J. Cohen and U. Rothblum, “Nonne gati ve Ranks, Decompositions and Fact orizati ons of Nonnegati ve Matrices, ” Linear Alg ebra and its Appli- cations , vo l. 190, pp. 149–168, 1993. [29] H. Ito, S. -I. Am ari, and K. K obayashi, “Indentiﬁabil ity of Hidden Marko v Information Sources and the ir Minimum Degree s of Freedom, ” IEEE T ransacti ons on Information Theo ry , v ol. 38, no. 2, pp. 324–3 33, 1992. [30] C. R. Shaliz i, K. L. Shalizi , and J. P . Crutchﬁeld, “Patt ern discove ry in time series, part i: Theory , algor ithm, analysis, and con ver gence, ” J ournal of Mac hine Learnin g Researc h, (in prepar ation) , 2002. [31] B. W eiss, “Subshi fts of ﬁnite type and sophic systems, ” Monatshe fte fur Mathemat ik , vol. 77, pp. 462–474, 1973. [32] C. R. Shalizi and K . L. Klinkner , “Blind Construction of Optimal Nonlinear Recursiv e P redict ors for Discrete Sequence s, ” in Uncertainty in Artiﬁcial Intellig ence: Pr oceedings of the T wenti eth Confer ence (UAI 2004), Ar lington, V irgini a , M. Chick ering and J. Y . Halpern, Eds. A U AI Press, 2004, pp. 504–511. [Online]. A vai lable: http:/ /arxi v .org/ab s/cs.LG/0406011 [33] J. P . Crutchﬁeld and C. R. Shalizi, “Thermodynamic Depth of Causal States: Objecti ve Comple xity via Minimal Representati ons, ” Physical Revie w E , v ol. 59, pp. 275–283, 1999. PLA CE PHO TO HERE George Cybenko is the Dorothy and W alter Gramm Professor of Enginee ring at the Thayer School of Engineeri ng at Dartmouth Colleg e. Prior to join- ing Dartmouth, he was Professor of Electric al and Computer Engineeri ng at the Unive rsity of Illinoi s at Urbana-Champaign. His curre nt research inter- ests are in machine learning, s ignal processing and computer security . Cybenk o was founding Editor -in- Chief of IEEE Computing in Scienc e an d E ngineer - ing and IEEE Securi ty & Priv acy . He is presently First V ice-Pre sident of the IEEE Computer Societ y . He earned his B.Sc. (T oronto) and Ph.D (Princeton) degrees in mathematic s and is a Fello w of the IEEE . PLA CE PHO TO HERE V alentino Cre spi recei ved his Laurea Degree and his Ph.D. De gree in Computer Science from the Uni versit y of Milan, Italy , in J uly 1992 and July 1997, respecti vely . From Sept ember 1998 to August 2000 he was an Assistant Professor of Computer Science at the Eastern Medite rranean Uni versit y , Famagust a, North Cyprus and from September 200 0 to August 2003 he worked at Dartmout h Colle ge, Hanov er, NH, as a Rese arch Faculty . Since S eptem- ber 2003 he has been on faculty at the Depa rtment of Computer Science, Califo rnia State Unive rsity , Los Angeles, curre ntly in the capa city of As sociate Professor . His research intere sts include Distri buted Computing, T racking Systems, UA V Surveil- lance , Sensor Network s, Informati on and Communicat ion Theory , Complexit y Theory and Combina torial Optimization. At Dartmouth Colle ge he dev eloped the T ASK projec t and consulted to the Process Query Systems project, direct ed by Prof. George Cybenk o. At CSULA he has been teachi ng lo wer di vision, upper di vision and master courses on Algorithms, Data Structures, Jav a Programming, Compilers, Theory of Computa tion and Computationa l Learning of Languag es and Stochastic Proce sses. During his professional acti vity Dr Crespi has published a number of papers in prestigious journals and conferenc es of Applied Ma thematic s, Computer Sci ence and Engineer ing. Moreov er Dr Crespi is cu rrently a member of the A CM and of the IEEE.

Learning Hidden Markov Models using Non-Negative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment