Online Learning in Discrete Hidden Markov Models

Online Learning i n Discrete Hidden Mark o v Models Roberto Alamino ∗ and Nestor Caticha † ∗ Neural Computing Resear c h Gr oup, Aston University Aston T riangle, Birmingha m, B4 7ET , United Kingdom † Instituto de Física, Universidade de São P aulo, CP 6631 8 São P aulo, SP , CEP 05389-9 70 Br azil Abstract. W e presen t and analyse three online algo rithms for learn ing in discrete Hidden Mar kov Models (HMMs) and compare them with the Baldi-Chauvin Algo rithm. Using the Kullback- Leibler div ergence as a measure of generalisation error we draw learning cu rves in simpliﬁed situa tions. The perfor mance f or learning drifting concepts of one of the presented algorithms is analysed and compar ed with the Baldi-Chauvin algorithm in the same situatio ns. A brief d iscussion ab out learning and symmetry break ing based on our results is also presented . Key W ords: HMMs, Online Algo rithm, Generalisation Error , Bayesian Algorithm . INTR ODUCTION Hidden Mark ov Models (HMMs) [1, 2] are e xtensiv ely studied machine learning models for time series wit h s e veral applications i n ﬁelds like speech recogniti on [2], b ioinfor- matics [3, 4] and L DPC codes [5]. They consist of a Markov chain of non-observable hidden states q t ∈ S , t = 1 , ..., T , S = { s 1 , s 2 , ..., s n } , with initial probability vector π i = P ( q 1 = s i ) and transitio n matrix A ij ( t ) = P ( q t +1 = s j | q t = s i ) , i, j = 1 , .., n . At discrete times t , each q t emits an obs erved state y t ∈ O , O = { o 1 , ..., o m } , with emis- sion probability matrix B iα ( t ) = P ( y t = o α | q t = s i ) , i = 1 , ..., n , α = 1 , ..., m , which are the actual observations of the time series represented, from tim e t = 1 t o t = T , by the observed sequence y T 1 = { y 1 , y 2 , ..., y T } . The q t ’ s form the s o called hidd en s equence q T 1 = { q 1 , q 2 , ..., q T } . The probability of observing a sequence y T 1 giv en ω ≡ ( π , A, B ) is P ( y T 1 | ω ) = X q T 1 P ( y 1 ) P ( y 1 | q 1 ) T Y t =2 P ( q t +1 | q t ) P ( y t | q t ) . (1) In the l earning pr ocess , the HMM is fed wit h a series and adapts its parameters to produce similar ones. Data feeding can range from ofﬂine (all data is fed and parameters calculated all at once) to online (data is fed by parts and partial calculations are made). W e st udy a scenario with d ata generated by a HMM of unknown parameters, an ex- tension of the student-t eacher scenario from neural networks. The performance, as a function of the number of observ ations, is given by how far , m easured by a suitable cri- terion, is the st udent from the teacher . Here we use the naturally arising K u llback-L eibler (KL) diver gence that, although not accessibl e i n practice since it needs knowledge of the teacher , is an extension of the idea of generalisation error being very i nformativ e. W e p ropose three algo rithms and compare them with the Bal di-Chauvin Algori thm (BC) [6]: the Baum-W elch Online Al gorithm (BWO), an ada ptation of th e of ﬂine Baum- W elch Reestima tion F ormu las (BW) [1] and, startin g from a Bayesian formulation, an approximation named Bayesian Online Algor ithm (BOnA), that can be sim pliﬁed again without noticeable lost of performance to a Mean P osterior Algorith m (M P A). BOnA and MP A, i nspired by Am ari [7] and Opper [8], are essentially mean ﬁeld methods [9] in whi ch a manifold of prior tractable di stributions is introduced and the new datum leads, through Bayes theorem, to a non-tractable posterior . The ke y s tep is t o take as the new prior , not the posterior , b ut the clos est distribution (i n some sense) in the manifold. The paper is o r ganised as follows: ﬁrst, BWO is introduced and analysed. Next, we deriv e BOnA for HMMs and, from it, MP A. W e compare MP A and B C for drift ing con- cepts. Then, we discuss learning and symmetry breaking and end with our conclusions. B A UM- WELCH O NLINE ALGORITHM The Baum-W elch Online Algori thm (BW O) is an on line adaptation of BW where in each iteratio n o f BW , y becomes y p , t he p -th observed s equence. Multip lying t he BW increment by a learning rate η B W we get the update equations for ω ˆ ω p +1 = ˆ ω p + η B W ˆ ∆ ω p , (2) with ˆ ∆ ω p the BW va riations for y p . The complexity of B WO is polynomial in n and T . In ﬁgure 1, the HMM learns sequences generated by a teacher wi th n = 2 , m = 3 and T = 2 for di f ferent η B W . Initial students ha ve matrices with all ent ries set to the same value, what we call a symmetric initial student . W e took averages ov er 500 random teachers and distances are gi ven by th e KL-div er gence betwee n two HMMs ω 1 and ω 2 d K L ( ω 1 , ω 2 ) ≡ X y T 1 P ( y T 1 | ω 1 ) ln  P ( y T 1 | ω 1 ) P ( y T 1 | ω 2 )  . (3) W e see that after a certain nu mber o f sequences the HMM stops learning, which is particular to the symmetric initial student and disappears for a non-symmetri c o ne. Denoting the v ariation of the parameters i n BC b y ∆ , in BW b y ˆ ∆ , in BWO by ˜ ∆ , and with γ t ( i ) ≡ P ( q t = s i | y p , ω p ) , we hav e to ﬁrst order in λ ∆ π i = λη B C n ˆ ∆ π i = λ n η B C η B W ˜ ∆ π i , (4) ∆ A ij = λη B C n " T − 1 X t =1 γ t ( i ) # ˆ ∆ A ij = λ n η B C η B W " T − 1 X t =1 γ t ( i ) # ˜ ∆ A ij , ∆ B iα = λη B C n " T X t =1 γ t ( i ) # ˆ ∆ B iα = λ n η B C η B W " T X t =1 γ t ( i ) # ˜ ∆ B iα . 10 100 1000 p 0.1 d (Kullback-Leibler) 0.001 0.01 0.1 FIGURE 1. Log-log curves of BW O for three different η B W indicated next to the curves. For η B W ≈ λη B C /n and small λ , variations in BC are proportional to tho se in BWO, but with di f ferent effe ctiv e learning rates for each matrix depending on y p . Simul ations show that actual values are of the same order of approximated ones. THE BA YESIAN ONLINE ALGORITHM The Bayesian Online Al gorithm (BOnA) [8] u ses Bayesian i nference to adjust ω i n the HMM using a data s et D P = { y 1 , ..., y P } . For ea ch data, the prior dis tribution is updated by Ba yes’ theorem. This update tak es a prior from a parametric f amily a nd transforms it in a pos terior which in general has no longer the same p arametric form. The strategy used by BOnA is then t o project the posterior back into t he initial parametric family . In order to achie v e this, we minimise the KL-div er gence between the p osterior and a distribution in the parametric family . Th is mi nimisation will enable us to ﬁnd the parameters of the closest parametric distribution by which we will approximate our posterior . The student HMM ω parameters in each step of the learning process are estimated as th e means o f the each projected distribution. For a parametric family that has the form P ( x ) ∝ e − P i λ i f i ( x ) , which can be obt ained by the MaxEnt princip le where we constrain t he ave rages over P ( x ) of arbit rary func- tions f i ( x ) , minimising t he KL-diver gence turns out to be equiv alent to equating the a verages < f i ( x ) > over P ( x ) to the a verage of t hese functions over the unprojected posterior (our posterior distribution j ust after the Bayesian update for the next d ata). For HMMs, the v ector π and each i -th row A i of A and B i of B are different discrete distributions which we assume independent in order to write the factorized distribution P ( ω | u ) ≡ P ( π | ρ ) n Y i =1 P ( A i | a i ) P ( B i | b i ) , (5) where u = ( ρ, a, b ) represents the parameters of the distributions. As each factor i s a d istribution ov er probabilities, the natu ral choice are the Di richlet distributions, which for a N -dimensional variable x is D ( x | u ) = Γ( u 0 ) Q N i =1 Γ( u i ) N Y i =1 x u i − 1 i , (6) where u 0 = P i u i and Γ is the analytical continuation of the factorial to real numb ers. These can be obtained from MaxEnt with f i ( x ) = ln x i [13]: Z dµ D ( x ) ln x i = α i , dµ ≡ δ X i x i − 1 ! Y i θ ( x i ) dx i . (7) The function to be extremized is L = Z dµ D ln D + λ  Z dµ D − 1  + X i λ i  Z dµ D ln x i − α i  , (8) and with δ L / δ D = 0 we get the Dirichlet with normalisatio n e λ +1 and u i = 1 − λ i . Each factor distribution is separately projected by equ ating the a verage of the loga- rithms in the original posterior Q and in t he projected distributions ψ ( ρ i ) − ψ X j ρ j ! = h ln π i i Q ≡ µ i ( ρ ) , (9) ψ ( a ij ) − ψ X k a ik ! = h ln A ij i Q ≡ µ ij ( a ) , ψ ( b iα ) − ψ X β b iβ ! = h ln B iα i Q ≡ µ iα ( b ) , where ψ ( x ) = d ln Γ( x ) /dx is the di gamma function. W e call a set of N equati ons ψ ( x i ) − ψ X j x j ! = µ i , (10) with i = 1 , ...N a digamma system in the variables x i with coef ﬁcients µ i . Let us call P p ( ω ) the projected di stribution after obs erv ation of y p , and Q p +1 ( ω ) the posterior distribution (not projected yet) after y p +1 . By Bayes’ theorem, Q p +1 ( ω ) ∝ P p ( ω ) X q p +1 P ( y p +1 , q p +1 | ω ) . (11) The calculation of µ ’ s in (9) leads to ave rages ove r Dirichlets of the form [10] µ i = *" Y j x r j j # ln x i + = Γ( u 0 ) Q j Γ( u j ) Q j Γ( u j + r j ) Γ( u 0 + r 0 ) [ ψ ( u i + r i ) − ψ ( u 0 + r 0 )] . (12) T o solve (10 ), we solve for x i , s um over i with x 0 ≡ P i x i and ﬁnd numerically , b y iterating from an arbitrary initial point, the ﬁxed points of the one-dimensional map x n +1 0 = X i ψ − 1 [ µ i + ψ ( x n 0 )] , (13) 1 10 100 p 0.1 d (Kullback-Leibler) FIGURE 2. Comparison in log-log scale of MP A (dashed line) and BOnA (circles). where we found a unique solution except for µ i ≈ 0 , which is rare in most applications. BOnA has a comm on problem of Bayesian algorit hms: the sum over hidden vari- ables makes the complexity scales exponentially i n T . Also, the calculatio n o f several digamma functions is very time consuming. In t he fol lowing, w e de velop an approxi- mation that runs faster , although still with exponential com plexity in T . This is n ot a problem for we can make T con stant and the algorithm will scale polynomiall y in n . MEAN POSTERIOR APPR O X IMA TION The Mean Posterior Approximation (MP A) is a simpliﬁcation of BOnA inspired in its results for Gaus sians, wh ere we match ﬁrst and s econd mom ents of po sterior and projected di stributions. Noting it, instead of minim ising d K L we match the mean and one of the variances of posterior and projected distributions as an approximation, which giv es, with ha tted v ariables for reestimated va lues [10] ˆ ρ i = h π i i Q h π 1 i Q − h π 2 1 i Q h π 2 1 i Q − h π 1 i 2 Q , (14) ˆ a ij = h a ij i Q h a i 1 i Q − h a 2 i 1 i Q h a 2 i 1 i Q − h a i 1 i 2 Q , ˆ b iα = h b iα i Q h b i 1 i Q − h b 2 i 1 i Q h b 2 i 1 i Q − h b i 1 i 2 Q , with compl exity again of o rder n T , but with heavily reduced real computatio nal time making it better for practical applications. Figure 2 compares MP A and BOnA. Th e initial difference decrea ses in time and both come closer relatively fast. W e used n = 2 , m = 3 and T = 2 and averaged over 150 random teachers wi th sy mmetric initial s tudents. The com putational time for BOnA was 340min, and for MP A, 5s in a 1GHz processo r . Fig ure 3 a compares MP A to BC and ﬁgure 3b to BWO. In bot h cases MP A has better generalisati on. W e us ed n = 2 , m = 3 , T = 2 , symmetric initial students and a vera ged over 500 rando m teachers. 10 100 1000 10000 p 0.1 d (Kullback-Leibler) 0.001 0.01 0.1 10 100 1000 10000 p 0.1 d (Kullback-Leibler) 0.005 0.0001 0.0005 a) b) FIGURE 3. a) Co mparison b etween MP A (dashed) and BC (continu ous). V alues of λ a re ind icated n ext to the curves. η B C = 0 . 5 . b) Comparison between MP A (dashed ) and BW O (continu ous). V alu es of η B W are indicated next to the curves. Both scales are log-log. 0 500 1000 1500 2000 2500 3000 p 0 0.2 0.4 0.6 0.8 d (Kullback-Leibler) 0 500 1000 1500 2000 2500 3000 p 0.2 0.4 0.6 0.8 1 d (Kullback-Leibler) a) b) FIGURE 4. Drifting concepts. Continuou s lines correspon d to MP A and dashed lines to BC. a) Abru pt changes at 500 sequences interval. b) Small rando m changes at each new sequence. LEARNING DRIFTING CONCEPTS W e tested BC and MP A for changing teachers. In ﬁgure 4a, it changes at rando m after each 500 sequences ( λ = 0 . 01 , η B C = 10 . 0 ). In ﬁgure 4b, each time a sequence is observed, a small random quantity is added to the t eacher . Both hav e n = 2 , m = 3 and are a veraged o ver 200 runs. Figure 4b shows that BC adapts better , but is no t fully adaptive and we do no t know how to m odify it. MP A i nstead derives from Bayesian principles and we can guess t he problem by analog y with similar Bayesian algorithms [12]: variances decrease i n the process as in the perceptron, where they are the learning rates, explaining t he m emory ef fect di f ﬁculting t he learning after changes. Although not proved yet, we expect the same relationship in MP A, which can be used to improve per formance. LEARNING AND SYMMETR Y BRE AKING Learning from symm etric initial students requires th at th e parameters separate from each other in some point, which depends on the algo rithm and is an impo rtant feature in online 0 1000 2000 3000 p 0 0.5 1 1.5 2 2.5 3 d (Kullback-Leibler) 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Initial Probabilities 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Transition Matrix 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Emission Matrix 0 1000 2000 3000 p 0 0.5 1 1.5 2 2.5 3 d (Kullback-Leibler) 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Initial probabilities 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Transition Matrix 0 1000 2000 3000 p 0 0.2 0.4 0.6 0.8 1 Emission Matrix a) b) FIGURE 5. KL-divergence an d student’ s parameters for a) BC and b) MP A. algorithms [11], b reaking the symmetry with a sharp decrease in the generalisatio n error . Instead of takin g ave rages to smooth abrupt changes, here we draw curves for onl y one teacher , rendering them visible. Flat lines before a symm etry breaking are called plateaux and occur when it is difﬁ cult to break the symmetry . Figure 5a sho ws BC ( λ = 0 . 0 1 , η B C = 1 . 0 ) with tw o abrupt changes: in th e be ginning and after 1000 sequences. π and A on ly break the sym metry in the second poin t, and B in both. Figure 5b shows that in MP A the second change i s stronger and t he s ymmetry breaking af fects both B and A . Figure 6 sho ws BWO with η B W = 0 . 01 where only B is af fected. The more symmetries are broken, the best the gene ralisation of the algorithm . In all simul ations w e set n = 2 , m = 3 and T = 2 wit h a teacher HMM giv en by π =  1 0  , A =  0 1 1 0  , B =  1 0 0 0 0 1  . (15) CONCLUSIONS W e propos ed and analysed three learning algo rithms for HMMs: Baum-W elch On- line (BWO), Bayesian Online Algorithm (BOnA) and Mean Posterior Approximation (MP A). W e showed the superior performance of M P A for static teachers, but the Baldi- Chauvin (BC) algorit hm is better for drifti ng concepts, alt hough the Bayesian nature of MP A s uggests how to ﬁ x it. The results seem to be conﬁrmed by i nitial tests on r eal data. The imp ortance o f s ymmetry breaking in learning processes i s presented here in a brief discussion where the phenomenon is shown to occur in our mo dels. A CKNO WLED GEMENTS W e would like to thank Ev aldo Oliveira, Manfred Opper and Lehel Csato for useful discussions . T his work was made part in the Uni ver sity of São Paulo with ﬁnancial support of F APESP and part in the Aston Univer sity with support of Ev er gro w Project. 0 1000 2000 3000 p 1 1.5 2 2.5 3 d (Kullback-Leibler) 0 1000 2000 3000 p 0.2 0.4 0.6 0.8 1 Initial probabilities 0 1000 2000 3000 p 0.2 0.4 0.6 0.8 1 Transition Matrix 0 1000 2000 3000 p 0 0.1 0.2 0.3 0.4 0.5 Emission Matrix FIGURE 6. KL-divergence and student’ s parameters for BWO. REFERENCE S 1. Y . Ephraim , N. Merhav , Hidden Markov Pro cesses. IEEE T rans. Inf. Theory 48 , 1518-156 9 (200 2). 2. L. R. Rabiner, A T utorial on Hidden Markov Models and Selected App lications in Speech Recogni- tion. Proc. IEEE 77 , 257-28 6 (1989 ). 3. P . Baldi, S. Brunak , Bioinformatics: The Machine Learning Approach. MIT Press (2001). 4. R. Durb in, S. Ed dy , A. Krog h, G. Mitchison, Biologic al sequence analysis: Probab ilistic mo dels of proteins and nucleic acids. Cambridge Univ ersity Press, Cambridge (1998). 5. J. Gar cia-Frias, Deco ding of Low-Density Parity-Check Codes Over Finite-State Binary Markov Channels. IEEE T rans. Comm. 52 , 1840-1 843 (2 004). 6. P . Baldi, Y . Cha uvin, Smooth On -Line Learning Algo rithms for Hid den Markov M odels. Neural Computation 6 , 307-31 8 (1994 ). 7. S. Amari, Neural learning in structured parameter spaces - Natural Riemannian gradient. NIPS’96 9 , MIT Press (1996 ). 8. M. Opper, A Bayesian Ap proach to On-line Lear ning. On -line lea rning in Neu ral Network s, e dited by D. Saad, Publications of the Ne wton Institute, Cambridg e Press, Cambridge (1998 ). 9. M. Opper, D. Saad, Advanced Mean Field Methods: Theory and Practice. MIT Press (2001). 10. R. Alamino, N. Caticha, Bayesian Online Algo rithms f or Learning in Discre te Hidden Markov Models. Submitted to Discrete and Continuou s Dynamical Systems. 11. T . Heskes, W . Wie gerinck, W . , On-line Learn ing with T ime-Correlated Examples. On-lin e Learning in Neural Networks, 251- 278, edited by D avid Saad, Cambridge Univ ersity Pre ss, Camb ridge (199 8). 12. R. V icente, O. Kinouchi, N. Catich a. Statis tical Mechanics of Online Learning of Drifting Concepts: A V ariation al Approach. Machine Learning 32 , 179-201 (1998). 13. M. O. Vlad, M . Tsuchiya, P . Oefner, J. Ross. Bay esian analysis of systems with rando m che mical composition : R enormaliza tion-gro up a pproach to Dirichlet distributions and the statistical theo ry of dilution. Phys. Re v . E 65 , 01111 2(1)-0 1112(8) (200 1).

Online Learning in Discrete Hidden Markov Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment