Learning the Structure of Deep Sparse Graphical Models

Deep belief networks are a powerful way to model complex probability distributions. However, learning the structure of a belief network, particularly one with hidden units, is difficult. The Indian buffet process has been used as a nonparametric Baye…

Authors: Ryan Prescott Adams, Hanna M. Wallach, Zoubin Ghahramani

LEARNING THE STRU CTURE O F DEEP SP ARSE GRAPHICAL MODELS By R y an P . Adams ∗ , Hanna M. W allach an d Zoubin Ghahramani University of T or onto, University of Massachusetts and University of Cambridge Deep b elief netw orks are a p ow erful w a y to mo del complex prob- abilit y d istributions. How eve r, learning the structu re of a belief net- w ork, particularly one with hidden units, is difficult. The Indian buf- fet process h as been u sed as a nonparametric Bay esian p rior on the directed structu re of a b elief n etw ork with a single infi nitely wide hidden lay er. In this pap er, w e introd uce the cascading I ndian buff et process (CIBP), whic h p ro vides a nonparametric prior on th e struc- ture of a la yere d, directed belief netw ork that is unb ounded in b oth depth and width, yet allo ws t ractable in ference. W e use th e CIBP prior with the nonlinear Gaussian b elief netw ork so each unit can additionally va ry its b ehavior b etw een d iscrete and con tinuous rep- resen tations. W e provide Marko v chain Monte Carlo algorithms for inference in these b elief netw orks and explore th e structures learned on several image data sets. 1. Introduction. The b elief net w ork or directed probabilistic g raphical mo del [ P earl , 1988 ] is a p opular and useful w a y to represent complex prob- abilit y distributions. Method s f or learnin g the parameters of suc h netw orks are w ell-established. Learnin g netw ork structure, ho wev er, is more difficult, particularly w hen the n etw ork in clud es un ob s erv ed hidden units. T h en, not only m u st the s tr ucture (edges) b e determined, but the n um b er of hidden units must also b e inferred. T his pap er con tribu tes a nov el nonparametric Ba ye sian p ersp ectiv e on th e general problem of learnin g graphical mo dels with hidden v ariables. Nonparametric Ba y esian approac hes to this problem are app ealing b ecause they can av oid the difficult compu tations required for selecting the appropr iate a p osteriori dimensionalit y of the mo del. In- stead, they introdu ce an infinite num b er of parameters in to the mo del a pri- ori and inference determines the sub set of these that actually con tribu ted to th e observ ations. The Indian bu ffet pro cess (IBP) [ Ghahramani et al. , 2007 , Griffiths and Ghahr amani , 2006 ] is one example of a nonparamet- ric Ba y esian p rior and it h as previously b een used to in tro duce an infi- nite n um b er of hidden units in to a b elief n etw ork with a single hidd en la yer [ W o o d et al. , 2006 ]. ∗ http://www .cs.toront o.edu/ ~ rpa 1 2 R.P . ADAMS ET AL. This pap er u nites t wo imp ortant areas of researc h: n onparametric Ba ye- sian metho ds and deep b elief net works. T o date, wo rk on deep b elief net- w orks has n ot addressed the general structure-learning problem. W e there- fore p resen t a u n ifying fr amew ork for solving this p roblem us ing nonpara- metric Ba y esian metho ds. W e first prop ose a no vel extension to the Indian buffet pr o cess — the cascading Indian buffet pro cess (CIBP) — and use the F oster-Lyapuno v criterion to prov e con verge nce prop erties th at make it tractable with finite computation. W e then use the CIBP to generalize the single-la y ered, IBP-based, directed b elief netw ork to construct m u lti-la y ered net works that are b oth infin itely wide and infinitely d eep, and discuss u se- ful prop erties of suc h net w orks including exp ected in-degree and out-degree for in dividual u n its. Finally , w e com bine this framew ork with the p o werful con tinuous sigmoidal b elief net work framew ork [ F rey , 1997 ]. Th is allo ws us to infer the t yp e (i.e., d iscrete or contin uous) of individu al hidden units—an imp ortant p rop erty that is not widely discuss ed in previous w ork. T o sum- marize, w e presen t a flexible, nonp arametric framewo rk for directed deep b elief n et works that p ermits inference of the num b er of h idden units, the directed edge stru cture b et ween units, the depth of th e net w ork and the most appropriate typ e for eac h unit. 2. Finite Belief Netw orks. W e consider b elief net w ork s that are la y- ered directed acyclic graphs with b oth visible and hid den units. Hidden units are random v ariables that app ear in the joint d istr ibution d escrib ed b y the b elief n et work b ut are not observed. W e index la y ers by m , increasing with depth up to M , and allo w visible u nits (i.e., observe d v ariables) only in la yer m = 0. W e require that un its in la y er m ha v e paren ts only in la y er m + 1. Within la yer m , w e denote the num b er of units as K ( m ) and index the units with k so that the k th unit in la y er m is d enoted u ( m ) k . W e use the no- tation u ( m ) to refer to the vec tor of all K ( m ) units for la y er m toget her. A binary K ( m − 1) × K ( m ) matrix Z ( m ) sp ecifies the edges from lay er m to la yer m − 1, so that elemen t Z ( m ) k ,k ′ = 1 iff there is an edge from unit u ( m ) k ′ to unit u ( m − 1) k . A unit’s activ ation is determined b y a weigh ted sum of its parent units. The weigh ts for lay er m are denoted by a K ( m − 1) × K ( m ) real-v alued ma- trix W ( m ) , so that the acti v ations for th e un its in la y er m can b e writ- ten as y ( m ) = ( W ( m +1) ⊙ Z ( m +1) ) u ( m +1) + γ ( m ) , where γ ( m ) is a K ( m ) -dimen- sional vec tor of bias weights and the binary op erator ⊙ ind icates the Hada- mard (elemen t wise) pr o duct. T o ac h iev e a wide r an ge of p ossible b ehavi ors for the u nits, w e use the nonline ar Gaussian b elief ne twork (NLGBN) [ F rey , 1997 , F r ey and Hin ton , LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 3 1999 ] framewo rk. In the NLGBN, the distribution on u ( m ) k arises from adding zero mean Gaussian noise with pr ecision ν ( m ) k to the activ ation sum y ( m ) k . This noisy sum is then transformed with a sigmoid fun ction σ ( · ) to ar- riv e at the v alue of the unit. W e mo dify the NLGBN s lightly so th at the sigmoid function is fr om the real line to ( − 1 , 1), i.e. σ : R → ( − 1 , 1), via σ ( x ) = 2 / (1 + exp { x } ) − 1. The distribu tion of u ( m ) k giv en its parents is then p ( u ( m ) k | y ( m ) k , ν ( m ) k ) = exp  − ν ( m ) k 2 h σ − 1 ( u ( m ) k ) − y ( m ) k i 2  σ ′ ( σ − 1 ( u ( m ) k )) q 2 π /ν ( m ) k where σ ′ ( x ) = d d x σ ( x ). As discussed in F rey [ 1997 ] and sho wn in Figure 1 , differen t choic es of ν ( m ) k yield different b elief u n it b eh a viors from effectiv ely discrete binary un its to nonlinear con tinuous u nits. In the multila y er ed con- struction w e ha v e describ ed here, the join t distribu tion ov er the un its in a NLGBN is (1) p ( { u ( m ) } M m =0 | { Z ( m ) , W ( m ) } M m =1 , { γ ( m ) , { ν ( m ) k } K ( m ) k =1 } M m =0 , ) =   K ( M ) Y k =1 p ( u ( M ) | γ ( M ) k , ν ( M ) k )   M − 1 Y m =0 K ( m ) Y k =1 p ( u ( m ) k | y ( m ) k , ν ( m ) k ) . 3. Infinite Belief Netw orks. Conditioned on the num b er of la ye rs M , the la yer widths K ( m ) and the net w ork structures Z ( m ) , inference in b e- lief net w orks can b e straigh tforwardly implemented using Mark o v c h ain Mon te Carlo [ Neal , 1992 ]. Learnin g th e depth, width and structure, how ev er , present s significant compu tational c hallenges. In this s ection, we present a no v el nonparametric prior, the c asc ading Indian buffet pr o c ess , for m ulti- la yered b elief n et works that are b oth in finitely wide and infin itely deep. By using an infin ite pr ior we a v oid the need for the complex dimensionalit y- altering prop osals that w ould otherwise b e r equired during in ference. 3.1. The Indian buffet pr o c ess. Section 2 used the binary matrix Z ( m ) as a con v enien t wa y to repr esen t th e edges conn ecting la y er m to la y er m − 1. W e stated that Z ( m ) w as a finite K ( m − 1) × K ( m ) matrix. W e can use the Indian buffet pr o c ess (IBP) [ Griffiths and Gh ahramani , 2006 ] to allo w this matrix to h a ve an infinite num b er of columns. W e assum e the t wo-parameter IBP [ Ghahramani et al. , 2007 ], and use Z ( m ) ∼ IBP ( α, β ) to indicate that th e ma- trix Z ( m ) ∈ { 0 , 1 } K ( m − 1) ×∞ is dra wn from an IBP with p arameters α, β > 0. 4 R.P . ADAMS ET AL. −1 −0.5 0 0.5 1 (a) ν = 1 2 −1 −0.5 0 0.5 1 (b) ν = 5 −1 −0.5 0 0.5 1 (c) ν = 1000 Fig 1: Thr ee mo des o f op eratio n for the NLGBN unit. The black solid line shows the zero mean distribution (i.e. y = 0), the red dashed line shows a pre-sig moid mean of +1 and the blue dash-dot line s hows a pr e -sigmoid mea n o f − 1 . (a) Bi- nary behavior fro m small precision. (b) Roughly Gaussian b ehavior from medium precision. (c) Deterministic b e havior fro m larg e precisio n. The ep on ymous metaphor for the IBP is a restaurant with an infin ite n um- b er of dishes a v ailable. Eac h customer chooses a fin ite set of dishes to taste. The ro ws of the binary matrix corr esp ond to customers and the columns corresp ond to dishes. If the j th customer tastes the k th dish, th en Z j,k = 1, otherwise Z j,k = 0. The fir st cus tomer in to the restauran t samples a num b er of dishes that is Poisson distributed with parameter α . After that, when the j th customer ent ers the restauran t, she selects dish k with probabil- it y η k / ( j + β − 1), w h ere η k is the n u m b er of pr evious customers that ha v e tried the k th d ish. She then c ho oses a n um b er of additional dishes to taste that is P oisson distribu ted w ith parameter αβ / ( j + β − 1). Ev en th ough eac h customer c ho oses dishes based on their p opu larit y with previous cus tomers, the ro ws and columns of the resulting matrix Z ( m ) are infinitely exc h ange- able. As in W o o d et al. [ 2006 ], if the mo del of S ection 2 had only a single hid den la yer, i.e. M = 1, then the IBP could b e used to mak e that la y er infi nitely wide. While a b elief n et work with an infi nitely-wide hidd en la yer can rep- resen t any probability distribution arbitrarily closely [ Le Roux and Bengio , 2008 ], it is not necessarily a useful prior on su c h d istributions. Without in tra-la y er connections, the the hidden units are indep endent a priori . This “shallo wness” is a strong assump tion that wea k ens the mo del in practice and th e explosion of r ecen t literature on de ep b elief networks (see, e.g. Hin ton and S alakh utd ino v [ 200 6 ], Hint on et al. [ 2006 ]) sp eaks to the em- pirical success of b elief n et works with more hidd en stru cture. 3.2. The c asc ading Indian buffet pr o c ess. T o build a prior on b elief net- w orks that are u n b oun d ed in b oth wid th and depth, we us e an IBP-lik e ob- ject that provides an infin ite sequence of b in ary matrices Z (0) , Z (1) , Z (2) , · · · . LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 5 W e r equ ire the matrices in this s equence to inherit the useful sparsit y prop- erties of the IBP , w ith th e constrain t that th e columns from Z ( m − 1) corre- sp ond to the r o ws in Z ( m ) . W e in terpret eac h matrix Z ( m ) as sp ecifying the directed ed ge structur e from la ye r m to la yer m − 1, where b oth la y ers ha ve a p oten tially-unb ounded width. W e prop ose the cascading Indian buffet pr o cess to provide a prior with these prop erties. The CIBP extends the v anilla IBP in the follo wing wa y: eac h of the “dishes” in the restauran t are also “customers” in another In dian buffet pro cess. The columns in one binary matrix corresp ond to the rows in another binary matrix. The CIBP is infinitely exc hangeable in the ro ws of matrix Z (0) . Eac h of the IBPs in the recursion is exchangea ble in its ro ws and columns, so it do es not c h ange the pr obabilit y of the data to p r opagate a p erm utation bac k through the matrices. If there are K (0) customers in the firs t restaurant, a surp rising result is that, for finite K (0) , α , and β , the CIBP recursion terminates with prob a- bilit y one. By “terminate” w e m ean that at some p oint the customers d o not taste an y d ishes and all d eep er r estaur ants hav e n either d ishes nor cus- tomers. Here we only s k etch the in tuition b ehind this result. A pr o of is pro vided in App endix A . The matrices in the CIBP are constru cted in a sequence, starting with m = 0. The num b er of nonzero columns in matrix Z ( m +1) , K ( m +1) , is de- termined entirely by K ( m ) , the num b er of activ e nonzero columns in Z ( m ) . W e r equire that f or some matrix Z ( m ) , there are no nonzero columns. F or this purp ose, w e can disregard the fact that it is a matrix-v alued sto c hastic pro cess and instead consider the Mark ov c hain that results on the num b er of nonzero columns. Figure 2a sho ws three traces of such a Mark o v c hain on K ( m ) . If w e define λ ( K ; α, β ) = α P K k ′ =1 β k ′ + β − 1 , then the Marko v c hain has the tran s ition distribution p ( K ( m +1) = k | K ( m ) , α, β ) = 1 k ! exp n − λ ( K ( m ) ; α, β ) o λ ( K ( m ) ; α, β ) k , (2) whic h is simply a Poisso n distrib ution w ith mean λ ( K ( m ) ; α, β ). Clearly , K ( m ) = 0 is an absorbin g state , ho wev er, the s tate space of the Marko v c h ain is countably-infinite and to kno w that it will reac h th e absorbing state with probabilit y one, we must k n o w that K ( m ) do es not blo w up to infin it y . In suc h a Mark o v c hain, this r equiremen t is equiv alent to the statemen t that the c hain has an equilibrium distribution when conditioned on nonab- sorption (h as a quasi-stationary distribution ) [ Seneta and V ere-Jones , 1966 ]. F or coun tably-infi nite state spaces, a Mark o v chain has a (qu asi-) station- ary distribution if it is p ositiv e-recurr en t, whic h is the prop erty that there is a finite exp ected time b et ween consecutiv e visits to any state. P ositiv e 6 R.P . ADAMS ET AL. 10 20 30 40 50 60 70 80 0 20 40 De p t h m Wi d t h K ( m ) (a) Exa mple traces with K (0) = 50 0 5 10 15 20 0 5 10 15 20 C u r r ent Widt h K ( m ) Ex p ec t ed K ( m + 1 ) (b) Exp ected K ( m +1) 0 5 10 15 20 −10 −5 0 5 C u r r ent Widt h K ( m ) Drift (c) Drift Fig 2: Prop er ties of the Mar kov chain on layer w idth for the CIBP , with α = 3, β = 1. Note that these v alues a re illus trative and are not neces s arily appropria te for a net w ork structure. a) Example traces o f a Ma rko v ch ain on layer width, indexed by depth m . b) Exp ected K ( m +1) as a function of K ( m ) is shown in blue. The Lyapuno v function L ( · ) is shown in g reen. c) The drift as a function of the current width K ( m ) . This cor resp onds to the difference betw een the tw o lines in (a). Note that it go es negative when the lay er width is greater than eig ht . recurrency can b e sho wn by proving the F oster– Lyapunov stability crite- rion (FLSC) [ F a yolle et al. , 2008 ]. T ak en together, satisfying th e FLSC for the Marko v c h ain w ith transition probabilities give n b y Eqn 2 demons tr ates that ev entually the CIBP will r eac h a restaurant in which the customers try no new dishes. W e do this by sho wing that if K ( m ) is large enough, the exp ected K ( m +1) is s m aller than K ( m ) . The FLSC requires a Lyapunov function L ( k ) : N + → R ≥ 0, with whic h LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 7 (a) α = 1, β = 1 (b) α = 1, β = 1 2 (c) α = 1 2 , β = 1 (d) α = 1, β = 2 (e) α = 3 2 , β = 1 Fig 3: Samples from the CIBP-based prio r on netw ork structure s, with five visible units. w e define th e drift function : E k | K ( m ) [ L ( k ) − L ( K ( m ) )] = ∞ X k =1 p ( K ( m +1) = k | K ( m ) )( L ( k ) − L ( K ( m ) )) . The d rift is the exp ected c hange in L ( k ). If th ere is a K ( m ) ab o v e whic h all drifts are negativ e, then the Mark o v c hain satisfies the FLSC and is p ositiv e-recurren t. In the C I BP , this is satisfied for L ( k ) = k . Th at the dr if t ev entually b ecomes negati v e can b e seen b y the fact th at E k | K ( m ) [ L ( k )] = λ ( K ( m ) ; α, β ) is O (ln K ( m ) ) and E k | K ( m ) [ L ( K ( m ) )] = K ( m ) is O ( K ( m ) ). Figures 2b and 2c sho w a schemat ic of this id ea. 3.3. The CIBP as a prior on the structur e of an infinite b e lief network. The CIBP can b e used as a pr ior on the s equence Z (0) , Z (1) , Z (2) , · · · from Section 2 , to allo w an infinite sequence of infinitely-wide hidden lay ers. As b efore, there are K (0) visible u nits. The edges b et w een the first h idden la yer and the visible lay er are dra w n acco rding to the restauran t metaphor. Th is yields a fi nite n um b er of un its in the first hid den la y er, denoted K (1) as 8 R.P . ADAMS ET AL. b efore. These un its are now treated as the visible units in another IBP- based n etw ork. While this recurses in fi nitely deep, only a finite num b er of units are ancestors of the visible units. Figure 3 sho ws sev eral samples fr om the prior for different parameterizations. Only connected units are shown in the figure. The parameters α and β go v ern the exp ected w idth and sparsit y of the net work at eac h lev el. The exp ected in-degree of eac h u n it (n u m b er of par- en ts) is α and the exp ected out-degree (num b er of c hildren) is K / P K k =1 β β + k − 1 , for K units used in the la yer b elo w . F or clarit y , we ha ve presente d the CIBP results with α and β fixed at all depths; how ev er, this ma y b e o v erly restric- tiv e. F or example, in an image recognition problem we w ould not exp ect the sparsit y of edges m ap p ing lo w-lev el features to pixels to b e the same as that for high-lev el features to lo w-lev el features. T o address this, we allo w α and β to v ary with dep th , writing α ( m ) and β ( m ) . The CIBP terminates with p rob- abilit y one as long as there exists some finite upp er b ound for α ( m ) and β ( m ) for all m . 3.4. Priors on other p ar ameters. Oth er parameters in the mo del also r e- quire prior distributions and w e use these priors to tie parameters tog ether according to la y er. W e assu me that the w eights in la y er m are drawn indep en- den tly from Gaussian d istr ibutions with mean µ ( m ) w and precision ρ ( m ) m . W e assume a similar la y er-wise prior for biases w ith p arameters µ ( m ) γ and ρ ( m ) γ . W e use la ye r-wise gamma priors on the ν ( m ) k , with parameters a ( m ) and b ( m ) . W e tie these prior p arameters together with global normal-ga mma hyp er- priors for the weig h t and bias parameters, and gamma hyp erpriors for the precision parameters. 4. Inference. W e h a ve so far describ ed a prior on b elief n etw ork struc- ture and parameters, along with lik eliho o d functions for unit activ atio n. The in ference task in th is mo del is to find the posterior distribution o v er the structure and the p arameters of the n et work, having seen N K (0) - dimensional v ectors { x n ∈ ( − 1 , 1) K (0) } N n =1 . This p osterior distribu tion is complex, so we use Mark o v chain Mon te Carlo (MCMC) to dr a w samples from p ( { Z ( m ) , W ( m ) } ∞ m , { γ ( m ) , ν ( m ) } ∞ m , { x n } N n ), whic h , for fi x ed { x n } N n , is prop ortional to the p osterior distribution. This joint distribution requires marginalizing ov er the states of the hid den units that led to eac h of the N observ ations. The v alues of these hidden u n its are denoted {{ u ( m ) n } ∞ m =1 } N n =1 , and w e augment the Mark o v chain to include these as well. In general, one wo uld not exp ect that a distribu tion on infinite net works w ould yield tractable inference. Ho wev er, in our construction, conditioned LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 9 on the sequence Z (1) , Z (2) , · · · , almost all of the infinite num b er of units are indep endent. Due to th is indep endence, they trivially marginalize out of the mo d el’s joint distr ibution and we can restrict inference only to those units that are ancestors of the visible units. Of course, sin ce th is trivial marginalizatio n only arises from the Z ( m ) matrices, w e m u st also ha ve a distribution on infi n ite binary matrices that allo w s exact marginalization of all the uninstanti ated edges. The ro w -w ise an d column-wise exc hangeabil- it y prop erties of the IBP are what allo ws the use of infin ite m atrices. The b ottom-up conditional structure of the C IBP allo ws an infinite n um b er of these matrices. T o simp lify n otation, w e will us e Ω for the aggregated state of the mo del v ariables, i.e. Ω = ( { Z ( m ) , W ( m ) , { u ( m ) n } N n =1 } ∞ m =1 , { γ ( m ) , ν ( m ) } ∞ m =0 , { x n } N n =1 ). Giv en the hyp erparameters, w e can then write the joint distrib ution as (3) p ( Ω ) =   p ( γ (0) ) p ( ν (0) ) K (0) Y k =1 N Y n =1 p ( x k ,n | y (0) k ,n , ν (0) k )   ×   ∞ Y m =1 p ( W ( m ) ) p ( γ ( m ) ) p ( ν ( m ) ) K ( m ) Y k =1 N Y n =1 p ( u ( m ) k ,n | y ( m ) k ,n , ν ( m ) k )   . Although this distribution inv olv es sev eral infinite sets, it is p ossible to sam- ple from the relev an t parts of th e p osterior. W e do this by MCMC, up dating part of th e mo del, while cond itioning on the rest. In p articular, condition- ing on the binary matric es { Z ( m ) } ∞ m =1 , whic h defin e th e structure of the net work, inference b ecomes exactly as it w ould b e in a finite b elief net w ork. 4.1. Samp ling fr om the hidden u ni t states. Since w e cannot easily int e- grate out the hid d en un its, it is necessary to explicitly represen t them and sample from them as part of the Marko v c hain. As we are conditioning on the net w ork structure, it is only n ecessary to sample the units that are an- cestors of th e visible un its. F rey [ 1997 ] p rop osed a slice sampling sc h eme for the hidd en unit states b ut w e ha v e b een more successful with a s p ecialized indep end ence-c hain v arian t of m ultiple-try Metrop olis–Hastings [ Liu et al. , 2000 ]. Our metho d prop oses s everal ( ≈ 5) p ossible new unit states from the activ ati on distribution and selects from among them (or rejects them all) according to th e likeli ho o d imposed b y its children. As this op eration can b e executed in parallel b y to ols suc h as Matlab, we hav e seen significantly b etter mixing p erformance b y w all-cloc k time th an th e slice sampler. 4.2. Samp ling fr om the weights and biases. Giv en that a dir ected edge exists, w e sample the p osterior distrib ution o v er its weig h t. Conditioning on 10 R.P . ADAMS ET AL. the rest of the mo d el, the NLGBN results in a con v enien t Gaussian form for the distr ib ution on w eigh ts so th at w e can Gibb s up d ate them using a Gaussian with parameters µ w − p ost m,k ,k ′ = ρ ( m ) w µ ( m ) w + ν ( m − 1) k P n u ( m ) n,k ′ ( σ − 1 ( u ( m − 1) k ) − ξ ( m ) n,k ,k ′ ) ρ ( m ) w + ν ( m − 1) k P n ( u ( m ) n,k ′ ) 2 (4) ρ w − p ost m,k ,k ′ = ρ ( m ) w + ν ( m − 1) k X n ( u ( m ) n,k ′ ) 2 , (5) where ξ ( m ) n,k ,k ′ = γ ( m − 1) k + X k ′′ 6 = k ′ Z ( m ) k ,k ′′ W ( m ) k ,k ′′ u ( m ) n,k ′′ . (6) The bias γ ( m ) k can b e similarly sampled fr om a Gaussian d istribution with parameters µ γ − p ost m,k = ρ ( m ) γ µ ( m ) γ + ν ( m ) k P N n =1 ( σ − 1 ( u ( m ) n,k ) − χ ( m ) n,k ) ρ ( m ) γ + N ν ( m ) k (7) ρ γ − p ost m,k = ρ ( m ) γ + N ν ( m ) k (8) where χ ( m ) n,k = K ( m +1) X k ′ =1 Z ( m +1) k ,k ′ W ( m +1) k ,k ′ u ( m +1) n,k ′ . (9) 4.3. Samp ling fr om the activation varianc e s. W e use the NLGBN mo del to gain the abilit y to c hange the mo de of u nit b eha viors b et w een discrete and con tinuous represen tations. This corresp onds to sampling from the p osterior distributions o v er the ν ( m ) k . With a conjugate pr ior, the new v alue can b e sampled from a gamma distribution with parameters a ν − post m,k = a ( m ) ν + N / 2 (10) b ν − post m,k = b ( m ) ν + 1 2 N X n =1 ( σ − 1 ( u ( m ) n,k ) − y ( m ) k ) 2 . (11) 4.4. Samp ling fr om the structur e. A m o del f or in finite b elief net works is only us efu l if it is p ossible to p erform in ference. The a pp eal of the CIBP prior is that it enables construction of a tr actable Marko v chain for inference. LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 11 T o d o this sampling, we must add and remo ve edges from th e net work, consisten t with the p osterior equ ilibrium d istribution. When adding a lay er, w e must sample additional la y erw ise mo del comp onen ts. When in tro ducing an edge, we must dra w a w eight for it from the pr ior. If this new edge in tro duces a p reviously-unseen hidden unit, we must dr a w a bias for it and also d ra w its d eep er cascading connections fr om the p r ior. Finally , w e must sample the N new hid den u nit states from an y new u nit we introdu ce. W e iterate o v er eac h la y er that connects to the visible u nits. Within eac h la yer m ≥ 0, w e iterate o v er the conn ected units. Samplin g the edges inci- den t to the k th unit in la ye r m has tw o phases. First, we iterate o ver eac h connected unit in la yer m + 1, ind exed b y k ′ . W e calculate η ( m ) − k ,k ′ , wh ic h is the n um b er of n on zero entrie s in the k ′ th column of Z ( m +1) , excluding an y en try in th e k th row. If η ( m ) − k ,k ′ is zero, we call the unit k ′ a singleton parent , to b e dealt with in the second phase. If η ( m ) − k ,k ′ is n onzero, we in tr o duce (or k eep) the edge from unit u ( m +1) k ′ to u ( m ) k with Bernoulli p robabilit y p ( Z ( m +1) k ,k ′ = 1 | Ω \ Z ( m +1) k ,k ′ ) = 1 Z   η ( m ) − k ,k ′ K ( m ) + β ( m ) − 1   × N Y n =1 p ( u ( m ) n,k | Z ( m +1) k ,k ′ = 1 , Ω \ Z ( m ) k ,k ′ ) p ( Z ( m +1) k ,k ′ = 0 | Ω \ Z ( m +1) k ,k ′ ) = 1 Z   1 − η ( m ) − k ,k ′ K ( m ) + β ( m ) − 1   × N Y n =1 p ( u ( m ) n,k | Z ( m +1) k ,k ′ = 0 , Ω \ Z ( m +1) k ,k ′ ) , where Z is the appropriate n orm alizatio n constan t. In the second phase, we consider deleting connections to sin gleton parents of unit k , or adding new singleton paren ts. W e do th is via a Met rop olis– Hastings op erator using a birth/death p ro cess. I f there are currently K ◦ singleton p aren ts, then w ith probability 1 / 2 we prop ose addin g a new one b y dra wing it recursiv ely from d eep er la yers, as ab o ve. W e accept the pr op osal to insert a connection to this n ew p aren t unit w ith M–H acceptance ratio a mh − insert = α ( m ) β ( m ) ( K ◦ + 1) 2 ( β ( m ) + K ( m ) − 1) N Y n =1 p ( u ( m ) n,k | Z ( m +1) k ,j = 1 , Ω \ Z ( m +1) k ,j ) p ( u ( m ) n,k | Z ( m +1) k ,j = 0 , Ω \ Z ( m +1) k ,j ) . 12 R.P . ADAMS ET AL. If we do not prop ose to insert a unit and K ◦ ≥ 0, then w ith prob ab ility 1 / 2 w e select u niformly fr om among the singleton parent s of u nit k and prop ose remo vin g th e connection to it. W e ac cept the pr op osal to r emo ve the j th one with M–H acceptance ratio a mh − remove = K 2 ◦ ( β ( m ) + K ( m ) − 1) α ( m ) β ( m ) N Y n =1 p ( u ( m ) n,k | Z ( m +1) k ,j = 0 , Ω \ Z ( m +1) k ,j ) p ( u ( m ) n,k | Z ( m +1) k ,j = 1 , Ω \ Z ( m +1) k ,j ) . After these ph ases, c hains of units that are not ancestors of the visible units can b e discarded. Notably , this birth/d eath op erator samples from the IBP p osterior with a n on-truncated equ ilibr ium distribution, ev en without conjugacy . Unlik e the stic k-br eaking approac h of T eh et al. [ 2007 ], it allo ws use of th e t w o-parameter IBP , whic h is imp ortant to this mo del. 4.5. Samp ling F r om CIBP Hyp erp ar ameters. When applying this mo del to data, it is infrequently the case th at we wo uld ha ve a go o d a priori idea of wh at the appropriate IBP parameters should b e. These con tr ol the width and sparsity of th e n et work and while we migh t ha ve go o d initial guesses for the lo west la yer, in general we wo uld lik e to infer { α ( m ) , β ( m ) } as part of the larger inference pro cedur e. This is straight forw ard in the fully-Ba yesian MCMC p r o cedure w e ha ve constructed, and it do es n ot differ mark edly from h yp erparameter inference in standard IBP mo dels w h en conditioning on Z ( m ) . As in some other nonparametric mo dels (e.g. T okdar [ 2006 ] and Rasm ussen and Williams [ 2006 ]), we ha v e found that ligh t-tailed p riors on the hyp erparameters h elps ensure th at th e mo d el sta y s in reasonable states. 5. Reconstructing I ma ges. W e applied the mod el and MCMC-b ased inference pr o cedure to three image data sets: the Olive tti faces, the MNIST digits and the F rey faces. W e used these data to analyze the structures and sparsit y that arise in the mo del p osterior. T o get a sense of the utilit y of the mo del, w e constructed a missing-data problem using held-out images from eac h set. W e r emo ved the b ottom halv es of the test images and ask ed the mo del to reconstruct the missing d ata, conditioned on the top half. The prediction itself wa s done b y in tegrating out the p arameters and structure via MCMC. Olivetti F ac es. The Oliv etti faces data [ Samaria and Harter , 1994 ] consists of 400 64 × 64 gra y s cale images of the faces of 40 distinct sub jects. W e d i- vided these in to 350 test data and 50 training data, selected randomly . This data set is an app ealing test b ecause it has few examp les, but man y dimen- sions. Figure 4a sho w s six b ottom-half test set reconstructions on the righ t, LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 13 (a) (b) (c) (d) Fig 4: Olivetti faces a) T es t images o n the left, with r econstructed b ottom halves on the right. b) Sixty features learned in the b ottom layer, where black shows absence of an edge. Note the learning of spars e feature s corres p o nding to sp ecific facial structures such as mouth shap es, noses and eyebrows. c) Raw predictive fantasies. d) F eature a ctiv ations from individual units in the seco nd hidden layer. compared to the ground truth on th e left. Figure 4b sho ws a subs et of sixt y w eigh t patterns from a p osterior sample of the structure, with blac k ind icat- ing that no edge is p r esen t from that hid d en unit to the visible unit (pixel). The algorithm is clearly assigning hid d en units to sp ecific and int erpretable features, suc h as mouth shap es, the presence of glasses or facial hair, and skin tone, while largely ignoring the rest of the image. Figure 4c sho ws ten pure fan tasies from th e mo del, easily generated in a d ir ected acyclic b elief net work. Figure 4d sho w s the result of activ ating individ ual units in the sec- ond hidd en la yer, while keeping the rest u nactiv ated, and p r opagating the 14 R.P . ADAMS ET AL. (a) (b) (c) (d) Fig 5: MNIST Digits a) Eight pairs of test reco nstructions, with the bottom half of each digit missing. The truth is the left imag e in each pair. b) 12 0 features learned in the b ottom lay er, whe r e black indicates that no edge exists. c) Activ ations in pixel space r esulting from activ ating individual units in the deepes t lay er. d) Samples from the p osterior of Z (0) , Z (1) and Z (2) (transp osed). activ ati ons do wn to the visible pixels. This p ro vides an id ea of the image space sp anned b y the pr incipal comp onents of these deep er units. A typica l p osterior net w ork had thr ee h idden lay ers, with ab out 70 units in eac h la y er. MNIST Digit Data. W e used a sub s et of the MNIST hand written digit data [ LeCun et al. , 1998 ] for training, consisting of 50 28 × 28 examples of eac h of the ten digits. W e used an additional ten examples of eac h digit for test data. In this case, the lo w er-lev el features are extremely sparse, as sho wn in Figure 5b , and the deep er units are simply activ ating sets of blobs at the pixel lev el. Th is is sho w n also b y activ ating individu al units at the deep est la yer, as sho w n in Figure 5c . T est reconstructions are s h o w n in Figur e 5a . A typical net w ork had three hid den la y ers, with appro ximately 120 in the first, 100 in the second and 70 in th e third. Th e binary matrices Z (0) , Z (1) , and Z (2) are sho wn in Figure 5d . LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 15 (a) (b) Fig 6: F rey faces a ) E ight pairs of test reconstructio ns, with the bo ttom half of each face missing. The truth is the left image in e a ch pair. b) 260 features learned in the botto m layer, where black indicates that no edge exists. F r ey F ac es. Th e F rey faces d ata 1 are 1965 20 × 28 gra yscale vid eo fr ames of a single face w ith differen t expr essions. W e divided these int o 1865 training data and 100 test data, selected randomly . While typical p osterior samples of the net wo rk again t yp ically used three hidden la y ers, the net works for these data tended to b e m uch w ider and more d ensely co nnected. In the b ottom lay er, as sh o w n in Figure 6b , a t ypical hidden u nit wo uld conn ect to man y pixels. W e attribu te this to global corr elation effects from ev ery image only coming f rom a single p er s on. Typica l widths w er e 260 units, 120 units in the s econd hidd en la yer, and 35 un its in the deep est la y er. In all three exp erimen ts, our MCMC s ampler app eared to m ix w ell and b egins to find reasonable reconstructions afte r a few hours of CP U time. It is interesting to note that the learned sparse connection p atterns in Z (0) v aried fr om local (MNIST), th rough in termediate (Olivet ti) to global (F rey), despite iden tical hyperp riors on the IBP parameters. T his strongly suggests that flexible p riors on structures are needed to adequately capture the statis- tics of d ifferen t data sets. 1 http://www .cs.toront o.edu/ ~ roweis/dat a.html 16 R.P . ADAMS ET AL. 6. Discussion. This pap er unites t wo areas of researc h—non p arametric Ba ye sian m etho ds and deep b elief n et works—to provide a n ov el nonparamet- ric p ers p ectiv e on the general problem of learning the structur e of directed deep b elief net works w ith hidd en un its. W e addressed three outstanding issues that s urround d eep b elief netw orks. First, w e allo wed the units to hav e differen t op erating regimes and infer appropriate lo cal represent ations that range f rom discrete binary b eha vior to n onlinear contin uous b eh a vior. Second, we pr ovided a w a y for a deep b elief n et work to con tain an arbitrary num b er of hid den un its arranged in an arbitrary num b er of la y ers. This structure enables the hidden un its to h a ve nontrivia l join t distribu tions. Third, we presente d a metho d for inferring the appropr iate dir ected graph structure of deep b elief n et work. T o address these issu es, w e in tro duced a no vel c asc ading extension to the I ndian buffet pro cess—the casca ding In dian buffet pro cess (CIBP)—and pro v ed con vergence p rop erties that make it useful as a Ba ye sian prior d istribution for a sequen ce of infinite binary m atrices. This work can be v iewed as an infinite m ultila y er generalization of the densit y n et work [ MacKa y , 1995 ], and also as part of a more general litera- ture of learning structure in probabilistic net w orks. With a few exceptions (e.g., Beal and Ghahr amani [ 2006 ], Elidan et al. [ 2000 ], F riedm an [ 199 8 ], Ramac hand ran and Mo oney [ 1998 ]), most p revious work on learning the structure of b elief net works has fo cused on the case where all units are ob- serv ed [ Buntine , 1991 , F riedman and Koller , 2003 , Hec kerman et al. , 1995 , Koivisto and So o d , 2004 ]. The framew ork p resen ted in this pap er not only allo ws for an u n b ound ed n um b er of h idden u n its, but f undamental ly couples the mo del f or the n um b er and b eh avior of th e units with the nonparametric mo del for the structure of the infi nite d irected graph. Rather than compar- ing structures b y ev aluating marginal lik eliho o d s of different mo dels, our nonparametric approac h mak es it p ossible to do in f erence in a single mo del with an unb ounded num b er of u nits and la yers, thereby learning effectiv e mo del complexit y . Th is ap p roac h is more app ealing b oth computationally and philosophically . There are a v ariety of future researc h paths that can p otenti ally stem from the mo del we ha ve presen ted h ere. As we ha ve presented it, w e do not exp ect that our MCMC-based un sup er v ised inference sc heme will b e comp etitiv e on sup ervised tasks with extensively-t uned discrimin ativ e mo d- els based on v arian ts of maxim u m-lik eliho o d learning. Ho wev er, we b elieve that this mo del can inform c hoices for n et work d epth, la yer size and edge structure in s uc h netw orks and will inspire f urther researc h into flexible nonparametric net work mod els. LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 17 A cknow le dgements. The authors wish to thank Brend an F rey , Da v id MacKa y , Iain Murray and Radf ord Neal for v aluable discussions. RP A is funded b y the Canadian Institute for Adv anced Researc h. References. M. J. Beal and Z. Ghahramani. V ariational Ba yesian learning of directed graphical mo dels with hidden va riables. Bayesian Analysis , 1(4):793–832, 2006. W. Buntine. Theory refin ement on Ba yesian netw orks. I n Pr o c. of the 7th Annual Con- fer enc e on Unc ertainty in Artificial Intel li genc e , pages 52–60, 1991. G. Elidan, N. Lotner, N. F riedman, and D. Koller. Discov ering h id den v ariables: A structure-based approach. In A dvanc es in Neur al I nformation Pr o c essing Systems 13 , 2000. G. F a yoll e, V. A. Malyshev, and M. V . Menshiko v. T opics i n the Construct ive The ory of Countable Markov C hai ns . Cam bridge U niversi ty Press, Cambridge, UK, 2008. B. J. F rey . Contin uous sigmoidal b elief n etw orks trained using slice sampling. In Ad v. in Neur al Information Pr o c essing Systems 9 , 1997. B. J. F rey and G. E. Hinton. V ariational learning in nonlinear Gaussian b elief netw orks. Neur al Computation , 11(1):19 3–213, 1999. N. F riedman. The Ba ye sian structu ral EM algorithm. In Pr o c e e dings of the 14th Annual Confer enc e on Unc ertainty in A rtificial I ntel li genc e , 1998. N. F riedman and D. Koller. Being Bay esian ab out netw ork structu re: A Ba yesian approach to structure discov ery in Ba yesia n net w orks. Machine L e arning , 50(1-2):95– 125, 2003. Z. Ghahramani, T. L. Griffiths, and P . Sollic h. Ba yesian nonparametric latent feature mod els. In Bayesian Statistics 8 , pages 201–22 6. Oxford Un iversi ty Press, Oxford, 2007. T. L. Griffiths and Z. Ghahramani. Infinite latent feature mo dels and the In dian buffet process. In A dvanc es in Neur al I nf ormation Pr o c essing Systems 18 , pages 475–482, 2006. D. Heck erman, D. Geiger, and D. M. Chick ering. Learning Bay esian netw ork s: The com- bination of k now ledge and statistical data. Machine L e arning , 20(3):197–243, 1995. G. E. Hinton and R. Salakhutdino v . Redu cing the dimensionality of d ata with neu ral netw orks. Scienc e , 313(5786):50 4–507, 2006. G. E. Hinton, S. O sindero, and Y.-W. T eh. A fast learning algorithm for deep belief nets. Neur al Computation , 18(7):15 27–1554, July 2006. M. Koivisto and K. So od. Exact Bay esian structure disco very in Bay esian netw orks. Journal of Machine L e arning R ese ar ch , 5:549–573, 2004. N. Le Roux and Y. Bengio. Representatio nal p ow er of restricted Boltzmann machines and deep b elief netw ork s. Neur al Computation , 20(6):1631–164 9, 2008 . Y. LeCun, L. Bottou, Y . Bengio, and P . Haffner. Gradient-based learning applied to docu ment recognition. Pr o c. of the IEEE , 86(11):2278– 2324, 1998. J. S . Liu, F. Liang, and W. H. W ong. The use of multiple-try meth od and lo cal optimiza- tion in Metrop olis sampling. Journal of the A meric an Statistic al Asso ciation , 95(449): 121–134 , 2000. D. J. C. MacKa y . Ba ye sian neural n etw orks and densit y netw orks. Nucle ar Instrument s and Metho ds in Physics R ese ar ch, Se ction A , 354(1):73– 80, 1995. R. M. Neal. Connectionist learning in b elief netw ork s. A rtificial Intel ligenc e , 56:71–113, July 1992. J. Pearl. Pr ob ab alistic R e asoning in I ntel li gent Systems: Networks of Plausibl e I nfer enc e . Morgan Kaufmann, S an Mateo, CA, 1988. 18 R.P . ADAMS ET AL. S. Ramachandran and R. J. Mo oney . Theory refinement of Ba yes ian netw orks with h idden v ariables. In Pr o c e e dings of the 15th International Confer enc e on Machine L e arning , pages 454–462, 1998. C. E. Rasmussen an d C. K. I. Williams. Gaussian Pr oc ess es for Machine L e arning . MIT Press, Cam bridge, MA, 2006. F. S. Samaria and A. C. Harter. Parameterisati on of a stochastic model for human face identificatio n. In Pr o c e e dings of the 2nd IEEE workshop on Applic ations of Computer Vision , 1994. E. Seneta and D. V ere-Jones. On quasi-stationary d istributions in discrete-time Marko v chai ns with a denumerable infinity of states. Journal of A ppl i e d Pr ob ability , 3(2):403– 434, December 1966. Y.-W. T eh, D. G¨ or¨ ur, and Z. Ghahramani. Stic k-breaking construction for the Indian buffet pro cess. In Pr o c. of the I nternational Confer enc e on A rtificial Intel l igenc e and Statistics , volume 11, 2007. S. T. T okdar. Exploring Di richlet Mixtur e and L o gistic Gaussian Pr o c ess Priors in Den- sity Estimation, R e gr ession and Sufficient Dim ension R e duction . PhD thesis, Purdue Universit y , August 2006. F. W o o d, T. L. Griffiths, and Z. Ghahramani. A non-parametric Bay esian metho d for in- ferring hidden causes. In Pr o c e e dings of the 22nd Confer enc e in U nc ertainty in Artificial Intel l igenc e , 2006. APPENDIX A: PROOF OF GENERAL CIBP TERMINA TION In the main pap er, we discussed that the cascading In dian b uffet pro cess for fixed and finite α and β ev en tually reac hes a restauran t in whic h the customers c ho ose no dishes. Ev er y deep er restauran t also has no dishes. Here w e sho w a more general r esult, for IBP parameters that v ary with depth, written α ( m ) and β ( m ) . Let there b e an inhomogeneous Marko v c hain M with state s pace N . Let m ind ex time and let the s tate at time m b e denoted K ( m ) . The in itial state K (0) is finite. The p robabilit y mass function describing the transition distribution for M at time m is giv en by (12) p ( K ( m +1) = k | K ( m ) , α ( m ) , β ( m ) ) = 1 k ! exp    − α ( m ) K ( m ) X k ′ =1 β ( m ) k ′ + β ( m ) − 1      α ( m ) K ( m ) X k ′ =1 β ( m ) k ′ + β ( m ) − 1   . Theorem A.1 . If ther e exi sts some ¯ α < ∞ and ¯ β < ∞ such that ∀ m , α ( m ) < ¯ α and β ( m ) < ¯ β , then lim m →∞ p ( K ( m ) = 0) = 1 . Pr oof. Let N + b e the p ositiv e in tegers. T he N + are a comm u nicating class for the Mark o v c hain (it is p ossible to ev en tu ally reac h an y mem- b er of the class from any other mem b er) and eac h K ( m ) ∈ N + has a nonzero probability of transitioning to the absorbin g state K ( m +1) = 0, LEARNING THE STRUCTURE OF DEEP SP ARSE GRAPHICAL MODELS 19 i.e. p ( K ( m +1) = 0 | K ( m ) ) > 0, ∀ K ( m ) . If, conditioned on nonabsorption, the Mark ov c hain h as a stationary d istribution (is quasi-stationar y ), then it reac hes absorp tion in fin ite time w ith p robabilit y one. Heuristically , this is the requir emen t that, conditioned on ha ving not yet reac hed a restaurant with no dishes, the n um b er of dishes in deep er restauran ts will not exp lo de. The quasi-statio nary condition can b e met by sho wing th at N + are p osi- tiv e recurr en t states. W e us e the F oster–Ly apu n o v stabilit y criterion (FLSC) to sho w p ositiv e-recurr ency of N + . The FLSC is met if there exists some function L ( · ) : N + → R + suc h that for some ǫ > 0 and some finite B ∈ N + , ∞ X k =1 p ( K ( m +1) = k | K ( m ) )  L ( k ) − L ( K ( m ) )  < − ǫ for K ( m ) > B (13) ∞ X k =1 p ( K ( m +1) = k | K ( m ) ) L ( k ) < ∞ for K ( m ) ≤ B . (14) F or Ly apuno v function L ( k ) = k , the first condition is equiv alen t to   α ( m ) K ( m ) X k =1 β ( m ) k + β ( m ) − 1   − K ( m ) < − ǫ. (15) W e observ e that α ( m ) K ( m ) X k =1 β ( m ) k + β ( m ) − 1 < ¯ α K ( m ) X k =1 ¯ β k + ¯ β − 1 , (16) for all K ( m ) > 0. Thus, the first condition is satisfied for an y B that satisfies the condition for ¯ α and ¯ β . That su c h a B exists for an y finite ¯ α and ¯ β can b e seen b y the equiv alen t condition   ¯ α K ( m ) X k =1 ¯ β k + ¯ β − 1   − K ( m ) < − ǫ for K ( m ) > B . (17) As the first term is r oughly logarithmic in K ( m ) , there exists some finite B that satisfies this inequalit y . The second FLSC cond ition is trivially satisfied b y the observ ation that P oisson distributions ha ve a finite mean. R y an P. Adams Dep ar tment of Computer Science University of Toronto 10 King’s College Ro a d Toronto, Ont ario M5S 3G4, CA E-mail: rpa@cs.toron to.edu Hanna M. W allach Dep ar tment of Computer Science University of Massachusetts Am herst 140 Governors Drive Amherst, MA 01003, USA E-mail: wa llach@cs.umass.edu 20 R.P . ADAMS ET AL. Zoubin Ghahram ani Dep ar tment of Engin eering University of Camb ridge Trumpington Street Cambridge CB2 1PZ, UK zoubin@eng.cam.ac.uk

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment