Hierarchical Models as Marginals of Hierarchical Models

Hierar chical Models as Mar ginals of Hierar chical Models Guido Mon t ´ ufar Max Planc k Institute for Mathematics in the Sciences mon tufar@mis.mpg.de Johannes Rauh Departmen t of Mathematics and Statistics Y ork Univ ersity jarauh@y orku.ca Marc h 8, 2016 Abstract W e inv estigate the represen tation of hierarchical mo dels in terms of marginals of other hierarchical models with smaller in teractions. W e fo cus on binary v ariables and marginals of pairwise interaction mo dels whose hidden v ariables are conditionally in- dep endent given the visible v ariables. In this case the problem is equiv alen t to the represen tation of linear subspaces of p olynomials by feedforward neural net works with soft-plus computational units. W e show that every hidden v ariable can freely mo del m ultiple interactions among the visible v ariables, which allows us to generalize and impro ve previous results. In particular, w e show that a restricted Boltzmann machine with less than [2(log( v ) + 1) / ( v + 1)]2 v − 1 hidden binary v ariables can approximate ev ery distribution of v visible binary v ariables arbitrarily w ell, compared to 2 v − 1 − 1 from the b est previously known result. Keyw ords: hierarchical model, restricted Boltzmann mac hine, interaction model, con- nectionism, graphical model 1 In tro duction Consider a ﬁnite set V of random v ariables. A hierarchical log-linear model is a set of join t probability distributions that can be written as pro ducts of in teraction p oten tials, as p ( x ) = Q Λ ψ Λ ( x ), where ψ Λ ( x ) = ψ Λ ( x Λ ) only dep ends on the subset Λ of v ariables and where the product runs ov er a ﬁxed family of such subsets. By introducing hidden v ariables, it is possible to express the same probabilit y distributions in terms of p otentials which in volv e only small sets of v ariables, as p ( x ) = P y Q λ ψ λ ( x, y ), with small sets λ . Using small in teractions is a cen tral idea in the context of connectionistic mo dels, where the sets λ are often restricted to hav e cardinalit y tw o. Due to the simplicit y of their lo cal c haracteristics, these mo dels are particularly well suited for Gibbs sampling [4]. The representation, or explanation, of complex in teractions among observ ed v ariables in terms of hidden v ariables is also related to the study of common ancestors [16]. W e are in terested in suﬃcient and necessary conditions on the num ber of hidden v ari- ables, their v alues, and the interaction structures, under whic h the visible marginals are ﬂexible enough to represent an y distribution from a giv en hierarchical model. Man y prob- lems can b e form ulated as special cases of this general problem. F or example, the problem of calculating the smallest num ber of lay ers of v ariables that a deep Boltzmann machine needs in order to represent any probabilit y distribution [9]. In this article, we focus on the case that all v ariables are binary . F or the hierarchical mo dels with hidden v ariables, we restrict our atten tion to models inv olving only pairwise in teractions and whose hidden v ariables are conditionally indep endent given the visible v ari- ables (no direct in teractions b et w een the hidden v ariables). A prominen t example of this t yp e of models is the restricted Boltzmann mac hine, which has full bipartite interactions b e- t ween the visible and hidden v ariables. The represen tational p o wer of restricted Boltzmann mac hines has been studied assiduously; see, e.g., [3, 17, 7, 10]. The free energy function of suc h a mo del is a sum of soft-plus computational units x 7→ log(1 + exp( P i ∈ V w i x i + c )). On the other hand, the energy function of a fully observ able hierarchical mo del with binary v ariables is a p olynomial, with monomials corresp onding to pure in teractions. Since an y function of binary v ariables can b e expressed as a p olynomial, the task is then to charac- terize the p olynomials computable b y soft-plus units. Y ounes [17] show ed that a hierarchical model with N binary v ariables and a total of M pure higher order in teractions (among three or more v ariables) can be represented as the visible marginal of a pairwise in teraction mo del with M hidden binary v ariables. In Y ounes’ construction, eac h pure interaction is mo deled by one hidden binary v ariable that in teracts pairwise with eac h of the inv olved visible v ariables. In fact, he sho ws that this replacemen t can b e accomplished without increasing the num b er of model parameters, by imp osing linear constraints on the coupling strengths of the hidden v ariable. In this work w e inv estigate w ays of squeezing more degrees of freedom out of each hidden v ariable. An indication that this should b e p ossible is the fact that the full in teraction model, for which M = 2 N −  N 2  − N − 1, can b e mo deled b y a pairwise interaction mo del with 2 N − 1 − 1 hidden v ariables [10]. Indeed, by con trolling groups of p olynomial co eﬃcien ts at the time, w e sho w that in general less than M hidden v ariables are suﬃcient. A sp ecial case of hierarc hical models with hidden v ariables are mixtures of hierarc hical mo dels. The smallest mixtures of hierarc hical models that contain other hierarchical models ha ve been studied in [8]. The approach follo w ed there is diﬀeren t and complementary to our analysis of soft-plus p olynomials. F or the necessary conditions, the idea there is to compare the possible support sets of the limit distributions of b oth models. F or the suﬃcient conditions, the idea is to ﬁnd a small S -set cov ering of the set of elemen tary even ts. An S -set of a probability mo del is a set of elemen tary ev ents such that ev ery distribution supported in that set is a limit distribution from the model. Another type of hierarchical mo dels with hidden v ariables are tree models. The geometry of binary tree models has b een studied in [18] in terms of moments and cumulan ts. That analysis b ears some relation to ours in that it also elab orates on M¨ obius in v ersions. This pap er is organized as follows. Section 2 introduces hierarc hical mo dels and formal- izes our problem in the light of previous results. Section 3 pursues a characterization of the p olynomials that can be represen ted b y soft-plus units. Section 4 applies this characteri- zation to study the representation of hierarc hical mo dels in terms of pairwise interaction mo dels with hidden v ariables. This section addresses principally restricted Boltzmann ma- c hines. Section 5 oﬀers our conclusions and outlook. 2 2 Preliminaries This section introduces hierarchical models, with and without hidden v ariables, formalizes the problem that w e address in this pap er, and presen ts motiv ating prior results. 2.1 Hierarc hical Mo dels Consider a ﬁnite set V of v ariables with ﬁnitely man y joint states x = ( x i ) i ∈ V ∈ X = × i ∈ V X i . W e write v = | V | for the cardinality of V . F or a given set S ⊆ 2 V of subsets of V let V X ,S := ( g ( x ) = X Λ ∈ S g Λ ( x ) : g Λ ( x ) = g Λ ( x Λ ) ) . This is the linear subspace of R X spanned b y functions g Λ that only dep end on sets of v ariables Λ ∈ S . The hierarchical mo del of probability distributions on X with interactions S is the set E X ,S :=  p ( x ) = 1 Z ( g ) exp( g ( x )) : g ∈ V X ,S  , (1) where Z ( g ) = P x 0 ∈ X exp( g ( x 0 )) is a normalizing factor. W e call E ( x ) = g ( x ) = X Λ ∈ S g Λ ( x ) (2) the ener gy function of the corresponding probabilit y distribution. F or con v enience, in all what follo ws w e assume that S is a simplicial complex, meaning that A ∈ S implies B ∈ S for all B ⊆ A . F urthermore, we assume that the union of elemen ts of S equals V . In the case of binary v ariables, X i = { 0 , 1 } for all i ∈ V , the energy can be written as a polynomial, as E ( x ) = X Λ ∈ S J Λ Y i ∈ Λ x i . Here, J Λ ∈ R , Λ ∈ S , are the in teraction w eigh ts that parametrize the mo del. 2.2 Hierarc hical Mo dels with Hidden V ariables Consider an additional set H of v ariables with ﬁnitely man y join t states y = ( y j ) j ∈ H ∈ Y = × j ∈ H Y j . W e write h = | H | for the cardinality of H . F or a simplicial complex T ⊆ 2 V ∪ H , let V X × Y ,T ⊆ R X × Y b e the linear subspace of functions of the form g ( x, y ) = P λ ∈ T g λ ( x, y ), g λ ( x, y ) = g λ (( x, y ) λ ). The marginal on X of the hierarchical mo del E X × Y ,T is the set M X × Y ,T :=    p ( x ) = 1 Z ( g ) X y ∈ Y exp( g ( x, y )) : g ∈ V X × Y ,T    , (3) where Z ( g ) = P x 0 ∈ X ,y 0 ∈ Y exp( g ( x 0 , y 0 )) is again a normalizing factor. The free energy of a probabilit y distribution from M X × Y ,T is given by F ( x ) = log X y ∈ Y exp  X λ ∈ T g λ ( x, y )  . (4) 3 Here and throughout “log” denotes the natural logarithm. In the case of binary visible v ariables, X i = { 0 , 1 } for all i ∈ V , the free energy (4) can b e written as a polynomial, as F ( x ) = X B ⊆ V K B Y i ∈ B x i , where the co eﬃcien ts can b e computed from M¨ obius’ in v ersion formula as K B = X C ⊆ B ( − 1) | B \ C | log X y ∈ Y exp  X λ ∈ T g λ ((1 C , 0 V \ C ) , y )  , B ∈ 2 V . (5) Here (1 C , 0 V \ C ) ∈ { 0 , 1 } V is the v ector with v alue 1 in the en tries i ∈ C and v alue 0 in the en tries i 6∈ C . If there are no direct in teractions betw een hidden v ariables, i.e., if | λ ∩ H | ≤ 1 for all λ ∈ T , then the sum ov er y factorizes and the free energy (4) can b e written as F ( x ) = X λ : λ ∩ H = ∅ g λ ( x ) + X j ∈ H log X y j ∈ Y j exp  X λ ∈ T : j ∈ λ g λ ( x, y j )  . (6) P articularly in teresting are the models with full bipartite in teractions b et w een the set of visible v ariables and the set of hidden v ariables, i.e., mo dels with T = { λ ⊆ V ∪ H : | λ ∩ V | ≤ 1 , | λ ∩ H | ≤ 1 } , called restricted Boltzmann mac hines (with discrete v ariables). 2.3 Problem and Previous Results In general the marginal of a hierarc hical model is not a hierarchical mo del. How ever, one ma y ask whic h hierarc hical mo dels are contained in the marginal of another hierarc hical mo del. T o represent a hierarc hical model in terms of the marginal of another hierarchical model, w e need to represen t (1) in terms of (3). Equiv alen tly , w e need to represent all p ossible energy functions in terms of free energies. Given a set of visible v ariables V and a simplicial complex S ⊆ 2 V , what conditions on the set of hidden v ariables H and the simplicial complex T ⊆ 2 V ∪ H are suﬃcien t and necessary in order for an y function E of the form (2) to b e representable in terms of some function F of the form (4)? W e would like to arrive at a result that generalizes the follo wing. • A restricted Boltzmann machine with h hidden binary v ariables can approximate any probabilit y distribution from a binary hierarc hical mo del E S with |{ Λ ∈ S : | Λ | > 1 }| ≤ h arbitrarily well [17]. • The restricted Boltzmann mac hine with h = 2 v − 1 − 1 hidden binary v ariables can appro ximate an y probability distribution of v binary v ariables arbitrarily well [10]. Our Theorem 11 in Section 4 impro ves and generalizes these statemen ts. The basis of this result are soft-plus polynomials, whic h we discuss in the follo wing section. 4 x 1 x 2 x 3 . . . x v φ s f ( s ) = log (1 + exp( s )) Figure 1: Illustration of a soft-plus computational unit. The p ossible inputs X = { 0 , 1 } V , corresp onding to the v ertices of the unit V -cub e, are mapp ed to the real line by an aﬃne map x 7→ w > x + c , and then the soft-plus non-linearit y f : s 7→ log(1 + exp( s )) is applied. 3 Soft-plus P olynomials Consider a function of the form φ : { 0 , 1 } V → R ; x 7→ log(1 + exp( w > x + c )) , (7) parametrized by w = ( w i ) i ∈ V ∈ R V and c ∈ R . W e regard φ as a soft-plus c omputational unit , which in tegrates eac h input vector x ∈ { 0 , 1 } V in to a scalar via x 7→ w > x + c and applies the soft-plus non-linearity f : R → R + ; s 7→ log (1 + exp( s )). See Figure 1 for an illustration of this function. In view of Equation (6), the function φ corresp onds to the free energy added b y one hidden binary v ariable in teracting pairwise with V visible binary v ariables. The parameters w i , i ∈ V , corresp ond to the pair interaction w eigh ts and c to the bias of the hidden v ariable. What kinds of polynomials on { 0 , 1 } V can be represen ted b y soft-plus units? F ollowing Equation (5), the polynomial coeﬃcients of φ are giv en b y K B ( w , c ) = X C ⊆ B ( − 1) | B \ C | log 1 + exp  X i ∈ C w i + c  ! , B ∈ 2 V . (8) F or eac h B ∈ 2 V this is an alternating sum of the v alues φ ( x ) of the soft-plus unit on the input vectors x ∈ { 0 , 1 } V with supp( x ) ⊆ B . In particular, K B is indep endent of the parameters w i , i 6∈ B . W e will use the s horthand notation w B for ( w i ) i ∈ B . Note that, if w i = 0 for some i ∈ V , then K C = 0 for all C ∈ 2 V with i ∈ C . In the follo wing we focus on the description of the p ossible v alues of the highest degree coeﬃcients. F or example, Y ounes [17] show ed that a soft-plus unit can represent a polynomial with an arbitrary leading co eﬃcien t: Lemma 1 (Lemma 1 in [17]) . L et B ⊆ V and w i = 0 for i 6∈ B . Then, for any J B ∈ R , ther e is a choic e of w B ∈ R B and c ∈ R such that K B = J B . The idea of Y ounes’ proof of Lemma 1 is to choose all non-zero w i of equal magnitude. This simpliﬁes the calculations and reduces the n umber of free parameters to one. Our goal is to sho w that a soft-plus unit can actually freely mo del several polynomial co eﬃcien ts at 5 K { 1 , 2 } K { 1 , 2 , 3 } K { 1 } K { 1 , 2 } ∅ { 1 } { 2 } { 3 } { 4 } { 1 , 2 } { 1 , 3 } { 2 , 3 } { 1 , 4 } { 2 , 4 } { 3 , 4 } { 1 , 2 , 3 } { 1 , 2 , 4 } { 1 , 3 , 4 } { 2 , 3 , 4 } { 1 , 2 , 3 , 4 } K { 1 , 2 , 3 } K { 1 , 2 , 3 , 4 } K ∅ K { 1 } Figure 2: Illustration of Lemma 2. Depicted is for each edge pair ( B , B 0 ) the set of co eﬃcien t pairs ( K B , K B 0 ) ∈ R 2 of the p olynomials P C ⊆ V K C Q i ∈ C x i expressible as log(1 + exp( w > x + c )). Sho wn is also the set of monomials of partial degree one and degree at most 4, partially ordered b y v ariable inclusion. the same time. Our approac h to simplify the M¨ obius inv ersion formula (8) is to choose the parameters w and c in suc h a wa y that the function φ has many zeros. Clearly this can only b e done in an appro ximate w ay , since the soft-plus function is strictly p ositiv e. Nev ertheless, these approximations can b e made arbitrarily accurate, since log(1 + exp( s )) ≤ exp( s ) is arbitrarily close to zero for suﬃcien tly large negativ e v alues of s . W e call a pair of sets ( B , B 0 ) an e dge p air or a c overing p air when B ) B 0 and there is no set C with B ) C ) B 0 . The next lemma sho ws that a soft-plus unit can jointly mo del the co eﬃcients of an edge pair, at least in part. When the maximum degree | B | is at most 3, the tw o co eﬃcien ts are restricted by an inequality , but when | B | ≥ 4, there are no suc h restrictions. The result is illustrated in Figure 2. Lemma 2. Consider an e dge p air ( B , B 0 ) . Dep ending on | B | , for any  > 0 ther e is a choic e of w B ∈ R B and c ∈ R such that k ( K B , K B 0 ) − ( J B , J B 0 ) k ≤  if and only if J B 0 ≥ 0 , − J B , for | B | = 1 J B 0 ≥ 0 , − J B or J B 0 ≤ 0 , − J B , for | B | = 2 J B 0 ≥ 0 , − J B or J B 0 ≤ 0 , − J B , for | B | = 3 ( J B , J B 0 ) ∈ R 2 , for | B | ≥ 4 . Pr o of. This pro of is deferred to App endix A. Remark 3. If ( B , B 0 ) is an edge pair with | B | = 3, then, despite ha ving | B | + 1 = 4 parameters to v ary ( w i , i ∈ B , and c ), w e can only determine the p olynomial coeﬃcients K B and K B 0 up to a certain inequality . W e exp ect that the same is true in general: If we 6 w ant to freely con trol k polynomial co eﬃcien ts, w e need strictly more than k parameters. Otherwise, the coeﬃcients are restricted b y some inequalities. This situation is common in mo dels with hidden v ariables. In particular, mixture mo dels often require man y more parameters to eliminate suc h inequalities than exp ected from na ¨ ıv e parameter coun ting [8]. It is natural to ask whether it is p ossible to con trol other pairs of co eﬃcients or even larger groups of co eﬃcien ts. W e discuss a simple example b efore pro ceeding with the analysis of this problem. Example 4. Consider a soft-plus unit with tw o binary inputs. W rite f : s 7→ log(1 + exp( s )) for the soft-plus non-linearit y and f 0 = f ( c ), f 1 = f ( w 1 + c ), f 2 = f ( w 2 + c ), f 12 = f ( w 1 + w 2 + c ) for the v alues of the soft-plus unit on { 0 , 1 } 2 . F rom Equation (8) it is easy to see that K ∅ = f 0 ≥ 0 K { 1 } = f 1 − f 0 ≥ − K ∅ K { 2 } = f 2 − f 0 ≥ − K ∅ . No w let us in vestigate the quadratic co eﬃcient K { 1 , 2 } = f 12 − f 1 − f 2 + f 0 . Using the con vexit y of f we ﬁnd 0 ≤ K { 1 , 2 } , if K { 1 } , K { 2 } ≥ 0 0 ≤ K { 1 , 2 } ≤ − K { 1 } , − K { 2 } , if − K { 1 } , − K { 2 } ≥ 0 − K { 1 } ≤ K { 1 , 2 } ≤ 0 , if K { 1 } , − K { 2 } ≥ 0 − K { 2 } ≤ K { 1 , 2 } ≤ 0 , if − K { 1 } , K { 2 } ≥ 0 . Hence the computable polynomials ha v e co eﬃcient triples ( K { 1 } , K { 2 } , K { 1 , 2 } ) enclosed in a p olyhedral region of R 3 as depicted in Figure 3. Ho w ever, an y pair ( K { 1 } , K { 2 } ) ∈ R 2 is p ossible (for K ∅ large enough). The next lemma shows that a soft-plus unit can jointly mo del certain tuples of p olynomial co eﬃcien ts corresponding to v − k + 1 monomials of degree k . W e call star tuple a set of the form { B ∪ { j } : j ∈ B 0 } , where B , B 0 ⊆ V satisfy B ∩ B 0 = ∅ . Each element of the star tuple cov ers the set B . In the Hasse diagram of the p ow er set 2 V , the sets B ∪ { j } , j ∈ B 0 , are the leav es of a star with ro ot B . Lemma 5. Consider any B , B 0 ⊆ V with B ∩ B 0 = ∅ . L et w i = 0 for i 6∈ B ∪ B 0 . Then, for any J B ∪{ j } ∈ R , j ∈ B 0 , and  > 0 , ther e is a choic e of w B ∪ B 0 ∈ R B ∪ B 0 and c ∈ R such that | K B ∪{ j } − J B ∪{ j } | ≤  for al l j ∈ B 0 , and | K C | ≤  for al l C 6 = B , B ∪ { j } , j ∈ B 0 . Pr o of of L emma 5. Since w i = 0 for i 6∈ B ∪ B 0 , we hav e that K C = 0 for all C 6⊆ B ∪ B 0 . W e c ho ose c = − ( | B | − 1 2 ) ω , w i = ω for all i ∈ B , and w j = J B ∪{ j } for j ∈ B 0 . Cho osing ω  P j ∈ B 0 | w j | yields f ( P i ∈ C w i + c ) ≈ 0 for all C 6⊇ B . In this case, K C ≈ 0 , for all B 6⊆ C ⊆ B ∪ B 0 . F urthermore, for all j ∈ B 0 w e ha v e K B ∪{ j } ≈ f  X i ∈ B w i + w j + c  − f  X i ∈ B w i + c  = f ( J B ∪{ j } + 1 2 ω ) − f ( 1 2 ω ) ≈ ( J B ∪{ j } + 1 2 ω ) − ( 1 2 ω ) = J B ∪{ j } . Similarly , K B ∪ C ≈ 0 for all C ⊆ B 0 with | C | ≥ 2. Note that K B ≈ 1 2 ω . 7 K { 1 } K { 2 } K { 1 , 2 } Figure 3: Illustration of Example 4. Depicted is a region of R 3 , clipp ed to [ − 1 , 1] 3 , which con tains the coeﬃcient triples ( K { 1 } , K { 2 } , K { 1 , 2 } ) ∈ R 3 of the p olynomials computable b y a soft-plus unit with tw o binary inputs. This region consists of 4 solid conv ex cones. The in tuition behind Lemma 5 is simple. When P i ∈ B w i + c  1, the v alues w > x + c , for x with x i = 1, i ∈ B , fall in a region where the soft-plus function is nearly linear. In turn, the soft-plus unit is nearly a linear function of x j , j ∈ B 0 , with co eﬃcients w j , j ∈ B 0 . Remark 6. Closely related to soft-plus units are r e ctiﬁe d line ar units , whic h compute functions of the form ϕ : { 0 , 1 } V → R ; x 7→ max { 0 , w > x + c } . In this case the non-linearity is s 7→ { 0 , s } . This reﬂects precisely the zero/linear b eha vior of the soft-plus activ ation for large negative or p ositive v alues of s . Our p olynomial descriptions are based on this b eha vior and hence they apply b oth to soft-plus and rectiﬁed linear units. W e close this section with a brief discussion of dep endencies among co eﬃcien ts. The next prop osition giv es a p ersp ectiv e on the p ossible v alues of the co eﬃcient K B , dep ending on w m , once w B \{ m } and c hav e b een ﬁxed. Prop osition 7. L et ( B , B 0 ) b e an e dge p air with B 0 = B \ { m } and let J B ∈ R . F or ﬁxe d w B 0 ∈ R B 0 and c ∈ R , ther e is some w m ∈ R such that K B = J B if and only if a c ertain de gr e e- 2 | B 0 |− 1 p olynomial in one r e al variable has a p ositive r o ot. Pr o of of Pr op osition 7. Observe that K B ( w , c ) = K B 0 ( w B 0 , c + w m ) − K B 0 ( w B 0 , c ) . Hence K B = J B if and only if K B 0 ( w B 0 , c + w m ) = K B 0 ( w B 0 , c ) + J B =: r . W e use the 8 abbreviation ˜ t = e t , which implies p ositivit y . W e ha v e K B 0 ( w B 0 , c + w m ) = X C ⊆ B 0 ( − 1) | B 0 \ C | log 1 + exp  X i ∈ C w i + c + w m  ! = log   Y C ⊆ B 0  1 + ˜ w m ˜ c Y i ∈ C ˜ w i  ( − 1) | B 0 \ C |   . No w, K B 0 ( w B 0 , c + w m ) = r if and only if Y C ⊆ B 0  1 + ˜ w m ˜ c Y i ∈ C ˜ w i  ( − 1) | B 0 \ C | = ˜ r , or, equiv alently , Y C ⊆ B 0 : B 0 \ C even  1 + ˜ w m ˜ c Y i ∈ C ˜ w i  − ˜ r Y C ⊆ B 0 : B 0 \ C o dd  1 + ˜ w m ˜ c Y i ∈ C ˜ w i  = 0 . This is a polynomial of degree at most 2 | B 0 |− 1 in ˜ w m = e w m . This description implies v arious kinds of constraints. F or example, b y Descartes’ rule of signs, a p olynomial can only hav e p ositiv e ro ots if the sequence of polynomial co eﬃcients, ordered by degree, has sign changes. 4 Conditionally Indep enden t Hidden V ariables In the case of a bipartite graph b etw een V and H with all v ariables binary , the hierarchical mo del (or rather its visible marginal) is a restricted Boltzmann machine, denoted RBM V ,H . This mo del is illustrated in Figure 4. The free energy tak es the form F ( x ) = X i ∈ V b i x i + X j ∈ H log 1 + exp  X i ∈ V w j i x i + c j  ! . This is the sum of an arbitrary degree-one polynomial, with co eﬃcients b i ∈ R , i ∈ V , and h = | H | indep endent soft-plus units, with parameters w j i ∈ R , j ∈ H , i ∈ V and c j ∈ R , j ∈ H . The free energy con tributed b y the hidden v ariables can be though t of as a feedforw ard net w ork with soft-plus computational units. W e can use eac h soft-plus unit to model a group of coeﬃcients of any given p olynomial, starting at the highest degrees. Using the results from Section 3 we arrive at the following represen tation result: Theorem 8. Every distribution fr om a hier ar chic al mo del E S on { 0 , 1 } V c an b e appr oxi- mate d arbitr arily wel l by distributions fr om RBM V ,H whenever ther e exist h sets B 1 , . . . , B h ⊆ 2 V which c over { Λ ∈ S : | Λ | ≥ 2 } in r everse inclusion or der, wher e e ach B j is a star tuple or an e dge p air of sets of c ar dinality at le ast 3 . 9 x 1 x 2 x 3 · · · x v b 1 b 2 b 3 b v y 1 y 2 y 3 y 4 · · · y h c 1 c 2 c 3 c 4 c h w 11 w hv Figure 4: A restricted Boltzmann machine. The free energy contributed b y the hidden units is a sum of indep enden t soft-plus units. Pr o of of The or em 8. W e need to express the p ossible energy functions of the hierarchical mo del as sums of independent soft-plus units plus linear terms. This problem can b e reduced to co v ering the appearing monomials of degree t wo or more b y groups of co eﬃcien ts that can b e join tly controlled b y soft-plus units. In view of Lemmas 2 and 5, edge pairs with sets of cardinalit y 3 or more and star tuples can be jointly controlled. W e start with the highest degrees and co v er monomials down w ards, b ecause setting the co eﬃcients of a given group may pro duce uncontrolled v alues for the co eﬃcien ts of smaller monomials. Since S is a simplicial complex, w e only need to co ver the elemen ts of S . Finding a minimal cov ering is in general a hard combinatorial problem. In the following w e derive upper b ounds for the k -interaction model, which is the hierarc hical mo del E S with S = S k := { Λ ⊆ V : | Λ | ≤ k } . W e will fo cus on star tuples and consider individual co verings of the lay ers  V j  = { Λ ⊆ V : | Λ | = j } . Let v = | V | . Denote D ( v , j ) the smallest num ber of star tuples that co v er  V j  . W e use the following notion from the theory of com binatorial designs (see [1] for an ov erview on that sub ject). F or in tegers v ≥ k > r denote C ( v , k , r ) the smallest possible n um b er of elements of  V k  suc h that every element from  V r  is con tained in at least one of them. Lemma 9. F or 0 < j ≤ v , the minimal numb er of star tuples that c over  V j  is D ( v , j ) = C ( v , v − j + 1 , v − j ) . Inserting known r esults for C ( v , t + 1 , t ) we obtain the exact values D ( v , 1) = 1 D ( v , 2) = v − 1 D ( v , 3) = l v v − 2 l v − 1 v − 3 · · ·  4 2  · · · mm D ( v , v − 3) =  v 4  v − 1 3  v − 2 2  ( v 6≡ 7 mo d 12) D ( v , v − 2) =  v 3  v − 1 2  D ( v , v − 1) =  v 2  D ( v , v ) = 1 and the gener al b ound D ( v , j ) ≤ 1 + log( v − j + 1) v − j + 1  v j  , 0 < j ≤ v. F urthermor e, we have the simple b ound D ( v , j ) ≤  v − 1 j − 1  , 0 < j ≤ v . 10 Pr o of of L emma 9. A star tuple cov ering of  V j  is given b y a collection B 1 , . . . , B n of ele- men ts of  V j − 1  suc h that every element of  V j  con tains at least one of the B i . The minimal p ossible num b er of elemen ts in such a collection is precisely D ( v , j ) = C ( v , v − j + 1 , v − j ). The equalities follow from corresp onding equalities for C ( v , v − j + 1 , v − j ) b y sev eral au- thors, which are listed in [13]. The inequalit y follows from a result b y Erd˝ os and Sp encer [2] sho wing that C ( v , k , r ) ≤ h  v r  .  k r  i h 1 + log  k r  i . The simple bound results from the fact that each set B from  V j  con tains a set B 0 from  V \{ 1 } j − 1  . Remark 10. Lemma 9 presents widely applicable bounds on the cardinality of star tuple co verings, whic h are naturally not alw a ys tight. F or v ≤ 28, b etter individual bounds on C ( v , t + 1 , t ) can be found in [13, T able II I]. See also [14] for a list of kno wn exact v alues. In another direction [15] oﬀers optimal asymptotic bounds on C ( v , k , r ) for ﬁxed k and r . Lemma 9 allows us to formulate the follo wing more explicit version of Theorem 8: Theorem 11. L et 1 ≤ k ≤ v . Every distribution fr om the k -inter action mo del E S k on { 0 , 1 } V c an b e appr oximate d arbitr arily wel l by distributions fr om RBM V ,H whenever h surp asses or e quals U ( v , k ) = P k j =2 D ( v , j ) , which is b ounde d ab ove as indic ate d in L emma 9. This is the c ase, in p articular, whenever h ≥ P k j =2  v − 1 j − 1  or h ≥ log( v − 1)+1 v +1 P k j =2  v +1 j  . Pr o of of The or em 11. This follo ws directly from Theorem 8 and Lemma 9. F or the last statemen t w e use the simple b ound from the lemma, by whic h D ( v , j ) ≤  v − 1 j − 1  , and the general b ound, b y whic h D ( v , j ) ≤ log( v − j +1)+1 v +1  v +1 j  . In order to pro vide a n umerical sense of Theorem 11 we give upper b ounds on U ( v , k ), 2 ≤ k ≤ v ≤ 14, in T able 1. F or con v enience w e also pro vide an Octa v e [5] script for computing such b ounds in http://personal-homepages.mis.mpg.de/montufar/starco v er.m. In the special case k = v , the k -interaction model E S k is the ful l inter action mo del and con tains all (strictly positive) probability distributions on { 0 , 1 } V . Hence Theorem 11 en tails the following universal approximation result: Corollary 12. Every distribution on { 0 , 1 } V c an b e appr oximate d arbitr arily wel l by dis- tributions fr om RBM V ,H whenever h surp asses or e quals U ( v , v ) , which is b ounde d ab ove as indic ate d in L emma 9. This is the c ase, in p articular, whenever h ≥ 2 v − 1 − 1 or h ≥ 2(log( v − 1)+1) v +1 (2 v − ( v + 1) − 1) + 1 . Corollary 12 pro vides a signiﬁcan t and unexp ected impro vemen t the best previously kno wn upper b ound 2 v − 1 − 1 from [10]. Whether the upp er b ound 2 v − 1 − 1 was optimal or not had remained an op en problem in [10] and sev eral succeeding papers. In T able 2 w e giv e upper bounds on U ( v , v ), 2 ≤ v ≤ 40, and compare these with the previous result. Remark 13. In general an RBM can represent many more distributions than just the in teraction mo dels describ ed ab ov e. F or several small examples discussed further b elow, our bounds for the represen tation of interaction mo dels are tigh t. Ho w ever, Theorem 11 is based on upp er b ounds on a sp eciﬁc type of cov erings and we susp ect that it can b e further impro ved, at least in some sp ecial cases, even if not reaching the hard lo wer b ounds coming from parameter counting. 11 k \ v 2 3 4 5 6 7 8 9 10 11 12 13 14 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 3 - 3 3 5 8 11 15 19 24 29 35 41 48 55 4 - - 6 7 11 17 27 55 39 82 54 117 74 162 98 216 125 268 341 5 - - - 12 15 20 34 69 53 147 84 234 124 356 182 520 251 725 453 1002 6 - - - - 21 31 38 80 64 172 109 343 175 570 282 908 427 1385 750 2068 1473 7 - - - - - 39 63 84 68 184 121 373 205 742 348 1276 559 2107 1014 3389 1944 8 - - - - - - 85 69 127 189 126 390 222 789 395 1534 672 2705 1259 4652 2452 9 - - - - - - - 190 127 255 395 227 808 414 1591 729 3078 1416 5583 2823 10 - - - - - - - - 396 228 511 814 420 1615 753 3156 1494 6105 3053 11 - - - - - - - - - 815 421 1023 1621 759 3182 1520 6196 3144 12 - - - - - - - - - - 1622 760 2047 3189 1527 6229 3177 13 - - - - - - - - - - - 3190 1528 4095 6236 3184 14 - - - - - - - - - - - - 6237 3185 8191 T able 1: Upp er b ounds on the minimal num ber of hidden units for whic h RBM V ,H can appro ximate every distribution from the k -interaction mo del E S k on { 0 , 1 } V arbitrarily w ell, follo wing from Theorem 11, for 2 ≤ k ≤ v ≤ 14. Shown are upp er b ounds on U ( v , k ) = P k j =2 D ( v , j ) ev aluated using Lemma 9 and some individual b ounds on D ( v , j ) = C ( v , v − j + 1 , v − j ) from [13, T able I I I]. Upp er scripts indicate v alues obtained using only Lemma 9. Lo wer scripts indicate the previous RBM univ ersal approximation bound 2 v − 1 − 1 from [10]. En tries with v ≤ 9 or k ≤ 3 are exact v alues of U ( v , k ). 12 v U ( v , v ) ≤ 2 v − 1 − 1 = l 2 v v +1 − 1 m = 2 1 1 1 3 3 3 1 4 6 7 3 5 12 15 5 6 21 31 9 7 39 63 15 8 69 127 28 9 127 255 51 10 228 511 93 11 421 1023 170 12 760 2047 315 13 1528 4095 585 14 3185 8191 1092 15 6642 16,383 2047 16 14,269 32,767 3855 17 30,352 65,535 7281 18 63,431 131,071 13,797 19 132,195 262,143 26,214 20 272,160 524,287 49,932 21 553,195 1048,575 95,325 22 1115,207 2097,151 182,361 23 2227,484 4194,303 349,525 24 4427,830 8388,607 671,088 25 8760,826 16,777,215 1290,555 26 17,265,199 33,554,431 2485,513 27 33,951,316 67,108,863 4793,490 28 66,656,315 134,217,727 9256,395 29 132,084,407 268,435,455 17,895,697 30 257,962,181 536,870,911 34,636,833 31 504,141,876 1073,741,823 67,108,863 32 985,875,453 2147,483,647 130,150,524 33 1929,093,753 4294,967,295 252,645,135 34 3776,867,237 8589,934,591 490,853,405 35 7398,516,744 17,179,869,183 954,437,176 36 14,500,416,431 34,359,738,367 1857,283,155 37 28,433,369,622 68,719,476,735 3616,814,565 38 55,779,952,400 137,438,953,471 7048,151,460 39 109,476,401,847 274,877,906,943 13,743,895,347 40 214,954,581,277 549,755,813,887 26,817,356,775 T able 2: Bounds on the minimal n um b er of hidden units for which RBM V ,H can appro xi- mate every distribution on { 0 , 1 } V arbitrarily well, for 2 ≤ v ≤ 40. The ﬁrst column gives up- p er b ounds following from Corollary 12. Shown are upp er b ounds on U ( v , v ) = P v j =2 D ( v , j ) ev aluated using Lemma 9 and some individual b ounds on D ( v , j ) = C ( v , v − j + 1 , v − j ) from [13, T able I II]. The second column gives the previous upp er b ound 2 v − 1 − 1 from [10]. The last column giv es the hard lo wer b ound l 2 v v +1 − 1 m that results from parameter coun t- ing, i.e., from demanding that the mo del RBM V ,H has at least ( h + 1)( v + 1) − 1 ≥ 2 v − 1 parameters. 13 Besides from RBMs w e can also consider mo dels that include in teractions among the visible v ariables other than biases. In this case we only need to co ver the interaction sets from the simplicial complex S that are not already included in the simplicial complex T . In Theorem 8 one just replaces { Λ ∈ S : | Λ | ≥ 2 } by S \ T . W e note the follo wing sp ecial case: Corollary 14. Every distribution fr om the k -inter action mo del E S k on { 0 , 1 } V c an b e ap- pr oximate d arbitr arily wel l by the visible mar ginals of a p airwise inter action mo del with h = P k j =3 D ( v , j ) hidden binary variables. The latter is b ounde d ab ove as indic ate d in L emma 9. In p articular, every distribution on { 0 , 1 } V c an b e appr oximate d arbitr arily wel l by the visible mar ginals of a p airwise inter action mo del with h = 2 v − 1 − ( v − 1) − 1 or h = 2(log( v − 2)+1) v +1 (2 v − ( v + 1) − 1 − ( v +1) v 4 ) + 1 hidden binary variables. Corollary 14 impro ves a previous result by Y ounes [17], which sho wed that a pairwise in teraction mo del with h = 2 v −  v 2  − v − 1 hidden binary v ariables can appro ximate every distribution on { 0 , 1 } V arbitrarily well. W e close this section with a few small examples illustrating our results. Example 15. The model RBM 3 , 1 is the same as the t w o-mixture of pro duct distributions of 3 binary v ariables and is also kno wn as the trip o d tr e e mo del . It has 7 parameters and the same dimension. What is the largest hierarchical mo del contained in the closure of this mo del? The closure of a mo del M is the set of all probability distributions that can b e appro ximated arbitrarily well by probability distributions from M . The closure of RBM 3 , 1 con tains all 3 hierarc hical models on { 0 , 1 } 3 with t w o pairwise in teractions. F or example, it contains the mo del E S with S = {{ 1 , 2 } , { 1 , 3 } , { 1 } , { 2 } , { 3 }} . Indeed, tw o quadratic coeﬃcients can be jointly modeled by one soft-plus unit (Lemma 5) and the linear co eﬃcients with the biases of the visible v ariables. In particular, the closure of RBM 3 , 1 also contains the 3 hierarc hical mo dels with a single pairwise in teraction. It does not contain the hierarc hical mo del with 3 pairwise in teractions, E S with S = S 2 = {{ 1 , 2 } , { 1 , 3 } , { 2 , 3 } , { 1 } , { 2 } , { 3 }} , whic h is kno wn as the no-thr e e-way inter action mo del . One wa y of proving this is by comparing the p ossible supp ort sets of the t wo models, as prop osed in [8]. The supp ort set of a pro duct distribution is a cylinder set. The support set of a mixture of tw o pro duct distributions is a union of tw o cylinder sets. On the other hand, the p ossible support sets of a hierarchical model correspond to the faces of its marginal p olytope, con v { ( Q i ∈ Λ x i ) Λ ∈ S : x ∈ X } ⊂ R S . The marginal polytop e of the no-three-w ay in teraction mo del is the cyclic p olytope C ( N , d ) with N = 8 vertices and dimension d = 6 (see, e.g., [8, Lemma 18]). This is a neigh borly p olytope, meaning that ev ery d/ 2 = 3 or less v ertices form a face. In turn, every subset of { 0 , 1 } 3 of cardinalit y d/ 2 = 3 is the supp ort set of a distribution in the closure of the no-three-wa y in teraction mo del. 1 Since the set { (100) , (010) , (001) } is not a union of t w o cylinder sets, the closure of RBM 3 , 1 do es not con tain the no-three-wa y interaction mo del. Example 16. The closure of RBM 3 , 2 con tains the no-three-w ay interaction mo del. Tw o of the quadratic coeﬃcients can be join tly modeled with one hidden unit and the third with the second hidden unit (Lemma 5). It do es not contain the full in teraction mo del. F ollowing the ideas explained in the pre- vious example, this can be shown b y analyzing the p ossible supp ort sets of the distributions in the closure of RBM 3 , 2 . F or details on this we refer the reader to [12]. 1 More generally , in [6] it is sho wn that if S ⊇ { Λ ⊆ V : | Λ | ≤ k } , then the marginal p olytop e of E S is 2 k − 1 neigh b orly , meaning that any 2 k − 1 or fewer of its v ertices deﬁne a face. 14 Example 17. The model RBM 3 , 3 is a universal approximator. This follo ws immediately from the univ ersal appro ximation b ound 2 v − 1 − 1 from [10]. This observ ation can b e recov- ered from our results as follows. The cubic coeﬃcient can b e mo deled with one hidden unit (Lemma 1). Two quadratic coeﬃcients can be jointly modeled with one hidden unit and the third with another hidden unit (Lemma 5). Example 18. The model RBM 4 , 6 is a univ ersal approximator. The quartic co eﬃcient can b e mo deled with one hidden unit. The 4 cubic co eﬃcients can b e mo deled with tw o hidden units (Lemma 5). The 6 quadratic co eﬃcients can b e grouped into 3 pairs with a shared v ariable in eac h pair. These can be modeled with 3 hidden units (Lemma 5). 5 Conclusions and Outlo ok W e studied the kinds of interactions that app ear when marginalizing ov er a hidden v ariable that is connected b y pair-in teractions with all visible v ariables. W e derived upp er bounds on the minimal num b er of v ariables of a hierarc hical mo del whose visible marginal distributions can approximate an y distribution from a given fully observ able hierarchical mo del arbitrarily w ell. These results generalize and improv e previous results on the represen tational pow er of RBMs from [10] and [17]. Man y in teresting questions remain op en at this point: A full characterization of soft-plus p olynomials and the necessary n umber of hidden v ariables is missing. It w ould b e in teresting to lo ok at non-binary hidden v ariables. This corresp onds to analyzing the hierarchical models that can b e represented b y mixture mo dels. In the case of conditionally indep enden t binary hidden v ariables, the partial factorization leads to soft-plus units, whereas in the case of higher-v alued hidden v ariables, it leads to shifted logarithms of denormalized mixtures. Similarly , it w ould b e in teresting to tak e a lo ok at non-binary visible v ariables. In this case state v ectors cannot be iden tiﬁed in a one-to-one manner with subsets of v ariables. This means that the corresp ondence betw een function v alues and p olynomial co eﬃcients is not as direct. Our analysis could also be extended to co ver the represen tation of conditional probabilit y distributions from hierarchical mo dels in terms of conditional restricted Boltzmann machines and to reﬁne the results on this problem rep orted in [11]. Another interesting direction are m odels where the hidden v ariables are not conditionally indep enden t given the visible v ariables, such as deep Boltzmann mac hines, whic h inv olv e sev eral la yers of hidden v ariables. This case is more challenging, since the free energy do es not decomp ose in to independent terms. Ac kno wledgmen ts W e thank Nihat Ay for helpful remarks with the manuscript. References [1] C. J. Colbourn and J. H. Dinitz. Handb o ok of Combinatorial Designs, Se c ond Edition (Discr ete Mathematics and Its Applic ations) . Chapman & Hall/CR C, 2006. [2] P . Erd˝ os and J. Sp encer. Pr ob abilistic Metho ds in Combinatorics . Academic Press Inc, 1974. 15 [3] Y. F reund and D. Haussler. Unsup ervised learning of distributions on binary v ectors using t w o la yer netw orks. In J. E. Mo o dy , S. J. Hanson, and R. P . Lippmann, edi- tors, A dvanc es in Neur al Information Pr o c essing Systems 4 , pages 912–919. Morgan- Kaufmann, 1992. http://papers.nips.cc/paper/535- unsupervised- learning- of- distributions- on- binary- vectors- using- two- layer- networks.pdf . [4] S. Geman and D. Geman. Sto c hastic relaxation, Gibbs distributions, and the Bay esian restoration of images. Pattern Analysis and Machine Intel ligenc e, IEEE T r ansactions on , P AMI-6(6):721–741, 1984. http://dx.doi.org/10.1109/TPAMI.1984.4767596 . [5] S. H. John W. Eaton, Da vid Bateman and R. W eh bring. GNU Octave version 4.0.0 manual: a high-level inter active language for numeric al c omputations . 2015. http: //www.gnu.org/software/octave/doc/interpreter . [6] T. Kahle. Neigh b orliness of marginal p olytop es. Beitr¨ age zur Algebr a und Ge ometrie , 51(1):45–56, 2010. http://www.emis.de/journals/BAG/vol.51/no.1/4.html . [7] N. Le Roux and Y. Bengio. Representational p ow er of restricted Boltzmann mac hines and deep b elief netw orks. Neur al Computation , 20(6):1631–1649, June 2008. http: //dx.doi.org/10.1162/neco.2008.04- 07- 510 . [8] G. Mont´ ufar. Mixture decomp ositions of exp onential families using a decomp osition of their sample spaces. Kyb ernetika , 49(1):23–39, 2013. http://www.kybernetika.cz/ content/2013/1/23 . [9] G. Mont´ ufar. Deep narro w Boltzmann mac hines are univ ersal appro ximators. Inter- national Confer enc e on L e arning R epr esentations 2015 (ICLR 2015), San Die go, CA, USA , 2015. . [10] G. Mont´ ufar and N. Ay . Reﬁnements of universal approximation results for deep b elief net works and restricted Boltzmann machines. Neur al Computation , 23(5):1306–1319, 2011. http://dx.doi.org/10.1162/NECO_a_00113 . [11] G. Mon t ´ ufar, N. Ay , and K. Ghazi-Zahedi. Geometry and expressiv e p ow er of condi- tional restricted Boltzmann mac hines. Journal of Machine L e arning R ese ar ch , 16:2405– 2436, 2015. http://jmlr.org/papers/v16/montufar15b.html . [12] G. Mont´ ufar and J. Morton. When do es a mixture of pro ducts con tain a product of mixtures? SIAM Journal on Discr ete Mathematics , 29:321–347, 2015. http://dx. doi.org/10.1137/140957081 . [13] K. J. Nurmela and P . R. J. ¨ Osterg ˚ ard. New cov erings of t-sets with (t+1)-sets. Journal of Combinatorial Designs , 7(3):217–226, 1999. http://dx.doi.org/10.1002/(SICI) 1520- 6610(1999)7:3< 217::AID- JCD5> 3.0.CO;2- W . [14] OEIS. The on-line encyclopedia of in teger sequences, A066010 triangle of cov ering n umbers T(n,k) = C(n,k,k-1), n > = 2, 2 < = k < = n. Published electronically at http://oeis.org , 2010. [15] V. R¨ odl. On a pac king and c o v ering problem. Eur op e an Journal of Combinatorics , 6(1):69–78, 1985. http://dx.doi.org/10.1016/S0195- 6698(85)80023- 8 . 16 [16] B. Steudel and N. Ay . Information-theoretic inference of common ancestors. Entr opy , 17(4):2304, 2015. http://dx.doi.org/10.3390/e17042304 . [17] L. Y ounes. Synchronous Boltzmann machines can be universal appro ximators. Ap- plie d Mathematics L etters , 9(3):109–113, 1996. http://dx.doi.org/10.1016/0893- 9659(96)00041- 9 . [18] P . Zwiernik and J. Q. Smith. T ree cum ulants and the geometry of binary tree mo dels. Bernoul li , 18:290–321, 2012. http://dx.doi.org/10.3150/10- BEJ338 . A Pro ofs Pr o of of L emma 2. Let B 0 = B \ { m } . The edge coeﬃcients satisfy K B 0 ( w B 0 , c ) = X C ⊆ B 0 ( − 1) | B 0 \ C | log 1 + exp  X i ∈ C w i + c  ! and K B ( w B , c ) = K B 0 ( w B 0 , c + w m ) − K B 0 ( w B 0 , c ) . Using this structure, w e now pro ceed with the proof of the individual cases. The case | B 0 | = 0 . W e omit this simple exercise. The case | B 0 | = 1 . The if statemen t is as follows. The elemen ts of the set { 0 , 1 } B are the v ertices of the | B | -dimensional unit cub e. W e call tw o vectors x, x 0 ∈ { 0 , 1 } B adjacen t if they diﬀer in exactly one en try , in which case they are the vertices of an edge of the cub e. The weigh ts w B and c can b e chosen suc h that the aﬃne map { 0 , 1 } B → R ; x B 7→ w > B x B + c maps an y c hosen pair of adjacent v ectors to any arbitrary v alues and all other v ectors to large negativ e v alues. The soft-plus function is monotonically increasing, taking v alue zero at minus inﬁnity and plus inﬁnit y at plus inﬁnit y . Hence, for an y s, s 0 ∈ R + , one ﬁnds weigh ts w and c suc h that φ ( x ) =    s, ( x B 0 , x m ) = (1 , . . . , 1 , 1) s 0 , ( x B 0 , x m ) = (1 , . . . , 1 , 0) ≈ 0 , otherwise , or, alternatively , suc h that φ ( x ) =    s, ( x B 0 , x m ) = (1 , . . . , 1 , 0 , 1) s 0 , ( x B 0 , x m ) = (1 , . . . , 1 , 0 , 0) ≈ 0 , otherwise . This leads to K B ≈ ( s − s 0 ) and K B 0 ≈ s 0 or, alternatively , K B ≈ − ( s − s 0 ) and K B 0 ≈ − s 0 . The approximation can b e made arbitrarily precise. The only if statement is as follo ws. Denote the soft-plus function b y f : R → R + ; s 7→ log(1 + exp( s )). Since | B 0 | = 1, C ⊆ B 0 implies C = B 0 or C = ∅ . W e ha v e that K B 0 ( w B 0 , c ) = f ( w B 0 + c ) − f ( c ) and K B 0 ( w B 0 , c + w m ) = f ( w B 0 + c + w m ) − f ( c + w m ) are either both p ositiv e or b oth negativ e, depending on the sign of w B 0 . If b oth are positive, then K B ( w B , c ) = K B 0 ( w B 0 , c + w m ) − K B 0 ( w B 0 , c ) ≥ − K B 0 ( w B 0 , c ), and similarly in the case that b oth are negativ e. 17 The case | B 0 | = 2 . The if statemen t follo ws from the previous case | B 0 | = 1. Indeed, consider an edge pair ( C , C 0 ) with an elemen t more than the edge pair ( B , B 0 ), suc h that B = C \ { n } and B 0 = C 0 \ { n } . Then, for any w B and c , c hoosing w n large enough one obtains an arbitrarily accurate appro ximation K C (( w B , w n ) , c − w n ) ≈ K B ( w B , c ) and K C 0 (( w B 0 , w n ) , c − w n ) ≈ K B 0 ( w B 0 , c ). F or the only if statemen t we use a similar argument as previously . W e hav e K B 0 ( w B 0 , c ) = f ( w 1 + w 2 + c ) + f ( c ) − f ( c + w 1 ) − f ( c + w 2 ). By conv exity of f , this is non-negative if and only if e ither w 1 , w 2 ≥ 0 or w 1 , w 2 ≤ 0. In other words, this is non-negativ e if and only if w 1 · w 2 ≥ 0. Under either of these conditions, K B 0 ( w B 0 , c + w m ) is also non-negativ e. Simi- larly , K B 0 ( w B 0 , c ) is non-p ositive if and only if w 1 · w 2 ≤ 0. In this case, K B 0 ( w B 0 , c + w m ) is also non-p ositiv e. Now the statemen t follows as in the case | B 0 | = 1. The case | B 0 | = 3 . W e need to sho w that any edge pair co eﬃcien ts can b e represented. Consider ﬁrst J B 0 ≥ 0. W e choose weigh ts of the form w B 0 = ω 1 B 0 , where ω ∈ R and 1 B 0 is the vector of | B 0 | ones. Then K B 0 ( w B 0 , c ) = f (3 ω + c ) − 3 f (2 ω + c ) + 3 f ( ω + c ) − f ( c ). W e can choose ω and c such that 3 ω + c = f − 1 ( J B 0 ) while 2 ω + c, ω c, c take very large negativ e v alues. This yields K B 0 ≈ J B 0 . Note that the deriv ative of the soft-plus function is the logistic function, i.e., f 0 ( s ) = 1 / (1+exp( − s )). Cho osing ω large enough from the b eginning, the function w m 7→ K B 0 ( w B 0 , c + w m ) is monotonically increasing in the in terv al w m ∈ [0 , ω / 2] and surpasses the v alue 1 5 ω . On the other hand, when w m is large enough, depending on ω and c , w e hav e that 2 ω + c + w m ≥ 5 12 (3 ω + c + w m ) and f (2 ω + c + w m ) ≥ 5 12 f (3 ω + c + w m ). In this case f (3 ω + c + w m ) − 3 f (2 ω + c + w m ) ≤ − 1 4 (3 ω + c + w m ) ≤ − 1 4 ω . A t the same time, ω + c + w m and c + w m are smaller than − 1 12 ω and so f ( ω + c + w m ) and f ( c + w m ) are very small in absolute v alue. By the mean v alue theorem, dep ending on w m , K B 0 ( w B 0 , c + w m ) tak es an y v alue in the in terv al [ − 1 5 ω , 1 5 ω ], where ω is arbitrarily large. In turn, w e can obtain K B = K B 0 ( w 0 B , c + w m ) − K B 0 ( w B 0 , c ) ≈ J B for any J B ∈ R . F or J B 0 ≤ 0 the pro of is analogous after lab el switching for one v ariable. The case | B 0 | > 3 . This follows from the previous case | B 0 | = 3 in the same w ay that the if part of the case | B 0 | = 2 follo ws from the case | B 0 | = 1. 18

Hierarchical Models as Marginals of Hierarchical Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment