A measure of statistical complexity based on predictive information

A measure of statistical complexit y based on predictiv e information Samer A. Abdalla h and Mark D. Plum bley Que en Mar y University of L ondon (Dated: Octob er 25, 2018) W e introduce an information t heoretic measure of statistical structure, called ‘binding in forma- tion’, for sets of random v ariables, and compare it with several previously prop osed measures in- cluding excess entrop y , Bialek et al. ’s predictive informatio n, and the multi-informati on. W e derive some of the prop erties of the binding information, particularly in relation to th e multi-information, and show that, for ﬁn ite sets of binary random v ariables, the pro cesses which maximises bindin g information are the ‘parit y’ processes. Finally we discuss some of the implications this has for th e use of the b inding information as a meas ure of complexit y . I. INTRO DUCTION The concepts o f ‘structure’, ‘pa ttern’ and ‘co mplexit y’ are relev a n t in many ﬁelds of inquir y: ph ysics, bio logy , cognitive sciences, mac hine learning, the arts and so on; but are v ag ue enough to resist b eing qua nt iﬁed in a single deﬁnitive manner. One a pproach, which we ado pt here, is to attempt to characterise them in sta tis tica l terms, for distributions over conﬁgurations o f some system, using the tools of information theory [1]. In this letter, w e pr o po se a measure o f statistica l struc- ture based on the concept o f pr e dictive information r ate (PIR) [2], which mea sures an asp ect of tempor al depen- dency not captured by previo usly prop ose d mea sures. W e review a n um ber of these earlier prop osa ls and the PIR, and then deﬁne the binding information a s the ex- tensive counterpart of the PIR applicable to arbitra ry countable s ets of random v a riables. After descr ibing some of its prop erties, we identify some ﬁnite disc rete pro cesses that ma ximise the binding informatio n. In the following, if X is a ra ndo m pro cess indexed by a set A , and B ⊆ A , then X B denotes the comp o und random v ariable (random ‘vector’) formed b y taking X α for each α ∈ B . The set of integers fro m M to N inclusive will b e written M ..N , and \ will deno te the set diﬀerence op erator, so, fo r example, X 1 .. 3 \{ 2 } ≡ ( X 1 , X 3 ). II. BACK GROUND Suppo se that ( . . . , X − 1 , X 0 , X 1 , . . . ) is a bi-inﬁnite sta- tionary sequence of random v ar iables, and tha t ∀ t ∈ Z , the random v ariable X t takes v alues in a discrete set X . Let µ b e the a sso ciated shift-in v ar iant probability mea - sure. Stationa rity implies that the probability distribu- tion ass o ciated with any c o nt iguous blo ck o f N v a riables ( X t +1 , . . . , X t + N ) is indep endent of t , and therefore w e can deﬁne a shift-inv ariant blo ck entropy function: H ( N ) , H ( X 1 , . . . , X N ) = X x ∈X N − p N µ ( x ) log p N µ ( x ) , (1) where p N µ : X N → [0 , 1 ] is the unique probability mass function for any N consecutive v ariables in the seque nce , p N µ ( x ) , P r( X 1 = x 1 ∧ . . . ∧ X N = x N ). The entr opy r ate h µ has tw o equiv alen t deﬁnitions in terms of the blo ck entropy function [1, Ch. 4]: h µ , lim N →∞ H ( N ) N = lim N →∞ H ( N ) − H ( N − 1) . (2) The blo ck en tropy function ca n also b e used to express the mutual infor mation betw een t wo cont iguous segments of the sequence of length N and M resp ectively: I ( X − N .. − 1 ; X 0 ..M − 1 ) = H ( N ) + H ( M ) − H ( N + M ) . (3) If we let b oth block lengths N and M tend to inﬁnity , we obtain what has been ca lled the exc ess entr opy [3] or the eﬀe ctive me asur e c omplexity [4 ]. It is the amount of in- formation abo ut the inﬁnite future tha t ca n b e obtained, on a v erage , b y observing the inﬁnite past: E = lim N →∞ 2 H ( N ) − H ( 2 N ) . (4) Bialek et al. [5] deﬁned the pr e dictive information I pred ( N ) as the mutual infor mation b etw een a blo ck of length N and the inﬁnite future following it: I pred ( N ) , lim M →∞ H ( N ) + H ( M ) − H ( N + M ) . (5) They show ed that even if I pred ( N ) div erges as N tends to inﬁnit y , the manner of its divergence reveals something ab out the lear nability of the underlying r a ndom pro cess. Bialek et al. also emphasised that I pred ( N ) is the sub- extensive comp onent o f the entropy: if N h µ is the pur ely extensive ( i.e. , linear in N ) comp onent of the ent ropy , then I pred ( N ) is the diﬀere nce betw een the blo ck entropy H ( N ) a nd its extensive component: H ( N ) = N h µ + I pred ( N ) . (6) The multi-information [6] is deﬁned for any collection of N ra ndom v ariables ( X 1 , . . . , X N ) as I ( X 1 ..N ) , − H ( X 1 ..N ) + X i ∈ 1 ..N H ( X i ) . (7) F or N = 2 , the multi-information reduces to the m utual information I ( X 1 ; X 2 ), while for N > 2, I ( X 1: N ) co nt in- ues to b e a mea sure of dep endence, b e ing zero if and only 2 ﬁnite past inﬁnite future PI X − N : − 1 X 0 , . . . (a) predictive information I pred ( N ) inﬁnite past inﬁnite future E . . . , X − 1 X 0 , . . . (b) excess entropy E r µ b µ ρ µ X 0 inﬁnite future inﬁnite past . . . , X − 1 X 1 , . . . (c) predictive information rate b µ FIG. 1. V enn diagram representation [1, Ch. 2] of sev- eral information measures for stationary random pro cesses. Eac h circle or ov al represen ts a random v ariable or sequence of random v ariables relative to time t = 0. Overla pp ed areas correspond to vari ous m utual informations. In (c), the circle represents the ‘p resen t’. Its total area is H ( X 0 ) = H (1) = ρ µ + r µ + b µ , where ρ µ is the multi-information rate, r µ is the residual entrop y rate, and b µ is the predictiv e information rate. The entrop y rate is h µ = r µ + b µ . if the v a r iables are s tatistically indep endent. In the ther- mo dynamic limit, the intensiv e multi-information r ate ( cf. Dubno v’s information r ate [7]) can be deﬁned as ρ µ , lim N →∞ I ( X 1 ..N ) − I ( X 1 ..N − 1 ) . (8) It can easily b e shown that ρ µ = I pred (1) = H (1) − h µ . Erb a nd Ay [8] studied this quan tit y (they call it I ) and show ed that, in the pre s ent terminology , I ( X 1 ..N ) + I pred ( N ) = N ρ µ . (9) Comparing this with ( 6 ), we see that I pred ( N ) is a lso the sub-extensive comp onent of the multi-information. Thu s, all of the measures co nsidered so far, being linear ly depe ndent in v arious w a ys, are close ly related. Another class of measures, including Gr assb erger ’s true me asur e c omplexity [4] and Crutchﬁeld et al. ’s sta- tistic al c omplexity C µ [9, 10], is base d on the prop erties of sto chastic automata that mo del the pro ces s under c o n- sideration. These hav e some interesting prop erties but are beyond the scop e of this letter. In [2], we introduce d the pr e dictive information r ate (PIR), which is the av erag e information in one observ a- tion a bo ut the inﬁnite future given the inﬁnite past. If ← X t = ( . . . , X t − 2 , X t − 1 ) denotes the v a r iables b efor e time t , and → X t = ( X t +1 , X t +2 , . . . ) denotes those a fter t , the PIR is deﬁned as a conditional mutual infor mation: I t , I ( X t ; → X t | ← X t ) = H ( → X t | ← X t ) − H ( → X t | X t , ← X t ) . (10 ) Equation ( 10 ) can b e read as the av erage reduction in uncertaint y a bo ut the future o n lea rning X t , given the past. Due to the symmetry of the mutual infor mation, it can a lso b e written as I t = H ( X t | ← X t ) − H ( X t | → X t , ← X t ). H ( X t | ← X t ) is the en tropy r ate h µ , but H ( X t | → X t , ← X t ) is a quantit y that do es not app ear to b e have b een considered by other authors yet. It is the conditiona l entropy of one v ar iable given al l the o thers in the sequenc e , future as well as past. W e call this the r esidual entr opy r ate r µ , and deﬁne it as a limit: r µ , lim N →∞ H ( X − N ..N ) − H ( X − N .. − 1 , X 1 ..N ) . (11) The second term, H ( X − N .. − 1 , X 1 ..N ), is the joint entropy of tw o non-adjacent blo cks with a gap b e tw een them, and canno t b e expr essed as a function of block entropies alone. If we let b µ denote the shift-inv ariant PIR, then b µ = h µ − r µ (see Fig. II ). Many of the measures r eviewed ab ov e were intended as mea sures of ‘complexity’, a q uality that is so mewhat op en to interpretation [11, 12]. It is generally a greed, how ev er, that complexity should b e low for systems that a r e deterministic or easy to compute or predict— ‘ordered’—a nd low for systems that a co mpletely random and unpredictable—‘disordere d’. The PIR s atisﬁes these conditions without b eing ‘over-univ ersal’ in the sense of Crutchﬁeld et al. [12, 1 3]: it is not simply a function of ent ropy or en tropy rate that fails to distinguis h be t ween the diﬀerent streng ths of tempora l dependency tha t c a n be exhibited by sys tems at a given level of entrop y . In our analysis of Marko v chains [2], we found that pro cesse s which max imis e the PIR do not maximise the multi- information rate ρ µ (or the exces s entropy , which is the same in this case ), but do hav e a certain kind o f pa rtial predictability that requir es the obs erver con tin ually to pay a tten tion to the most r ecent observ ations in order to make optimal pr edictions. And so, while Crutchﬁeld et al. ma ke a compelling case for the exces s entrop y E and their statistical complexity C µ as measur es of complex- it y , th ere is still r o om to sug gest that the PIR captures a diﬀerent and non trivia l asp ect of tempo ral dependency structure not prev iously examined. II I. BINDING INFOR MA TION If the PIR ra te is accumulated ov er success ive time steps, a q uantit y which we call the binding information is obtained. T o pro ceed, w e ﬁrst reformulate the inﬁnite sequence PIR ( 10 ) so that it b ecomes applicable to a ﬁnite s equence of r andom v ariables ( X 1 , . . . , X N ): I t ( X 1 ..N ) = I ( X t ; X ( t +1) ..N | X 1 .. ( t − 1) ) , (12) Note that this is no longer shift-inv a riant and may de- pend on t . The binding infor mation, then, is the sum B ( X 1 ..N ) = X t ∈ 1 ..N I t ( X 1 ..N ) . (13) 3 (a) H ( X 1 .. 4 ) (b) I ( X 1 .. 4 ) (c) B ( X 1 .. 4 ) FIG. 2. Illustration of binding information as compared with multi-info rmation for a set of four rand om v ariables. In each case, the quantit y is represented by t he total amount of blac k ink, as it were, in the shaded parts of the diagram. Whereas the multi-information counts the m ultiply-ov erlapped areas multiple times, the bind ing information counts eac h over- lapp ed areas just on ce. Expanding this in terms of entropies and conditiona l en- tropies and ca ncelling terms yields B ( X 1 ..N ) = H ( X 1 ..N ) − X t ∈ 1 ..N H ( X t | X 1 ..N \{ t } ) . (14) Like the m ulti-information, it measures dep endencies betw een random v a riables, but in a diﬀer ent w ay (see ﬁg. II I ). Tho ugh the binding information was derived by accumulating the PIR sequentially , the result is p er m u- tation inv ariant, sugg e sting tha t the co ncept migh t b e applicable to ar bitrary sets of r andom v ar iables regar d- less of their top olo g y . Accor dingly , we deﬁne the binding information as f ollows: Deﬁnition 1. If { X α | α ∈ A} is set of r andom vari ables indexe d by a c ountable set A , the binding information is B ( X A ) , H ( X A ) − X α ∈A H ( X α | X A\{ α } ) . (15) Since the binding information can b e expressed as a sum of (conditional) m utual informations b etw een se ts of random v ariables ( 13 ), it is (a) non-neg ative and (b) inv a r iant to inv ertible p oint wise transforma tions of the v ar iables; that is, if Y A is a set of random v ariables suc h that, ∀ α ∈ A , Y α = f α ( X α ) for some inv ertible functions f α , then B ( Y A ) = B ( X A ). The binding information is zer o fo r sets o f independent random v a riables—the case of complete ‘disor der’—and zero when a ll v aria bles ha ve zer o en tropy , taking known v alues and re presenting a certain kind of ‘order’. How- ever, it is also p ossible to obtain low binding information for r andom systems which ar e nonetheless very order ed in a certain w ay . If each v ariable X α is some function of X α ′ for a ll α ′ 6 = α , then the state o f the en tire system can b e read oﬀ from any one of its co mpo nent v a riables. In this case, it is ea s y to show that B ( X A ) = H ( X A ) = H ( X α ) for an y α ∈ A , which, a s we will see, is relatively low compared with what is poss ible as so on as N b ecomes appreciably lar ge. Thus, binding information is lo w for bo th highly ‘o rdered’ and hig hly ‘disorder ed’ systems, but in this case , ‘hig hly order ed’ do es not simply mea n deterministic or known a priori : it means the whole is predictable from the smallest of its parts. 1 N N H (bits) I (bits) B (bits) I < ( N − 1) H I < N − H B < H B < ( N − 1)( N − H ) a b c d 1 1 N N B (bits) I (bits) I + B < N I < ( N − 1) B B < ( N − 1) I b c a d FIG. 3. Constraints on multi-information I ( X 1 ..N ) and bind- ing informati on B ( X 1 ..N ) f or a system of N = 6 binary ran- dom v ariables. The la b elled points represent iden tiﬁable dis- tributions ov er the 2 N states th at this system can o ccupy: ( a) known state , the system is deterministically in one conﬁgura- tion; (b) giant bit , one of the P 6 B processes; ( c) p arity , the parity pro cesses P 6 2 , 0 or P 6 2 , 1 ; (d) indep endent , the system of indep endent unbiased random bits. IV. BOUNDS ON BINDING AND MUL T I-INF ORMA TION In this sectio n we conﬁne our attention to sets of dis- crete random v a riables taking v alue s in a common al- phab et co ntaining K symbols. In this case, it is quite straightforward to derive upper b ounds, as functions of the joint en tropy , on both the m ulti-information and the binding infor mation, a nd also upp e r b o unds on multi- information and binding information a s functions of each other. In [14], we pr ov e the following results: Theorem 1. If { X α | α ∈ A} is a set of N = |A| r andom variables al l taking values in a discr ete set of c ar dinality K , then the fol lowing c onstra ints al l hold: I ( X A ) ≤ N log K − H ( X A ) (16) I ( X A ) ≤ ( N − 1) H ( X A ) (17) B ( X A ) ≤ H ( X A ) (18) B ( X A ) ≤ ( N − 1)( N log K − H ( X A )) . (19) Also , B ( X A ) and I ( X A ) ar e mutual ly c onstr aine d: I ( X A ) + B ( X A ) ≤ N log K . (20) These bounds restr ict I ( X A ) and B ( X A ) to tw o tr ia n- gular re gions of the plane when plotted against the join t ent ropy H ( X A ) and are illustrated for N = 6 , K = 2 in ﬁg. IV . Two more linear b ounds were suggested by em- pirical computations of binding info r mation and m ulti- information: I ( X A ) ≤ ( N − 1) B ( X A ) (21) and B ( X A ) ≤ ( N − 1) I ( X A ) . (22) W e hav e not fo und a gener al pro of of these inequalities for all N , but we have constructed a numerical a lgorithm [1 4] that is able to ﬁnd pro o fs f or given v alues of N up to 3 7, at which p oint insuﬃcient numerical pr ecision b ecomes the limiting f actor. 4 V. MAXIMISING BINDING INFORMA T ION Is the absolute maximum of B ( X 1 ..N ) = ( N − 1) log K implied by Theo rem 1 is attainable, and by what k inds of pro cesses? In [14] we prov e the following: Theorem 2. If { X 1 , . . . , X N } is a set of discr ete r andom variables e ach taking values in 0 .. ( K − 1) , then B ( X 1 ..N ) is maximise d at ( N − 1) log 2 K bits by the K ‘mo dulo- K pr o c esses’ P N K,m for m ∈ 0 .. ( K − 1) , under which the pr ob abili ty of a c onﬁgur ation x ∈ (0 ..K − 1) N is P N K,m ( x ) = ( K 1 − N if  P N i =1 x i  mo d K = m, 0 otherwise. (23) When K = 2 (binary random v ar iables) the maximal binding information of N − 1 bits is reached by the tw o ‘parity’ pro cesses: P N 2 , 0 is the ‘even’ pr o cess, whic h dis- tributes uniform probability o v er all conﬁgurations with even par ity; P N 2 , 0 is the ‘odd’ pro cess, which distr ibutes uniform probabilities over the complement ary s et. The m ulti-information of the parity pro cesses is 1 bit. By contrast, the binary pro cesses which maximis e the multi- information at N − 1 bits are the ‘giant bit’ pro cesses: the indices 1 ..N are partitioned int o tw o sets B a nd its complement B = 1 ..N \ B , and probabilities assigned to conﬁguratio ns x ∈ { 0 , 1 } N as follo ws: P N B ( x ) =      1 2 : if ∀ i ∈ 1 ..N . x i = I ( i ∈ B ) , 1 2 : if ∀ i ∈ 1 ..N . x i = I ( i ∈ B ) , 0 : otherwis e , (24) where I ( · ) is 1 if its argument is true and 0 otherwise. The binding information of these pro cesses is 1 bit. Thus we s e e that the pro cesses whic h maximise the binding information and the mult i-information a re quite diﬀerent in c haracter. VI. DISCUSSION A ND CONCLUSIONS As noted in § I , Bia lek et al. arg ue that the predic- tive information I pred ( N ), b eing the sub-extensive com- po nent of the entropy , is the unique measure of c omplex- it y that satisﬁes certain rea sonable desider ata, including transformatio n in v ar iance for contin uo us-v a lue d v ariables [5, § 5 .3]. While lack of space precludes a full discus- sion, w e note that tr a nsformation inv aria nc e do es not , as Bialek et al. sta te [5, p. 245 0], demand sub-extensivity: binding information is transformatio n inv ariant, since it is a sum of conditional m utual informatio ns, and yet it c an hav e an extensive compo nent , since its intensive counterpart, the PIR, ca n have a well-deﬁned v alue, e.g. , in stationary Mark ov chains [2]. Measures of statis tica l dep endency a r e disc ussed by Studen` y a nd V ejnarov` a, [6, § 4], who form ulate a ‘lev el- sp eciﬁc’ mea sure that ca ptures the dep endency visible when ﬁxe d size subsets of v ariables are examined in iso- lation. Studen` y and V ejnarov` a [6, p. 277] use the parity pro cess as an example of a random process in whic h the depe ndence is o nly visible at the highest level, that is, amongst a ll N v a riables; if fewer than N v a riables a r e ex- amined, they a pp ea r to be indep endent . They note that such pr o cesses were called ‘pseudo-indep endent’ by Xi- ang et al. [15], who conc luded that sta ndard a lgorithms for Bayesian netw ork construction fail when applied to them. It is in triguing, then, that these a re singled out as ‘mo st complex’ according to the binding information criterion. T o summar ise, we hav e in tro duced binding information as a measure of statistical structure that can b e applied to any coun table set o f random v ariables regar dless of any topolo gical o rganisa tio n of the v ariables. Binding information is maximised in ﬁnite discr e te v alued sys - tems by the ‘mo dulo pr o cess’. F urther results on binding information, a nd inv estigations of binding information in some speciﬁc random proce sses a re presen ted in [14]. [1] T. M. Co ver and J. A. Thomas, El ements of Information The ory (John Wiley and Sons, New Y ork, 1991). [2] S. A. Ab dallah and M. D. Plumbley , Connection Science 21 , 89 (2009). [3] J. Crutchﬁeld and N. P ac k ard, Ph ysica D : Nonlinear Phe- nomena 7 , 201 (1983). [4] P . Grassb erger, I nternational Journal of Theoretica l Physics 25 , 9 07 (1986). [5] W. Bia lek, I. Nemenman, and N. Tish b y , Neural Com- putation 13 , 2409 (2001 ). [6] M. Studen ` y and J. V ejnarov` a, in L e arning in Gr aphic al Mo dels , ed ited by M. I. Jordan (MIT Press, 1998) pp . 261–297 . [7] S. D ubnov, Signal Processing Letters, IEEE 11 , 698 (2004). [8] I. Erb and N. Ay , Journal of Statistical Physics 115 , 949 (2004). [9] J. P . Crutchﬁeld and K. Y oung, Physica l R eview Letters 63 , 105 (1989) . [10] J. P . Crutc hﬁeld and D. P . F eldman, Physical Review E 55 , 1239R (1997 ). [11] C. H. Bennett, in Complexity, Entr opy, and the Physics of Inf ormation , edited by W. H. Zurek (Addison-W esley , 1990) p p. 137–148 . [12] D. P . F eldman and J. P . Crutchﬁeld, Physics Letters A 238 , 244 (199 8). [13] J. P . Crutchﬁeld, D. P . F eldman, and C. R. Sh alizi, Physica l R eview E 62 , 2996 (2000 ) . [14] S. A. Ab dallah and M. D. Plum bley , Pr e dictive Inf orma- tion, Multi-information and Binding Inf ormation , T ech. Rep. C4DM-TR-10-10 (Queen Mary Universit y of Lon- don, 2010). [15] Y. Xiang, S. W on g, and N . Cercone, in Pr o c. 12th Conf. on Unc ertainty in Artiﬁcial I ntel li genc e (1996) pp. 564– 571.

A measure of statistical complexity based on predictive information

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment