Universal Prediction of Selected Bits

Univ ersal Predictio n of Selecte d Bits T or Lattimore and Marcus Hutter and V aibha v Ga v ane Australian National Univ ersit y tor.lattimo re@anu.edu.au Australian National Univ ersit y a nd ET H Z ¨ uric h marcus.hutt er@anu.edu.au VIT Univ ersit y , V ellore vaibhav.gav ane@gmail.com 20 July 2011 Abstract Man y learning tasks can b e view ed as sequence prediction problems. F or example, online classiﬁcation can b e conv erted to sequence prediction with the sequence b eing pairs of in p ut/target data and where the goal is to cor- rectly predict th e target data giv en inp ut data and p revious input/target pairs. Solomonoﬀ induction is kno wn to solv e the g eneral sequence predic- tion p r oblem, but only if the entire sequence is sampled fr om a computable distribution. In the case of classiﬁcation and d iscriminativ e learning though, only the targets need b e structured (giv en the inp uts). W e show that the normalised v ersion of Solo monoﬀ indu ction can still b e used in this case, and more generally that it can detect an y recursive sub-pattern (regularit y) within an otherwise completely unstructur ed sequence. It is also sho wn that the un- normalised v ersion can fail to pred ict ve ry simple r ecursiv e sub -p atterns. Con ten ts 1 In tro duction 2 2 Notation and Deﬁnitions 3 3 M nor m Predicts Selected Bits 6 4 M F ails to Predict Selected Bits 9 5 Discussion 14 A T able of Notat ion 17 Keyw ords Sequence pr ediction; Solomonoﬀ induction; online classiﬁcation; discrimina- tiv e learning; algorithmic inf ormation theory . 1 1 In tro duction The sequence prediction problem is the task of predicting the next sym b ol, x n after observing x 1 x 2 · · · x n − 1 . Solomonoﬀ induction [Sol64a, Sol64b] solve s this problem b y taking inspiration from Oc cam’s razor and Epicurus’ princip le of m ultiple ex- planations. These ideas are formalised in the ﬁeld of Kolmogoro v complex ity , in particular b y the univ ersal a priori semi-measure M . Let µ ( x n | x 1 · · · x n − 1 ) b e t he true (unkno wn) probabilit y of seeing x n ha ving al- ready observ ed x 1 · · · x n − 1 . The celebrated result of So lo monoﬀ [Sol6 4a] stat es that if µ is computable then lim n →∞ [ M ( x n | x 1 · · · x n − 1 ) − µ ( x n | x 1 · · · x n − 1 )] = 0 with µ -probability 1 (1) That is, M can learn the true underlying distribution from whic h the data is sampled with probabilit y 1. Solo monoﬀ induc tion is arguably the gold standard predictor, univ ersally solving many (passiv e) prediction problems [Hut04, Hut0 7, Sol64a]. Ho w ev er, Solomonoﬀ induction mak es no guaran tees if µ is not computable. This w ould not b e problematic if it w ere unreasonable to predict sequences sampled from incomputable µ , but this is not t he case. Conside r the sequence b elow , where ev ery ev en bit is the same as the preceding o dd bit, but where the o dd bits may be c hosen arbitrarily . 00 1 1 11 11 00 11 00 00 00 11 11 00 00 00 0 0 00 11 11 (2) An y c hild will quic kly learn the pattern that eac h ev e n bit is the same as the pre- ceding o dd bit and will correctly predict the ev en bits. If Solomonoﬀ induction is to b e considered a truly in telligen t predictor then it to o should b e able to predict t he ev en bits. More generally , it should b e able to detect an y computable sub-pattern. It is this question, ﬁrst p osed in [Hut04, Hut09 ] a nd resisting attempts b y exp erts for 6 ye ars, t ha t w e address. A t ﬁrst sigh t, this app ears to b e an esoteric question, but consider the following problem. Suppose y ou are giv en a sequence of pairs, x 1 y 1 x 2 y 2 x 3 y 3 · · · where x i is the data for an image (o r feature v ec tor) of a c haracter and y i the corresponding ascii co de (class la b el) for t ha t c haracter. The goal of online classiﬁcation is t o construct a predictor that correctly predicts y i giv en x i based on the previously se en training pairs. It is reasonable to assume that there is a relativ ely simple pa t tern to generate y i giv en x i (h umans and computers s eem to ﬁnd simple patterns for c haracter recognition). Ho w ev er it is not necessarily reasonable to assume there exists a simple, or eve n computable, underlying distribution g enerating the training data x i . This problem is precisely what gav e rise to discriminativ e lear ning [L S0 6]. It turns out that there ex ist sequenc es with ev en bits equal to preceding odd bits on whic h the conditional distribution of M fa ils to conv erge to 1 on the ev en bits. On the other hand, it is kno wn that M is a defectiv e measure, but ma y be normalised to a prop er measure, M nor m . W e sho w that this normalised version do es 2 con v erge on an y recursiv e sub-pattern of any s equence, suc h as t ha t in Equation (2) . This o utcome is unanticipated since (all?) other results in the ﬁeld are indep enden t of no r malisation [Hut04 , Hut07, L V08, Sol64a]. The pro ofs are completely diﬀeren t to the standard pro ofs o f predictiv e results. 2 Notation and Deﬁnit ions W e use similar notation to [G´ ac83, G´ ac08, Hut04]. F or a more comprehensiv e in- tro duction to Kolmogorov complexit y a nd Solomonoﬀ induction see [Hut04, Hut07, L V08, ZL70]. Strings. A ﬁnite binary string x is a ﬁnite sequenc e x 1 x 2 x 3 · · · x n with x i ∈ B = { 0 , 1 } . Its length is denoted ℓ ( x ). An inﬁnite binary string ω is an inﬁnite sequence ω 1 ω 2 ω 3 · · · . The empt y string of length zero is denoted ǫ . B n is the set of all binary strings of length n . B ∗ is the set o f all ﬁnite binary strings. B ∞ is the set o f all inﬁnite binary strings. Substrings are denoted x s : t := x s x s +1 · · · x t − 1 x t where s, t ∈ N a nd s ≤ t . If s > t then x s : t = ǫ . A useful shorthand is x 0 suc h that f ( x ) ≥ c · g ( x ) for all x . f ( x ) × ≤ g ( x ) is deﬁned similarly . f ( x ) × = g ( x ) if f ( x ) × ≤ g ( x ) and f ( x ) × ≥ g ( x ). Deﬁnition 2 (Measures) . W e call µ : B ∗ → [0 , 1 ] a semime asur e if µ ( x ) ≥ P b ∈B µ ( xb ) fo r all x ∈ B ∗ , a nd a probabilit y measure if equality holds and µ ( ǫ ) = 1. µ ( x ) is the µ - pro babilit y that a sequenc e starts with x . µ ( b | x ) := µ ( xb ) µ ( x ) is the prob- abilit y of observing b ∈ B giv en that x ∈ B ∗ has already b een observ ed. A function P : B ∗ → [0 , 1] is a sem i-distribution if P x ∈B ∗ P ( x ) ≤ 1 and a probabilit y distribu- tion if equality holds. Deﬁnition 3 (Enum erable F unctions) . A real v alued function f : A → R is enumer able if there exists a computable function f : A × N → Q satisfying lim t →∞ f ( a, t ) = f ( a ) and f ( a, t + 1) ≥ f ( a, t ) fo r all a ∈ A and t ∈ N . 3 Deﬁnition 4 (Mac hines) . A T uring mac hine L is a recursiv ely en umer- able set (whic h may b e ﬁnite) con taining pairs of ﬁnite binary strings ( p 1 , y 1 ) , ( p 2 , y 2 ) , ( p 3 , y 3 ) , · · · . L is a pr eﬁx machine if the s et { p 1 , p 2 , p 3 · · · } is preﬁx free (no program is a preﬁx of a n y other). It is a monotone m a chine if for all ( p, y ) , ( q , x ) ∈ L with ℓ ( x ) ≥ ℓ ( y ), p ⊑ q = ⇒ y ⊑ x . W e deﬁ ne L ( p ) to b e the set of strings output b y prog r a m p . This is diﬀerent for monotone a nd preﬁx ma chines . F o r preﬁx mac hines, L ( p ) con tains only one elemen t, y ∈ L ( p ) if ( p, y ) ∈ L . F or monotone mac hines, y ∈ L ( p ) if there exists ( p, x ) ∈ L with y ⊑ x and there do es not exist a ( q , z ) ∈ L with q ⊏ p and y ⊑ z . F or b oth mac hines L ( p ) re presen ts the output of m achine L when giv en input p . If L ( p ) do es not exist then w e sa y L do es not halt on input p . Not e that for monotone mac hines it is p ossible for t he same program to output m ultiple strings. F or example (1 , 1) , (1 , 11) , (1 , 11 1) , (1 , 1111) , · · · is a perfectly legitimate monotone T ur ing mac hine. F or preﬁx mac hines this is not p ossible. Also no te that if L is a monotone mac hine and there exists an x ∈ B ∗ suc h that x 1: n ∈ L ( p ) and x 1: m ∈ L ( p ) then x 1: r ∈ L ( p ) f or a ll n ≤ r ≤ m . Deﬁnition 5 (Complexit y) . Let L b e a preﬁx o r monoto ne machine then deﬁne λ L ( y ) := X p : y ∈ L ( p ) 2 − ℓ ( p ) C L ( y ) := min p ∈B ∗ { ℓ ( p ) : y ∈ L ( p ) } If L is a preﬁx mac hine then we write m L ( y ) ≡ λ L ( y ). If L is a monotone machine then we write M L ( y ) ≡ λ L ( y ). Not e that if L is a preﬁx mac hine then λ L is an en umerable semi-distribution while if L is a monotone mac hine, λ L is an en umerable semi-measure. In fact, ev ery enume rable semi-measure (or semi-distribution) can b e represen ted via some mac hine L as λ L . F or preﬁx/monotone machine L we write L t for the ﬁrst t program/o utput pairs in t he recursiv e en umeration of L , so L t will b e a ﬁnite set con taining at most t pairs. 1 The set of all monotone (or preﬁx) machines is itself recursiv ely enumerable [L V08], 2 whic h allow s one to deﬁne a univ ersal monotone mac hine U M as follo ws. Let L i b e the i th monotone mac hine in the r ecursiv e en umeration of monotone mac hines. ( i ′ p, y ) ∈ U M ⇔ ( p, y ) ∈ L i where i ′ is a preﬁx co ding of t he in teger i . A univ ersal preﬁx mac hine, denoted U P , is deﬁned in a similar wa y . F or details see [L V08]. 1 L t will contain exactly t pair s unless L is ﬁnite, in which case it will contain t pairs until t is greater than the size of L . This anno yance will nev er be pro blematic. 2 Note the en umeration may inc lude r ep e tition, but this is unimportant in this ca se. 4 Theorem 6 (Univ ersal Preﬁx/Monotone Mac hines) . F or the universal monotone machine U M and universa l pr eﬁx machi n e U P , m U P ( y ) > c L m L ( y ) for al l y ∈ B ∗ M U M ( y ) > c L M L ( y ) for al l y ∈ B ∗ wher e c L > 0 dep ends o n L but not y . F or a pro of, see [L V08]. As usual, w e will ﬁx referenc e univ ersal preﬁx/monotone mac hines U P , U M and drop the subscripts by letting, m ( y ) := m U P ( y ) ≡ X p : y ∈ U P ( p ) 2 − ℓ ( p ) M ( y ) := M U M ( y ) ≡ X p : y ∈ U M ( p ) 2 − ℓ ( p ) K ( y ) := C U P ( y ) ≡ min p ∈B ∗ { ℓ ( p ) : y ∈ U P ( p ) } K m ( y ) := min p ∈B ∗ { ℓ ( p ) : y ∈ U M ( p ) } The ch oice of reference univ ers al T uring mac hine is usually 3 unimp ortan t sinc e a diﬀeren t choice v aries m , M b y only a mu ltiplicative constan t, while K, K m a re v aried b y additiv e constan ts. F o r natural n um b ers n w e deﬁne K ( n ) b y K ( h n i ) where h n i is the binary represen tation of n . M is not a prop er measure, M ( x ) > M ( x 0) + M ( x 1) for all x ∈ B ∗ , whic h means that M (0 | x ) + M (1 | x ) < 1, so M assigns a non-zero probabilit y that the sequence will end. This is b ecause there are mono tone programs p tha t halt, o r enter inﬁnite lo ops. F o r this reason Solomono ﬀ in tro duced a norma lised v ers ion, M nor m deﬁned as follo ws. Deﬁnition 7 (Normalisation) . M nor m ( ǫ ) := 1 M nor m ( y n | y N where f ( ω 0 since p ⊏ q . F or other v alue s, P ( · , t ) = P ( · , t + 1). Note that it is not po ssible that p = q since then x = y and duplicates are not added to L . Therefore P is an en umerable semi-distribution. By Theorem 8 w e hav e m ( ω 0 t here exists a z ∈ B ∗ suc h that M (0 | z ) + M (1 | z ) < δ . This result is already known and is left as an exercise (4.5.6) with a pro of sk etc h in [L V08]. F or completeness , w e include a pro of. Recall t ha t M ( · , t ) is the function approx imating M ( · ) from b elow. Fixing an n , deﬁne z ∈ B ∗ inductiv ely as follo ws. 1. z := ǫ 2. Let t b e the ﬁrst natural n um b er suc h tha t M ( z b, t ) > 2 − n for some b ∈ B . 3. If t exists then z := z ¬ b and repeat step 2. If t do es not exist then z is left unc hanged (forev er). Note that z m ust b e ﬁnite since each time it is extended, M ( z b, t ) > 2 − n . T herefore M ( z ¬ b, t ) < M ( z , t ) − 2 − n and so e ach time z is extended, the v alue of M ( z , t ) decreases b y at least 2 − n so ev en tually M ( z b, t ) < 2 − n for all b ∈ B . Now once the z is no longer b eing extended ( t do es not exist in step 3 ab ov e) w e hav e M ( z 0) + M ( z 1) ≤ 2 1 − n . (11) Ho w ev er w e can also sho w that M ( z ) × ≥ 2 − K ( n ) . The in tuitiv e idea is tha t t he pro cess ab o ve requires only the v alue of n , whic h can b e enco ded in K ( n ) bits. More fo rmally , let p b e suc h that n ∈ U P ( p ) and note that the follo wing set is recursiv ely en umerable (but not recursiv e) b y the pro cess ab ov e. L p := ( p, ǫ ) , ( p, z 1:1 ) , ( p, z 1:2 ) , ( p, z 1:3 ) , · · · , ( p, z 1: ℓ ( z ) − 1 ) , ( p, z 1: ℓ ( z ) ) . 11 No w tak e the union of all suc h sets, whic h is a) recursiv ely en umerable since U P is, and b) a monot one mac hine b ecause U P is a preﬁx machine . L := [ ( p,n ) ∈ U P L p . Therefore M ( z ) × ≥ M L ( z ) ≥ 2 − K ( n ) (12) where the ﬁrst inequalit y is from Theorem 6 and the second follo ws since if n ∗ is the program of length K ( n ) with U P ( n ∗ ) = n then ( n ∗ , z 1: ℓ ( z ) ) ∈ L . Com bining Equations (11) and (12) give s M (0 | z ) + M (1 | z ) × ≤ 2 1 − n + K ( n ) . Since this tends to zero a s n go es to inﬁnit y , 5 for eac h δ > 0 w e can construct a z ∈ B ∗ satisfying M (0 | z ) + M (1 | z ) < δ , as required. F or the second part of the pro of, w e construct ω b y concatenation. ω := z 1 z 2 z 3 · · · where z n ∈ B ∗ is chosen suc h tha t, M (0 | z n ) + M (1 | z n ) < δ n (13) with δ n to b e chos en la t er. Now, M ( b | z 1 · · · z n ) ≡ M ( z 1 · · · z n b ) M ( z 1 · · · z n ) (14) × ≤ h 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) i M ( z n b ) M ( z n ) (15) ≡ h 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) i M ( b | z n ) (16) where Equation (14) is the deﬁnition of conditional probabilit y . Equation ( 1 5) fol- lo ws b y apply ing Lemma 13 with x = z 1 z 2 · · · z n − 1 and y = z n or z n b . Equation (16) is a g ain the deﬁnition o f conditional probability . Now let δ n = 2 − n 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) . Com bining this with Equations (13) and (16) giv es M (0 | z 1 · · · z n ) + M (1 | z 1 · · · z n ) × ≤ 2 − n . 5 An integer n can easily b e enco ded in 2 log n bits, so K ( n ) ≤ 2 lo g n + c for s ome c > 0 independent of n . 12 Therefore, lim inf n →∞ [ M (0 | ω 0 such that M ( ¯ ω 2 n | ¯ ω < 2 n ) > c fo r all n ∈ N . In this sense M can still b e used to predict in the same wa y as M nor m , but it will nev er con verge as in Equation (1). 5 Discuss ion Summary . Theorem 10 sho ws that if an inﬁnite sequenc e con tains a computable sub-pattern then the nor malised univ ersal semi-measure M nor m will ev entually pre- dict it. This means that Solomo no ﬀ ’s normalised v ersion o f induction is eﬀectiv e in the classiﬁcation example give n in t he intro duction. Note tha t w e ha ve only prov en the binary case, but exp ect the pro of will go through iden tically for arbitrary ﬁnite alphab et. On the o ther hand, Theorem 12 shows that plain M can fail to predict suc h structure in the sense that the conditional distribution need not con v erge to 1 on the true sequence. This is b ecause it is not a pro p er measure, and do es no t conv erge to one. These r esults are surprising since (all?) other predictive results, including Equation (1) and many others in [Hut04, Hut07, L V08, Sol64a], do not rely o n normalisation. Consequences. W e ha v e shown that M nor m can predict r ecursiv e structure in inﬁnite strings that are incomputable (eve n sto c hastically so). These results give 14 hop e that a So lo monoﬀ inspired algor ithm ma y b e eﬀectiv e at online classiﬁcation, ev en whe n the training data is given in a completely unstructured wa y . Note that while M is enume rable and M nor m is only appro ximable, 6 b oth the conditional distributions are only approximable, whic h means it is no harder to predict using M nor m than M . Op en Questions. A n um b er of op en questions w ere encoun tered in writing this pap er. 1. Extend Theorem 10 t o the sto chastic cas e where a sub-pattern is generated sto c hastically from a computable distribution ra ther tha n merely a computable function. I t seems lik ely that a diﬀerent approac h will b e required to solve this problem. 2. Another interesting question is to strengthen the result b y pro ving a con v er- gence rate. It ma y be p ossible to pro ve tha t under the same conditions a s Theorem 10 that P ∞ i =1 [1 − M nor m ( ω n i | ω 0 suc h tha t f ( x ) > c · g ( x ) f or all x × ≤ f ( x ) × ≤ g ( x ) if t here exists a c > 0 suc h tha t f ( x ) < c · g ( x ) f or all x × = f ( x ) × = g ( x ) if f ( x ) × ≥ g ( x ) a nd f ( x ) × ≤ g ( x ) x ⊏ y x is a preﬁx of y and ℓ ( x ) < ℓ ( y ) x ⊑ y x is a preﬁx of y 17

Universal Prediction of Selected Bits

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment