Universal Prediction of Selected Bits

Many learning tasks can be viewed as sequence prediction problems. For example, online classification can be converted to sequence prediction with the sequence being pairs of input/target data and where the goal is to correctly predict the target dat…

Authors: Tor Lattimore, Marcus Hutter, Vaibhav Gavane

Univ ersal Predictio n of Selecte d Bits T or Lattimore and Marcus Hutter and V aibha v Ga v ane Australian National Univ ersit y tor.lattimo re@anu.edu.au Australian National Univ ersit y a nd ET H Z ¨ uric h marcus.hutt er@anu.edu.au VIT Univ ersit y , V ellore vaibhav.gav ane@gmail.com 20 July 2011 Abstract Man y learning tasks can b e view ed as sequence prediction problems. F or example, online classification can b e conv erted to sequence prediction with the sequence b eing pairs of in p ut/target data and where the goal is to cor- rectly predict th e target data giv en inp ut data and p revious input/target pairs. Solomonoff induction is kno wn to solv e the g eneral sequence predic- tion p r oblem, but only if the entire sequence is sampled fr om a computable distribution. In the case of classification and d iscriminativ e learning though, only the targets need b e structured (giv en the inp uts). W e show that the normalised v ersion of Solo monoff indu ction can still b e used in this case, and more generally that it can detect an y recursive sub-pattern (regularit y) within an otherwise completely unstructur ed sequence. It is also sho wn that the un- normalised v ersion can fail to pred ict ve ry simple r ecursiv e sub -p atterns. Con ten ts 1 In tro duction 2 2 Notation and Definitions 3 3 M nor m Predicts Selected Bits 6 4 M F ails to Predict Selected Bits 9 5 Discussion 14 A T able of Notat ion 17 Keyw ords Sequence pr ediction; Solomonoff induction; online classification; discrimina- tiv e learning; algorithmic inf ormation theory . 1 1 In tro duction The sequence prediction problem is the task of predicting the next sym b ol, x n after observing x 1 x 2 · · · x n − 1 . Solomonoff induction [Sol64a, Sol64b] solve s this problem b y taking inspiration from Oc cam’s razor and Epicurus’ princip le of m ultiple ex- planations. These ideas are formalised in the field of Kolmogoro v complex ity , in particular b y the univ ersal a priori semi-measure M . Let µ ( x n | x 1 · · · x n − 1 ) b e t he true (unkno wn) probabilit y of seeing x n ha ving al- ready observ ed x 1 · · · x n − 1 . The celebrated result of So lo monoff [Sol6 4a] stat es that if µ is computable then lim n →∞ [ M ( x n | x 1 · · · x n − 1 ) − µ ( x n | x 1 · · · x n − 1 )] = 0 with µ -probability 1 (1) That is, M can learn the true underlying distribution from whic h the data is sampled with probabilit y 1. Solo monoff induc tion is arguably the gold standard predictor, univ ersally solving many (passiv e) prediction problems [Hut04, Hut0 7, Sol64a]. Ho w ev er, Solomonoff induction mak es no guaran tees if µ is not computable. This w ould not b e problematic if it w ere unreasonable to predict sequences sampled from incomputable µ , but this is not t he case. Conside r the sequence b elow , where ev ery ev en bit is the same as the preceding o dd bit, but where the o dd bits may be c hosen arbitrarily . 00 1 1 11 11 00 11 00 00 00 11 11 00 00 00 0 0 00 11 11 (2) An y c hild will quic kly learn the pattern that eac h ev e n bit is the same as the pre- ceding o dd bit and will correctly predict the ev en bits. If Solomonoff induction is to b e considered a truly in telligen t predictor then it to o should b e able to predict t he ev en bits. More generally , it should b e able to detect an y computable sub-pattern. It is this question, first p osed in [Hut04, Hut09 ] a nd resisting attempts b y exp erts for 6 ye ars, t ha t w e address. A t first sigh t, this app ears to b e an esoteric question, but consider the following problem. Suppose y ou are giv en a sequence of pairs, x 1 y 1 x 2 y 2 x 3 y 3 · · · where x i is the data for an image (o r feature v ec tor) of a c haracter and y i the corresponding ascii co de (class la b el) for t ha t c haracter. The goal of online classification is t o construct a predictor that correctly predicts y i giv en x i based on the previously se en training pairs. It is reasonable to assume that there is a relativ ely simple pa t tern to generate y i giv en x i (h umans and computers s eem to find simple patterns for c haracter recognition). Ho w ev er it is not necessarily reasonable to assume there exists a simple, or eve n computable, underlying distribution g enerating the training data x i . This problem is precisely what gav e rise to discriminativ e lear ning [L S0 6]. It turns out that there ex ist sequenc es with ev en bits equal to preceding odd bits on whic h the conditional distribution of M fa ils to conv erge to 1 on the ev en bits. On the other hand, it is kno wn that M is a defectiv e measure, but ma y be normalised to a prop er measure, M nor m . W e sho w that this normalised version do es 2 con v erge on an y recursiv e sub-pattern of any s equence, suc h as t ha t in Equation (2) . This o utcome is unanticipated since (all?) other results in the field are indep enden t of no r malisation [Hut04 , Hut07, L V08, Sol64a]. The pro ofs are completely differen t to the standard pro ofs o f predictiv e results. 2 Notation and Definit ions W e use similar notation to [G´ ac83, G´ ac08, Hut04]. F or a more comprehensiv e in- tro duction to Kolmogorov complexit y a nd Solomonoff induction see [Hut04, Hut07, L V08, ZL70]. Strings. A finite binary string x is a finite sequenc e x 1 x 2 x 3 · · · x n with x i ∈ B = { 0 , 1 } . Its length is denoted ℓ ( x ). An infinite binary string ω is an infinite sequence ω 1 ω 2 ω 3 · · · . The empt y string of length zero is denoted ǫ . B n is the set of all binary strings of length n . B ∗ is the set o f all finite binary strings. B ∞ is the set o f all infinite binary strings. Substrings are denoted x s : t := x s x s +1 · · · x t − 1 x t where s, t ∈ N a nd s ≤ t . If s > t then x s : t = ǫ . A useful shorthand is x 0 suc h that f ( x ) ≥ c · g ( x ) for all x . f ( x ) × ≤ g ( x ) is defined similarly . f ( x ) × = g ( x ) if f ( x ) × ≤ g ( x ) and f ( x ) × ≥ g ( x ). Definition 2 (Measures) . W e call µ : B ∗ → [0 , 1 ] a semime asur e if µ ( x ) ≥ P b ∈B µ ( xb ) fo r all x ∈ B ∗ , a nd a probabilit y measure if equality holds and µ ( ǫ ) = 1. µ ( x ) is the µ - pro babilit y that a sequenc e starts with x . µ ( b | x ) := µ ( xb ) µ ( x ) is the prob- abilit y of observing b ∈ B giv en that x ∈ B ∗ has already b een observ ed. A function P : B ∗ → [0 , 1] is a sem i-distribution if P x ∈B ∗ P ( x ) ≤ 1 and a probabilit y distribu- tion if equality holds. Definition 3 (Enum erable F unctions) . A real v alued function f : A → R is enumer able if there exists a computable function f : A × N → Q satisfying lim t →∞ f ( a, t ) = f ( a ) and f ( a, t + 1) ≥ f ( a, t ) fo r all a ∈ A and t ∈ N . 3 Definition 4 (Mac hines) . A T uring mac hine L is a recursiv ely en umer- able set (whic h may b e finite) con taining pairs of finite binary strings ( p 1 , y 1 ) , ( p 2 , y 2 ) , ( p 3 , y 3 ) , · · · . L is a pr efix machine if the s et { p 1 , p 2 , p 3 · · · } is prefix free (no program is a prefix of a n y other). It is a monotone m a chine if for all ( p, y ) , ( q , x ) ∈ L with ℓ ( x ) ≥ ℓ ( y ), p ⊑ q = ⇒ y ⊑ x . W e defi ne L ( p ) to b e the set of strings output b y prog r a m p . This is different for monotone a nd prefix ma chines . F o r prefix mac hines, L ( p ) con tains only one elemen t, y ∈ L ( p ) if ( p, y ) ∈ L . F or monotone mac hines, y ∈ L ( p ) if there exists ( p, x ) ∈ L with y ⊑ x and there do es not exist a ( q , z ) ∈ L with q ⊏ p and y ⊑ z . F or b oth mac hines L ( p ) re presen ts the output of m achine L when giv en input p . If L ( p ) do es not exist then w e sa y L do es not halt on input p . Not e that for monotone mac hines it is p ossible for t he same program to output m ultiple strings. F or example (1 , 1) , (1 , 11) , (1 , 11 1) , (1 , 1111) , · · · is a perfectly legitimate monotone T ur ing mac hine. F or prefix mac hines this is not p ossible. Also no te that if L is a monotone mac hine and there exists an x ∈ B ∗ suc h that x 1: n ∈ L ( p ) and x 1: m ∈ L ( p ) then x 1: r ∈ L ( p ) f or a ll n ≤ r ≤ m . Definition 5 (Complexit y) . Let L b e a prefix o r monoto ne machine then define λ L ( y ) := X p : y ∈ L ( p ) 2 − ℓ ( p ) C L ( y ) := min p ∈B ∗ { ℓ ( p ) : y ∈ L ( p ) } If L is a prefix mac hine then we write m L ( y ) ≡ λ L ( y ). If L is a monotone machine then we write M L ( y ) ≡ λ L ( y ). Not e that if L is a prefix mac hine then λ L is an en umerable semi-distribution while if L is a monotone mac hine, λ L is an en umerable semi-measure. In fact, ev ery enume rable semi-measure (or semi-distribution) can b e represen ted via some mac hine L as λ L . F or prefix/monotone machine L we write L t for the first t program/o utput pairs in t he recursiv e en umeration of L , so L t will b e a finite set con taining at most t pairs. 1 The set of all monotone (or prefix) machines is itself recursiv ely enumerable [L V08], 2 whic h allow s one to define a univ ersal monotone mac hine U M as follo ws. Let L i b e the i th monotone mac hine in the r ecursiv e en umeration of monotone mac hines. ( i ′ p, y ) ∈ U M ⇔ ( p, y ) ∈ L i where i ′ is a prefix co ding of t he in teger i . A univ ersal prefix mac hine, denoted U P , is defined in a similar wa y . F or details see [L V08]. 1 L t will contain exactly t pair s unless L is finite, in which case it will contain t pairs until t is greater than the size of L . This anno yance will nev er be pro blematic. 2 Note the en umeration may inc lude r ep e tition, but this is unimportant in this ca se. 4 Theorem 6 (Univ ersal Prefix/Monotone Mac hines) . F or the universal monotone machine U M and universa l pr efix machi n e U P , m U P ( y ) > c L m L ( y ) for al l y ∈ B ∗ M U M ( y ) > c L M L ( y ) for al l y ∈ B ∗ wher e c L > 0 dep ends o n L but not y . F or a pro of, see [L V08]. As usual, w e will fix referenc e univ ersal prefix/monotone mac hines U P , U M and drop the subscripts by letting, m ( y ) := m U P ( y ) ≡ X p : y ∈ U P ( p ) 2 − ℓ ( p ) M ( y ) := M U M ( y ) ≡ X p : y ∈ U M ( p ) 2 − ℓ ( p ) K ( y ) := C U P ( y ) ≡ min p ∈B ∗ { ℓ ( p ) : y ∈ U P ( p ) } K m ( y ) := min p ∈B ∗ { ℓ ( p ) : y ∈ U M ( p ) } The ch oice of reference univ ers al T uring mac hine is usually 3 unimp ortan t sinc e a differen t choice v aries m , M b y only a mu ltiplicative constan t, while K, K m a re v aried b y additiv e constan ts. F o r natural n um b ers n w e define K ( n ) b y K ( h n i ) where h n i is the binary represen tation of n . M is not a prop er measure, M ( x ) > M ( x 0) + M ( x 1) for all x ∈ B ∗ , whic h means that M (0 | x ) + M (1 | x ) < 1, so M assigns a non-zero probabilit y that the sequence will end. This is b ecause there are mono tone programs p tha t halt, o r enter infinite lo ops. F o r this reason Solomono ff in tro duced a norma lised v ers ion, M nor m defined as follo ws. Definition 7 (Normalisation) . M nor m ( ǫ ) := 1 M nor m ( y n | y N where f ( ω 0 since p ⊏ q . F or other v alue s, P ( · , t ) = P ( · , t + 1). Note that it is not po ssible that p = q since then x = y and duplicates are not added to L . Therefore P is an en umerable semi-distribution. By Theorem 8 w e hav e m ( ω 0 t here exists a z ∈ B ∗ suc h that M (0 | z ) + M (1 | z ) < δ . This result is already known and is left as an exercise (4.5.6) with a pro of sk etc h in [L V08]. F or completeness , w e include a pro of. Recall t ha t M ( · , t ) is the function approx imating M ( · ) from b elow. Fixing an n , define z ∈ B ∗ inductiv ely as follo ws. 1. z := ǫ 2. Let t b e the first natural n um b er suc h tha t M ( z b, t ) > 2 − n for some b ∈ B . 3. If t exists then z := z ¬ b and repeat step 2. If t do es not exist then z is left unc hanged (forev er). Note that z m ust b e finite since each time it is extended, M ( z b, t ) > 2 − n . T herefore M ( z ¬ b, t ) < M ( z , t ) − 2 − n and so e ach time z is extended, the v alue of M ( z , t ) decreases b y at least 2 − n so ev en tually M ( z b, t ) < 2 − n for all b ∈ B . Now once the z is no longer b eing extended ( t do es not exist in step 3 ab ov e) w e hav e M ( z 0) + M ( z 1) ≤ 2 1 − n . (11) Ho w ev er w e can also sho w that M ( z ) × ≥ 2 − K ( n ) . The in tuitiv e idea is tha t t he pro cess ab o ve requires only the v alue of n , whic h can b e enco ded in K ( n ) bits. More fo rmally , let p b e suc h that n ∈ U P ( p ) and note that the follo wing set is recursiv ely en umerable (but not recursiv e) b y the pro cess ab ov e. L p := ( p, ǫ ) , ( p, z 1:1 ) , ( p, z 1:2 ) , ( p, z 1:3 ) , · · · , ( p, z 1: ℓ ( z ) − 1 ) , ( p, z 1: ℓ ( z ) ) . 11 No w tak e the union of all suc h sets, whic h is a) recursiv ely en umerable since U P is, and b) a monot one mac hine b ecause U P is a prefix machine . L := [ ( p,n ) ∈ U P L p . Therefore M ( z ) × ≥ M L ( z ) ≥ 2 − K ( n ) (12) where the first inequalit y is from Theorem 6 and the second follo ws since if n ∗ is the program of length K ( n ) with U P ( n ∗ ) = n then ( n ∗ , z 1: ℓ ( z ) ) ∈ L . Com bining Equations (11) and (12) give s M (0 | z ) + M (1 | z ) × ≤ 2 1 − n + K ( n ) . Since this tends to zero a s n go es to infinit y , 5 for eac h δ > 0 w e can construct a z ∈ B ∗ satisfying M (0 | z ) + M (1 | z ) < δ , as required. F or the second part of the pro of, w e construct ω b y concatenation. ω := z 1 z 2 z 3 · · · where z n ∈ B ∗ is chosen suc h tha t, M (0 | z n ) + M (1 | z n ) < δ n (13) with δ n to b e chos en la t er. Now, M ( b | z 1 · · · z n ) ≡ M ( z 1 · · · z n b ) M ( z 1 · · · z n ) (14) × ≤ h 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) i M ( z n b ) M ( z n ) (15) ≡ h 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) i M ( b | z n ) (16) where Equation (14) is the definition of conditional probabilit y . Equation ( 1 5) fol- lo ws b y apply ing Lemma 13 with x = z 1 z 2 · · · z n − 1 and y = z n or z n b . Equation (16) is a g ain the definition o f conditional probability . Now let δ n = 2 − n 2 K ( ℓ ( z 1 ··· z n − 1 ))+ K ( z 1 ··· z n − 1 ) . Com bining this with Equations (13) and (16) giv es M (0 | z 1 · · · z n ) + M (1 | z 1 · · · z n ) × ≤ 2 − n . 5 An integer n can easily b e enco ded in 2 log n bits, so K ( n ) ≤ 2 lo g n + c for s ome c > 0 independent of n . 12 Therefore, lim inf n →∞ [ M (0 | ω 0 such that M ( ¯ ω 2 n | ¯ ω < 2 n ) > c fo r all n ∈ N . In this sense M can still b e used to predict in the same wa y as M nor m , but it will nev er con verge as in Equation (1). 5 Discuss ion Summary . Theorem 10 sho ws that if an infinite sequenc e con tains a computable sub-pattern then the nor malised univ ersal semi-measure M nor m will ev entually pre- dict it. This means that Solomo no ff ’s normalised v ersion o f induction is effectiv e in the classification example give n in t he intro duction. Note tha t w e ha ve only prov en the binary case, but exp ect the pro of will go through iden tically for arbitrary finite alphab et. On the o ther hand, Theorem 12 shows that plain M can fail to predict suc h structure in the sense that the conditional distribution need not con v erge to 1 on the true sequence. This is b ecause it is not a pro p er measure, and do es no t conv erge to one. These r esults are surprising since (all?) other predictive results, including Equation (1) and many others in [Hut04, Hut07, L V08, Sol64a], do not rely o n normalisation. Consequences. W e ha v e shown that M nor m can predict r ecursiv e structure in infinite strings that are incomputable (eve n sto c hastically so). These results give 14 hop e that a So lo monoff inspired algor ithm ma y b e effectiv e at online classification, ev en whe n the training data is given in a completely unstructured wa y . Note that while M is enume rable and M nor m is only appro ximable, 6 b oth the conditional distributions are only approximable, whic h means it is no harder to predict using M nor m than M . Op en Questions. A n um b er of op en questions w ere encoun tered in writing this pap er. 1. Extend Theorem 10 t o the sto chastic cas e where a sub-pattern is generated sto c hastically from a computable distribution ra ther tha n merely a computable function. I t seems lik ely that a different approac h will b e required to solve this problem. 2. Another interesting question is to strengthen the result b y pro ving a con v er- gence rate. It ma y be p ossible to pro ve tha t under the same conditions a s Theorem 10 that P ∞ i =1 [1 − M nor m ( ω n i | ω 0 suc h tha t f ( x ) > c · g ( x ) f or all x × ≤ f ( x ) × ≤ g ( x ) if t here exists a c > 0 suc h tha t f ( x ) < c · g ( x ) f or all x × = f ( x ) × = g ( x ) if f ( x ) × ≥ g ( x ) a nd f ( x ) × ≤ g ( x ) x ⊏ y x is a prefix of y and ℓ ( x ) < ℓ ( y ) x ⊑ y x is a prefix of y 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment