Kolmogorov complexity in perspective

Kolmogorov complexit y in p ersp ectiv e Marie F erbus-Zanda LIAF A, CNRS & Univ ersit´ e P aris 7 2, pl. Jussieu 75251 P aris Cedex 05 F rance ferbus@logique.jussieu.fr Serge Grigorieﬀ LIAF A, CNRS & Univ ersit ´ e Paris 7 2, pl. J ussieu 75251 P aris Cedex 05 F rance seg@liafa.jussieu.f r No v em ber 4, 2018 Con ten ts 1 Three approaches to the quantitative deﬁnition of informati on 2 1.1 Whic h information ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 About anyth ing... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Especiall y words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Comb inatorial approac h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Constan t-length codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 V ar i able-length preﬁ x codes . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 En tropy o f a of distribution of frequencies . . . . . . . . . . . . . . . . . 5 1.2.4 Shannon’s source coding theorem for sym b ol codes . . . . . . . . . . . . 6 1.2.5 Closer to the en tropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.6 Coding ﬁnitely many w or ds with one w ord . . . . . . . . . . . . . . . . 8 1.3 Probabilistic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Algorithmic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Berry’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 The turn to computabilit y . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.3 Digression on computability theory . . . . . . . . . . . . . . . . . . . . . 10 1.5 Kolmogorov complexity and the inv ariance theorem . . . . . . . . . . . . . . . . 11 1.5.1 Program si ze complexity or Kolmogorov complexit y . . . . . . . . . . . 11 1.5.2 The inv ariance theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.3 About the constan t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5.4 Conditional Kolm ogorov complexity . . . . . . . . . . . . . . . . . . . . 13 1.5.5 Simple upp er b ounds for Kolmogorov complexity . . . . . . . . . . . . . 14 1.6 Oracular Kolm ogorov complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Kolmog orov complexi t y and undecidability 16 2.1 K is unbounded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 K is not computable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 K is computable from abov e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Kolmogorov complexity and G¨ odel’s incompleteness theorem . . . . . . . . . . 17 3 F orma l ization of randomness for ﬁnite ob jects 17 3.1 Probabilities: la ws about a non f ormalized intu ition . . . . . . . . . . . . . . . 17 3.2 The 100 heads paradoxica l result in probability theory . . . . . . . . . . . . . . 18 3.3 Kolmogorov ’s proposal: incompressible strings . . . . . . . . . . . . . . . . . . . 19 3.3.1 incompressibility with Kolmogorov complexity . . . . . . . . . . . . . . 19 1 3.3.2 incompressibility with length conditional Kolmogorov complexity . . . . 19 3.4 Incompressibility is r andomne ss: Martin-L¨ of ’ s argument . . . . . . . . . . . . . 20 3.5 Randomness: a new foundation for probability theory? . . . . . . . . . . . . . . 23 4 F orma l ization of randomness for inﬁnite ob jects 24 4.1 Martin-L¨ of approach with topology and comput ability . . . . . . . . . . . . . . 24 4.2 The b ottom-up approac h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 The naive idea badly fails . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.2 Miller & Y u’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 Kolmogorov r andomness and ∅ ′ . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.4 V ar i an ts of Kolmogorov complexit y and randomness . . . . . . . . . . . 26 5 Application of Kolmog o ro v complexity t o cla ssi ﬁcation 27 5.1 What i s the problem? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Classiﬁcation via compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2.1 The normali zed information dis tan ce N I D . . . . . . . . . . . . . . . . 29 5.2.2 The normali zed compression distance N C D . . . . . . . . . . . . . . . . 31 5.3 The Go ogle classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3.1 The normali zed Google distance N GD . . . . . . . . . . . . . . . . . . . 32 5.3.2 Discussing the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 Some ﬁnal remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Abstract W e survey the div erse approac hes to the notion of information conten t : from Sh an n on entrop y to Kolmogoro v complexity . The main applications of Kolmogoro v complexit y are presented: n amely , the mathematical no- tion of randomness (whic h goes back to th e 60’s with the w ork of Martin- L¨ of, Schnorr, Chaitin, Lev in), and classiﬁcation, whic h is a recent idea with p ro vo cativ e implementation by Vitanyi and Cili brasi. . Note. F ollowing Rober t Soare’s recommendations in [35], which hav e now gained large agr e emen t, we shall write c omputable and c omputably enu mer able in place of the old fa shioned r e cursive and r e cursively enumer able . Notation. By log ( x ) we mea n the logarithm of x in base 2 . By ⌊ x ⌋ we mean the “ﬂo or” of x , i.e. the la r gest in teger ≤ x . Simila r ly , ⌈ x ⌉ denotes the “ceil” of x , i.e. the smallest integer ≥ x . Recall that the length of the binar y r e presen tation of a non neg ativ e in teger n is 1 + ⌊ lo g n ⌋ . 1 Three approac hes to the q u an titativ e deﬁni- tion of informat ion A title b orrow ed fro m Kolmogor o v’s seminal pap er, 196 5 [22]. 1.1 Whic h information ? 1.1.1 Ab out any thing.. . Abo ut a n ything ca n be seen as conv eying information. As usual in ma thematical mo delization, we r etain only a few features of so me real entit y or pro cess, a nd asso ciate to them so me ﬁnite or inﬁnite mathematical o b jects. F o r instance, 2 • - an integer or a rationa l num b er or a w ord in some alpha bet, - a ﬁnite sequence or a ﬁnite s et of such o b jects, - a ﬁnite graph,... • - a real, - a ﬁnite or inﬁnite sequence of reals or a set of reals, - a function o ver words or num b ers,... This is very m uch as with proba bilit y spaces. F or insta nce, to mo delize the distributions of 6 balls int o 3 cells, (cf. F eller ’s b o ok [1 6] § I2, I I5) we forget everything abo ut the na tur e of balls and cells and o f the distribution pro cess, retaining only tw o q ue s tions: “ ho w many ar e they?” and “ are they distinguish- able or not?”. Accordingly , the mo delization consider s - either the 729 = 3 6 maps fro m the set of balls into the set o f cells in ca se the balls are distinguisha ble a nd so are the c e lls (this is what is done in Ma x w ell- Boltzman statistics), - or the 28 =  3+6 − 1 6  triples of non nega tiv e integers with sum 6 in cas e the cells are dis tinguishable but not the ba lls (this is wha t is done in in Bose-Einstein statistics) - or the 7 sets of at most 3 int egers w ith sum 6 in case the balls a re undistin- guishable and so a re the cells. 1.1.2 Esp ecially words In infor mation theory , sp ecial emphasis is made on information c on vey ed by words on ﬁnite alphab ets. I.e. on se qu ential information as opp osed to the obviously mas s iv ely parallel and int eractive distribution of information in real ent ities a nd pro cesses. A drastic reduction which allows for mathematical de- velopmen ts (but also illustrates the Italia n sentence “tr aduttore, tr aditore!”). As is largely p opularized by computer science, any ﬁnite alpha b et with mor e than t wo letter s can b e reduced to one with exactly tw o letters. F or instance, as exempliﬁed by the ASCI I co de (American Standar d Co de for Infor mation Int erchange), any symbol us e d in written English – namely the lowercase and upper case letters, the decimal digits, the diverse punctuation marks, the spa c e, ap ostrophe, quote, left and right pa ren theses – can be co ded by length 7 binary words (cor respo nding to the 12 8 ASCI I co des). Which lea ds to a simple way to co de an y English text b y a binary word (which is 7 times longer ). 1 Though quite rough, the length o f a word is the basic mea s ure o f its infor mation conten t. No w a fairnes s issue faces us: ric her the alpha b et, shorter the word. Considering g roups of k succ e ssiv e letters as new letter s of a sup er-alphab et, o ne trivially divides the length b y k . F or instance, a length n binary w ord b ecomes a length ⌈ n 256 ⌉ w ord with the usual packing o f bits by groups of 8 (called bytes) which is done in computers. This is why leng th considera tions will always b e develop ed r elativ e to binary 1 F or other Europ ean languages which ha ve a lot of diacritic marks, one has to consider the 256 co des of the Extended ASCII co de. 3 alphab ets. A choice to b e considered as a n ormalization of length . Finally , we come to the basic idea to measur e the informa tion conten t of a math- ematical ob ject x : information c ontent of x = length of a s hortest binary wor d which “enc o des” x What do we mean precisely b y “ encodes” is the crucial questio n. F ollowing the trichotom y p ointed b y Kolmog orov [22], we survey three approaches. 1.2 Com binatorial approac h 1.2.1 Constan t-length co des Let’s conside r the family A n of length n words in an alphab et A with s letter s a 1 , ..., a s . Co ding the a i ’s by binar y words w i ’s all of length ⌈ log s ⌉ , to an y word u in A n we c an asso ciate the binar y word ξ obtained by substituting the w i ’s to the o ccurrences of the a i ’s in u . Clearly , ξ has length n ⌈ log s ⌉ . Also, the map u 7→ ξ is v ery simple. Mathematically , it can be consider ed as a mor phis m from words in a lphabet A to binary w ords rela tiv e to the alg ebraic structure (of monoid) given b y the concatena tio n pr oduct of words. Observing that n lo g s can b e smalle r than n ⌈ lo g s ⌉ , a mo dest improv ement is po ssible which saves a bout n ⌈ log s ⌉ − n log s bits. The improved map u 7→ ξ is essentially a change of bas e: lo oking at u as the base s representation of an int eger k , the word ξ is the base 2 representation of k . Now, the ma p u 7→ ξ is no more a mor phism. How ever, it is still quite simple and can b e computed by a ﬁnite automato n. W e hav e t o consider k -adic representations rath er than the usual k - ary ones. The diﬀerence is simple: instead of using digits 0 , 1 , ..., k − 1 use digits 1 , ..., k . The interpretation as a sum of successive exp onentials of k is unchanged and so are all usual algorithms for arithmetical op erations. Also, the lexicographic ordering on k -adic representations corresp onds to th e natural order on integers. F or instance, th e successiv e integers 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7, written 0, 1, 10, 11, 100, 101, 110, 111 in binary (i.e. 2- ary) ha ve 2-adic represen tations the empt y w ord (for 0) and then the w ords 1, 2, 11, 12, 21, 22, 111. Whereas th e length of the k -ary representatio n of x is 1 + ⌊ log x log k ⌋ , its k -adic representation has length ⌊ log( x +1) log k ⌋ . Let’s interpret t he length n word u as the s - adic representation of an integ er x b etw een t = s n − 1 + ... + s 2 + s + 1 and t ′ = s n + ... + s 2 + s (whic h corresp ond to the length n w ord s 11 ... 1 and s s...s ). L et ξ b e the 2-adic rep resen tation of this integer x . The length of ξ is ≤ ⌊ log( t ′ + 1) ⌋ = ⌊ log( s n +1 − 1 s − 1 ) ⌋ ≤ ⌊ ( n + 1) log s − log( s − 1) ⌋ = ⌊ n log s − log(1 − 1 s ) ⌋ which diﬀers from n log s b y at most 1. 1.2.2 V ariable-length preﬁx co des Instead of co ding the s letters of A by bina ry words of length ⌈ log s ⌉ , one can co de the a i ’s by binary words w i ’s having diﬀerent leng thes so as to asso ciate short co des to most freq uen t letters and long co des to rare o nes. Which is the 4 basic idea of compression. Us ing these co des, the substitution o f the w i ’s to the o ccurrences of the a i ’s in a word u gives a binary w ord ξ . And the map u 7→ ξ is aga in very s imple. It is still a mor phism from the monoid o f w ords on alphab et A to the monoid o f binary words and can a lso b e computed by a ﬁnite automaton. Now, we face a problem: can we rec over u from ξ ? i.e. is the map u 7→ ξ injectiv e? In general the answer is no. Ho w ever, a simple suﬃcient co ndition to e nsure deco ding is that the family w 1 , ..., w s be a so- called pr eﬁx -fr e e c o de . Whic h means tha t if i 6 = j then w i is not a pr e ﬁx of w j . This condition insures th at th ere is a uniq u e w i 1 whic h is a preﬁx of ξ . Then, considering the associated suﬃx ξ 1 of v (i.e. v = w i 1 ξ 1 ) there is a unique w i 2 whic h is a preﬁx of ξ 1 , i.e. u is of th e form u = w i 1 w i 2 ξ 2 . And so on. Suppo se the num b ers of occ ur rences in u of the letters a 1 , ..., a s are m 1 , ..., m s , so that the length of u is n = m 1 + ... + m s . Using a preﬁx-free co de w 1 , ..., w s , the binar y word ξ ass ociated to u has length m 1 | w 1 | + ... + m s | w s | . A natura l question is, given m 1 , ..., m s , how to cho ose the pr eﬁ x -fr e e c o de w 1 , ..., w s so as to minimize the length of ξ ? David A. Huﬀman, 1952 [18], found a very eﬃcient alg orithm (which has linear time complexity if the frequencie s ar e a lready order ed). This a lgorithm (suitably mo diﬁed to keep its top eﬃciency for words containing long runs of the same data) is now a da ys used in nea rly every application tha t inv olves the co mpr ession and transmissio n of da ta: fax machines, mo dems, netw orks ,... 1.2.3 En trop y of a of distribution of frequencies The intuition of the no tio n of ent ropy in information theory is as follows. Given natura l integers m 1 , ..., m s , consider the family F m 1 ,...,m s of length n = m 1 + ... + m s words of the alphab et A in which there are exactly m 1 , ..., m s o ccurrences of letters a 1 , ..., a s . How many binary digits are there in the binary representation of the n umber of words in F m 1 ,...,m s ? It ha ppens (cf. Prop osi- tion 2) that this num ber is ess en tially linear in n , the co eﬃcien t of n depe nding solely o n the frequencies m 1 n , ..., m s n . It is this co eﬃcient which is called the ent ropy H of the distribution of the freque nc ie s m 1 n , ..., m s n . Now, H has a striking signiﬁcance in terms o f information conten t and com- pression. Any word u in F m 1 ,...,m s is uniquely c haracter ized by its ra nk in this family (say rela tiv ely to the lexicogra phic ordering on words in alphab et A ). In particular, the binar y repres en tation o f this ra nk “ encodes ” u and its length, which is b ounded by nH (up to an O (log n ) term) can b e seen as an upp er b ound of the information co n ten t of u . Otherwise said, the n letters of u are encoded by nH binar y digits. In terms of compression (now adays s o p opularized by the zip-like softw ares), u c an b e c ompr esse d to nH bits i.e. the me an information c ontent (which c an b e se en as t he c ompr ession size in bits) of a letter of u is H . Deﬁnition 1 (Shannon, 1948 [34]) . Le t f 1 , ..., f s b e a distribution of fr e quencies, i.e. a se quenc e of r e als in [0 , 1] s uch t hat f 1 + ... + f s = 1 . The entr opy of f 1 , ..., f s 5 is the r e al H = − ( f 1 log( f 1 ) + ... + f s log( f s )) Let’s lo ok at tw o extreme cases . If a ll frequencies a re equal to 1 s then the entropy is log ( s ), so that the mean information conten t of a letter of u is lo g( s ), i.e. there is no b etter (pre ﬁx-free) co ding than that descr ibed in § 1 .2.1 . In cas e one frequency is 1 and the o ther o nes a re 0 , the information conten t of u is reduced to its length n , which, wr itten in bina ry , re q uires lo g( n ) bits. As for the ent ropy , it is 0 (with the usual conven tion 0 log 0 = 0, justiﬁed by the fact that lim x → 0 x log x = 0 ). The disc r epancy b e tw een n H = 0 and the tr ue informatio n conten t log n comes fr o m the O (log n ) term (cf. the next Prop osition). Prop osition 2. L et m 1 , ..., m s b e natur al inte gers and n = m 1 + ... + m s . Then, letting H b e the entr opy of the distribut ion of fr e quencies m 1 n , ..., m s n , the numb er ♯ F m 1 ,...,m s of wor ds in F m 1 ,...,m s satisﬁes log( ♯ F m 1 ,...,m s ) = nH + O (lo g n ) wher e the b ound in O (log n ) dep ends solely on s and not on m 1 , ..., m s . Pr o of. F m 1 ,...,m s conta ins n ! m 1 ! × ... × m s ! w ords. Using S t irling’ s approxi- mation of the factorial function (cf. F eller’s b ook [16 ]), namely x ! = √ 2 π x x + 1 2 e − x + θ 12 where 0 < θ < 1 and equality n = m 1 + ... + m S , we get log( n ! m 1 ! × ... × m s ! ) = ( X i m i ) log( n ) − ( X i m i log m i ) + 1 2 log( n m 1 × ... × m s ) − ( s − 1) log √ 2 π + α where | α | ≤ s 12 log e . The ﬁrst tw o terms are exactly n [ P i m i n log( m i n )] = nH an d the remaining sum is O (log n ) since n 1 − s ≤ n m 1 × ... × m s ≤ 1. 1.2.4 Shannon’s source co ding theorem for symb ol co des The signiﬁcance of the entrop y explained a bov e has b een given a remark able and prec ise form by Claude E lw o od Shannon (1 916-200 1 ) in his celebrated 1948 pap er [3 4]. It’s ab out the leng th o f the binary word ξ asso ciated to u v ia a preﬁx-free co de. Shanno n pr o ved - a low er bound of | ξ | v alid whatever be the preﬁx-free co de w 1 , ..., w s , - an upp er b ound, quite close to the lower b ound, v alid for particula r preﬁx-free co des w 1 , ..., w s (those making ξ s hortest pos sible, fo r instance thos e given by Huﬀman’s algo r ithm). Theorem 3 (Shannon, 1 948 [34]) . S upp ose the numb ers of o c cu r re nc es in u of the letters a 1 , ..., a s ar e m 1 , ..., m s . L et n = m 1 + ... + m s . 1. F or every pr eﬁx-fr e e se quenc e of binary wor ds w 1 , ..., w s , the binary wor d ξ obtaine d by substitut ing w i to e ach o c curr enc e of a i in u satisﬁes nH ≤ | ξ | 6 wher e H = − ( m 1 n log( m 1 n ) + ... + m s n log( m s n )) is the so-c al le d entr opy of t he c onsider e d distribution of fr e quencies m 1 n , ..., m s n . 2. Ther e ex ists a pr eﬁx -fr e e se qu en c e of binary wor ds w 1 , ..., w s such that nH ≤ | ξ | < n ( H + 1) Pr o of. First, we recall tw o classical results. Theorem ( K raft’s inequalit y) . L et ℓ 1 , ..., ℓ s b e a ﬁnite se quenc e of i nte- gers. Ine quality 2 − ℓ 1 + ... + 2 − ℓ s ≤ 1 holds if and only if ther e exists a pr eﬁx-fr e e se quenc e of binary wor ds w 1 , ..., w s such that ℓ 1 = | w 1 | , ..., ℓ s = | w s | . Theorem (Gibbs’ inequality) . L et p 1 , ..., p s and q 1 , ..., q s b e two pr ob a- bility distributions, i .e. the p i ’s (r esp. q i ’s) ar e in [0 , 1] and have sum 1 . Then − P p i log( p i ) ≤ − P p i log( q i ) with e quali ty if and only if p i = q i for al l i . Pr o of of 1 . Set p i = m i n and q i = 2 −| w i | S where S = P i 2 −| w i | . Then | ξ | = P i m i | w i | = n [ P i m i n ( − log( q i ) − log S )] ≥ n [ − ( P i m i n log( m i n ) − log S ] = n [ H − log S ] ≥ nH The ﬁrst ineq ualit y is an instance of Gibbs’ inequ alit y . F or th e last one, observe that S ≤ 1 and app ly Kraft’ inequalit y . Pr o of of 2 . Set ℓ i = ⌈− log ( m i n ) ⌉ . Observe that 2 − ℓ i ≤ m i n . Th us, 2 − ℓ 1 + ... + 2 − ℓ s ≤ 1. Applyin g Kraft inequalit y , we see that th ere exists a preﬁ x-free family of w ords w 1 , ..., w s with lengthes ℓ 1 , ..., ℓ s . W e consider the binary word ξ obtained via th is preﬁx -free co de, i.e. ξ is obtained by su b stituting w i to eac h occurren ce of a i in u . Observe that − log ( m i n ) ≤ ℓ i < − log( m i n ) + 1. S umming, we get nH ≤ | ξ | ≤ n ( H + 1). In particula r ca s es, the lower b ound nH is exactly | ξ | . Theorem 4. In c ase t he fr e quencies m i n ’s ar e al l ne gative p owers of two (i.e. 1 2 , 1 4 , 1 8 ,...) then the optimal ξ (given by Huﬀman algorithm) satisﬁes ξ = nH . 1.2.5 Closer to the entr op y As simple as they are, pr eﬁx-free co des ar e not the o nly wa y to eﬃciently enco de int o a binary word ξ a word u from alpha bet a 1 , ..., a s for which the num b ers m 1 , ..., m s (of o ccurr ences of the a i ’s) are kno wn. Let’s go back to the enco ding men tioned a t the s tart of § 1.2.3. A word u in the family F m 1 ,...,m s (of leng th n words with exactly m 1 , ..., m s o ccurrences of a 1 , ..., a s ) ca n b e recov ered fro m the following da ta: - the v a lues of m 1 , ..., m s , - the rank o f u in F m 1 ,...,m s (relative to the lexicogra phic order on words). W e have seen (cf. Prop osition 2) that the rank of u has a binary r epresent ation ρ of leng th ≤ nH + O (log n ). The integers m 1 , ..., m s are enco ded by their 7 binary repr esen tations µ 1 , ..., µ s which a re all ≤ 1 + ⌊ lo g n ⌋ . Now, to enco de m 1 , ..., m s and the rank of u , we canno t just concatenate µ 1 , ..., µ s , ρ : how would we know wher e µ 1 stops, where µ 2 starts,..., in the word obtained by concatenation? Several tricks are po ssible to ov ercome the problem, they ar e describ ed in § 1.2.6. Using P r opo s ition 5, we set ξ = h µ 1 , ..., µ s , ρ i which has length | ξ | = | ρ | + O ( | µ 1 | + ... + | µ s | ) = nH + O (log n ) (Pr opositio n 5 gives a m uch b etter b ound but this is of no use here). Then, u can b e r eco vered fro m ξ which is a bina ry word of leng th nH + O (lo g n ). Thus, asymptotica lly , we get a better upper b ound than n ( H + 1), the one given by Shannon for co dings with preﬁx-free co des. Of cours e, ξ is no mo re o btained from u via a morphism (i.e. a ma p which preserves concatenation of words) b etw een the monoid of words in alphab et A to that of bina ry words. 1.2.6 Co ding ﬁni tely man y w ords wi th one word How can we co de tw o w ords u, v by o ne word? The simplest way is to consider u $ v where $ is a fresh symbo l outside the alphab et of u and v . But what if we want to stick to binar y words? As said ab o ve, the conca tena tion o f u a nd v do es no t do the job: one cannot r eco ver the r igh t pr eﬁx u in uv . A simple trick is to also co ncatenate the length of | u | in unary and delimitate it by a zer o: indeed, from the word 1 | u | 0 uv one can r eco ver u and v . In other words, the ma p ( u, v ) → 1 | u | 0 uv is injective from { 0 , 1 } ∗ × { 0 , 1 } ∗ → { 0 , 1 } ∗ . In this w ay , the co de of the pair ( u , v ) has length 2 | u | + | v | + 1 . This can obviously be extended to more a rgument s. Prop osition 5. L et s ≥ 1 . Ther e exists a map h i : ( { 0 , 1 } ∗ ) s +1 → { 0 , 1 } ∗ which is inje ct ive and c omput able and such that, for al l u 1 , ..., u s , v ∈ { 0 , 1 } ∗ , |h u 1 , ..., u s , v i| = 2( | u 1 | + ... + | u s | ) + | v | + s . This can b e improved, we shall need this technical improv ement in § 5.2.1. Prop osition 6. Ther e exists an inje ctive and c omputable s u ch that, for al l u 1 , ..., u s , v ∈ { 0 , 1 } ∗ , |h u 1 , ..., u s , v i| = ( | u 1 | + ... + | u s | + | v | ) + (log | u 1 | + ... + log | u s | ) + O ((log log | u 1 | + ... + log log | u s | )) Pr o of. W e consider the case s = 1, i.e. w e wan t to co de a pair ( u, v ). In - stead of put ting the preﬁ x 1 | u | 0, let’s p ut th e binary representation β ( | u | ) of the number | u | preﬁxed by its length. This gives th e more complex code: 1 | β ( | u | ) | 0 β ( | u | ) uv with length | u | + | v | + 2( ⌊ log | u |⌋ + 1) + 1 ≤ | u | + | v | + 2 log | u | + 3 The ﬁrst blo c k of ones gives t h e length of β ( | u | ). Using this length, we can get β ( | u | ) as th e factor follo wing this ﬁrst blo c k of ones. Now , β ( | u | ) is the bin ary rep resentation of | u | , so we get | u | and can n o w separate u and v in t he suﬃx uv . 8 1.3 Probabilistic approac h The abstra ct probabilistic a pproach allows for conside r able e x tensions of the results describ ed in § 1.2. First, the restr iction to ﬁxed given frequencies can b e relaxed. The pro babilit y of wr iting a i may dep end on what has b een already written. F or instanc e , Shannon’s so urce co ding theorem has b een extended to the s o called “ergo dic asymptotically mean s tationary source mo de ls ”. Second, one can co nsider a los sy co ding: some length n words in a lphabet A a re ill-treated or ignored. Let δ b e the probability of this set of w ords. Shannon’s theorem extends as follows: - whatever clo s e to 1 is δ < 1, o ne can compress u only down to nH bits. - whatever clos e to 0 is δ > 0 , o ne can a c hieve compression of u down to nH bits. 1.4 Algorithmic approac h 1.4.1 Berry’s parado x So far, w e considered tw o kinds of binary co dings for a word u in alpha bet a 1 , ..., a s . The simplest one uses v ariable-length preﬁx-free co des ( § 1.2.2). The other one co des the ra nk of u as a member o f some set ( § 1.2.5). Clearly , there a re plent y of o ther wa ys to enco de any mathematica l ob ject. Why not cons ide r all of them? And deﬁne the information conten t o f a mathematica l ob ject x as the s hortest univo que description of x (writt en as a binary wor d) . Though quite a ppealing, this notion is ill de ﬁned as stressed by Berry ’s para dox 2 : Let β b e the lexic o gr aphic al ly le ast binary wor d which c annot b e u ni- vo quely describ e d by any binary wor d of length less t han 10 00. This descr iption of β contains 1 06 symbols of written English (including s paces) and, using ASCI I co des, ca n be written a s a binar y word of length 10 6 × 7 = 742. Assuming suc h a description to be well deﬁned would lead to a univo que description of β in 742 bits, hence les s than 1000, a contradiction to the deﬁnition of β . The solution to this inconsistency is clear: the q uite v ague notion of univ o que description entering Berr y’s para do x is used b oth inside the sentence descr ibing β a nd inside the arg ument to get the contradiction. A co llapse o f t wo levels: - the would b e formal level carrying the description of β - and the meta level which carries the incons is tency argument. An y formaliza tion o f the no tion o f descriptio n should drastica lly r educe its scop e and totally forbid the ab o ve collapse . 2 Berry’s parado x is mentioned by Bertrand Russell, 1908 ([31], p.222 or 150) who credited G.G. Berry , an Oxford l ibrarian, for the s ugge stion. 9 1.4.2 The turn to computabili t y T o get ar ound the stum bling blo c k of Ber ry’s paradox and have a forma l notion of des cription with wide sco pe, Andrei Nikolaievitch Kolmo gorov (190 3–1987) made an ingenio us mov e: he tur ned to computability and replaced descriptio n by c omputation pr o gr am . Exploiting the successful formaliza tion of this a pri- ori v ague notion which was a c hieved in the thirties 3 . This a pproach w as ﬁrs t announced by Kolmog o rov in [21], 1963, and then developped in [2 2], 1965. Similar approaches were also indep enden tly developped by Ray J. Solo monoﬀ in [36, 37], 19 64, a nd by Greg o ry Chaitin in [4, 5], 19 66-69. 1.4.3 Digressio n on computabil it y theory The formalize d notion of c omputable function (also called re c ursiv e function) go es alo ng with that of p artial c omputable function which sho uld ra ther b e ca lled p artial ly c omputable p artial function (als o called pa rtial r ecursive function), i.e. the p artial qualiﬁer has to b e distributed. 4 So, there a re tw o theories : - the the ory of c omputable functions , - the the ory of p artial c omputable functions . The “rig h t” theory , the one with a cornucopia of sp ectacular r esults, is that of partial computable functions. Let’s pick up three fundamental results out of the corn ucopia, whic h w e state in t erms of compu ters and programming languages. Let I and O b e non empty ﬁnite prod ucts of simple countable families of mathematical ob jects such as N , A ∗ (the family of words in alphab et A ) where A is ﬁnite or countably inﬁn ite. Theorem 7. 1. [En umeration theorem] The (program, inp ut) → outpu t function which exe cutes pr o gr ams on their inputs i s itself p artial c om- putable. F ormal ly, this me ans that ther e exists a p artial c omputable function U : { 0 , 1 } ∗ × I → O such that the fam i ly of pa rtial c omputable function I → O is exactly { U e | e ∈ { 0 , 1 } ∗ } wher e U e ( x ) = U ( e, x ) . Such a funct ion U is c al le d universal for p artial c omputable functions I → O . 2. [Parame ter t heorem (or s m n thm)]. One c an exchange input and pr o gr am (this is von Neumann’s key idea for computers) . F ormal ly, this me ans that, letting I = I 1 × I 2 , universal maps U I 1 ×I 2 and U I 2 ar e such that ther e exists a c omputable total map s : { 0 , 1 } ∗ × I 1 → { 0 , 1 } ∗ such that, for al l e ∈ { 0 , 1 } ∗ , x 1 ∈ I 1 and x 2 ∈ I 2 , U I 1 ×I 2 ( e, ( x 1 , x 2 )) = U I 2 ( s ( e, x 1 ) , x 2 ) 3 Through the works of Alonzo Churc h (via l am bda calculus), Al an Mathison T uring (via T uri ng mach ines) and Kur t G¨ odel and Jacques Herbrand (via Herbrand-G¨ odel systems of equations) and Stephen Cole Kleene (via the recursi on and minimi zation op erators). 4 In F rench, Daniel Lacom be used the expression semi-fonction semi-r ´ ecursive 10 3. [Kleene ﬁxed p oint theorem] F or any tr ansformation of pr o gr ams, ther e is a pr o gr am which do es the same input → outpu t job as its tr ansforme d pr o gr am. (Note: This is the se e d of c omputer vir olo gy... cf. [3] 2006) F ormal ly, this me ans that, f or every p artial c omputable map f : { 0 , 1 } ∗ → { 0 , 1 } ∗ , ther e exists e such that ∀ e ∈ { 0 , 1 } ∗ ∀ x ∈ I U ( f ( e ) , x ) = U ( e, x ) 1.5 Kolmogoro v complexit y and the in v ariance t heorem Note. The denotations of (plain) K olmo goro v complexity and its preﬁx versi on may cause some confusion. They long used to b e respectively denoted by K and H in the literature. But in their b ook [25], Li & Vitanyi resp ectiv ely denoted th em C and K . Due to the large success of this b ook, these last den otations are since used in many p apers. S o th at tw o incompatible denotations now app ear in th e litterature. Since we mainly fo cus on p lain Kolmogoro v complexity , we stick to the traditional denotations K and H . 1.5.1 Program size compl exit y or Ko l mogoro v complexity T urning to co mputabilit y , the basic idea for Kolmo gorov complexity is description = pr o gr am When we say “ program”, we mean a progra m taken fro m a family of progr ams, i.e. written in a pro gramming languag e o r describing a T ur ing ma chine or a system of Herbr and-G¨ odel equations or a Post system,... Since we a r e so on go ing to c onsider length of pr ograms, following what ha s b een said in § 1.1.2, w e nor malize progra ms: they will b e binary words, i.e. elements of { 0 , 1 } ∗ . So, we hav e to ﬁx a function ϕ : { 0 , 1 } ∗ → O and consider that the o utput of a progra m p is ϕ ( p ). Whic h ϕ a r e we to consider? Since we know that there are universal par tial co m- putable functions (i.e. functions able to emulate any other partial computable function mo dulo a co mputable transformatio n of prog r ams, in other words, a compiler from one language to a nother), it is natura l to cons ide r universal pa r- tial computable functions. Which agrees with wha t has been s aid in § 1.4.3. The general deﬁnition o f the Kolmog orov complexity ass ociated to any function { 0 , 1 } ∗ → O is a s follows. Deﬁnition 8. If ϕ : { 0 , 1 } ∗ → O is a p artial function, set K ϕ : O → N K ϕ ( y ) = min {| p | : ϕ ( p ) = y } Intuition: p is a pr o gr am (with n o input ), ϕ ex e cutes pr o gr ams (i.e. ϕ is al l to gether a pr o gr amming language plus a c ompiler plus a machinery to run pr o- gr ams) and ϕ ( p ) is the output of the run of pr o gr am p . Thus, for y ∈ O , K ϕ ( y ) is the length of shortest pr o gr ams p with which ϕ c omputes y (i.e. ϕ ( p ) = y ) 11 As said ab ov e, w e shall consider this deﬁnition for partia l computable func- tions { 0 , 1 } ∗ → O . O f course, this for ces to consider a se t O endow e d with a computability structure. Hence the choice of sets that we shall call elementary which do not exhaust all pos sible ones but will suﬃce for the results men tioned in this pap er. Deﬁnition 9. The family of elementary sets is obtaine d as fol lows: - it c ontains N and the A ∗ ’s wher e A is a ﬁnite or c ountable alpha b et, - it is close d under ﬁnite (non empty) pr o duct, pr o duct with any non empty ﬁ n ite set and the ﬁnite se quenc e op er ator 1.5.2 The in v ariance theo rem The proble m with Deﬁnition 8 is tha t K ϕ strongly dep ends on ϕ . Here comes a remark able res ult, the inv ar iance theorem, which insures that ther e is a s m al lest K ϕ up to a c onstant . It turns out that the pro of o f this theorem only needs the enumeration theorem and ma kes no use of the pa rameter theorem (usually omnipresent in co mputabilit y theory). Theorem 10 (Inv ariance theorem, Kolmo gorov, [22],1965) . L et O b e an el- ementary set. Among the K ϕ ’s, wher e ϕ : { 0 , 1 } ∗ → O varies in the family P C O of p artial c omputable functions, ther e is a sm al lest one, up to an additive c onstant (= within some b ounde d interval). I.e. ∃ V ∈ P C O ∀ ϕ ∈ P C O ∃ c ∀ y ∈ O K V ( y ) ≤ K ϕ ( y ) + c Such a V is c al le d optimal. Pr o of. Let U : { 0 , 1 } ∗ × { 0 , 1 } ∗ → O b e a partial computable universal function for partial computable functions { 0 , 1 } ∗ → O (cf. T heorem 7, Enumeratio n theorem). Let c : { 0 , 1 } ∗ × { 0 , 1 } ∗ → { 0 , 1 } ∗ b e a total computable injective map such that | c ( e, x ) | = 2 | e | + | x | (cf. Proposition 5). Deﬁne V : { 0 , 1 } ∗ → O as follo ws: ∀ e ∈ { 0 , 1 } ∗ ∀ x ∈ { 0 , 1 } ∗ V ( c ( e , x )) = U ( e, x ) where equ alit y means th at b oth sides are sim ultaneously deﬁned or n ot. Then, for every p artial computable funct ion ϕ : { 0 , 1 } ∗ → O , for every y ∈ O , if ϕ = U e (i.e. ϕ ( x ) = U ( e, x ) for all x , cf. Theorem 7, Enumeration theorem) then K V ( y ) = least | p | such that V ( p ) = y ≤ least | c ( e, x ) | such that V ( c ( e, x )) = y = least | c ( e, x ) | such that U ( e, x )) = y = least | x | + 2 | e | + 1 such t hat ϕ ( x ) = y since | c ( e, x ) | = | x | + 2 | e | + 1 and ϕ ( x ) = U ( e , x ) = (least | x | such th at ϕ ( x ) = y ) + 2 | e | + 1 = K ϕ ( y ) + 2 | e | + 1 12 Using the in v ar iance theorem, the Kolmogo ro v co mplexit y K : O → N is deﬁned a s K V where V is any ﬁxed optima l function. The ar bitr ariness of the choice of V do es not modify drastically K V , merely up to a constant. Deﬁnition 11. Kolmo gor ov c omplexity K O : O → N is K ϕ wher e ϕ is some ﬁxe d optimal function I → O . K O wil l b e denote d by K when O is cle ar fr om c ontext . K O is ther efor e minimum among t he K ϕ ’s, up to an additive c onstant. K O is deﬁne d up to an additiv e c onstant: if V and V ′ ar e b oth optimal then ∃ c ∀ x ∈ O | K V ( x ) − K V ′ ( x ) | ≤ c 1.5.3 Ab out the constan t So Ko lmogorov complexity is an integer deﬁned up to a constant. . . ! But the constant is unifor mly bo unded for x ∈ O . Let’s quote wha t K olmogorov s aid abo ut the constant in [22]: Of c ourse, one c an avoid the indeterminaci es asso ciate d with the [ab ove] c onst ants, by c onsidering p articular [. . . functions V ], but it is doubtful that t his c an b e done without explicit arbitr ariness. One must, however, supp ose t hat t he diﬀer ent “r e asonable” [ab ove optimal functions] wil l le ad to “c omplexity est imates” t hat wil l c on- ver ge on hundr e ds of bits inste ad of tens of t housands. Henc e, s u ch quantities as the “c omplexity” of the tex t of “ War and Pe ac e” c an b e assume d to b e deﬁne d with what amounts to unique- ness. In fact, this co nstan t is in relation with the multitude o f mo dels of co mputa- tion: universal T uring machines, universal cellular automata, Herbrand-G¨ odel systems of eq uations, Post systems, KLeene deﬁnitions,... If we feel that one of them is canonical then we ma y consider the asso ciated Kolmo gorov complexity as the right one and for get abo ut the constant. This has b een develop ed for Schoenﬁnkel-Curry combinators S, K , I by T romp [25] § 3 .2.2–3.2.6. How ever, this do es abso lutely not lessen the impor tance o f the inv a riance the- orem since it tells us that K is less than any K ϕ (up to a co nstan t). A result which is applied again and aga in to develop the theor y . 1.5.4 Conditional Kolmogo ro v complexi ty In the enumeration theo rem (cf. Theorem 7), we c o nsidered (pr o gr am, input ) → output functions. Then, in the deﬁnition of Ko lmogorov co mplexit y , we gave up the inputs, dealing with functions pr o gr am → output . Conditional Kolmogo ro v complexity deals with the inputs. Instead of mea sur- ing the information conten t of y ∈ O , we meas ur e it given as fr ee s ome ob ject z , which may help to compute y . A tr ivial ca se is when z = y , then the infor- mation conten t of y given y is null. In fact, there is an ob vious pro gram whic h 13 outputs exactly its input, whatever b e the input. Let’s men tion tha t, in computer science, inputs are also considered as t he envi- r onment . Let’s give the formal deﬁnition a nd the adequate inv ar iance theor em. Deﬁnition 12. If ϕ : { 0 , 1 } ∗ × I → O is a p artial function, set K ϕ ( | ) : O × I → N K ϕ ( y | z ) = min {| p | | ϕ ( p, z ) = y } Intuition: p is a pr o gr am (with n o input ), ϕ ex e cutes pr o gr ams (i.e. ϕ is al l to gether a pr o gr amming language plus a c ompiler plus a machinery to run pr o- gr ams) and ϕ ( p, z ) is the output of the run of pr o gr am p on input z . Thus, for y ∈ O , K ϕ ( y ) is the length of s hortest pr o gr ams p with which ϕ c omputes y on input z (i.e. ϕ ( p, z ) = y ) Theorem 13 (Inv ariance theore m for conditional co mplexit y) . A m ong the K ϕ ( | ) ’s, wher e ϕ varies in the family P C O I of p artial c omput able function { 0 , 1 } ∗ × I → O , ther e is a smal lest one, up to an additive c onst ant (i.e. within some b ounde d int erval ) : ∃ V ∈ P C O I ∀ ϕ ∈ P C O I ∃ c ∀ y ∈ O ∀ z ∈ I K V ( y | z ) ≤ K ϕ ( x | y ) + c Such a V is c al le d optimal. Pr o of. Simple application of the enumeration theorem for pa rtial computable functions. Deﬁnition 1 4. K O I : O × I → N is K V ( | ) wher e V is some ﬁxe d optimal function. K O I is deﬁne d up to an additiv e c onstant: if V et V ′ ar e b oth minimum t hen ∃ c ∀ y ∈ O ∀ z ∈ I | K V ( y | z ) − K V ′ ( y | z ) | ≤ c Again, a n integer deﬁned up to a constant. . . ! How ever, the consta n t is uniform in y ∈ O and z ∈ I . 1.5.5 Simple upp er b ounds for K o lmogoro v complexity Finally , let’s men tion rather trivia l upp er b ounds: - the informatio n co n ten t of a word is at most its length. - conditional complexity cannot b e harder than the non conditional one. Prop osition 15. 1. Ther e exists c such that, ∀ x ∈ { 0 , 1 } ∗ K { 0 , 1 } ∗ ( x ) ≤ | x | + c , ∀ n ∈ N K N ( n ) ≤ log( n ) + c 2. Ther e ex ists c such that, ∀ x ∈ D ∀ y ∈ E K ( x | y ) ≤ K ( x ) + c 3. L et f : O → O ′ b e c omputable. Then, K O ′ ( f ( x ) ) ≤ K O ( x ) + O (1 ) . 14 Pr o of. W e only prov e 1. Let I d : { 0 , 1 } ∗ → { 0 , 1 } ∗ b e th e identit y function. The inv ariance th eorem insures that th ere exists c such that K { 0 , 1 } ∗ ≤ K { 0 , 1 } ∗ I d + c . I n particular, for all x ∈ { 0 , 1 } ∗ , K { 0 , 1 } ∗ ( x ) ≤ | x | + c . Let θ : { 0 , 1 } ∗ → N b e the function which associate to a w ord u = a k − 1 ...a 0 the in teger θ ( u ) = (2 k + a k − 1 2 k − 1 + ... + 2 a 1 + a 0 ) − 1 (i.e. t h e predecessor of th e integer with binary rep resentation 1 u ). Clearly , K N θ ( n ) = ⌊ log ( n + 1) ⌋ . The in v ariance theorem insures that th ere ex ists c such th at K N ≤ K N θ + c . H en ce K N ( n ) ≤ log( n ) + c + 1 for all n ∈ N . The following prop erty is a v ariation of an a rgument alrea dy used in § 1.2.5: the rank o f a n e le men t in a set deﬁnes it, and if the set is computable, so is this pro cess. Prop osition 16. L et A ⊆ N × D b e c omputable s u ch that A n = A ∩ ( { n } × D ) is ﬁnite for al l n . Then, letting ♯ ( X ) b e the numb er of elements of X , ∃ c ∀ x ∈ A n K ( x | n ) ≤ lo g( ♯ ( A n )) + c Intuition. An element in a set is determine d by its r ank. And this is a c om- putable pr o c ess. Pr o of. Observe th at x is determined by its rank in A n . This rank is an integ er < ♯A n hence with binary rep resentation of length ≤ ⌊ log ( ♯A n ) ⌋ + 1. 1.6 Oracular Kolmogoro v complexit y As is alwa ys the case in computability theory , everything relativiz es to any or- acle Z . This means that the equatio n g iv en at the start of § 1.5 now b ecomes description = pr o gr am of a p artial Z -c omputable function and for each p ossible or acle Z there exists a Kolmo gorov complexity r elativ e to oracle Z . Oracles in computability theory can als o be considered as se cond-order arg u- men ts of computable or partial computable functionals . The same ho lds with oracular Kolmo gorov c o mplexit y: the o racle Z ca n b e seen a s a second- o rder condition for a se c ond-or der c onditional Kolmo gor ov c omplexity K ( y | Z ) where K ( | ) : O × P ( I ) → N Whic h ha s the adv a ntage that the unavoidable constant in the “up to a con- stant” prop erties do es not dep end on the particular ora cle. It dep e nds solely on the consider ed functional. Finally , one can mix ﬁrst-order a nd second- o rder conditions, leading to a c ondi- tional K olmogorov complexity with b oth ﬁrs t- o rder a nd s econd-order co nditions K ( y | z , Z ) where K ( | , ) : O × I × P ( I ) → N W e shall se e in § 4.2.3 an interesting prop erty inv o lving ora cular Ko lmogorov complexity . 15 2 Kolmogoro v complexit y and un decidabilit y 2.1 K is un b ounded Let K = K V : O → N wher e V : { 0 , 1 } ∗ → O is optimal (cf. Theorem § 10). Since there are ﬁnitely many progr a ms of size ≤ n (namely 2 n +1 − 1 words), there are ﬁnitely many elements of O with Kolmogor o v co mplexit y less than n . This shows that K is unbounded. 2.2 K is not computable Berry’ paradox (cf. § 1.4 .1) has a coun terpart in terms o f Kolmog orov c o mplex- it y , namely it gives a pro of that K , which is a tota l function O → N , is no t computable. Pr o of. F or simplicity of notations, w e consider the case O = N . Deﬁne L : N → O as follo ws: L ( n ) = least k such that K ( k ) ≥ 2 n So that K ( L ( n )) ≥ n for all n . If K were compu table so would b e L . Let V : O → N b e optimal, i.e. K = K V . The inv ariance theorem insures that there exists c su c h that K ≤ K L + c . Observe th at K L ( L ( n ) ≤ n b y deﬁnition of K L . Then 2 n ≤ K ( L ( n )) ≤ K L ( L ( n ) + c ≤ n + c A contradiction for n > c . The undecidability of K ca n b e sen a s a v ersion of the undecida bility o f the halting problem. In fact, there is a simple wa y to co mpute K when the halting problem is used a s an oracle. T o get the v a lue of K ( x ), pro ceed as follows: - enumerate the progra ms in { 0 , 1 } ∗ in lexicog raphic o rder, - for ea c h pr ogram p ch eck if V ( p ) halts (using the ora c le), - in case V ( p ) halts then compute its v alue, - halt and output | p | when some p is obtained such that V ( p ) = x . The argument for the undecidability of K can b e used to pr o ve a mu ch strong er statement: K can not b e b ounded from b elow b y a n un b ounded pa rtial com- putable function. Theorem 17 (Kolmogo ro v) . Ther e is no unb oun de d p artial r e cursive function ϕ : O → N such that ϕ ( x ) ≤ K ( x ) for al l x in the domain of ϕ . Of cours e, K is b ounded fr o m ab ov e by a total computable function, cf. Prop osition 15. 2.3 K is computable from ab o v e Though K is not computable, it can b e approximated fro m ab o ve. The idea is simple. Supp ose O = { 0 , 1 } ∗ . consider all progr ams of length less than | x | 16 and let them b e exe cuted dur ing t steps. If none of them converges a nd outputs x then the t -b ound is | x | . If so me of them conv erges and outputs x then the bo und is the length o f the shortest such pro gram. The limit of this pro cess is K ( x ), it is obtained a t some ﬁnite step which we are not able to b ound. F ormally , this mea ns that there is some F : O × N → N which is computable and decreasing in its second arg ument such that K ( x ) = lim n → + ∞ F ( x, n ) 2.4 Kolmogoro v complexit y and G¨ odel’s incompleteness theorem G¨ odel’s incompleteness ’ theo rem has a str iking version, due to Chaitin, 197 1 -74 [6, 7], in terms of Kolmo gorov c omplexit y . In the lang uage o f ar ithmetic one can formalize partial computability (this is G¨ odel ma in technical ingredient for the pro of of the incompleteness theorem) hence a ls o Ko lmogorov co mplexit y . Chaitin pr o ved an n low er bound to the infor mation conten t o f ﬁnite families of s tatemen ts ab out ﬁnite re s trictions asso ciated to an integer n of the halting problem or the v alues of K . In par ticular, for any formal system T , if n is bigg e r than the K olmogorov complexity of T (plus some constant, indep endent of T ) such statements canno t all b e prov a ble in T Theorem 18 (Chaitin, 1 974 [7]) . Supp ose O = { 0 , 1 } ∗ . 1. L et V : { 0 , 1 } ∗ → O b e optimal (i.e. K = K V ). L et T n b e t he family of tru e statements ∃ p ( V ( p ) = x ) for | x | ≤ n (i.e. the halting pr oblem for V limite d to the ﬁnitely many wor ds of length ≤ n ). Then ther e exists a c onstant c such that K ( T n ) ≥ n − c for al l n . 2. L et T n b e the family of true statements K ( x ) ≥ | x | ) for | x | ≤ n . Then ther e exists a c onstant c such that K ( T n ) ≥ n − c for al l n . Note. In the statement of the theor e m, K ( x ) refers to the Kolmo gorov com- plexity on O wher e as K ( T n ) refers to that on an adequate elementary family (cf. Deﬁnition 9). 3 F ormalization of randomness for ﬁnite ob jects 3.1 Probabilities: la ws ab out a non formalized in tuition Random ob jects (wor ds, int e gers, r e als,...) constitute the basic intuition for probabilities ... but they ar e not c onsider e d p er se. No formal deﬁnition of ran- dom ob ject is given: there seems there is no need for such a formal concept. The existing formal notion of r andom variable has nothing to do with randomness: a random v ariable is merely a me asur able funct ion which can b e a s non random as one likes. 17 It so unds strang e tha t the mathematical theory which deals with randomness remov es the natural basic questions: - what is a r andom string? - what is a r andom inﬁnite se quen c e? When questioned, people in probability theory agr ee that they skip these ques- tions but do not feel sor ry a bout it. As it is, the theory deals with laws of randomness and is so successful that it ca n do without entering this problem. This may seem to be a nalogous to what is the case in geo metry . What ar e po in ts, lines, pla nes? No deﬁnition is given, o nly relations b et ween them. Giv- ing up the quest for a n analys is of the nature o f geometric a l ob jects in proﬁt of the axioma tic metho d has b een a c o nsiderable scien tiﬁc step. How ever, we contest such a n a nalogy . Ra ndo m ob jects a re heavily used in many areas of science and technology: sampling, cryptolog y ,... Of course , such ob jects are are in fac t “as much as we c an r andom” . Which means fake r andomness . Anyone who c onsiders arithmetic al metho ds of pr o ducing r andom r e als is, of c ourse, in a state of sin. F or, as has b e en p ointe d out sever al times, ther e is no such thing as a r andom numb er — ther e ar e only metho ds to pr o duc e r andom numb ers, and a strict arithmetic al pr o c e dur e is of c ourse not such a metho d. John von Neumann, 1951 [29] So, what is “true ” r andomness? Is there something like a degr ee of r andomness? Presently , (fake) randomnes s only means to pa ss some statistical tests. One can ask for mor e. In fact, since Pier r e Simon de Lapla ce (17 49–1827 ), so me probabilists ne ver gave up the idea of formalizing the notion of random ob ject. Let’s cite pa r ticularly Richard von Mises (1883– 1953) and Kolmog orov. In fact, it is quite impress iv e that, ha ving so br illian tly and eﬃciently a xiomatized pr obabilit y theory via measure theor y in 1933 [20], Kolmogor o v was no t fully satisﬁed of such founda- tions. 5 And kept a keen interest to the quest for a for ma l notion of r andomness initiated by von Mises in the 20’s. 3.2 The 100 heads parado xical result in probabilit y theory That probability theory fails to completely account fo r ra ndomness is s trongly witnessed by the following paradoxical fact. In proba bility theor y , if we t oss an unbiaise d c oin 100 times then 100 he ads ar e just as pr ob able as any other outc ome! Who r eally b elieves that ? The axioms of pr ob ability the ory, as developp e d by Kolmo gor ov, do not solve al l mysteries that they ar e sometimes supp ose d to. Peter G` acs [17] 5 Kolmogorov is one of the rare probabilists – up to now – not to b eliev e that K olmogoro v’ s axioms for probability theory do not constitute the last word about form al izing randomness... 18 3.3 Kolmogoro v’s prop osal: incompressible strings W e now assume that O = { 0 , 1 } ∗ , i.e. we restr ic t to words. 3.3.1 incompressi b i lit y with Kolm o goro v complexity Though muc h work has b een devoted to get a mathematic al the ory of r andom obje cts , notably by von Mise s [38, 39], none was satisfa ctory up to the 60’s when K olmogorov based such a theo ry on Ko lmogorov complexity , hence on computability theor y . The theo ry was, in fact, indep endently develope d by Gregor y J. Chaitin (b. 1947), 196 6 [4], 1 969 [5] (both paper s submitted in 19 65). 6 The basic idea is a s follo ws: lar ger is the Kolmo gor ov c omplexity of a text, mor e r andom is this tex t, lar ger is its information c ontent, and mor e c ompr esse d is this text. Thu s, a theo ry for measuring the infor mation conten t is als o a theory of ran- domness. Recall that there ex ists c such that for all x ∈ { 0 , 1 } ∗ , K ( x ) ≤ | x | + c (Pr opos i- tion 15). Also, there is a “stupid” progra m o f leng th ab out | x | whic h c omputes the w ord x : tell the successive letters of x . The intuition of incompressibility is as follows: x is incompressible if ther no sho rter way to g et x . Of course, we a r e not going to deﬁne absolute randomness for words. B ut a measure of ra ndomness ba s ed o n how far from | x | is K ( x ). Deﬁnition 19 (Measur e o f inco mpressibilit y) . A wor d x is c -inc ompr essible if K ( x ) ≥ | x | − c . As is rather intuitiv e, most things are random. The next Prop osition for - malizes this idea . Prop osition 2 0. The pr op ortion of c -inc ompr essible st r ings of length n is ≥ 1 − 2 − c . Pr o of. A t most 2 n − c − 1 p rograms of length < n − c and 2 n strings of length n . 3.3.2 incompressi b i lit y with length condi tional Kolmo goro v com - plexit y W e obs erv ed in § 1 .2 .3 that the entropy of a word of the form 000 ... 0 is n ull. I.e. ent ropy did not considere d the infor mation con vey ed by the length. Here, with incompressibility ba sed on Kolmogo rov complexit y , w e can a lso ig- nore the information conten t conv eyed by the le ng th by considering inc ompr ess- ibility b ase d on length c onditional K olmo gor ov c omplexity . Deﬁnition 21 (Meas ur e of length conditiona l incompressibility) . A wor d x is length c onditional c -inc ompr essible if K ( x | | x | ) ≥ | x | − c . 6 F or a detailed analysis of who did wha t, and when , see [25] p. 89–92. 19 The same simple count ing arg umen t yields the following P ropo sition. Prop osition 22. The pr op ortion of length c onditional c -inc ompr essible strings of lengt h n is ≥ 1 − 2 − c . A prior i leng th co nditional incompressibility is stro ng er than mer e incom- pressibility . How ever, the t wo no tions of incompressibility ar e about the same . . . up to a constant. Prop osition 23. Th er e exists d such t hat, for al l c ∈ N and x ∈ { 0 , 1 } ∗ 1. x is length c onditional c -inc ompr essible ⇒ x is ( c + d ) -inc ompr essible 2. x is c -inc ompr essible ⇒ x is length c onditional (2 c + d ) -inc ompr essible. Pr o of. 1 is trivial. F or 2, observe t h at t here exists e such that, for all x , ( ∗ ) K ( x ) ≤ K ( x | | x | ) + 2 K ( | x | − K ( x | | x | )) + d In fact, if K = K ϕ and K ( | ) = K ψ ( | ) and | x | − K ( x | | x | ) = ϕ ( p ) ψ ( q | | x | ) = x K ( | x | − K ( x | | x | )) = | p | K ( x | | x | ) = | q | With p and q , hence with h p, q i (cf. Prop ositio n 5), one can successively get 8 > > < > > : | x | − K ( x | | x | ) this is ϕ ( p ) K ( x | | x | ) this is q | x | just sum x this is ψ ( q | | x | ) Using K ≤ log + c 1 and K ( x ) ≥ | x | − c , (*) yields | x | − K ( x | | x | ) ≤ 2 log( | x | − K ( x | | x | )) + 2 c 1 + c + d Finally , observe t h at z ≤ 2 log z + k insures z ≤ m ax (8 , 2 k ). 3.4 Incompressibilit y is randomness: Martin-L¨ of ’s argu- men t Now, if incompressibility is clearly a necessary co nditio n for rando mness, how do we argue that it is a suﬃcien t condition? Contrapos ing the wan ted implication, let’s see that if a word fails some statistica l test then it is no t incompressible. W e cons ider so me s p ectacular fa ilures of statistical tests. Example 24. 1. [Constant left ha lf length preﬁx] F or al l n lar ge en ough, a string 0 n u with | u | = n c ann ot b e c -inc ompr essible. 2. [Palindromes] L ar ge enough p alindr omes c annot b e c -inc ompr ess ible. 3. [0 a nd 1 not equidistributed] F or al l 0 < α < 1 , for al l n lar ge enough, a string of length n which has ≤ α n 2 zer os c annot b e c -inc ompr essible. 20 Pr o of. 1. Let c ′ b e suc h that K ( x ) ≤ | x | + c ′ . Observe th at th ere exists c ′′ such th at K (0 n u ) ≤ K ( u ) + c ′′ hence K (0 n u ) ≤ n + c ′ + c ′′ ≤ 1 2 | 0 n u | + c ′ + c ′′ So t hat K (0 n u ) ≥ | 0 n u | − c is imp ossible for n large enough. 2. Same argument: There exists c ′′ such th at, for all palindrome x , K ( x ) ≤ 1 2 | x | + c ′′ 3. The pro of follo ws the classical argument to get the la w of large numbers (cf. F eller’s bo ok [16]). Let’s do it for α = 2 3 , so that α 2 = 1 3 . Let A n b e t he set of strings of length n with ≤ n 3 zeros. W e estimate the num ber N of elements of A n . N = i = n 3 X i =0 n i ! = i = n 3 X i =0 n ! i ! ( n − i )! ≤ ( n 3 + 1) n ! n 3 ! 2 n 3 ! Use S tirling’s formula (1730) √ 2 nπ “ n e ” n e 1 12 n +1 < n ! < √ 2 nπ “ n e ” n e 1 12 n N < n √ 2 nπ ` n e ´ n p 2 n 3 π “ n 3 e ” n 3 q 2 2 n 3 π “ 2 n 3 e ” 2 n 3 = 3 2 r n π „ 3 3 √ 4 « n Using Prop osition 16, for any elemen t of A n , we hav e K ( x | n ) ≤ log ( N ) + d ≤ n log „ 3 3 √ 4 « + log n 2 + d Since 27 4 < 8, we hav e 3 3 √ 4 < 2 and log “ 3 3 √ 4 ” < 1. H ence, n − c ≤ n log “ 3 3 √ 4 ” + log n 2 + d is imp ossible for n large enough. So t hat x cannot b e c -incompressible. Let’s give a common framework to the thr e e ab ov e examples so as to g et some ﬂav or of what can be a statistica l test. T o do this, we follow the ab ov e pro ofs o f compressibility . Example 25 . 1. [Cons tan t left half length preﬁx] Set V m = al l s trings with m zer os ahe ad . The se quenc e V 0 , V 1 , ... is de cr e asing. The numb er of st r ings of length n in V m is 0 if m > n and 2 n − m if m ≤ n . Thus, the pr op ortion ♯ { x || x | = n ∧ x ∈ V m } 2 n of length n wor ds which ar e in V m is 2 − m . 2. [Palindromes] Put in V m al l strings which have e qual length m pr eﬁx and suﬃx. The se quenc e V 0 , V 1 , ... is de cr e asing. The numb er of s trings of length n in V m is 0 if m > n 2 and 2 n − 2 m if m ≤ n 2 . Thus, the pr op ortion of length n 21 wor ds which ar e in V m is 2 − 2 m . 3. [0 and 1 not equidistributed] Put in V α m = al l strings x such that the nu mb er of zer os is ≤ ( α + (1 − α )2 − m ) | x | 2 . The se qu en c e V 0 , V 1 , ... is de cr e asing. A c omputation analo gous t o that done in the pr o of of the law of lar ge numb ers shows that the pr op ortion of length n wor ds which ar e in V m is ≤ 2 − γ m for some γ > 0 (indep endent of m ). Now, what abo ut other statistical tests? But what is a sta tistical test? A convincing formaliza tion ha s b een developed b y Martin-L¨ o f. The intuition is that illustrated in Example 25 augmented of the following fea ture: each V m is computably enumerable and s o is the relation { ( m, x ) | x ∈ V m } . A feature which is analogous to the partial computability a ssumption in the deﬁnitio n of Kolmogo rov complexity . Deﬁnition 26. [A bst r act notion of statistic al test, Martin-L¨ of, 1964] A statis- tic al test is a famil y of nest e d critic al r e gions { 0 , 1 } ∗ ⊇ V 0 ⊇ V 1 ⊇ V 2 ⊇ ... ⊇ V m ⊇ ... such that { ( m, x ) | x ∈ V m } is c omputably enumer able and the pr op ortion ♯ { x || x | = n ∧ x ∈ V m } 2 n of lengt h n wor ds which ar e in V m is 2 − m . Intuition. The b ound 2 − m is ju s t a normalization. Any b ound b ( n ) such that b : N → Q which is c omputable, de cr e asing and with limit 0 c ould r eplac e 2 − m . The signiﬁc anc e of x ∈ V m is that the hyp othesis x is random is r eje ct e d with signiﬁc anc e level 2 − m . R emark 27 . Instead of sets V m one ca n consider a function δ : { 0 , 1 } ∗ → N such that ♯ { x || x | = n ∧ x ∈ V m } 2 n ≤ 2 − m and δ is computable from b elo w, i.e. { ( m, x ) | δ ( x ) ≥ m } is recursively en umerable. W e have just argued on so me examples that all statistical tes ts from pr actice are of the form sta ted by Deﬁnition 2 6. Now co mes Martin-L¨ o f fundamental result ab out statistical tests which is in the v ein of the inv aria nce theore m. Theorem 28 (Martin-L¨ of, 19 65) . U p to a c onst ant shift, ther e exists a lar gest statistic al test ( U m ) m ∈ N ∀ ( V m ) m ∈ N ∃ c ∀ m V m + c ⊆ U m In terms of functions, up to an additive c onstant, ther e exists a lar gest st atistic al test ∆ ∀ δ ∃ c ∀ x δ ( x ) < ∆( x ) + c Pr o of. Consider ∆( x ) = | x | − K ( x | | x | ) − 1. ∆ is a test. Clearly , { ( m, x ) | ∆( x ) ≥ m } is comp u tably enumerable. ∆( x ) ≥ m means K ( x | | x | ) ≤ | x | − m − 1. So n o more elements in { x | ∆( x ) ≥ m ∧ | x | = n } than programs of length ≤ n − m − 1, which is 2 n − m − 1. 22 ∆ is lar gest. x is determined b y its rank in the set V δ ( x ) = { z | δ ( z ) ≥ δ ( x ) ∧ | z | = | x |} . S ince this set has ≤ 2 n − δ ( x ) elemen ts, the rank of x has a b inary representation of length ≤ | x | − δ ( x ). Add useless zeros ahead to get a w ord p with length | x | − δ ( x ). With p w e get | x | − δ ( x ). With | x | − δ ( x ) and | x | we get δ ( x ) and construct V δ ( x ) . With p we get the rank of x in this set, hence we get x . Thus, K ( x | | x | ) ≤ | x | − δ ( x ) + c , i.e. δ ( x ) < ∆( x ) + c . The imp ortance of the pr evious result is the following coro llary which insures that, for words, incompressibility implies (hence is equiv alent to) r andomness. Corollary 29 (Ma rtin-L¨ of, 1965) . Inc ompr essibility p asses al l statistic al tests. I.e. for al l c , for al l statistic al t est ( V m ) m , ther e exists d such that ∀ x ( x is c -inc ompr essible ⇒ x / ∈ V c + d ) Pr o of. Let x b e length cond itional c -incompressible. This means th at K ( x | | x | ) ≥ | x | − c . Hence ∆( x ) = | x | − K ( x | | x | ) − 1 ≤ c − 1, which means th at x / ∈ U c . Let n o w ( V m ) m b e a statistical test. Then there is some d such t hat V m + d ⊆ U m Therefore x / ∈ V c + d . R emark 30 . O bserve that incompress ibilit y is a b ottom-up no tion: we lo ok at the v alue o f K ( x ) (or that of K ( x | | x | )). On the opposite, passing sta tis tica l tests is a top-down notion. T o pass all statistical tests amounts to an inclusio n in an intersection: na mely , a n inclusion in \ ( V m ) m [ c V m + c 3.5 Randomness: a new foundation for probabilit y the- ory? Now that there is a so und mathematical notion of randomness (for ﬁnite ob- jects), or mor e exactly a measure of randomness, is it p ossible/reaso nable to use it as a new foundation for proba bilit y theory? Kolmogo rov has b een a m biguous on this question. In his ﬁrst pap er on the sub ject (19 6 5, [22], p. 7 ), K olmogorov brieﬂy evok ed that p ossibilit y : . . . to consider the use of the [Algorithmic Information Theory] con- structions in pr o viding a new basis for Probability Theor y . How ever, la ter (1983, [23], p. 35–3 6), he separated both topics “there is no need whatso ev er to c hange the es tablished construction of the mathematical probability theory on the bas is of the genera l theory of mea sure. I am not enclined to attribute the sig niﬁca nce of necessary foundations of probability theo r y to the investigations [ab out Ko lmogorov co mplexit y] that I a m now go ing to sur v ey . But they are mo st interesting in themselves. 23 though stressing the role of his new theory of random ob jects fo r mathematics as a whole ([23], p. 3 9): The c o ncepts of information theory as applied to inﬁnite sequences give r ise to v ery interesting in vestigations, which, without b eing in- disp e nsable a s a basis of pro babilit y theor y , ca n acquire a certa in v alue in the in vestigation of the alg orithmic side o f mathematics as a whole. 4 F ormalization of randomness for inﬁ n ite ob- jects W e sha ll s tic k to inﬁnite seque nc e s of zeros and ones: { 0 , 1 } N . 4.1 Martin-L¨ of approac h with t opology and computabilit y This appro ac h is an ex tension to inﬁnite s equences of the one he develop ed for ﬁnite ob jects, cf. § 3.4. T o prov e a probability la w amo un ts to prove tha t a certain set X of sequences has pro babilit y one. T o do this, one has to prove that the co mplemen t set Y = { 0 , 1 } N \ X ha s probability zero. Now, in order to prov e that Y ⊆ { 0 , 1 } N has probability zero , ba sic measur e theor y tells us that one has to include Y in op en s ets with ar bitrarily sma ll pr obabilit y . I.e. for each n ∈ N o ne must ﬁnd an op en set U n ⊇ Y which has pro babilit y ≤ 1 2 n . If things were on the real line R we w ould say that U n is a countable union of int erv a ls with rational endp oin ts. Here, in { 0 , 1 } N , U n is a countable union of se ts of the for m u { 0 , 1 } N where u is a ﬁnite binary s tring and u { 0 , 1 } N is the set o f inﬁnite sequences which extend u . In or der to prov e that Y ha s probability zer o, for each n ∈ N one must ﬁnd a fam- ily ( u n,m ) m ∈ N such that Y ⊆ S m u n,m { 0 , 1 } N and P roba ( S m u n,m { 0 , 1 } N ) ≤ 1 2 n for each n ∈ N . Now, Mar tin-L¨ of makes a crucial observ a tion: mathematical probability laws which w e can consider neces sarily hav e so me eﬀective character. And this ef- fectiveness sho uld reﬂect in the pro of as follows: t he doubly indexe d s e quenc e ( u n,m ) n,m ∈ N is c omputable. Thu s, the set S m u n,m { 0 , 1 } N is a c omput ably enumer able op en set and T n S m u n,m { 0 , 1 } N is a countable intersection o f a c omputably enumer able fam- ily of op en sets . Now comes the essential theorem, whic h is completely a nalog to Theorem 28. Theorem 31 (Martin-L¨ of [26]) . Le t’s c al l c onstructively nul l G δ set any set of the form T n S m u n,m { 0 , 1 } N wher e the se quenc e u n,m is c omputably enumer able and P roba ( S m u n,m { 0 , 1 } N ) ≤ 1 2 n (which implies that the interse ction set has 24 pr ob ability zer o). Ther e exist a lar gest c onstructively nul l G δ set Let’s insist that the theorem says lar gest , up to nothing, rea lly lar g est. Deﬁnition 32 (Martin-L¨ of [26]) . A se quen c e α ∈ { 0 , 1 } N is r andom if it b elongs to n o c onstructively nul l G δ set (i.e. if it do es not b elongs t o the lar gest one). In pa r ticular, the family of ra ndo m sequence, b eing the complement o f a constructively null G δ set, has pr o babilit y 1. 4.2 The b ottom-up approac h 4.2.1 The naiv e i dea badly fails The na tural naive idea is to extend r andomness from ﬁnite ob jects to inﬁnite ones. The obvious ﬁrst approa c h is to consider sequences α ∈ { 0 , 1 } N such that, for some c , ∀ n K ( α ↾ n ) ≥ n − c (1) How ever, Ma rtin-L¨ of prov ed tha t there is no such sequence. Theorem 33 (Ma rtin-L¨ of [27]) . F or every α ∈ { 0 , 1 } N ther e ar e inﬁnitely many k su ch that K ( α ↾ k ) ≤ k − log k − O (1) . Pr o of. Let f ( z ) = k − ( ⌊ log z ⌋ + 1) First, observe th at f ( z +2) − f ( z ) = 2 −⌊ log( z +2) ⌋ + ⌊ log z ⌋ = 2 − ( ⌊ log z +log(1+ 2 z ⌋−⌊ log z ⌋ ) > 0 since log (1 + 2 z ≤ 1 for k z ≥ 1. Fix any m and consider α ↾ m . This w ord is the binary represen tation of some integer k such that m = ⌊ log k ⌋ + 1. N ow, consider x = α ↾ k and let y b e the suﬃx of x of length k − m = f ( k ). F rom y we get | y | = k − m = f ( k ). Since f ( z + 2) − f ( z ) > 0, t h ere are at most tw o (consecutive) integers k suc h that f ( k ) = | y | . One bit of information tells whic h one in case there are tw o of th em. So, from y (p lus one bit of information) one gets m . Hence the binary representatio n of m , which is α ↾ m . By concatenation with y , w e reco v er x = α ↾ k . This process being eﬀective, Proposition 15 (p oin t 3) insures th at K ( α ↾ k ) ≤ K ( y ) + O (1) ≤ | y | + O (1) = k − m + O (1) = k − log k + O (1) The ab ov e a r gumen t can b e extended to prove a muc h more general r esult. Theorem 3 4 (Large o scillations, Martin-L ¨ of, 1 971 [2 7]) . L et f : N → N b e a total c omputable fun ction satisfying P n ∈ N 2 − g ( n ) = + ∞ . Then, for every α ∈ { 0 , 1 } N , ther e ar e inﬁnitely many k such that K ( α ↾ k ) ≤ k − f ( k ) . 25 4.2.2 Mille r & Y u’s theorem It to ok ab out forty years to g et a c haracteriza tio n of randomness via plain Kolmogo rov complexity w hich completes very simply Theorem 3 4. Theorem 35 (Miller & Y u, 2004 [28]) . 1 . L et f : N → N b e a total c omput able function satisfying P n ∈ N 2 − g ( n ) < + ∞ . Then, for every r andom α ∈ { 0 , 1 } N , ther e exists c such that K ( α ↾ k | k ) ≥ k − f ( k ) − c for al l k . 2. Ther e exists a total c omputable function f : N → N satisfying P n ∈ N 2 − g ( n ) < + ∞ such that for every non r andom α ∈ { 0 , 1 } N ther e ar e inﬁnitely many k su ch that K ( α ↾ k ) ≤ k − f ( k ) . Recently , an elementary pr oof of this theorem was g iv en by Bienv en u & Merkle & Shen, [2 ]. 4.2.3 Kolmogo ro v randomne ss and ∅ ′ A na tural question following Theo rem 33 is to lo ok a t the so-ca lled Kolmog o rov random seq ue nc e s which satisfy K ( α ↾ k ) ≥ k − O (1) for inﬁnitely many k ’s. This question go t a very s ur prising answer inv olving randomness with o racle the halting pro blem ∅ ′ . Theorem 3 6 (Nies, Stephan & T e rwijn [3 0]) . L et α ∈ { 0 , 1 } N . Ther e ar e inﬁnitely many k such that K ( α ↾ k ) ≤ k − f ( k ) (i.e. α is Kolmo gor ov r andom) if and only if α is ∅ ′ -r andom. 4.2.4 V arian ts o f Kolmo goro v complexity and randomnes s Bottom-up c ha racterization o f random sequences w er e obtained b y Cha itin, Levin and Schnorr using diverse v ariants of Kolmogorov complexity .. Deﬁnition 37. 1. [Schnorr, [32] 1971] F or O = { 0 , 1 } ∗ , the pr o c ess c omplexity S is the vari ant of Kolmo gor ov c omplexity obtaine d by r estricting to p art ial c omputable functions { 0 , 1 } ∗ → { 0 , 1 } ∗ which ar e monotonous, i.e. if p is a pr eﬁx of q and V ( p ) , V ( q ) ar e b oth deﬁne d t hen V ( p ) is a pr eﬁx of V ( q ) . 2. [Chaitin, [8] 1975] The pr eﬁx-fr e e variant H of Kolmo gor ov c omplexity is obtaine d by r estricting to p artial c omputable functions { 0 , 1 } ∗ → { 0 , 1 } ∗ which have pr eﬁx-fr e e domains. 3. [L evin, [40] 1970] F or O = { 0 , 1 } ∗ , the monotone variant K m of Kolmo gor ov c omplexity is obtaine d as fol lows: K m ( x ) is the le ast | p | such that x is a pr eﬁx of U ( p ) wher e U is universal among monotone p artial c omputable functions. Theorem 38. L et α ∈ { 0 , 1 } N . The fol lowing c onditions Then α is r andom if and only if S ( α ↾ k ) ≥ k − O (1) if and only if H ( α ↾ k ) ≥ k − O (1) if and only if K m ( α ↾ k ) ≥ k − O (1 ) . 26 The main pr o blem with these v a r ian ts of Kolmo gorov complexity is that ther e is no solid understanding of what the r estrictions they involve r e al ly m e an . Chaitin ha s introduced the idea of s elf-delimitation for preﬁx- free functions : since a progra m in the doma in of U has no extension in the domain o f U , it somehow know where it ends. Tho ugh in teresting, this interpretation is not a deﬁnitive explana tion a s Chaitin himself admits (perso nal communication). Nevertheless, these v ariants hav e wonderful prop erties. Let’s cite one o f the most strik ing one: tak ing O = N , the series 2 − H ( n ) conv erges and is the biggest absolutely convergen t series up to a mult iplicative factor. 5 Application of Kolmogoro v complexit y to c las- siﬁcation 5.1 What is the problem? Striking r e s ults have b een o bta ined, using Kolmogo rov complexity , with the problem of class ifying q uite diverse families of ob jects: let them be litera ry texts, m usic pieces, examination sc ripts (lax sup ervised) or, at a diﬀerent le vel, natural langua ges, natural s pecies (philogeny). The autho r s, mainly Bennett, Vitanyi, Cilibrasi,.. hav e worked out r eﬁned metho ds which are a long the following lines. (1) Deﬁne a sp eciﬁc family of o b jects which we w ant to classify . F or example a set o f Russian literary texts that we want to group by authors. In this simple case all texts ar e written in their original Russian language. Another ins tance, music. In that case a common tra nslation is necessary , i.e. a nor malization of the texts o f these music pieces that w e wan t to gr oup by co mp oser. This is r equired in or der to b e a ble to compar e them. An instance a t a diﬀerent le v el: the 52 main europ ean language s. In that case one ha s to cho ose a c anonical text and its repr esen tations in each one of the diﬀerent languag es (i.e. corpus) that we co nsider. F or instance, the Universal Declara tion of Human Rights and its trans lations in these languages , an exa mple which was a ba sic test for Vitan yi’s metho d. As concerns natur a l sp ecies, the canonical ob ject will be a DNA sequence. What ha s to be done is to select, deﬁne and normaliz e a family of ob jects or corpus that we wan t to classify . Observe that this is not a lways an obvious step: • Ther e may b e no p ossible normalizatio n. F or instance with artists paintings,. • The family to b e class iﬁed may b e ﬁnite tho ugh ill deﬁned or even of unknown size, cf. 5.3.1. (2) In ﬁne we are with a family of w ords on s ome ﬁxed alphab et r epresent ing ob jects fo r which w e wan t to compa re and measure pairwis e the co mmon information conten t. 27 This is done b y deﬁning a dis tance for these pairs o f (binary) words with the following in tuition: the mor e common infor mation there is betw een t wo w ords, the closer they are a nd the shor ter is their distance. Conv ersely , the less common informa tio n ther e is betw een tw o words, the more they are indep endent a nd non cor related, a nd bigg er is their distance. Two identical words ha ve a null distance. Two totally indepen- dent words (for exa mple, words r epresent ing 1 00 co in tossing) hav e distance ab out 1 (for a normalized distance bo unded b y 1). Observe that the a uthors follow Kolmog orov’s a pproach which was to deﬁne a numerical mea sure of information conten t o f words, i.e. a measur e of their randomness. In exactly the same w ay , a volume o r a surface g ets a numerical measure. (3) Asso ciate a classiﬁca tion to the ob jects or corpus deﬁned in (1) using the nu merical mea s ures o f the distances introduced in (2). This step is pr esen tly the least formally deﬁned. The a uthors give r ep- resentations of the obta ined classiﬁcatio ns using tables, tre e s, graphs,... This is indeed more a visualiza tion of the obtained class iﬁcation than a formal class iﬁcation. Here the authors hav e no p ow e r ful formal framework such as , for exa mple, Co dd’s relationa l mo del o f da ta bases and its exten- sion to o b ject data bases with trees. How a re we to interpret their tables or trees? W e face a problem, a classical one. for instance with distances betw een DNA s equences, Or with the acyclic graph structure o f Unix ﬁles in a computer. This is muc h as with the rudimentary , not to o formal, clas s iﬁcation o f words in a dictionary of s y non yms. Nevertheless, Vitan yi & al. obtained b y his metho ds a clas siﬁcation tree for the 52 Eur opean la nguages whic h is that obtaine d by linguists, a re- mark a ble succes s. And the phylogenetic trees relative to parenthoo d which are precis e ly those obtained via DNA sequence c o mparisons by biologists. (4) An impo rtan t problem r emains to use a dista nce to obtain a classiﬁcatio n as in (3). Let’s cite Cilibrasi [9]: Large ob jects (in the s ense of long string s ) that diﬀer by a tin y part a re intuitiv ely closer than tiny o b jects that diﬀer by the same amount. F or example, tw o whole mitochondrial genomes of 18 , 000 bases that diﬀer by 9 , 000 are very diﬀere nt, while tw o whole nuclear genomes of 3 × 10 9 bases that diﬀer by o nly 9 , 000 bases ar e very similar . Thus, abs olute diﬀer e nce b et w een tw o ob jects do es not gov ern similar ity , but relative diﬀerence seems to. 28 As we shall s ee, this problem is eas y to ﬁx b y some no r malization o f distances. (5) Finally , all these metho ds rely on Kolmogor o v co mplexit y whic h is a non computable function (cf. § 2 .2). The r emark able idea introduced by Vi- tanyi is as follows: • co nsider the Kolmogo ro v complexity of an o b ject as the ultimate and ideal v alue of the compressio n of tha t ob ject, • a nd compute approximations o f this idea l co mpression using usual eﬃcient co mpressors suc h as gzip, bzip2, PPM,... Observe that the q ualit y and fastness of such compresso rs is la rgely due to heavy use of s tatistical to ols. F o r example, P PM (Prediction by Par- tial Matching) uses a pleasing mix o f s ta tistical mo dels a rranged by trees , suﬃx tr ees or suﬃx arr a ys. The remar k a ble eﬃciency of these to ols is of cours e due to several do zens of y ears of r e s earch in data compr ession. And as time go es on, they impr o ve and b etter appr oximate Kolmo g orov complexity . Replacing the “pur e’ but non co mputable Ko lmogorov complexity by a banal co mpression algor ithm such a s gzip is quite a daring step to ok by Vitanyi! 5.2 Classiﬁcation via compression 5.2.1 The normalized i nformation distance N I D W e now formalize the notions describ ed ab ov e. The idea is to measure the information con tent shared b y t w o binary words representing s ome ob jects in a family we w ant to clas sify . The ﬁr st such tentativ e go es back to the 90’s [1 ]: Be nnett and a l. deﬁne a notion of information distanc e b et w een tw o words x, y as the siz e of the shortest progra m which maps x to y a nd y to x . Thes e considerations rely on the notio n of reversible c o mputation. A p ossible forma l deﬁnition for such a dis tance is I D ( x, y ) = leas t | p | s uch that U ( p, x ) = y and U ( p, y ) = x where U : { 0 , 1 } ∗ × { 0 , 1 } ∗ → { 0 , 1 } ∗ is optimal for K ( | ). An alternative de ﬁnitio n is as follows: s I D ′ ( x, y ) = max { K ( x | y ) , K ( y | x ) } The in tuition for these deﬁnitions is that the sho rtest progr am which computes x from y takes in to accoulnt all similarities b et ween x and y . Observe that the t wo deﬁnitions do not coincide (even up to loga r ithmic terms) but lead to similar developmen ts a nd eﬃcien t applications. 29 Note. In the ab ov e formula, K ca n b e plain Kolmog orov complexity or its preﬁx version. In fact, this do es not matter for a simple rea son: all pro perties inv olving this distance will b e tr ue up to a O (log( | x | ) , log ( | y | )) term a nd the diﬀerence betw een K ( z | t ) and H ( z | t ) is b ounded b y 2 log( | z | ). F o r conceptual s implicit y , we stick to plain Kolmo gorov complexity . I D and I D ′ satisfy the axioms of a distance up to a lo garithmic term . The strict axioms for a distance d are    d ( x, x ) = 0 (iden tity) d ( x, y ) = d ( y , x ) (symmetry) d ( x, z ) ≤ d ( x, y ) + d ( y , z ) (triangle inequality) The up to a log term ax ioms which a r e sa tisﬁed by I D and I D ′ are as follows:    I D ( x, x ) = O (1 ) I D ( x, y ) = I D ( y , x ) I D ( x, z ) ≤ I D ( x, y ) + I D ( y , z ) + O (log( I D ( x, y ) + I D ( y , z ))) Pr o of. Let e b e such that U ( e , x ) = x for all x . Then I D ( x, x ) ≤ | e | = O (1). No b etter upp er boun d is possible (except if we assume that the empty w ord is such an e ). Let now p, p ′ , q , q ′ b e shortest p rograms suc h that U ( p, y ) = x , U ( p ′ , x ) = y , U ( q, z ) = y , U ( q ′ , y ) = z . Thus, K ( x | y ) = | p | , K ( y | x ) = | p ′ | , K ( y | z ) = | q | , K ( z | y ) = | q ′ | . Consider the injective comp u table function h i of Proposition 6 which is such that |h r, s i| = | r | + | s | + O (log | r | ). Set ϕ : { 0 , 1 } ∗ × { 0 , 1 } ∗ → { 0 , 1 } ∗ b e such that ϕ ( h r, s i , x ) = U ( s, U ( r , x )). Then ϕ ( h q , p i , z ) = U ( p, U ( q, z )) = U ( p, y ) = x so that, by the inv ariance theorem, K ( x | z ) ≤ K ϕ ( x | z ) + O (1) ≤ |h q , p i| + O (1) = | q | + | p | + O (log( | q | )) = K ( y | z ) + K ( x | y ) + O (log ( K ( y | z ))) And similarly for the oth er terms. Whic h pro ves the stated approxima- tions of the axioms. It turns out that s uc h approximations of the a xioms are enough for the developmen t of the theory . As said in § 5 .1, to av oid scale distortion, this distance I D is no r malized to N I D (normalized informa tio n dista nce) a s follo ws: N I D ( x, y ) = max( K ( x | y ) , K ( y | x )) max( K ( x ) , K ( y )) The r emaining problem is that this distance is not computable since K is not. Here co mes Vitanyi’s da ring idea: consider this N I D a s an ideal distance which is to b e approximated by r eplacing the Kolmogor o v function K by computable compressio n algo rithms which go on improving. 30 5.2.2 The normalized com pression distance N C D The approximation of K ( x ) b y C ( x ) wher e C is a compr essor, do es not suﬃce. W e also hav e to appr oximate the conditiona l Ko lmogorov complex it y K ( x | y ). Vitanyi choo ses the follo wing approximation: C ( y | x ) = C ( xy ) − C ( x ) The authors expla in a s follo ws their intuition. T o compre s s the word xy ( x co ncatenated to y ), - the compres sor ﬁrst compresses x , - then it co mpresses y but skip all infor ma tion from y which w as already in x . Thu s, the output is not a compressio n of y but a co mpr ession of y with all x information removed. I.e. this o utput is a c onditional c ompr ession o f y knowing x . Now, the a s sumption that the compresso r ﬁrst co mpresses x is questionable: how do es the compressor recov ers x in xy ?. One ca n a r gue pos itiv ely in case x, y ar e random (= incompressible) and in cas e x = y . And betw een these tw o extreme cases? But it works... The miracle of modeliza tion? Or something not completely understo o d? With this approximation, plus the a ssumption that C ( xy ) = C ( y x ) (a ls o ques- tionable) we g e t the fo llowing a ppro ximation of N I D , called the norma liz ed compressio n distance N C D : N C D ( x, y ) = max( C ( x | y ) , C ( y | x )) max( C ( x ) , C ( y )) = max( C ( y x ) − C ( y ) , C ( xy ) − C ( x )) max( C ( x ) , C ( y )) = C ( xy ) − min( C ( x ) , C ( y )) max( C ( x ) , C ( y )) Clustering a ccording to N C D and, more generally , classiﬁcation via compres- sion, is a k ind of black box: w ords are gro up ed together accor ding to features that a re no t explicitly known to us. Mo reov er , there is no reasonable hop e that the analys is of the computation do ne by the c ompressor gives so me light on the obtained cluster s. F or exa mple, what mak es a text b y T o lsto ¨ ı so characteris tic? What diﬀerentiates the styles o f T ols to ¨ ı a nd Do stoievski? But it works, Russian texts are gro uped b y authors by a compresso r which ignor es everything ab out Russian literature. When dealing with s ome cla ssiﬁcation obtained by co mpression, one s hould hav e so me idea o f this classiﬁcation: this is semantics wherea s the co mpressor is purely syntactic and do es not understa nd a n ything. This is very m uc h lik e with ma c hines which, g iv en s ome forma l deduction sys- tem, are able to prove quite c omplex statements. But these theo rems are proved with no explicit semantical idea, how are w e to in terpret them? No hop e that the machine giv es any hint. 31 5.3 The Go ogle classiﬁcation Though it do es no t use Kolmogo ro v complexity , w e now present ano ther recent approach by Vitan yi and Cilibr asi [11] to classiﬁcation which leads to a very per forming to ol. 5.3.1 The normalized Go og le distance N GD This quite orig inal metho d is based on the huge da ta bank constituted by the world wide web and the Go ogle sea rc h engine which allows for bas ic queries using conjunction o f keywords. Observe that the web is not a data base, merely a da ta bank, since the da ta on the web a re not structured as data o f a data base. Citing [15], the idea of the metho d is a s follo ws: When the Go ogle s earch engine is used to search for the word x , Go ogle ds iplays the num ber of hits that word x has. The ratio of this num be r to the total num ber of webpages indexed by Go ogle represents the pro babilit y that word x a ppears on a webpage [...] If word y ha s a hig he r co nditional pro babilit y to app ear o n a web page, given that word x also a pp ears on the webpage, than it doe s by itself, then it can b e concluded that w ords x a nd y are related. Let’s cite an ex ample from Cilibr asi and Vitany [10] which we c omplete and up- date the ﬁgures. The searches for the index term“hor se”, “r ider” and “molecule” resp ectiv ely r eturn 156, 62 . 2 and 45 . 6 million hits. Searches for pairs of words “horse r ide r ” and “hors e mo lecule” resp ectively r eturn 2 . 6 6 and 1 . 52 million hits. These ﬁgures stress a stro nger re lation b etw ee n the words “ horse” and “rider” than b et ween “horse” and “molecule” . Another exa mple with famous pa in tings: “ Dejeuner sur l’her be”,“Mo ulin de la Galette” and “la Jo conde”. Let refer them by a, b, c. Go ogle searches for a, b, c resp ectively g iv e 446 00 0, 278 000 a nd 1 310 0 0 0 hits. As b oth the searches for a + b, a+c and b+c, they resp ectiv ely give 13 7 0 0, 888 a nd 603 hits. Clearly , the tw o pa in tings by Renoir ar e more often cited together than ea ch one is with the painting b y da Vinci. In this w ay , the metho d reg roups paintings by artists, using what is said abo ut these pain tings on the web. But this do es no t asso ciate the painters to groups of paintings. F ormally , Cilibras i a nd Vitan y [10, 11] deﬁne the normalized Go ogle distance as follows: N GD ( x, y ) = max(log f ( x ) , log f ( y )) − log f ( x, y ) log M − min(log f ( x ) , lo g f ( y )) where f ( z 1 , ... ) is the n umber of hits for the co njunctiv e query z 1 , ... and M is the total num ber of webpages that Goo gle indexes. 32 5.3.2 Discussing the metho d 1. The n um b er of ob jects in a future class iﬁcation a nd that of canonical repre - sentativ es of the diﬀerent cor pus is no t chosen in adv ance nor even b oundable in adv ance and it is constantly mo ving. This dynamical and unco n trolled feature is a totally new exper ie nce. 2. Domains a priori completely reb el to cla ssiﬁcation as is the pictoria l domain (no no rmalization o f paintings is poss ible ) ca n now be co nsidered. Because w e are no more dealing w ith the pain tings themselves but with what is said ab out them o n the w eb. And, whereas the “ pictorial lang uage” is merely a metaphor, this is a true “languag e ” which deals with keyw ords and their rela tions in the texts written by web user s. 3. How ever, there is a big limitation to the metho d, that of a closed world: the World ac c or ding to Go o gle , Information ac c or ding to Go o gle ... If Go ogle ﬁnds s o mething, o ne ca n check its p ertinence. Els e, wha t do es it mean? So le c ertain ty , that o f uncertaint y . When failing to g et hits with several keywords, we give up the or iginal query and mo dify it up to the po in t Goo gle gives some p ertinent a nsw ers. So that failure is as nega tion in Pro log whic h is much weak er than classica l negation. It’s reas onable to give up a query and according ly consider the re- lated conjunction as meaningless. How ever, one s hould keep in mind that this is relative to the close - a nd rela tiv ely small - world o f da ta on the web, the sole world acces sible to Goog le. When succeeding with a quer y , the r isk is to stop on this succeeding quer y and - forget that previous queries hav e been tr ied which failed, - omit going on with some other queries which could pos sibly lead to mo re p er- tinen t answers. There is a need to formaliz e informa tio n o n the web and the re la tions ruling the data it co n tains. And a lso the no tio n of pe rtinence. A ma thematical framework is badly needed. This remark able inno v ative approach is still in its infancy . 5.4 Some ﬁnal remarks These a pproaches to classiﬁca tion via compressio n and Go ogle s earch of the web are re a lly prov o c ativ e. They a llo w fo r clas siﬁcation o f diverse corpus a lo ng a top-down op erational mo de as opp osed to b ottom-up gr o uping. T op-down since there is no prerequis ite o f any a pr iori knowledge o f the co nten t of the texts under co nsideration. One gets informatio n on the texts without ent ering their semantics, simply by co mpressing them or counting hits with Go ogle. This has muc h resemblance with sta tis tica l metho ds which point cor- relations to g roup ob jects. Indeed, compresso rs and Go ogle use a larg e amount of statistical exp ertise. On the o pposite, a bo tton- up a pproach uses keywords whic h ha ve to be prev i- ously known so that we alr eady have in mind what the gro ups of the clas s iﬁcation 33 should b e. Let’s illustra te this top-down versus bo ttom-up opp osition by contrasting three approaches rela ted to the classical compr ehension schema. Mathematic al appr o ach. This is a global, intrinsically deterministic approach along a fundamental di- chotom y: true/fa ls e, pr o v able/ inco nsisten t. A quest for a bsoluteness ba sed on certaint y . This is reﬂected in the cla ssical c o mprehension sc hema ∀ y ∃ Z Z = { x ∈ y | P ( x ) } where P is a pro perty ﬁxed in adv ance . Pr ob abili stic appr o ach. In this pra gmatic appro ac h uncertaint y is taken into co nsideration, it is b ounded and trea ted mathematica lly . T his can b e related to a proba bilistic version of the comprehension s c hema where the truth of P ( x ) is replaced by some limitation of the uncertaint y: the pr obabilit y that x satisﬁes P is true is in a given in terv al. Whic h as k s for a t w o arg umen ts pr o perty P : ∀ y ∃ Z Z = { x ∈ y | µ ( { ω ∈ Ω | P ( x, ω ) } ) ∈ I } where µ is a probability on so me s pace Ω and I is some interv a l of [0 , 1]. The abov e mathematical and pr obabilistic appr oaches are bottom-up. One starts with a g iv en P to g roup ob jects. Go o gle appr o ach. Now, there is no idea of the interv al of uncer tain t y . Go ogle may give 0% up to 100% of p ertinent answers. It seems to b e m uch harder to put in a mathematical framework. But this is quite a n exciting appro ac h, one of the few top-down ones together with the compr ession appr oach and those ba sed o n statistica l inference. This Go ogle approach r ev eals prope rties, regularity laws. References [1] C. Bennett and P . G` acs, M. Li and W. Zure k. Infor mation distance. IEEE T r ans. on Information The ory , 44 (4 ):1407–14 2 3, 1 998 . [2] L. Bienv enu, W. Merkle a nd A. Shen. A simple pro of of Miller-Y u theorem. T o app ear. [3] G. Bo nfan te and M. Kaczmar ek and J-Y. Mar ion. On abs tract co mputer vi- rology : from a recur sion-theoretic per spective. Journal of c omputer vir olo gy , 3-4, 2006 . [4] G. Cha itin. On the leng th o f pro g rams for computing ﬁnite binar y sequences. Journal of the ACM , 13:54 7 –569, 19 6 6. 34 [5] G. Cha itin. On the leng th o f pro g rams for computing ﬁnite binar y sequences: statistical consider ations. Journal of t he ACM , 16:145– 159, 196 9 . [6] G. Chaitin. Computational co mplexit y and g¨ ode l inco mpleteness theorem. ACM SIGACT News , 9:11–12, 1971. [7] G. Chaitin. Infor ma tion theore tic limitations of forma l systems. Journ al of the AC M , 21:4 03–424, 197 4. [8] G. Chaitin. A theory of program size formally identical to information theory . Journal of the A CM , 22:32 9–340, 19 75. [9] R. Cilibr a si. Clustering by compression. IEEE T r ans. on Information The- ory , 51(4):15 23-1545 , 200 3. [10] R. Cilibr a si and P . Vitanyi. Go ogle teaches co mputers the meaning of words. ERCIM News , 61, April 2 005. [11] R. Cilibras i and P . Vitanyi. The Go ogle simila r it y distance. IEEE T r ans. on Know le dge and Data Engine ering , 1 9(3):370-38 3, 2 007. [12] J.P . Delahay e. Information, c omplexit´ e, hasar d . Herm` es, 19 99 (2d edition). [13] J.P . Delahaye. Classer musiques, langues , ima ges, textes et g´ eno mes. Pour L a Scienc e , 316:98– 103, 2 004. [14] J.P . Delahay e. Complexit´ es : aux limites des math´ ematiques et de l’informatique. Pour La Science, 2 006. [15] A. Ev ang elista and B. Kjos- Hanssen. Go ogle distance b etw e en words. F r on- tiers in Under gr aduate R ese ar ch , Univ. o f Connec ticut, 2006. [16] W. F eller. Intr o duction to pr ob ability the ory and its applic ations , v olume 1 . John Wiley , 1968 (3d edition). [17] P . G´ acs. Lectures notes on descriptional complexit y and randomness. Boston University , pages 1–67, 1993 . ht tp://cs-pub.bu.edu/fa cult y/gacs/ Ho me.h tml . [18] D.A. Huﬀman. A metho d for co nstruction o f minimum -redundancy co des. Pr o c e e dings IRE , 40 :1 098–11 0 1, 1 952. [19] D. Knuth. The Art of Computer Pr o gr amming. Volume 2: semi-numeric al algorithms . Addison- W esley , 1981 (2d edition). [20] A.N. Kolmo g orov. G rundb e griﬀe der Wahscheinlichk eitsr e chnung . Springer-V erlag, 1933 . English tr a nslation ‘F oundations of the Theory o f Probability’, Chels ea, 1956. [21] A.N. Ko lmogorov. On tables of random num ber s. S ankhya , The Indian Journal of Statistics, s er. A , 25:369– 376, 19 63. 35 [22] A.N. Kolmogor o v. Three approaches to the quantitative deﬁnition of in- formation. Pr oblems Inform. T r ansmission , 1(1 ):1 –7, 196 5. [23] A.N. K olmogorov. Combinatorial foundation of informa tion theory and the calculus of pr obabilit y . Russian Math. Surveys , 38(4):29–4 0, 1983 . [24] M. Li, X. Chen, X. Li, B. Ma and P . Vit´ anyi. The similarity metrics. 14th ACM -SIAM Symp osium on Discr ete algorithms , 2003 . [25] M. Li and P . Vit´ anyi. An intr o duction to Kolmo gor ov Complexity and its applic ations . Springer , 2d Edition, 1997. [26] P . Martin-L¨ of. The deﬁnition of random s equences. Information and Con- tr ol , 9:602–6 1 9, 196 6. [27] P . Martin-L¨ of. Complexity of oscilatio ns in inﬁnite binary sequences. Z. Wahrscheinlic hkeitsthe orie verw. Geb. , 19:225– 230, 19 71. [28] J. Miller and L. Y u. On initial se gmen t co mplexit y and degrees of ra ndom- ness. T r ans . Amer. Math. S o c. , to a pp ear. [29] J. von Neumann. V arious techniques used in c onnection with rando m digits. Monte Carlo Metho d , A.S. Ho useholder, G.E. F or s ythe, and H.H. Ger mond, eds., National Bureau of Sta nda rds Applied Mathematics Ser ies (W as hing- ton, D.C.: U.S. Gov ernment Printing Oﬃce), 12:3 6–38, 195 1. [30] A. Nies & F. Stephan & S.A. T er wijn. Randomness, relativiz a tion and T uring degr ees. T o a ppear. [31] B. Russell. Mathematical log ic as based on the theo ry o f types. A mer. J. Math. , 30:222 – 262, 1 908. Reprinted in ‘F ro m F reg e to G¨ odel A source bo ok in mathematical lo g ic, 1879-19 31’, J. v an Heijeno ort ed., p. 15 0-182, 1967. [32] P . Schnorr. A uniﬁed appr oach to the deﬁnition of r andom sequences . Math. Systems The ory , 5:2 46–258, 1 971. [33] P . Schnorr. A Pro cess complexity and eﬀective random tests. J.of Computer and System S c. , 7:376–3 88, 1 973. [34] C.E.. Shannon. The mathematical theory of communication. Bel l System T e ch. J. , 27:379– 423, 1948 . [35] R. Soa r e. Computability a nd Rec ur sion. Bul letin of Symb olic L o gic , 2:284– 321, 1996 . [36] R. Solomo noﬀ. A formal theory of inductive inference, part I. Information and c ontr ol , 7:1–22 , 1964. [37] R. Solo mo noﬀ. A formal theory o f inductive inference, pa r t I I. Information and c ontr ol , 7:224– 254, 1964 . 36 [38] R. von Mises. Grundlagen der wahrscheinlic hkeit srechn ung. Mathemat. Zeitsch. , 5:52 –99, 19 19. [39] R. von Mises. Pr ob ability, S tatistics and T ruth . Macmillan, 19 39. Reprinted: Dov er , 19 8 1. [40] A. Zvonkin and L. Lev in. The complexity of ﬁnite o b jects and the develop- men t of the co nc e pts o f informatio n and randomness by means of the theory of algorithms. Russian Math. Surveys , 6 :83–124, 1 970. 37

Kolmogorov complexity in perspective

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment