Some information-theoretic computations related to the distribution of prime numbers

We illustrate how elementary information-theoretic ideas may be employed to provide proofs for well-known, nontrivial results in number theory. Specifically, we give an elementary and fairly short proof of the following asymptotic result: The sum of …

Authors: Ioannis Kontoyiannis

Some Informatio n-Theoretic Computations Related to the Distribution of Pri me Num b ers Submitted to Jorma Riss anen’s Festschrift vol ume I. Kon toy iannis ∗ No v em b er 8 , 2018 Abstract W e illustrate how elementary infor mation-theoretic ideas may b e employ ed to provide pro ofs for w ell-known, nontrivial results in n umber theory . Sp ecifica lly , we give an elementary and fairly short pro of of the following a symptotic r esult, X p ≤ n log p p ∼ log n, as n → ∞ , where the sum is ov er all primes p not exceeding n . W e also give finite- n bo unds refining the ab ov e limit. This result, originally prov ed by Chebyshev in 185 2, is closely related to the ce lebrated pr ime num b er theorem. ∗ Department of Informatics, Athens Un ivers ity of Economics and Business, Pa tission 76, Athens 10434, Greece. Email: yiannis@aueb.gr . W eb: http://pages.cs.au eb.gr/users/yiannisk/ . 1 In tro duction The significan t depth of the connection b et w een information theory and statistics app ears to ha v e b een recog nized very so on after the b irth of information theory [17] in 1948; a b ook-length exp osition w as pr o vided b y Kullb ack [12] alrea dy in 1 959. In subsequent decades m uch was accomplished, and in the 1980s the d ev elopmen t of th is connection culminated in Rissanen’s celebrated work [14][15][16], la ying the foundations for the notion of s to chastic complexit y and the Minim um Description Length pr inciple, or MDL. Here we offer a fir st glimpse of a different conn ection, this time b et w een information theory and num b er th eory . In p articular, we w ill sho w that basic information-theoretic argumen ts com bined with element ary computations can b e u s ed to giv e a n ew pro of for a classical result concerning the distribu tion of prime n um b ers. The p roblem of understand ing this “distribution” (including the issue of exactly what is mean t b y that statement ) has, of course, b een at the heart of mathematics s ince antiquit y , and it has led, among other things, to the dev elopmen t of the field of analytic num b er theory; e.g., Ap ostol’s text [1] offers an accessible in tro du ction and [2] giv es a more historical p ersp ectiv e. A ma jor subfi eld is pr ob abilistic n umb er theory , wh ere probabilistic to ols are used to deriv e results in num b er theory . This ap p roac h, pioneered b y , among others, Mark Kac and P aul Erd¨ os from the 1930s on, is describ ed, e.g., in Kac’s b eautiful b ook [11], Billingsley’s review [3], and T enen baum’s more recent text [18]. The starting p oint in muc h of the relev an t literature is the follo wing setup: F or a fi xed, large integ er n , choose a random intege r N from { 1 , 2 , . . . , n } , and write it in its u nique prime factorization, N = Y p ≤ n p X p , (1) where the pro du ct run s o v er all p rimes p n ot exceeding n , and X p is the largest p ow er k ≥ 0 suc h that p k divides N . Through this rep resen tation, the uniform distribution on N ind uces a join t d istribution on the { X p ; p ≤ n } , and the key observ ation is th at, for large n , the random v aria bles { X p } are distribu ted appro ximately lik e indep enden t geometrics. Indeed, since there are exactly ⌊ n/p k ⌋ multiples of p k b et wee n 1 and n , Pr { X p ≥ k } = Pr { N is a multiple of p k } = 1 n j n p k k ≈  1 p  k , for large n, (2) so the distribu tion of X p is approximat ely geometric. Similarly , for the joint distrib ution of the { X p } w e fin d , Pr { X p i ≥ k p i for p rimes p 1 , p 2 , . . . , p m ≤ n } = 1 n j n p k 1 1 p k 2 2 · · · p k m m k ≈  1 p 1  k 1  1 p 2  k 2 · · ·  1 p m  k m , sho wing that the { X p } are appr o ximately indep enden t. This elegan t app ro ximation is also mathematically p o w erful, as it mak es it p ossible to trans- late standard resu lts ab out collections of indep enden t rand om v aria bles into imp ortan t prop erties that hold for ev ery “t ypical” integ er N . Billingsley in his 1973 W ald Memoria l Lectures [3] giv es an accoun t of the state-o f-the-art of related results up to that p oint , b ut he also go es on to mak e a further, fascinating connection with the entr opy of the rand om v ariables { X p } . 2 Billingsley’s argum ent essen tially b egins with the observ atio n that, since th e representa- tion (1) is u nique, the v alue of N and the v alues of the exp onen ts { X p } are in a on e-to-one corresp ondence; therefore, the en tropy of N is the same as the en tropy of the collection { X p } , 1 log n = H ( N ) = H ( X p ; p ≤ n ) . And since the random v ariables { X p } are approxima tely indep enden t geometrics, w e should exp ect that, log n = H ( X p ; p ≤ n ) ≈ X p ≤ n H ( X p ) ≈ X p ≤ n h log p p − 1 − log  1 − 1 p i , (3) where in the last equalit y w e simply su bstituted the w ell-kno wn expression for the en tropy of a geometric random v ariable (see Section 2 for details on the definition of the entrop y an d its computation). F or large p , the ab ov e summand s b eha v e like log p p to fir st order, leading to the asymptotic estimate, X p ≤ n log p p ≈ log n, for large n. Our main goal in this pap er is to sho w that this approximati on can indeed b e made rigorous, mostly through elemen tary in formation-theoretic argum en ts; we will establish: Theorem 1. As n → ∞ , C ( n ) := X p ≤ n log p p ∼ log n, (4) where the su m is ov er all primes p not exceeding n . 2 As describ ed in more detail in the f ollo wing section, the fact that the joint distr ib ution of the { X p } is asymptotically close to the distribution of indep enden t geometrics is not sufficient to turn Billingsley’s heur istic in to an actual pr o of – at least, we w ere not able to mak e the tw o “ ≈ ” steps in (3) rigorous directly . I nstead, w e provide a p ro of in t w o steps. W e mo d ify Billingsley’s heuristic to derive a lower b ound on C ( n ) in Theorem 2, and in Theorem 3 we use a differen t argumen t, again going via the entrop y of N , to compu te a corresp onding upp er b ound . These t w o combined pr o v e Th eorem 1, and they also giv e fi ner, fi nite- n b ounds on C ( n ). In Section 2 we state our main results and describ e the intuition b ehind their pr o ofs. W e also briefly review some other elegan t inf ormation-theoretic argumen ts connected with b ounds on the n umb er of primes up to n . Th e app endix con tains the r emaining pr o ofs. Before mo ving on to the results themselv es, a few w ords ab out the history of Theorem 1 are in order. The relationship (4) was fir s t prov ed by Chebyshev [7][6] in 1852, wh ere he also pro du ced finite- n b ounds on C ( n ), with exp licit constants. Chebyshev’s m otiv ation was to pro ve the celebrated p rime n umb er theorem (PNT), stating that π ( n ), the n umb er of primes not exceeding n , gro ws like , π ( n ) ∼ n log n , as n → ∞ . 1 F or defi n iteness, we take log to denote th e n atural logarithm to base e throughout, although the choice of the base of the logari thm is largely irrelev an t for our considerations. 2 As usual, the notation “ a n ∼ b n as n → ∞ ” means that lim n →∞ a n /b n = 1. 3 This wa s conjectured by Gauss around 1792, and it was only prov ed in 1896; Ch eb yshev was not able to pro duce a complete pro of, but he used (4) and his finer b ounds on C ( n ) to sho w that π ( n ) is of order n log n . Although w e will not pu rsue th is direction here, it is actually not hard to see that the asymp totic b ehavior of C ( n ) is intimat ely connected with that of π ( n ). F or example, a simple exercise in summation by parts shows that π ( n ) can b e expressed dir ectly in terms of C ( n ): π ( n ) = n + 1 log( n + 1) C ( n ) − n X k =2  k + 1 log( k + 1) − k log k  C ( k ) , for all n ≥ 3 . (5) F or the sak e of completeness, this is prov ed in the app endix. The PNT w as fin ally pr o v ed in 1896 by Hadamard and by d e la V all ´ ee-P ousin. Their pro ofs w ere not elemen tary – b oth r elied on the use of Hadamard’s theory of integ ral fu nctions applied to th e Riemann zeta fun ction ζ ( s ); see [2 ] for some d etails. In fact, for quite some time it w as b eliev ed that no elemen tary pr o of wo uld ever b e fou n d, and G.H. Hardy in a famous lecture to the Mathematica l S o ciet y of C op enhagen in 1921 [4] wen t as far as to suggest that “ if anyone pr o duc es an e lementary pr o of of the PNT ... he wil l show that ... i t is time for the b o oks to b e c ast aside and f or the the or y to b e r ewritt en. ” It is, therefore, not su rprising that Selb erg and Erd¨ os’ announcement in 1948 that they had p ro duced such an elemen tary pro of caused a great sensation in the mathematical w orld; see [9] for a surve y . In our con text, it is in teresting to note that Chebyshev’s result is again used explicitly in one of the steps of th is elementa ry pro of. Finally w e remark that, although th e simp le argum en ts in this w ork fall sh ort of giving estimates precise enou gh for an elemen tary information-theoretic pr o of of the PNT, it may not b e en tirely un reasonable to h op e that such a pro of ma y exist. 2 Primes and B its: Heuristics and Results 2.1 Preliminaries F or a fixed (t ypically large) n ≥ 2, our s tarting p oin t is the setting describ ed in the in tro duction. T ak e N to b e a u n iformly distributed inte ger in { 1 , 2 , . . . , n } and write it in its uniqu e p rime factorizat ion as in (1), N = Y p ≤ n p X p = p X 1 1 · p X 2 2 · · · · · p X π ( n ) π ( n ) , where π ( n ) d enotes the num b er of primes p 1 , p 2 , . . . , p π ( n ) up to n , and X p is th e largest inte ger p o wer k ≥ 0 such that p k divides N . As noted in (2) ab o ve, the distrib ution of X p can b e describ ed b y , Pr { X p ≥ k } = 1 n j n p k k . for all k ≥ 1 , (6) This representa tion also giv es simple u pp er and lo w er b ounds on its mean E ( X p ), µ p := E ( X p ) = X k ≥ 1 Pr { X p ≥ k } ≤ X k ≥ 1  1 p  k = 1 /p 1 − 1 /p = 1 p − 1 , (7) and µ p ≥ Pr { X p ≥ 1 } ≥ 1 p − 1 n . (8) 4 Recall the imp ortan t observ ation that the distr ib ution of eac h X p is close to a geometric. T o b e p recise, a random v ariable Y with v alues in { 0 , 1 , 2 , . . . } is said to hav e a geometric distribution with mean µ > 0, denoted Y ∼ Geom( µ ), if Pr { Y = k } = µ k / (1 + µ ) k +1 , for all k ≥ 0. Then Y of cours e has mean E ( Y ) = µ an d its en trop y is, h ( µ ) := H (Geom( µ )) = − X k ≥ 0 Pr { Y = k } log Pr { Y = k } = ( µ + 1) log( µ + 1) − µ log µ. (9) See, e.g., [8] for the standard prop erties of the ent ropy . 2.2 Billingsley’s Heuristic and Low er Bounds on C ( n ) First we s h o w ho w Billingsley’s heuristic can b e mo d ified to yield a lo we r b ound on C ( n ). Arguing as in th e in tro du ction, log n ( a ) = H ( N ) ( b ) = H ( X p ; p ≤ n ) ( c ) ≤ X p ≤ n H ( X p ) ( d ) ≤ X p ≤ n H (Geom( µ p )) ( e ) = X p ≤ n h ( µ p ) , (10) where ( a ) is simply the en tropy of the uniform distribution, ( b ) comes from the fact th at N and the { X p } are in a one-to-one corresp ondence, ( c ) is the w ell-kno wn su b additivit y of th e en trop y , ( d ) is b ecause the geometric has maximal entrop y among all distributions on th e non- negativ e integ ers with a fixed mean, and ( e ) is th e defin ition of h ( µ ) in (9). Noting that h ( µ ) is nondecreasing in µ and recalling the upp er b ound on µ p in (7) give s, log n ≤ X p ≤ n h ( µ p ) ≤ X p ≤ n h (1 / ( p − 1)) = X p ≤ n h p p − 1 log  p p − 1  − 1 p − 1 log  1 p − 1 i . (11) Rearranging the terms in the sum prov es: Theorem 2. F or all n ≥ 2, T ( n ) := X p ≤ n h log p p − 1 − log  1 − 1 p i ≥ log n. Since the su mmands ab ov e b eha v e lik e log p p for large p , it is n ot d iffi cu lt to deduce th e follo wing lo w er b ound s on C ( n ) = P p ≤ n log p p : Corollary 1. [Lower Bound s on C ( n ) ] ( i ) lim inf n →∞ C ( n ) log n ≥ 1; ( ii ) C ( n ) ≥ 86 125 log n − 2 . 35 , for all n ≥ 16 . Corollary 1 is pr o v ed in the app endix. P art ( i ) p ro v es half of Theorem 1, and ( ii ) is a simple ev aluation of the more general b oun d derive d in equation (15) in the pro of: F or an y N 0 ≥ 2, we ha v e, C ( n ) ≥  1 − 1 N 0  1 − 1 1 + log N 0  log n + C ( N 0 ) − T ( N 0 ) , for all n ≥ N 0 . 5 2.3 A Simple Upp er Bound on C ( n ) Unfortunately , it is n ot clear ho w to rev erse the inequalities in equ ations (10) and (11) to get a corresp onding upp er b ound on C ( n ) – esp ecially inequalit y ( c ) in (10). Instead we use a differen t argument, one wh ic h is less s atisfying from an inf ormation-theoretic p oint of view, for t w o reasons. First, although again w e do go via the entrop y of N , it is not n ecessary to do so; see equation (13) b elow. And second, w e n eed to use an auxiliary r esu lt, namely , the follo wing rough estimate on the sum, ϑ ( n ) := P p ≤ n log p : ϑ ( n ) := X p ≤ n log p ≤ (2 log 2) n, for all n ≥ 2 . (12) F or completeness, it is p ro v ed at the en d of this section. T o obtain an upp er b ound on C ( n ), we note that the ent ropy of N , H ( N ) = log n , can b e expressed in an alternativ e form: Let Q denote th e probabilit y mass fu nction of N , so that Q ( k ) = 1 /n for all 1 ≤ k ≤ n . Since N ≤ n = 1 / Q ( N ) alw a ys, w e hav e, H ( N ) = E [ − log Q ( N )] ≥ E [log N ] = E h log Y p ≤ n p X p i = X p ≤ n E ( X p ) log p. (13) Therefore, r ecalling (8) and using the b oun d (12), log n ≥ X p ≤ n  1 p − 1 n  log p = X p ≤ n log p p − ϑ ( n ) n ≥ X p ≤ n log p p − 2 log 2 , th us proving: Theorem 3. [Uppe r Bound] F or all n ≥ 2, X p ≤ n log p p ≤ log n + 2 log 2 . Theorem 3 together with Corollary 1 pr o v e T h eorem 1. Of course the use of th e en tropy could ha v e b een av oided entirely: Instead of using that H ( N ) = log n in (13), w e could simply use that n ≥ N by defin ition, s o log n ≥ E [log N ], and p r o ceed as b efore. Finally (paraphr asing from [10, p. 341]) we giv e an elegan t argum en t of Erd¨ os that emplo ys a cute, elemen tary tr ic k to prov e the inequ ality on ϑ ( n ) in (12). First observe that we can restrict atten tion to o dd n , sin ce ϑ (2 n ) = ϑ (2 n − 1), for all n ≥ 2 (as there are no ev en primes other than 2). Let n ≥ 2 arbitrary; then every pr ime n + 1 < p ≤ 2 n + 1 divides the binomial co efficien t, B :=  2 n + 1 n  = (2 n + 1)! n !( n + 1)! , since it divides the numerator b ut n ot the den ominator, and hence the pro du ct of all these primes also divides B . In particular, their pro d uct m ust b e no greater than B , i.e., Y n +1


Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment