Maximum Entropy on Compact Groups

Submitted to Entr opy . Pages 1 - 16 . OPEN A CCESS entropy ISSN 1099-4300 www .mdpi.com/journal/entrop y Article Maximum Entr opy on Compact Groups Peter Harr emo ¨ es Centrum W iskunde & Informatica, Science Park 123, 1098 GB Amsterdam, Noord-Holland, The Netherlands E-mail: P .Harremoes@cwi.nl V ersion November 8, 2018 submitted to Entrop y . T ypesetted by L A T E X using class ﬁle mdpi.cls Abstract: On a compact group the Haar probability measure plays the role of uniform dis- tribution. The entropy and rate distortion theory for this uniform distribution is studied. Ne w results and simpliﬁed proofs on con ver gence of con volutions on compact groups are presented and they can be formulated as entropy increases to its maximum. Information theoretic tech- niques and Mark ov chains play a crucial role. The con vergence results are also formulated via rate distortion functions. The rate of con ver gence is shown to be e xponential. K eywords: Compact group; Con volution; Haar measure; Information di ver gence; Maximum entropy; Rate distortion function; Rate of con ver gence; Symmetry . Classiﬁcation: MSC 94A34,60B15 1. Introduction It is a well-kno wn and celebrated result that the uniform distribution on a ﬁnite set can be charac- terized as ha ving maximal entropy . Jaynes used this idea as a foundation of statistical mechanics [1], and the Maximum Entropy Principle has become a popular principle for statistical inference [2–8]. Of- ten it is used as a method to get prior distributions. On a ﬁnite set, for any distributions P we ha ve H ( P ) = H ( U ) − D ( P k U ) where H is the Shannon entropy , D is information di ver gence, and U is the uniform distribution. Thus, maximizing H ( P ) is equi valent to minimizing D ( P k U ) . Minimization V ersion Nov ember 8, 2018 submitted to Entr opy 2 of 16 of information di vergence can be justiﬁed by the conditional limit theorem by Csisz ´ ar [9, Theorem 4]. So if we hav e a good reason to use the uniform distribution as prior distribution we automatically get a justiﬁcation of the Maximum Entrop y Principle. The conditional limit theorem cannot justify the use of the uniform distribution itself, so we need something else. Here we shall focus on symmetry . Example 1. A die has six sides that can be permuted via r otations of the die. W e note that not all permutations can be r ealized as r otations and not all r otations will give permutations. Let G be the gr oup of permutations that can be r ealized as r otations. W e shall consider G as the symmetry gr oup of the die and observe that the uniform distribution on the six sides is the only distribution that is in variant under the action of the symmetry gr oup G. Example 2. G = R / 2 π Z is a commutative gr oup that can be identiﬁed with the gr oup S O (2) of r ota- tions in 2 dimensions. This is the simplest example of a gr oup that is compact but not ﬁnite . For an object with symmetries the symmetry group deﬁnes a group action on the object, and any group action on an object deﬁnes a symmetry group of the object. A special case of a group action of the group G is left translation of the elements in G . Instead of studying distributions on objects with symmetries, in this paper we shall focus on distributions on the symmetry groups themselves. It is no serious restriction because a distribution on the symmetry group of an object will induce a distribution on the object itself. Con ver gence of con volutions of probability measures were studied by Stromber g [10] who pro ved weak con vergence of con volutions of probability measures. An information theoretic approach was introduced by Csisz ´ ar [11]. Classical methods in volving characteristic functions ha ve been used to giv e conditions for uniform conv ergence of the densities of con volutions [12]. See [13] for a revie w of the subject and further references. Finally it is shown that con ver gence in information div ergence corresponds to uniform con vergence of the rate distortion function and that weak con ver gence corresponds to pointwise con ver gence of the rate distortion function. In this paper we shall mainly consider con volutions as Mark ov chains. This will gi ve us a tool, which allows us to prov e con ver gence of iid. con v olutions, and the rate of con ver gence is prov ed to be exponential. The rest of the paper is org anized as follo ws. In Section 2. we establish a number of simple results on distortion functions on compact set. These results will be used in Section 4. . In Section 3. we deﬁne the uniform distrib ution on a compact group as the uniquely determined Haar probability measures. In Sec- tion 4. it is sho wn that the uniform distrib ution is the maximum entropy distribution on a compact group in the sense that it maximizes the rate distortion function at any positi ve distortion level. Con ver gence of con volutions of a distribution to the uniform distrib ution is established in Section 5. using Marko v chain techniques, and the rate of con ver gence is discussed in Section 6. . The group S O (2) is used as our running example. W e ﬁnish with a short discussion. 2. Distortion on compact groups Let G be a compact group where ∗ denotes the composition. The neutral element will be denoted e and the in verse of the element g will be denoted g − 1 . V ersion Nov ember 8, 2018 submitted to Entr opy 3 of 16 Figure 1 . Squared Euclidean distance between the rotation angles x and y . W e shall start with some general comments on distortion functions on compact sets. Assume that the group both plays the role as source alphabet and reproduction alphabet. A distortion function d : G × G → R is gi ven and we will assume that d ( x, y ) ≥ 0 with equality if and only if x = y . W e will also assume that the distortion function is continuous. Example 3. As distortion function on S O (2) we use the squar ed Euclidean distance between the corr e- sponding points on the unit cir cle, i.e. d ( x, y ) = 4 sin 2  x − y 2  = 2 − 2 cos ( x − y ) . This illustrated in F igur e 1 . The distortion function might be a metric but ev en if the distortion function is not a metric, the relation between the distortion function and the topology is the same as if it was a metric. One w ay of constructing a distortion function on a group is to use the squared Hilbert-Smidt norm in a unitary representation of the group. Theorem 4. If C is a compact set and d : C × C → R is a non-ne gative continuous distortion function such that d ( x, y ) = 0 if and only if x = y , then the topology on C is gener ated by the distortion balls { x ∈ C | d ( x, y ) < r } wher e y ∈ C and r > 0 . Pr oof. W e hav e to pro ve that a subset B ⊆ C is open if and only if for any y ∈ B there e xists a ball that is a subset of B and contains y . Assume that B ⊂ C is open and that y ∈ B . Then { B compact. Hence, the function x → d ( x, y ) has a minimum r on { B and r must be positi ve because r = d ( x, y ) = 0 would imply that x = y ∈ B . Therefore { x ∈ C | d ( x, y ) < r } ⊆ B . Continuity of d implies that the balls { x ∈ C | d ( x, y ) < r } are open. If any point in B is contained in an open ball, then B is a union of open set and open. The follo wing theorem may be considered as a kind of uniform continuity of the distortion function or as a substitute for the triangular inequality when d is not a metric. V ersion Nov ember 8, 2018 submitted to Entr opy 4 of 16 Lemma 5. If C is a compact set and d : C × C → R is a non-ne gative continuous distortion function such that d ( x, y ) = 0 if and only if x = y , then ther e exists a continuous function f 1 satisfying f 1 (0) = 0 such that | d ( x, y ) − d ( z , y ) | ≤ f 1 ( d ( z , y )) for x, y, z ∈ C . (1) Pr oof. Assume that the theorem does not hold. Then there exists  > 0 and a net ( x λ , y λ , z λ ) λ ∈ Λ such that d ( x λ , y λ ) − d ( z λ , y λ ) >  and d ( z λ , y λ ) → 0 . A net in a compact set has a con ver gent subnet so without loss of generality we may assume that the net ( x λ , y λ , z λ ) λ ∈ Λ con ver ges to some triple ( x ∞ , y ∞ , z ∞ ) . By continuity of the distortion function we get d ( x ∞ , y ∞ ) − d ( z ∞ , y ∞ ) ≥  and d ( z ∞ , y ∞ ) = 0 , which implies z ∞ = y ∞ and we hav e a contradiction. W e note that if a distortion function satisﬁes ( 1 ) then it deﬁnes a topology in which the distortion balls are open. In order to deﬁne the weak topology on probability distrib utions we extend the distortion function from C × C to M 1 + ( C ) × M 1 + ( C ) via d ( P, Q ) = inf E [ d ( X , Y )] , where X and Y are random variables with values in C and the inﬁmum is taken all joint distributions on ( X , Y ) such that the marginal distribution of X is P and the marginal distribution of Y is Q. The distortion function is continuous so ( x, y ) → d ( x, y ) has a maximum that we denote d max . Theorem 6. If G is a compact set and d : C × C → R is a non-ne gative continuous distortion function such that d ( x, y ) = 0 if and only if x = y , then | d ( P, Q ) − d ( S, Q ) | ≤ f 2 ( d ( S, P )) for P, Q, S ∈ M 1 + ( C ) for some continuous function f 2 satisfying f 2 (0) = 0 . Pr oof. According to Lemma 5 there e xists a function f 1 satisfying ( 1 ). W e use that E [ | d ( X , Y ) − d ( Z , Y ) | ] ≤ E [ f 1 ( d ( Z , X ))] = E [ f 1 ( d ( Z , X )) | d ( Z , X ) ≤ δ ] · P ( d ( Z , X ) ≤ δ ) + E [ f 1 ( d ( Z , X )) | d ( Z , X ) > δ ] · P ( d ( Z , X ) > δ ) ≤ f 1 ( δ ) · 1 + f 1 ( d max ) · E [ d ( Z, X )] δ ≤ f 1 ( δ ) + f 1 ( d max ) · d ( S, P ) δ . This hold for all δ > 0 and in particular for δ = ( d ( S, P )) 1 / 2 , which prov es the theorem. V ersion Nov ember 8, 2018 submitted to Entr opy 5 of 16 The theorem can be used to construct the weak topology on M 1 + ( C ) with  P ∈ M 1 + ( C ) | d ( P , Q ) < r  , P ∈ M 1 + ( C ) , r > 0 as open balls that generate the topology . W e note without proof that this deﬁnition is equi valent with the quite dif ferent deﬁnition of weak topology that one will ﬁnd in most textbooks. For a group G we assume that the distortion function is right in variant in the sense that for all x, y , z ∈ G a distortion function d satisﬁes d ( x ∗ z , y ∗ z ) = d ( x, y ) . A right in variant distortion function satisﬁes d ( x, y ) = d ( x ∗ y − 1 , e ) , so right in variant continuous distortion functions of a group can be constructed from non-negati ve functions with a minimum in e . 3. The Haar measure W e use ∗ to denote conv olution of probability measures on G. For g ∈ G we shall use g ∗ P to denote the g -translation of the measure P or , equi valently , the con volution with a measure concentrated in g . The n -fold con volution of a distrib ution P with itself will be denoted P ∗ n . For random v ariables with values in G one can formulate an analog of the central limit theorem. W e recall some facts about probability measures on compact groups and their Haar measur es . Deﬁnition 7. Let G be a gr oup. A measur e P is said to be a left Haar measure if g ∗ P = P for any g ∈ G . Similarly , P is said to be a right Haar measure if P ∗ g = P for any g ∈ G. A measur e is said to be a Haar measure if it is both a left Haar measur e and a right Haar measur e. Example 8. The uniform distribution on S O (2) or R / 2 π Z has density 1 / 2 π with r espect to the Lebesgue measur e on [0; 2 π [ . The function f ( x ) = 1 + ∞ X n =1 a n cos ( n ( x + φ n )) (2) is a density on a pr obability distribution P on S O (2) if the F ourier coefﬁcients a n ar e sufﬁciently small so that f is non-ne gative. A sufﬁcient condition for f to be non-ne gative is that P ∞ n =1 | a n | ≤ 1 . T ranslation by y gives a distribution with density f ( x − y ) = 1 + ∞ X n =1 a n cos ( n ( x − y + φ n )) . The distrib ution P is in variant if and only if f is 1 or , equivalently , all F ourier coefﬁcients ( a n ) n ∈ N ar e 0 . A measure P on G is said to have full support if the support of P is G, i.e. P ( A ) > 0 for any non-empty open set A ⊆ G. The following theorem is well-kno wn [14–16]. Theorem 9. Let U be a pr obability measure on the compact gr oup G. Then the following four conditions ar e equivalent. V ersion Nov ember 8, 2018 submitted to Entr opy 6 of 16 • U is a left Haar measur e. • U is a right Haar measur e. • U has full support and is idempotent in the sense that U ∗ U = U. • Ther e e xists a pr obability measur e P on G with full support such that P ∗ U = U. • Ther e e xists a pr obability measur e P on G with full support such that U ∗ P = U. In particular a Haar pr obability measur e is unique. In [14–16] one can ﬁnd the proof that any locally compact group has a Haar measure. The unique Haar probability measure on a compact group will be called the uniform distribution and denoted U . For probability measures P and Q the information diver gence fr om P to Q is deﬁned by D ( P k Q ) = ( R log dP dQ dP , if P  Q ; ∞ , otherwise. W e shall often calculate the di vergence from a distribution to the uniform distribution U, and introduce the notation D ( P ) = D ( P k U ) . For a random v ariable X with v alues in G we will sometimes write D ( X k U ) instead of D ( P k U ) when X has distribution P . Example 10. The distribution P with density f given by ( 2 ) has D ( P ) = 1 2 π Z 2 π 0 f ( x ) log ( f ( x )) dx ≈ 1 2 π Z 2 π 0 f ( x ) ( f ( x ) − 1) dx = 1 2 ∞ X n =1 a 2 n . Let G be a compact group with uniform distrib ution U and let F be a closed subgroup of G. Then the subgroup has a Haar probability measure U F and D ( U F ) = log ([ G : F ]) (3) where [ G : F ] denotes the inde x of F in G. In particular D ( U F ) is ﬁnite if and only if [ G : F ] is ﬁnite. 4. The rate distortion theory W e will develop aspects of the rate distortion theory of a compact group G. Let P be a probability measure on G. W e observe that compactness of G implies that a co vering of G by distortion balls of radius δ > 0 contains a ﬁnite cov ering. If k is the number of balls in a ﬁnite cov ering then R P ( δ ) ≤ V ersion Nov ember 8, 2018 submitted to Entr opy 7 of 16 log ( k ) where R P is the rate distortion function of the probability measure P . In particular the rate distortion function is upper bounded. The entropy of a probability distribution P is giv en by H ( P ) = R P (0) . If the group is ﬁnite then the uniform distribution maximizes the Shannon entropy R P (0) but if the group is not ﬁnite then in principle there is no entropy maximizer . As we shall see the uniform distribution still plays the role of entropy maximizer in the sense that the uniform distribution maximize the v alue R P ( δ ) of the rate distortion function for an y positi ve distortion le vel δ > 0 . The rate distortion function R P can be studied using its con ve x conjugate R ∗ P gi ven by R ∗ P ( β ) = sup δ β · δ − R P ( δ ) . The rate distortion function is then recov ered by the formula R P ( δ ) = sup β β · δ − R ∗ P ( β ) . The techniques are pretty standard [17]. Theorem 11. The rate distortion function of the uniform distrib ution is given by R ∗ U ( β ) = log ( Z ( β )) wher e Z is the partition function deﬁned by Z ( β ) = Z G exp ( β · d ( g , e )) dU g . The rate distortion function of an arbitr ary distribution P satisﬁes R U − D ( P k U ) ≤ R P ≤ R U . (4) Pr oof. First we prov e a Shannon type lo wer bound on the rate distortion function of an arbitrary distri- bution P on the group. Let X be a random variable with values in G and distribution P , and let ˆ X be a random v ariable coupled with X such that the mean distortion E h d  X , ˆ X i equals δ . Then I  X ; ˆ X  = D  X k U | ˆ X  − D ( X k U ) (5) = D  X ∗ ˆ X − 1 k U | ˆ X  − D ( X k U ) (6) ≥ D  X ∗ ˆ X − 1 k U  − D ( X k U ) . (7) No w , E h d  X , ˆ X i = E h d  X ∗ ˆ X − 1 , e i and D  X ∗ ˆ X − 1 k U  ≥ D ( P β k U ) where P β is the distribution that maximizes di ver gence under the constraint E [ d ( Y , e )] = δ when Y has distribution P β . The distribution P β is gi ven by the density dP β dU ( g ) = exp ( β · d ( g , e )) Z ( β ) . V ersion Nov ember 8, 2018 submitted to Entr opy 8 of 16 where β is determined by the condition δ = Z 0 ( β ) / Z ( β ) . If P is uniform then a joint distrib ution is obtained by choosing ˆ X uniformly distributed, and choos- ing Y distributed according to P β and independent of ˆ X . Then X = Y ∗ ˆ X is distrib uted according to P β ∗ U = U , and we ha ve equality in ( 7 ). Hence the rate determined the lo wer bound ( 7 ) is achiev able for the uniform distribution, which pro ve the ﬁrst part of the theorem, and the left inequality in ( 4 ). The joint distribution on  X , ˆ X  that achiev ed the rate distortion function when X has a uniform dis- tribution, deﬁnes a Marko v kernel Ψ : X → ˆ X that is in v ariant under translations in the group. For any distribution P the joint distribution on  X , ˆ X  determined by P and Ψ gi ves an achiev able pair of distortion, and rate that is on the rate distortion curve of the uniform distribution. This prov es the right inequality in Equation ( 4 ). Example 12. F or the gr oup S O (2) the rate distortion function can be parametrized using the modiﬁed Bessel functions I j , j ∈ N 0 . The partition function is given by Z ( β ) = Z G exp ( β · d ( g , e )) dU g = 1 2 π Z 2 π 0 exp ( β · (2 − 2 cos x )) dx = exp (2 β ) · 1 π Z π 0 exp ( − 2 β · cos x ) dx = exp (2 β ) · I 0 ( − 2 β ) . Hence R ∗ U ( β ) = log ( Z ( β )) = 2 β + log ( I 0 ( − 2 β )) . The distortion δ corresponding to β is given by δ = 2 − 2 I 1 ( − 2 β ) I 0 ( − 2 β ) and the corr esponding rate is R U ( δ ) = β · δ − (2 β + log ( I 0 ( − 2 β ))) = − β · 2 I 1 ( − 2 β ) I 0 ( − 2 β ) − log ( I 0 ( − 2 β )) . These joint values of distortion and rate can be plotted with β as parameter as illustr ated in F igur e 2 . The minimal rate of the uniform distribution is achie ved when X and ˆ X are independent. In this case the distortion is E h d  X , ˆ X i = R G d ( x, e ) dP x. This distortion level will be called the critical distortion and will be denoted d crit . On the interval ]0; d crit ] the rate distortion function is decreasing and the distortion rate function is the in verse R − 1 P of the rate distortion function R P on this interval. The distortion rate function satisﬁes: Theorem 13. The distortion rate function of an arbitr ary distribution P satisﬁes R − 1 U ( δ ) − f 2 ( d ( P , U )) ≤ R − 1 P ( δ ) ≤ R − 1 U ( δ ) for δ ≤ d crit (8) for some incr easing continuous function f 2 satisfying f 2 (0) = 0 . V ersion Nov ember 8, 2018 submitted to Entr opy 9 of 16 Figure 2 . The rate distortion region of the uniform distribution on S O (2) is shaded. The rate distortion function is the lower bounding curve. In the ﬁgure the rate is measured in nats. The critical distortion d crit equals 2, and the dashed line indicates d max = 4 . Pr oof. The right hand side follo ws because R U is decreasing in the interv al [0; d crit ] Let X be a random v ariable with distrib ution P and let Y be a random variable coupled with X. Let Z be a random variable coupled with X such that E [ d ( X , Z )] = d ( P , U ) . The couplings between X and Y , and between X and Z can be extended to a joint distrib ution on X, Y and Z such that Y and Z are independent gi ven X . For this joint distrib ution we hav e I ( Z ; Y ) ≤ I ( X , Y ) and | E [ d ( Z, Y )] − E [ d ( X , Y )] | ≤ f 2 ( d ( P , U )) . W e ha ve to prov e that E [ d ( X , Y )] ≥ R − 1 U ( I ( X , Y )) − f 2 ( d ( P , U )) but I ( Z ; Y ) ≤ I ( X , Y ) so it is suf ﬁcient to prove that E [ d ( X , Y )] ≥ R − 1 U ( I ( Z , Y )) − f 2 ( d ( P , U )) and this follo ws because E [ d ( Z, Y )] ≥ R − 1 U ( I ( Z , Y )) . 5. Con vergence of con volutions W e shall prov e that under certain conditions the n -fold con volutions P ∗ n con ver ge to the uniform distribution. V ersion Nov ember 8, 2018 submitted to Entr opy 10 of 16 Example 14. The function f ( x ) = 1 + ∞ X n =1 a n cos ( n ( x + φ n )) is a density on a pr obability distribution P on G if the F ourier coefﬁcients a n ar e suf ﬁciently small. If ( a n ) and ( b n ) are F ourier coef ﬁcients of P and Q then the convolution has density 1 2 π Z 2 π 0 1 + ∞ X n =1 a n cos n ( x − y + φ n ) ! 1 + ∞ X n =1 b n cos n ( y + ψ n ) ! dy = 1 + 1 2 π ∞ X n =1 Z 2 π 0 a n b n cos n ( x − y + φ n ) cos n ( y + ψ n ) dy = 1 + 1 2 π ∞ X n =1 Z 2 π 0 a n b n cos ( n ( x + φ n + ψ n ) − ny ) cos ( ny ) dy = 1 + 1 2 π ∞ X n =1 Z 2 π 0 a n b n cos n ( x + φ n + ψ n ) cos ( ny ) + sin ( n ( x + φ n + ψ n )) sin ( ny ) ! cos ( ny ) dy = 1 + ∞ X n =1 a n b n cos ( n ( x + φ n + ψ n )) 2 π Z 2 π 0 cos 2 ( ny ) dy = 1 + ∞ X n =1 a n b n cos ( n ( x + φ n + ψ n )) 2 . Ther efor e the n -fold con volution has density 1 + ∞ X k =1 a n k cos ( k ( x + nφ k )) 2 n − 1 = 1 + ∞ X k =1  a k 2  n 2 cos ( k ( x + nφ k )) . Ther efor e each of the F ourier coefﬁcients is e xponentially decr easing. Clearly , if P is uniform on a proper subgroup then con ver gence does not hold. In sev eral papers on this topic [13, 18, and references therein] it is claimed and “prov ed” that if con vergence does not hold then the support of P is contained in the coset of a proper normal subgroup. The proofs therefore contain errors that seem to hav e been copied from paper to paper . T o av oid this problem and make this paper more self-contained we shall reformulate and reprov e some already known theorems. In the theory of ﬁnite Mark ov chains is well-known that there exists an in v ariant probability measure. Certain Marko v chains e xhibits periodic behavior where a certain distrib ution is repeated after a number of transitions. All distributions in such a c ycle will lie at a ﬁxed distance from any (ﬁxed) measure, where the distance is gi ven by information di ver gence or total v ariation (or any other Csisz ´ ar f -div ergence). It is also well-known that ﬁnite Markov chains without periodic behavior are con vergent. In general a Marko v chain will conv erge to a “c yclic” behavior as stated in the follo wing theorem [19]. Theorem 15. Let Φ be a transition operator on a state space A with an in variant pr obability measur e Q in . If D ( S k Q ) < ∞ then ther e e xists a pr obability measure P ∗ such that D (Φ n S k Φ n Q ) → 0 and D (Φ n Q k Q in ) is constant. V ersion Nov ember 8, 2018 submitted to Entr opy 11 of 16 W e shall also use the follo wing proposition that has a purely computational proof [20]. Proposition 16. Let P x , x ∈ X be distributions and let Q be a pr obability distribution on X. Then Z D ( P x k Q ) dQx = D  Z P x dQx k Q  + Z D  P x k Z P t dQt  dQx. W e denote the set of probability measures on G by M 1 + ( G ) . Theorem 17. Let P be a distrib ution on a compact gr oup G and assume that the support of P is not contained in any nontrivial coset of a subgr oup of G. Then, if D ( S k U ) is ﬁnite then D ( P ∗ n ∗ S k U ) → 0 for n → ∞ . Pr oof. Let Ψ : G → M 1 + ( G ) denote the Marko v kernel Ψ ( g ) = P ∗ g . Then P ∗ n ∗ S = Ψ n ( P ∗ S ) . Thus there e xists a probability measure Q on G such that D (Ψ n ( P ) k Ψ n ( Q )) → 0 for n → ∞ and such that D (Ψ n ( Q )) is constant. W e shall prove that Q = U. First we note that D ( Q ) = D ( P ∗ Q ) = Z G ( D ( g ∗ Q ) − D ( g ∗ Q k P ∗ Q )) dP g = D ( Q ) − Z G D ( g ∗ Q k P ∗ Q ) dP g . Therefore g ∗ Q = P ∗ Q for P almost ev ery g ∈ G. Thus there exists at least one g 0 ∈ G such that g 0 ∗ Q = P ∗ Q. Then Q = ˜ P ∗ Q where ˜ P = g − 1 0 ∗ P . Let ˜ Ψ : G → M 1 + ( G ) denote the Markov kernel g → ˜ P ∗ g . Put P n = 1 n n X i =1 ˜ P ∗ i = 1 n n X i =1 ˜ Ψ i − 1  ˜ P  . According to [19] this ergodic mean will conv erge to a distribution T such that ˜ Ψ ( T ) = T so that T ∗ ˜ P = T . Hence we also hav e that T ∗ T = T , i.e. T is idempotent and therefore supported by a subgroup of G . W e know that ˜ P is not contained in any nontrivial subgroup of G so the support of T must be G . W e also get Q = T ∗ Q, which together with Theorem 9 implies that Q = U. by choosing S = P we get the following corollary . Corollary 18. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U . Assume that the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite. Then D ( P ∗ n k U ) → 0 for n → ∞ . Corollary 18 together with Theorem 11 implies the follo wing result. Corollary 19. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U . Assume that the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite. Then the rate distortion function of P ∗ n con ver ges uniformly to the rate distortion function of the uniform distribution. V ersion Nov ember 8, 2018 submitted to Entr opy 12 of 16 W e also get weak versions of these results. Corollary 20. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U. Assume that the support of P is not contained in any coset of a pr oper subgr oup of G. Then P ∗ n con ver ges to U in the weak topology , i.e. d ( P ∗ n , U ) → 0 for n → ∞ . Pr oof. If we take S = P β then D ( P β ) is ﬁnite and D ( P ∗ n ∗ P β k U ) → 0 for n → ∞ . W e ha ve d ( P ∗ n ∗ P β , U ) ≤ d max k P ∗ n ∗ P β − U k ≤ d max (2 D ( P ∗ n ∗ P β k U )) 1 / 2 implying that d ( P ∗ n ∗ P β , U ) → 0 for n → ∞ . No w | d ( P ∗ n , U ) − d ( P ∗ n ∗ P β , U ) | ≤ f 2 ( d ( P ∗ n ∗ P β , P ∗ n )) ≤ f 2 ( d ( P β , e )) . Therefore lim n →∞ sup d ( P ∗ n , U ) ≤ f 2 ( d ( P β , e )) for all β , which implies that lim n →∞ sup d ( P ∗ n , U ) = 0 . Corollary 21. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U. Assume that the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite. Then R P ∗ n con ver ges to R U pointwise on the interval ]0; d max [ for n → ∞ . Pr oof. Corollary 20 together with Theorem 13 implies uniform conv ergence of the distortion rate func- tion for distortion less than d crit . This implies pointwise con ver gence of the rate distortion function on ]0; d crit [ because rate distortion functions are con vex functions. The same ar gument works in the interv al ] d crit ; d max [ . Pointwise con ver gence in d crit must also hold because of continuity . 6. Rate of con vergence Normally the rate of con vergence will be exponential. If the density is lower bounded this is well- kno wn. W e bring a simpliﬁed proof of this. Lemma 22. Let P be a pr obability distribution on the compact gr oup G with Haar pr obability measur e U. If dP /dU ≥ c > 0 and D ( P ) is ﬁnite , then D  P n  ≤ (1 − c ) n − 1 D ( P ) . Pr oof. First we write P = (1 − c ) · S + c · U where S denotes the probability measure S = P − cU 1 − c . V ersion Nov ember 8, 2018 submitted to Entr opy 13 of 16 For an y distribution Q on G we hav e D ( Q ∗ P ) = D ((1 − c ) · Q ∗ S + c · Q ∗ U ) ≤ (1 − c ) · D ( Q ∗ S ) + c · D ( Q ∗ U ) ≤ (1 − c ) · D ( Q ) + c · D ( U ) = (1 − c ) · D ( Q ) . Here we hav e used con vexity of di ver gence. If a distribution P has support in a proper subgroup F then D ( P ) ≥ D ( U F ) = log ([ G : F ]) ≥ log (2) = 1 bit . Therefore D ( P ) < 1 bit implies that P cannot be supported by a proper subgroup, but it implies more. Proposition 23. If P is a distribution on the compact gr oup G and D ( P ) < 1 bit then d ( P ∗ P ) dU is lower bounded by a positive constant. Pr oof. The condition D ( P ) < 1 bit implies that U  dP dU > 0  > 1 / 2 . Hence there exists ε > 0 such that U  dP dU > ε  > 1 / 2 . W e hav e d ( P ∗ P ) dU ( y ) = Z G dP dU ( x ) · dP dU ( y − x ) dU x ≥ Z { dP dU >ε } ε · dP dU ( y − x ) dU x ≥ ε · Z { dP dU ( x ) >ε } ∩ { dP dU ( y − x ) >ε } ε dU x = ε 2 · U  dP dU ( x ) > ε  ∩  dP dU ( y − x ) > ε  . Using the inclusion-exclusion inequalities we get U  dP dU ( x ) > ε  ∩  dP dU ( y − x ) > ε  = U  dP dU ( x ) > ε  + U  dP dU ( y − x ) > ε  − U  dP dU ( x ) > ε  ∪  dP dU ( y − x ) > ε  ≥ 2 · U  dP dU ( x ) > ε  − 1 . Hence d ( P ∗ P ) dU ( y ) ≥ 2 ε 2  U  dP dU ( x ) > ε  − 1 / 2  for all y ∈ G. V ersion Nov ember 8, 2018 submitted to Entr opy 14 of 16 Combining Theorem 17 , Lemma 22 , and Proposition 23 we get the follo wing result. Theorem 24. Let P be a pr obability measur e on a compact gr oup G with Haar pr obability measur e U. If the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite then the rate of con verg ence of D ( P ∗ n k U ) to zer o is exponential. As a corollary we get the follo wing result that was ﬁrst proved by Kloss [21] for total v ariation. Corollary 25. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U. If the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite then P ∗ n con ver ges to U in variation and the rate of con ver gence is e xponential. Pr oof. This follo ws directly from Pinsker’ s inequality [22, 23] 1 2 k P ∗ n − U k 2 ≤ D ( P ∗ n k U ) . Corollary 26. Let P be a pr obability measur e on the compact gr oup G with Haar pr obability measur e U. If the support of P is not contained in any coset of a pr oper subgr oup of G and D ( P k U ) is ﬁnite, then the density dP ∗ n dU con ver ges to 1 point wise almost sur ely for n tending to inﬁnity . Pr oof. The v ariation norm can be written as k P ∗ n − U k = Z G     dP ∗ n dU − 1     dU. Thus U      dP ∗ n dU − 1     ≥ ε  ≤ k P ∗ n − U k ε . The result follows by the exponential rate of con vergence of P ∗ n to U in total v ariation combined with the Borel-Cantelli Lemma. 7. Discussion In this paper we hav e assumed the existence of the Haar measure by referring to the literature. W ith the Haar measure we ha ve then proved con ver gence of con volutions using Marko v chain techniques. The Marko v chain approach can also be used to prove the e xistence of the Haar measure by simply referring to the f act that a homogenous Markov chain on a compact set has an in variant distrib ution. The problem about this approach is that the proof that a Marko v chain on a compact set has an in v ariant distribution is not easier than the proof of the existence of the Haar measure and is less kno wn. W e ha ve shown that the Haar probability measure maximizes the rate distortion function at any dis- tortion le vel. The normal proofs of the existence of the Haar measure use a kind of cov ering argument that is very close to the techniques found in rate distortion technique. There is a chance that one can get an information theoretic proof of the existence of the Haar measure. It seems obvious to use conca vity V ersion Nov ember 8, 2018 submitted to Entr opy 15 of 16 arguments as one would do for Shannon entropy but, as prov ed by Ahlswede [24], the rate distortion function at a gi ven distortion lev el is not a conca ve function of the underlying distrib ution, so some more reﬁned technique is needed. As noted in the introduction for an y algebraic structure A the group Aut ( A ) can be considered as symmetry group, it it has a compact subgroup for which the results of this paper applies. It would be interesting to extend the information theoretic approach to the algebraic object A itself, b ut in general there is no known equiv alent to the Haar measure for other algebraic structures. Algebraic structures are used extensi vely in channel coding theory and cryptography so although the theory may become more in volv ed extensions of the result presented in this paper are deﬁnitely worthwhile. Acknowledgement The author want to thank Ioannis K ontoyiannis for stimulating discussions. References 1 Jaynes, E. T . Information Theory and Statistical Mechanics, I and II. Physical Reviews 1957 , 106 and 108 , 620–630 and 171–190. 2 T opsøe, F . Game Theoretical Equilibrium, Maximum Entropy and Minimum Information Discrimi- nation. In Maximum Entr opy and Bayesian Methods ; Mohammad-Djafari, A.; Demoments, G., Eds. Kluwer Academic Publishers: Dordrecht, Boston, London, 1993, pp. 15–23. 3 Jaynes, E. T . Clearing up mysteries – The original goal. In Maximum Entr opy and Bayesian Meth- ods ; Skilling, J., Ed. Kluwer: Dordrecht, 1989. 4 Kapur , J. N. Maximum Entr opy Models in Science and Engineering , re vised Ed. W ile y: Ne w Y ork, 1993. 5 Gr ¨ unwald, P . D.; Dawid, A. P . Game Theory , Maximum Entropy , Minimum Discrepancy , and Robust Bayesian Decision Theory . Annals of Mathematical Statistics 2004 , 32 , 1367–1433. 6 T opsøe, F . Information Theoretical Optimization T echniques. K ybernetika 1979 , 15 , 8 – 27. 7 Harremo ¨ es, P .; T opsøe, F . Maximum Entropy Fundamentals. Entr opy 2001 , 3 , 191–226. 8 Jaynes, E. T . Pr obability Theory - The Logic of Science . Cambridge University Press: Cambridge, 2003. 9 Csisz ´ ar , I. Sanov Property , Generalized I-Projection and a Conditional Limit Theorem. Ann. Pr obab . 1984 , 12 , 768–793. 10 Stromberg, K. Probabilities on compact groups. T rans. Amer . Math. Soc. 1960 , 94 , 295–309. 11 Csisz ´ ar , I. A note on limiting distributions on topological groups. Magyar T ud. Akad. Math. K utal ´ o INt. K olzl. 1964 , 9 , 595–598. 12 Schlosman, S. Limit theorems of probability theory for compact groups. Theory Pr obab . Appl. 1980 , 25 , 604–609. 13 Johnson, O. Information Theory and Central Limit Theor em . Imperial Collage Press: London, 2004. 14 Haar , A. Der Massbegrif f in der Theorie der kontinuierlichen Gruppen. Ann. Math. 1933 , 34 . V ersion Nov ember 8, 2018 submitted to Entr opy 16 of 16 15 Halmos, P . Measur e Theory . D. v an Nostrand and Co., 1950. 16 Conway , J. A Course in Functional Analysis . Springer-V erlag: New Y ork, 1990. 17 V ogel, P . H. A. On the Rate Distortion Function of Sources with Incomplete Statistics. IEEE T r ans. Inform. Theory 1992 , 38 , 131–136. 18 Johnson, O. T .; Suhov , Y . M. Entropy and con vergence on compact groups. J . Theor et. Pr obab. 2000 , 13 , 843–857. 19 Harremo ¨ es, P .; Holst, K. K. Con ver gence of Marko v Chains in Information Div ergence. Journal of Theor etical Pr obability 2009 , 22 , 186–202. 20 T opsøe, F . An Information Theoretical Identity and a problem in volving Capacity . Studia Scien- tiarum Mathematicarum Hungarica 1967 , 2 , 291–292. 21 Kloss, B. Probability distributions on bicompact topological groups. Theory Pr obab . Appl. 1959 , 4 , 237–270. 22 Csisz ´ ar , I. Information-type measures of dif ference of probability distrib utions andindirect observa- tions. Studia Sci. Math. Hungar . 1967 , 2 , 299–318. 23 Fedotov , A.; Harremo ¨ es, P .; T opsøe, F . Reﬁnements of Pinsker’ s Inequality . IEEE T rans. Inform. Theory 2003 , 49 , 1491–1498. 24 Ahlswede, R. F . Extremal Properties of Rate-Distortion Functions. IEEE. T rans. Inform. Theory 1990 , 36 , 166–171. c  Nov ember 8, 2018 by the author; submitted to Entr opy for open access under the terms and conditions of the Creati ve Commons Attribution license ( http://creati vecommons.or g/licenses/by/3.0/ ).

Maximum Entropy on Compact Groups

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment