Maximum Probability and Relative Entropy Maximization. Bayesian Maximum Probability and Empirical Likelihood

Works, briefly surveyed here, are concerned with two basic methods: Maximum Probability and Bayesian Maximum Probability; as well as with their asymptotic instances: Relative Entropy Maximization and Maximum Non-parametric Likelihood. Parametric and …

Authors: M. Grendar

Maxim um Probabilit y and Relative En trop y Maximizati on. Ba yesian Maxi m um Probabilit y and Empirical L ik eliho o d M. Grend´ ar ∗ Abstract W orks, briefly surveye d here, a re concerned w ith tw o basic metho ds: Maxim um Probabilit y and Ba yesia n Maxim um Probabilit y; as w ell as with their asymptotic instances: Relativ e Entrop y Maximization and Maxi- mum Non-parametric Likel iho o d. Parametric and emp irical extensions of the latter metho ds – Emp irical Maximum Maximum Entrop y and Empir- ical Likel iho od – are also m entioned. The metho ds are v iew ed as to ols for solving certain ill -p osed i nv erse problems, ca lled Π-problem, Φ-problem, respectively . Within th e tw o classes of p roblems, probabilistic justifica- tion and interpretation of the resp ective metho d s are discussed. Keywords. Π -problem, Φ-problem, Large Deviations, Bay esian Law of Large Numb ers, Nonparametric Maximum Likelihood , Estimating Equa- tions, Maximum A-Po steriori Probability , Empirical Maxim um Entrop y . 1 Φ -problem, MAP , MNPL The Φ-problem can b e lo osely stated as follows: there is a prio r distribution ov er a non-para metric set Φ of data-s a mpling distributions and a sa mple from unknown data -sampling distribution. The ob jectiv e is to selec t a data-sa mpling distribution from the set Φ, called mo del. More formally: Let P b e the set of all pr obability mass functions 1 (pmf ’s) with finite s uppo r t X . The set P is endowed with the usual topolog y . Let Φ ⊆ P . Let X n 1 , X 1 , X 2 , . . . , X n be i.i.d. sample fr om pmf r ∈ P . The ’true’ sampling dis tribution r need not be in Φ; in other words: the model Φ might be missp ecified. A s trictly p ositive prior π ( · ) is put ov er Φ. The o b jectiv e in the Φ-problem is to select a sampling distr ibutio n q from Φ, when the information summarized by {X , X n 1 , π ( · ) , Φ } and nothing else is av ailable. Bay esian Max im um Proba bility metho d selects the Maximum A-Posteriori Probable (MAP) data -sampling distribution(s) ˆ q MAP , arg sup q ∈ Φ π n ( q | X n 1 ); there the p os ter ior probability π n ( q | X n 1 ) ∝ e − l n ( q ) π ( q ), and l n ( q ) is us e d to ∗ Dept. of Mathematics, Bel Universit y , T a j o vske ho 40, 974 01 Bansk a Bystrica, Slov akia. E-mail: marian.grendar@savba.sk. Inst. of Measurement Science, SAS, Bratislav a. Inst. of Mathematics and CS, SAS, Bansk a Bystri ca. Date: Apr 9, 2008. T o app ear in Pro c. of Intnl. W orkshop on Applied Probability (IW AP ) 2008, Compi` egne, F rance , July 7-10, 2008. 1 F or the sake of simpli cit y the presenta tion is restricted to the discrete case . The con tin uous case is treated in [ 15 ]. 1 denote − P n i =1 log q ( x i ); log is mean t with the base e . Hence the standar d abbreviation, MAP , for the metho d. The Bayesian Sanov Theorem (BST), thro ugh its coro llary – the Bayesian Law o f La rge Numbers (BLLN) – pr ovides a strong case for MAP as the cor rect metho d for solving the Φ-problem. The theo rems are Bayesian counterparts of the well-known Large Deviatio ns (LD) theore ms for empirical measures: th e Sanov Theor em and the Conditional Law of La rge Num b ers (cf. [ 4 ] and Sec t. 2). In o rder to sta te the theo rems it is necess ary to introduce the L -divergence L ( q || p ) of q ∈ P with respec t to p ∈ P : L ( q || p ) , − P X p log q . The L - pro jection ˆ q o f p on Q ⊆ P is ˆ q , arg inf q ∈ Q L ( q || p ). The v alue o f L -divergence at an L -pro jection o f p on Q is denoted b y L ( Q || p ). Thm 1. (BST) L et X n 1 b e i.i.d. r . L et Q ⊂ Φ ⊆ P ; L ( Q || r ) < ∞ . Then for n → ∞ , 1 n log π n ( q ∈ Q | X n 1 ) = −{ L ( Q || r ) − L (Φ || r ) } , a.s. r ∞ . The p o sterior probability π n ( Q | X n 1 ) decays exponentially fast (a.s. r ∞ ) with the decay r a te L ( Q || r ) − L (Φ || r ). F or a pro of see [ 13 ]. T o the b est of our knowledge Ben-T al, Brown and Smith [ 1 ] were the first to use an LD rea soning in the Bayesian nonpara metric setting. Ganes h and O’C o nnell [ 8 ] proved BST for the well-specified sp ecial case; i.e., r ∈ Φ, by means of formal LD. Thm 2. (BLLN) L et Φ ⊆ P b e a c onvex, close d set. L et B ( ˆ q , ǫ ) , b e a close d ǫ -b al l define d by the t otal variation met r ic, c ent er e d at the L -pr oje ction ˆ q of r on Φ . Then, lim n →∞ π n ( q ∈ B ( ˆ q , ǫ ) | X n 1 ) = 1 , a.s. r ∞ . The BLLN is an extension of F reedman’s Bay esian nonpar a metric consis- tency theor e m [ 7 ] to the ca se of missp ecified mo del. It shows that the p osterio r probability concentrates (a.s. r ∞ ) on the L -pr o jection o f the ’true’ sampling distribution r o n Φ. F or a b o ok-le ng th treatment of Bay esian non-para metric consistency see [ 9 ]. MAP satisfie s the BLLN. T o see this, note that by the Strong Law of La rge Num be r s (SLLN), c onditions for supr e m um of the poster ior pro ba bility asymp- totically turn in to conditions for suprem um of the negative of L -div ergence. This also p ermits to view the L -pro jections as a symptotic instances of MAP distributions ˆ q MAP . There is also a nother metho d whic h satisfies the BLLN: Max imum Non- parametric Likelihoo d (MNPL). This ca n b e shown by the ab ove mentioned recourse to the SLLN. MNPL selects ˆ q MNPL , ar g sup q ∈ Q − l n ( q ). These tw o (up to trivial tra nsformations) are the only metho ds for s olving the Φ-problem, whic h comply with the BLLN; hence they are c o nsistent in the well- sp ecified as well as in the missp ecified case. Sele c ting a s ampling distr ibution by some other conceiv able metho d would, in general, asymptotica lly select sa mpling distribution which is a p osteriori zero- probable. In this sens e , selection of, say , the po sterior mean, or selection of arg s up q ∈ Φ − P X q log q r , are ruled out. The Φ-pro ble m b ecomes more interesting when turned in to a parametric setting. T o this end, let X b e a r andom v aria ble with pmf r ( x ; θ ) pa rame- trized by θ ∈ Θ ⊆ R K . Assume that a researcher is not willing to specify parametric family q ( X ; θ ) of data- sampling distributions, but is only willing to sp ecify some of its underlying features. These fea tures, i.e., the mo del Φ, can b e characterized by E s timating Eq ua tions (EE): Φ , S Θ Φ( θ ), where Φ( θ ) , { q ( x ; θ ) : P X q ( x ; θ ) u j ( x ; θ ) = 0 , 1 ≤ j ≤ J } , θ ∈ Θ ⊆ R K . In the 2 EE theory parlance, u ( · ) ar e the estimating functions, num b er of which is in general different than the num b er K o f pa r ameters θ . The ’true’ data sampling distribution r ( x ; θ ) need not belo ng to Φ. A Bay esian puts p os itive prior π ov er Φ, which in turn induces prior π ( θ ) ov e r Θ; cf. [ 6 ]. By the BLLN, the p os terior π n ( ·| X n 1 ) c o ncentrates on a weak neighborho o d of the L -pro jection ˆ q of r ( x ; θ ) on Φ: ˆ q ( x ; ˆ θ ) = arg inf θ ∈ Θ inf q ( x ; θ ) ∈ Φ( θ ) L ( q ( x ; θ ) || r ( x ; θ )) . This thus provides a pro babilistic justification for using ˆ θ as an estimator of θ . Thanks to the conv ex duality , the estimator ˆ θ c a n b e obtained a lso as ˆ θ = arg sup θ ∈ Θ inf λ ( θ ) ∈ R J − P m i =1 r ( x i ) log(1 − P j λ j ( θ ) u j ( x i ; θ )). Since r is in practice not known, following [ 19 ], one can estimate the conv ex dual ob jective function by − P n l =1 log(1 − P j λ j ( θ ) u j ( x l ; θ )). The resulting estimator is just the E mpirical Lik eliho o d (EL) estimator (cf. [ 25 ], [ 24 ], [ 21 ]). It can b e eas ily seen that EL satisfies the BLLN. The same is true for the Bay esian MAP es - timator ˆ q MAP ( x ; ˆ θ MAP ) = arg sup θ ∈ Θ sup q ( x ; θ ) ∈ Φ( θ ) π n ( q ( x ; θ ) | X n 1 ) . F or further results and discussion see [ 15 ], [ 1 6 ]. 2 Π -problem, M axProb, REM Unlik e the Φ problem, the Π problem is no t a statistical pr oblem. In the Π problem, the sampling distribution q is known, and there is a set Π ⊆ P , into which an unav ailable empir ical pmf, drawn from q , is as sumed to b elong . The ob jectiv e is to sele ct an empirica l pmf (also known as type, cf. [ 4 ]) from the set Π. Thus, the Φ a nd Π problems are opp osite to each other. More formally: let X b e a set of m ele ments. Typ e ν n , [ n 1 , n 2 , . . . , n m ] /n , where n i is the n umber of o ccur rences o f i - th element of X (i.e., outco me), i = 1 , 2 , . . . , m , in a sample of size n , drawn from sa mpling distribution q . The ob jectiv e in the Π-pr oblem is to select a t yp e(s) ν n from Π, when the information summarized by {X , q, n, Π } and nothing else is av ailable. Maximum Probability (MaxPr o b) metho d (cf. [ 2 ], [ 29 ], [ 10 ]) selects the t yp e ˆ ν n = arg sup ν n ∈ Π π ( ν n ; q ) which can b e g enerated by the sampling distr i- bution q , with the highest pr o bability . If the sampling is i.i.d., then π ( ν n ; q ) = n ! Q m i =1 q n i i n i ! . Niven [ 22 ] expanded MaxPr ob into non-i.i.d. and combin atoria l settings; see also [ 23 ], [ 29 ], [ 14 ]. The Sanov Theorem (ST) (cf. [ 26 ], [ 3 ]), thro ugh its corolla ry – the Co n- ditional Law of Large Numbers (CLLN) (cf. [ 28 ], [ 27 ], [ 3 ]) – provides a pr ob- abilistic justification for application o f MaxPr ob in the i.i.d. instance of the Π-problem. The ST identifies the exp onential decay rate function as the I - divergence I ( p || q ) , P p log p q , p, q ∈ P . The I -pro jection ˆ p of q on Π ⊆ P is ˆ p , a rg inf p ∈ Π I ( p || q ). The v alue of the I - divergence a t an I -pro jection of q on Π is denoted b y I (Π || q ). Thm 3. (ST) L et Π b e an op en set; I (Π || q ) < ∞ . T hen, fo r n → ∞ , 1 n log π ( ν n ∈ Π; q ) = − I (Π || q ) . The r ate of the expo nential co nv er gence of the probabilit y π ( ν n ∈ Π; q ) tow ar ds zero is determined by the informatio n divergence at (any of ) the I - pro jection(s) o f q on Π. 3 Thm 4. (CLLN) L et Π b e a c onvex, close d set that do es not c ontain q . L et B ( ˆ p, ǫ ) b e a close d ǫ -b al l define d by the total variation metric t hat is c enter e d at the I -pr oje ction ˆ p of q on Π . Then, lim n →∞ π ( ν n ∈ B ( ˆ p, ǫ ) | ν n ∈ Π; q ) = 1 . Given that a type fro m Π was o bserved, it is asymptotically zer o-pro ba ble that the t yp e w as different than the I -pro jection of the sa mpling distribution q on Π. It is straightforw ard to s ee that MaxPro b s a tisfies CLLN. Indeed, set of MaxPro b types con verges to set of I -pro jections, as n → ∞ ; cf. [ 11 ], [ 10 ]. Relative Entrop y Maximization metho d (REM/Ma xEnt) whic h maximizes, with resp ect to p , the negative of I - divergence (a.k.a., relative en tropy) thus can b e viewed as asymptotic form of MaxPr ob metho d. Still, it is po ssible to solve Π-problem b y s e lecting the type(s) with the high- est v a lue o f relative entrop y; in o ther words, to v ie w REM as a self-sta nding metho d fo r solving Π-problem, rather than as an a symptotic insta nce of Max - Prob. Obviously , RE M satisfies CLLN. MaxPro b and REM/MaxEnt are the only tw o metho ds whic h satisfy CLLN. Selection of the mean type, which was under the name ExpOc pro p o sed in [ 10 ], or selectio n of, say the type with the highest v alue of Tsallis entrop y , would in general, violate CLLN. The Π-pro blem o riginated in Statistical Ph ysics, where Π is formed by mea n energy constraint; see [ 5 ]. In [ 12 ] feasible set of type s formed by interv al obser- v ations was cons idered. Estimating Equations ca n b e used to expand the Π problem in to parametric setting. This time, the EE define a feasible set Π into which an unobser ved parametrized type ν n ( θ ) is supp osed to b elong: Π , S Θ Π( θ ), where Π( θ ) , { p ( x ; θ ) : P X p ( x ; θ ) u j ( x ; θ ) = 0 , 1 ≤ j ≤ J } , θ ∈ Θ ⊆ R K . The tr ue data - sampling distribution r ( x ; θ ) need not b elong to Π. The parametr ic Π-problem is framed b y the information {X , r , n, Π( θ ) , Θ } , and the ob jective is now to select parametric type ν n ( θ ) from Π. CLLN implies (cf. [ 20 ]) that the parametric Π-problem should be (for n → ∞ ) solved b y selecting ˆ p ( x ; ˆ θ ) = arg inf θ ∈ Θ inf p ( x ; θ ) ∈ Π( θ ) I ( p ( x ; θ ) || r ( x ; θ )) . Thanks to the con vex duality , the estima to r ˆ θ can equiv alent ly be obtained as ˆ θ = arg sup θ ∈ Θ inf λ ( θ ) ∈ R J log P m i =1 r ( x i ; θ ) exp( − P J j =1 λ j ( θ ) u j ( x i ; θ )). The estimator is known as Maximum Maximum En tropy (MaxMaxE nt) estimator. The parametric Π-pro blem can b e made mor e realistic, by assuming that a sample of s ize N is av ailable to a mo deler. K ita m ura and Stutzer [ 19 ] s uggested to use the s ample to estimate the co nv ex dual ob jective function b y its sample analogue log P N l =1 exp( − P J j =1 λ j u j ( x l ; θ )). The resulting metho d is known as Empirical Maximum Ma ximum Entrop y (E MME ) method, or Maximum En- tropy Empirica l Likeliho o d (cf. [ 17 ], [ 19 ], [ 21 ], [ 18 ]). 3 Ac kn o wledgemen ts V aluable dis cussions with George J udge and Ro be rt Niven, and a feedbac k fro m V al´ erie Girardin are g ratefully acknowledged. Supp orted by VE GA 1 /301 6 /06 and APVV RPEU-0008 -06 grants. 4 References [1] Ben-T al, A., Brown, D. E. and R . L. Smith. (1988): Relative Entrop y a nd the Co nv erg ence of the Posterior a nd E mpirical Distributions under Incomplete a n d C o n fl ic t in g I n f o r m a t io n . T ech. rep. 88-1 2 . U. of Michigan. [2] Boltzmann, L. (1 8 77): ¨ Uber die Be z iehun g zw is chen dem zw eiten Haupt- satze der mechanisc hen W¨ armetheorie und der W a hr scheilic hkeitsrechn ung resp ektive den S¨ atzen ¨ uber da s W¨ armegleic hgewich t. Wiener Berichte , 2 , 373–4 35. [3] Csisz´ ar, I. (1984): Sano v Prop erty , Gener alized I -pro jection and a Condi- tional Limit Theorem. Ann. Pr ob ab. , 12 , 7 68–79 3. [4] Csisz´ ar I. (1 998). The Metho d of T yp es . IEEE T r ans. IT , 44 , 2 5 05-2 5 23. [5] Ellis, R. S. (2005): Entr opy, L ar ge D eviations and Statistic al Me chanics . Springer. [6] Florens, J.-P . and Rolin, J.-M. (1994 ): Bay es, Bo otstr ap, Moments . Discus- sion pap er 94.1 3 . Institute de Statistique, Universit ´ e catholique de Louv ain. [7] F reedman, D. A. (1963): On the Asy mptotic Behavior of Bay es’ Estimates in the Discrete Case. Ann. Math. S tatist. , 34 , 1386-1 403. [8] Ganesh, A. a nd O’Connell, N. (1 999): An Inv erse of Sanov’s Theorem. Stat. & Pr ob. L etters , 42 , 201 – 206. [9] Ghosh, J. K. and Ramamo or thi, R. V. (2002 ): Baye sian Nonp ar ametrics . Springer. [10] Grend´ ar, M., Jr. and Grend´ ar, M. (2001): What Is the Question that MaxEnt Answers? A Pro ba bilistic Interpretation . In A. Mohammad-Djafari (Ed.): Bayesian Infer enc e and Maximum En- tr opy Metho ds in Scienc e and Engine ering , AIP , Melville, 83– 94. [11] Grend´ ar, M. (2004 ): Asymptotic Identit y of Mu- pr o jections and I-pro jections . A cta U. Belii, Ser. Math. , 11 , 3–6. [12] Grend´ ar, M., Jr. a nd Grend´ a r, M. (20 05): Maximum Pr obability/En tropy T ranslating of Contiguous Categor ic a l Observ ations into F r equencies, Appl. Math. Comp. , 161 , 3 4 7–35 1. [13] Grend´ ar, M. (2 006): L-divergence Co ns istency for a Discrete Prio r. J our. Stat. Re se ar ch , 40 , 73–7 6. Corre cted at: arXiv:math.PR/0 6108 2 4 . [14] Grend´ ar, M., and Niven, R. K. (2006): The P´ olya Urn: Limit Theo- rems, P´ olya Div ergence, Ma x im um Entrop y a nd Maximum P robability . On-line at arXiv:cond-mat/0 6126 9 7 . [15] Grend´ ar, M. and Judge, G. (20 08a): Consistency of E mpiri- cal Likeliho o d and Maximum A-Posterior i Proba bilit y under Mis- sp ecification. CUDARE W orking Pap er 105 2, F eb. 20 08. On-line: repo s itories.cdlib.or g/are ucb/105 2 . 5 [16] Grend´ ar, M. and Judg e, G. (2 008b): Lar ge Deviations Theory and Empir- ical Estimator Choice. Ec onometric R ev. , 27(4-6) , 1-13. [17] Imbens, G., R. Spady and P . Jo hnson (19 98): Information Theor etic Ap- proaches to Inference in Moment Co ndition Mo dels. Ec onometric a , 66 , 333 - 357. [18] Judge, G. a nd Mittelhammer, R. (2 007): Estimation and Inference in the Case of Comp eting Sets of Estimating E quations, J. Ec onometrics , 138 , 513–5 31. [19] Kitamura, Y. and Stutzer, M. (1997): An Information-Theor etic Alter - native to Ge ne r alized Metho d o f Moments E s timation. Ec onometric a , 65 , 861–8 74. [20] Kitamura, Y. a nd Stutzer, M. (20 0 2): Connectio ns Betw een E ntropic and Linear Pro jections in Asset Pr icing E stimation. J. Ec onometrics , 107 , 159– 174. [21] Mittelhammer, R., Judg e, G. and D. Miller (200 0): Ec onometric F ounda- tions . CUP . [22] Niven, R. K. (20 05): Combinatorial Informa tio n Theory: I. Philosophica l Basis of Cross-Entropy and Entrop y . On-line at: arXiv:cond-mat/05 1201 7 . [23] Niven, R. K. (2007): Origins of the Combinatorial Bas is of E n tropy . In K. H. Knuth et al. (Eds .): Bayesian Infer enc e and Maximum Entr opy Metho ds in Scienc e and Engine ering , AIP , Melville, 133–1 42. [24] Owen, A. (20 01): Empiric al Likeliho o d . C ha pman-Hall/CRC, New Y ork. [25] Qin, J. and Lawless, J. (1994): Empirical Likelihoo d and General Estimat- ing Equations. A nn. Statist. , 22 , 300 -325. [26] Sanov, I. N. (1 957): On the P robability of Large Devia tions o f Random V ariables. Mat. Sb ornik, 42 , 11– 44. (In Russian). [27] v an Campenhout, J. M. and Cov er, T. M. (198 1 ): Maximum Entrop y and Conditional Pr o bability . IEEE IT , 27 , 4 83–4 89. [28] V asice k , O. (1980): A Conditiona l Law of Lar ge Numbers. A nn. Pr ob ab. , 8 , 142–1 47. [29] Vincze, I. (197 2 ): On the Maxim um Proba bilit y Pr inciple in Statistical Physics. Col l. Math. S o c. J. Bolyai , 9 , 869– 8 93. 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment