Filtering Additive Measurement Noise with Maximum Entropy in the Mean
The purpose of this note is to show how the method of maximum entropy in the mean (MEM) may be used to improve parametric estimation when the measurements are corrupted by large level of noise. The method is developed in the context on a concrete exa…
Authors: Henryk Gzyl, Enrique ter Horst
FIL TERING ADDITIVE MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN HENR YK GZYL, ENRIQUE TER HORST Abstract. The purp ose of this note is to show how the metho d of maximum entrop y in the mean (MEM) may b e used to improve parametric estimation when the measurements are corrupted by large level of noise. The metho d is developed in the con text on a concrete example: that of estimation of the par ameter in an exp onential distribution. W e compare the per formance of our metho d with the bay esian and maximum likelyhoo d approa ches. 1. Introduction Supp ose that y ou w an t to measure the half-life of a deca ying n ucleus or the lif e- time of some elemen tary particle, or some other random v ariable mo deled by an exp onential distribution describing, sa y a decay time or a life t ime of a pro cess. Assume as w ell that the noise in the measuremen t pro cess can b e modeled b y a cen t ered gaussian random v ariable w hose v ar ia nce ma y b e of the same order of magnitude as that of the deca y rate to be measured. T o ma ke things w orse, assume that y ou can only collect v ery few measuremen ts. That is if x i denotes the realized v alue of the v ariable, one can only measure y i = x i + e i , for i = 1 , 2 , ..., n , where n is a small m um b er, sa y 2 or 3. In other w ords, assume that y ou kno w that the sample comes from a specific parametric distribution but is contaminated b y additive noise. What to do? One p o ssible approach is to apply small sample statistical estimation pro cedures. But these are designed fo r problem s where the v a riabilit y is due only to the random nature of the quantit y measured,and there is no other noise in t he measuremen t Still another p ossibilit y , the one w e w at to explore here, is to apply a maxen tropic filtering metho d, to estimate b oth the unknow n v ariable a nd the noise leve l. F or this we recast the 1 2 HENR YK GZYL, ENR IQUE TER HORST problem as a typical inv erse problem consisting of solving for x in (1) y = Ax + e ; x ∈ K where K is a con v ex set in R d , y ∈ R k and for some d and k , a nd A is an k × d -matrix whic h dep ends o n how w e rephrase t he our pro blem. W e could, fo r example, consider the follo wing problem: Find ˆ x ∈ [0 , ∞ ) suc h that (2) ˆ y = ˆ x + ˆ e In our case K = [0 , ∞ ), and w e set ˆ y = 1 n Σ j y j . O r w e could consider a collection of n suc h problems, one for ev ery measuremen t, and then pro ceed to carry on the estimation. Once w e ha ve solv ed t he generic problem (1), the v ariatio ns on t he theme are easy to write do wn. What is imp ortant to k eep in mind here, is that the output of the metho d is a filtered estimator ˆ x ∗ of ˆ x, whic h itself is an estimator of the unkno wn parameter. The no ve lt y then is to to filter out the noise in (2). The metho d of maxim um en tro p y in the mean is rather w ell suited for solving problems lik e (1) . See Na v aza’s [N] for an early dev elopmen t and Da cunha-Castele and Cam b oa [D -G] for full mathematical treatment . Belo w we shall briefly review what the metho d is ab out and then a pply it to obtain an estimator ˆ x from (2). In section 3 obtain the maxen t r o pic estimator and in section 4 w e examine some of its prop erties, in partcular w e examine what the results would b e if either the noise lev el were small or the num b er of measuremen ts w ere large. W e dev ote section 4 to some simulations in whic the methd is compared with a ba yes ian and a maxim um lik eliho o d approac hes. 2. The basics of MEM MEM is a tec hnique fo r t ransforming a p o ssibly ill-p osed linear problem with con v ex constrain ts in to a simpler (p ossibly unconstrained)but non-linear minimization problem. The n umber of v a riables in the auxiliary problem b eing equal to the n umber of equations in the FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 3 original pro blem, k in the case of example 1. T o carry o ut the transformation o ne thinks of the x there as the exp ected v alue of a random v ariable X with resp ect to some measure P to b e determined. The basic datum is a sample space (Ω s , F s ) o n whic h X is to b e defined. In our setup the natural c hoice is to take Ω s = K , F s = B ( K ), the Bo rel subsets of K , and X = id K the iden tity map. Similarly , w e think of e as the exp ected v a lue of a random v ar ia ble V taking v alues in R k . The natural choice of sample space here is Ω n = R k and F n = B ( R k ) t he Borel subsets. T o con tin ue w e need to select to prio r measures d Q s ( ξ ) and dQ n ( v ) on (Ω s , F s ) and (Ω n , F n ). The only restriction tha t w e imp ose o n them is that t he closure o f the conv ex h ull of b ot h supp ( Q s ) (resp. of su p p ( Q n )) is K (resp. R k ). Thes e prior measures em b o dy kno wledge that we may hav e ab out x and e but are not prio r s in the Bay esian sense. The tw o pieces a r e put together setting Ω = Ω s × Ω n ; F = F s ⊗ F n , and dQ ( ξ , v ) = dQ s ( ξ ) dQ n ( v ). And to get going we define the class (3) P = { P | P << Q ; AE P [ X ] + E P [ V ] = y } . Note that for any P ∈ P having a strictly p ositive density ρ = dP dQ , then E P [ X ] ∈ in t( K ). F or this standard result in analysis c hec k in Rudin’s b o ok [R]. The pro cedure to explicitly pro duce suc h P ’s is kno wn as the maxim um en tropy metho d. The first step of whic h is t o assume that P 6 = ∅ , whic h a moun ts to say tha t our in v erse problem (1) has a solution and define S Q : P → [ −∞ , ∞ ) b y the rule (4) S Q ( P ) = − Z Ω ln( dP dQ ) dP whenev er the function ln( dP dQ ) is P - in tegrable and S Q ( P ) = −∞ otherwise. This en tropy functional is conca ve on the con ve x set P . T o guess the form of the densit y of the measure 4 HENR YK GZYL, ENR IQUE TER HORST P ∗ that maximizes S Q is to consider the class of expo nen tial measures o n Ω defined by (5) dP λ = e − <λ, A ξ > − <λ,v > Z ( λ ) dQ where the normalization factor is Z ( λ ) = E Q [ e − <λ, A ξ > − <λ,v > ] . Here λ ∈ R k . If we define the dual en tro p y function Σ( λ ) : D ( Q ) → ( −∞ , ∞ ] b y the rule (6) Σ( λ ) = ln Z ( λ )+ < λ, y > or Σ( λ ) = ∞ whenev er λ / ∈ D ( Q ) ≡ { µ ∈ R k | Z ( µ ) < ∞} . It is easy to pro v e that, Σ( λ ) ≥ S Q ( P ) for any λ ∈ D ( Q ), and a ny P ∈ P . Th us if w e w ere a ble to find a λ ∗ ∈ D ( Q ) suc h tha t P λ ∗ ∈ P , we are done. T o find suc h a λ ∗ it suffices to minimize (the con v ex function) Σ( λ ) ov er (the conv ex set) D ( Q ). W e lea v e for the reader to v erify that if the minim um is reac hed in the interior of D ( Q ), then P λ ∗ ∈ P . W e direct the reader to [B-R] for all ab out this, and m uc h more. 3. E ntro pic Es tima t ors Let us now turn our a t ten tion to equation (2). Since our estimator is a sample mean of an exp onen tial (of unkno wn parameter) it is natural to assume f o r the metho d described in section 2, to assume that the prior Q s for X is a Γ( n, α/n ), where α > 0 is our b est (o r prior) guess of the unkno wn pa r ameter. Similarly , w e shall chose Q n to b e the distribution of a N (0 , δ /n ) random v ar iable as prior for the noise comp o nen t. Things a re ra t her easy under these assumptions. T o b egin with, note that Z ( λ ) = e λ 2 δ 2 2 n ( λ nα + 1 ) n and the t ypical member dP λ ( ξ , v ) of the exp onential family is now FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 5 (7) dP λ ( ξ , v ) = ( λ + nα ) n ξ n − 1 Γ( n ) e − ( λ + nα ) ξ e − ( v + δ 2 λ n ) 2 n 2 δ 2 (2 π δ 2 /n ) 1 / 2 dξ dv . It is also easy to v erify that the dual en tropy function Σ( λ ) is giv en by Σ( λ ) = λ 2 δ 2 2 n − n ln( λ nα + 1 ) + λ ˆ y the whose minim um v alue is reac hed at λ ∗ satisfying (8) λ ∗ δ 2 n − 1 /α λ ∗ nα + 1 + ˆ y = 0 and, discarding the obv ious solution (in Lemma 1 b elo w w e shall see wh y w e discard the other solution), w e a r e left with λ ∗ nα = 1 2 ( − (1 + ˆ y αδ 2 ) + ((1 − ˆ y αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ) from whic h w e obtain that (9) λ ∗ nα + 1 = 1 2 ((1 − ˆ y αδ 2 ) + ((1 − ˆ y αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ) as w ell a s (10) ˆ x ∗ = E P ( λ ∗ ) [ X ] = n ( λ ∗ + nα ) = [ α 2 (1 − ˆ y αδ ) + q (1 − ˆ y αδ ) 2 + 4 α 2 δ 2 1 / 2 ] − 1 ˆ e ∗ = E P ( λ ∗ ) [ V ] = − δ 2 λ ∗ n . Commen t 1. Cle arly, f r om (8) it fol lows that ˆ y = ˆ x ∗ + ˆ e ∗ . Thus it m akes sense to think of ˆ x ∗ as the estimator wi th the noise filter e d out, and to think of ˆ e ∗ as the r e sidual no i s e. 4. P r oper ties of ˆ x ∗ Let us now sp ell out some of the notatio n underlying the probabilistic mo del b ehind (1) . W e shall assume that the x i and the e i in the first section are v alues of random v ariables X i and ǫ i defined on a sample space ( W , W ). F or each θ > 0, w e assume to b e g iv en a probabilit y la w P ( θ ) o n ( W , W ), with resp ect to whic h the seque nces { X k | k = 1 , 2 , ... } and { ǫ k | k = 1 , 2 , ... } are b oth i.i.d. and indep enden t of eac h o ther, a nd that with respect to P ( θ ), X k ∼ exp( θ ) and ǫ k ∼ N (0 , δ 2 ). That is w e consider the underlying mo del for the 6 HENR YK GZYL, ENR IQUE TER HORST noise asour prior mo del for it. Minimal consistency is all righ t. F o rm the ab o ve , the following basic results are easy to obtain. F ro m (9) and (10) it is clear that Lemma 1. I f we take α = 1 / ˆ y , then λ ∗ = 0 and ˆ x ∗ = ˆ y and ˆ e ∗ = 0 . Commen t 2. A ctual ly it is e asy to verify that the so lution to ˆ x ∗ ( α ) = 1 /α i s α = 1 / ˆ y . T o examine the case in whic h la rge data sets w ere a v aila ble, let us add a superscript n and write ˆ y ( n ) to emphasize the size of the sample. If ˆ x ( n ) denotes te arit hmetic mean of an iid sequence of random v ariables ha ving exp( θ ) as common la w, it will follo w form the LLN that Lemma 2. As n → ∞ then (11) ( ˆ x ( n ) ) ∗ → ˜ x ( α ) ≡ [ α 2 (1 − θ αδ 2 ) + ((1 − θ αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ] ( − 1) . Pr o of. Start fro m (10), in vok e the LLN to conclude that ˆ y ( n ) t ends to θ and obtain (11 ) . Corollary 1. The true parameter is the solution of ˜ x ( α ) − 1 /α = 0. Pr o of. Just lo ok at the rig h t hand side of (11) t o conclude that ˜ x (1 /θ ) = θ . Commen t 3. What this asserts is that when the numb er of m e as ur ements is lar ge, to find the right va lue of the p ar ameter it suffic es to solve ˜ x ( α ) − 1 /α = 0 . And when the noise lev el go es to zero, w e hav e Lemma 3. With the notations i n tr o duc e d ab ove, ˆ x ∗ → ˆ y a s δ → 0 . Pr o of. When δ → 0, the dQ n ( v ) → ǫ 0 ( dv ) the D irac p oin t mass at 0. In this case, we just set δ = 0 in (8) and the conclusion follows. When w e c ho o se α = 1 / ˆ y , the estimator ˆ x ∗ happ ens to b e un biased. FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 7 Lemma 4. L et θ deno te the true but unknown p ar ameter of the exp onential, an d P θ ( dy ) have density f θ ( y ) = Z y −∞ θ n ( y − s ) n − 1 e − θ ( y − s ) e − s 2 / 2 δ 2 Γ( n ) √ 2 π δ ds for y > 0 and 0 otherwise. With the notations intr o duc e d ab ov e , we have E P ( θ ) [( ˆ x ( n ) ) ∗ ] = 1 / θ whenever the prior α for the m axent is the sam ple me an ˆ y . Pr o of. It drops out easily from Lemma 1, fro m (2) and the fact tha t the jo in t density f θ of ˆ y is a con volution. But the righ t c hoice of the parameter α is a p ending issue. T o settle it w e consider once more the iden tit y | ˆ y − ˆ x ∗ | = | ˆ e ∗ | . In our particular case w e shall see that α = 0 minimizes the right hand side of the previous iden tit y . Lemma 5. With the same notations as ab ove, ˆ e ∗ happ ens to b e a monotone function of α and ˆ e ∗ ( α = 0) = 1 2 ˆ y − p ˆ y 2 + 4 δ 2 and ˆ e ∗ ( α → ∞ ) = ˆ y . I n the first c ase ˆ x ∗ ( α = 0) = 1 2 ˆ y + p ˆ y 2 + 4 δ 2 , wher e as in the se c ond ˆ x ∗ ( α → ∞ ) = 0 . Pr o of. Recall from the first lemma that when α ˆ y = 1, then ˆ e ∗ = 0. A simple algebraic manipulation sho ws that when α ˆ y > 1 then ˆ e ∗ > 0 , and that when α ˆ y < 1 then ˆ e ∗ < 0 . . T o compute the limit of ˆ e ∗ as α → ∞ , note that f o r la rge α w e can neglect the term 4 /δ 2 under the square ro ot sign, and then the result drops out. It is also easy to c hec k the p ositivit y of the deriv ativ e of ˆ e ∗ with resp ect to α. Also clearly ˆ e ∗ (0) < ˆ e ∗ ( ∞ ) . T o sum up, with the c hoice α = 0, the entropic estimator and residual error are (12) ˆ x ∗ (0) = 1 2 ˆ y + p ˆ y 2 + 4 δ 2 , ˆ e ∗ (0) = 1 2 ˆ y − p ˆ y 2 + 4 δ 2 . 5. S imula tion and comp arison with the Ba yes ian and Maximum Likelihood appro aches In this section w e do sev eral things. First, w e generate histogr a ms that describ e the statistical nature of ˆ x ∗ as a function of the parameter α . F or that w e generate a data set of 8 HENR YK GZYL, ENR IQUE TER HORST 0 2 4 6 8 10 0 10 20 30 40 50 60 Figure 1. Histogra m for E ( x ) with the Maxim um En tropy Metho d. 1000 samples, a nd for each of them w e obta in ˆ x ∗ from (1 2 ). Also, for eac h data p oint we apply b oth a Ba yes ian estimation metho d and a maxim um lik eliho o d metho d, and plo t the resulting histograms. 5.1. The maxen tr opic est imator. Sim ulate n = 3 data p oin ts y 1 , y 2 , y 3 in the follo wing w ay: • Sim ulate a v alue fo r x i from an exp onen tial distribution with parameter λ . • Sim ulate a v alue fo r e i from a normal distribution N (0 , δ = 0 . 5) • Sum x i with e i to get y i , if y i < 0 rep eat first t wo steps until y i > 0 • Do t his for i = 1 , 2 , 3 . • Compute the Maximu m en tropy estimator giv en b y equation (10). 5.2. The ba ye sian estimator. In this section w e deriv e the a lgorithm fo r a Bay esian in- ference of the mo del giv en b y y i = x + e i , for i = 1 , 2 , ..., n . The classical lik eliho o d estimator of x is give n by ˆ y = 1 n P n i =1 y i . As w e kno w that the unknow n mean x has an exp o nen tial probabilit y distribution with par a meter θ ( x ∼ E ( θ )), therefore the joint densit y of the y i FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 9 0 2 4 6 8 10 0 10 20 30 40 50 Figure 2. Histogra m for E ( x ) with Bay es Metho d. and µ is prop ortional to: (13) n Y i =1 1 √ 2 π δ 2 exp − ( y i − x ) 2 2 δ 2 θ exp( − θ x ) π ( θ ) where θ exp( − θ x ) is the densit y of the unkno wn mean x and where π ( θ ) ∝ θ − 1 is the Jeffrey’s noninformativ e prior distribution for the parameter θ (Berger 1985). In order for us to compare b et w een the Maxim um Entrop y Metho d, Maxim um Like liho o d and Bay esian estimation metho ds, w e need to repeat many times the follo wing steps in order to deriv e a probability distribution of our estimations. In order to deriv e the Bay esian estimator, w e need to g et the p osterior probability distri- bution f or θ , whic h w e do with the follo wing G ibbs sampling sc heme (Rob ert et al. 2 005): • Draw x ∼ N ˆ y − θ δ 2 n , δ 2 n 1 x> 0 • Draw θ ∼ E ( x ) Rep eat this algo rithm many times in order to obtain a large sample from the p o sterior dis- tribution of θ in order to obtain the p osterior distribution of E ( x ) = 1 θ . F or our application, w e sim ulate data with θ = 1 , whic h giv es a n exp ected v alue for x equal to E ( x ) = 1. W e get t he fo llo wing histograms for the estimations of E ( x ) after 1000 iterat io ns when sim ulating data for θ = 1. 10 HENR YK GZYL, ENR IQUE TER HORST 5.3. The Maxim um Like liho o d estimator. The problem of obtaining a ML estimator is complicated in t his setup b ecause data p oin ts are distributed like f θ ( t ) = Z t −∞ θ e − θ ( t − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) f θ ( t ) = θ e − θ t + ( θδ ) 2 2 P ( S < t ) where S ∼ N ( θδ 2 , δ 2 ). Therefore, after observing t 1 , t 2 , and t 3 , w e get the following lik eliho o d that w e maximize numerically : (14) θ 3 e − θ P 3 i =1 t i + 3( θδ ) 2 2 3 Y i =1 P ( S < t i ) . If w e attempted to obtain the ML estimator analytically , we w ould need to solv e n θ − n X j =1 R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) = 0 . Notice t hat as δ → 0 this equation tends to n θ − P n j =1 t j = 0 as exp ected. W e can mov e forw ar d a bit, and integrate b y parts each n umerator, and after some calculations w e arriv e to n θ − n X j =1 t j + nδ 2 θ − n X j =1 δ e − t 2 j / 2 δ 2 R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) = 0 . T rying to solv e this equation in θ is ra ther hop eless. That is the reason why w e carried on a n umerical maximization pro cedure o n (1 4). T o understand what happ ens when the noise is small, w e drop the la st term in the last equation and w e are left with n θ − n X j =1 t j + nδ 2 θ the solution of whic h is 1 θ ∗ = 1 2 ˆ y + p ˆ y 2 − 4 δ 2 FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 11 0 2 4 6 8 10 0 50 100 150 200 250 Figure 3. Histogra m for E ( x ) with the Maxim um Lik eliho o d Metho d. or θ ∗ = 2 ˆ y + p ˆ y 2 − 4 δ 2 − 1 , and w e see that the effect of noise is to increase the ML estimator. In figure 3 we plot the histogr a m of 1 θ ∗ obtained b y n umerically maximaizing (14) for each sim ulated data p o int. When simulating data fo r θ = 1, the MEM, Maxim um lik eliho o d and Bay esian histograms are all sk ew ed to the right and yield a mean under the three simulated histograms close to 1. The MEM metho d yields a sample mean of 1 . 3252 with a sample standard deviation of 0 . 5, the Bay esian yields sample means equal to 1 . 045 and sample standard deviation of 0 . 552 9, and the Maxim um Lik eliho o d metho d yields a sample mean of 1 . 81 with a sample standard deviation of 2 . 29 . All the three metho ds pro duce righ t sk ewe d histograms for E ( x ). The MEM and Bay esian metho d provide b etter and similar results and more a ccurate tha n the Maxim um Lik eliho o d metho d. 6. Concluding remarks On one hand, MEM bac ks up the intuitiv e b elief, a ccording to whic h, if the y i are all the data that y ou hav e, it is all righ t to compute y our estimator o f the mean fo r α = 0. The MEM and Bay esian metho ds yield closer results to the true para meter v alue than the maxim um lik eliho o d estimator. 12 HENR YK GZYL, ENR IQUE TER HORST On the other, and this dep ends on your c hoice of priors, MEM provide s us with a w a y of mo difying those priors, a nd obta in represen tations like ˆ y = ˆ x ∗ + ˆ e ∗ ; where of course ˆ x ∗ = ˆ x ∗ ( ˆ y ). What w e sa w a b o v e, is that there is a choice of prior distributions suc h tha t ˆ x ∗ = ˆ y and ˆ e ∗ = 0 . The imp ortan t thing is that this is actually true rega r dless of what the “ true” probabilit y describing the x i is. FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 13 Reference s [1] [B] Berger, J.O. Statistic al De cisio n T he ory and Bayesian A nalysis , Springer; 2nd ed. 1985. [2] [B-L] Borwe in, J and Lewis, A. Convex Analysis and Nonline ar Optimiza tion , Springer V erlag, Berlin 2000. [3] [N] Na v aza, J. The use of non-lo c al c onstr aints in maximum-e n tr opy ele ctr on density r e c ons truction Acta Cryst., A 42 (1986) pp.212- 223 [4] [D-G] Dacunha-Castelle, D . and Gamboa, F. Maximum d’en tr opie et pr obleme des moments Ann. Inst. Henr ´ ı Poinc ar ´ e, 26 (1990 ), pp. 567-59 6 . [5] [C] Rob ert, C. and Casella, G . Monte Carlo S tatistic al Metho ds , Springer T exts in Statistics, 2005. [6] [R] Rudin, W. F unctiona l Analysis , Mc Graw Hill, New Y or k, 1973. F a cul t a d de Ciencias, Universidad Central de Venezuela, In stituto de E studios Superi- ores de Administraci ´ on IESA, Caracas, DF, Venezuela. E-mail addr ess : h gzyl@ reacci un.ve, enri que.te rhors t@iesa.edu.ve FIL TERING ADDITIVE MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN HENR YK GZYL, ENRIQUE TER HORST Abstract. The purp ose of this note is to show how the metho d of maximum entrop y in the mean (MEM) may b e used to improve pa rametric estimation when the measure men ts are co rrupted by la rge level of no ise. The metho d is developed in the context o n a co ncrete example: that of estimation of the par ameter in an exp onential distribution. W e compare the per formance of our metho d with the bay esian and maximum likelihoo d appr oaches. 1. Introduction Supp ose that y ou w an t to measure the half-life of a deca ying n ucleus or the lif e- time of some elemen tary particle, or some other random v ariable mo deled by an exp onential distribution describing, sa y a deca y time or the life time of a pro cess . Assume as w ell that the noise in the measuremen t pro cess can b e mo deled by a cen tered gaussian random v ariable whose v ar ia nce ma y b e of the same order of magnitude as that of the deca y rate to b e measured. T o ma ke things w orse, assume that y ou can only collect v ery few measuremen ts. That is if x i denotes the realized v alue of the v ariable, one can only measure y i = x i + e i , for i = 1 , 2 , ..., n , where n is a small m um bler, sa y 2 or 3 , and ǫ 1 denotes the additiv e measuremen t noise. In other words, assume that y ou know that the sample comes from a sp ecific parametric distribution but is con taminated b y additiv e noise. What to do? One p ossible approach is to apply small sample statistical estimation pro cedures. But these are designed for problems where the v ariability is due o nly to the random na ture of the quan tity measured,and there is no o t her noise in the measuremen t Still another p ossibilit y , t he one w e that to explore here, is to apply a maxen tropic filtering metho d, to estimate b oth the unknow n v ariable a nd the noise leve l. F or this we recast the 1 2 HENR YK GZYL, ENR IQUE TER HORST problem as a typical inv erse problem consisting of solving for x in (1) y = Ax + e ; x ∈ K where K is a con v ex set in R d , y ∈ R k and for some d and k , a nd A is an k × d -matrix whic h dep ends o n how w e rephrase t he our pro blem. W e could, fo r example, consider the follo wing problem: Find ˆ x ∈ [0 , ∞ ) suc h that (2) ˆ y = ˆ x + ˆ e In our case K = [0 , ∞ ), and w e set ˆ y = 1 n Σ j y j . O r w e could consider a collection of n suc h problems, o ne for ev ery measuremen t, a nd then pro ceed to carry on the estimation. Once w e ha ve solv ed t he generic problem (1), the v ariatio ns on t he theme are easy to write do wn. What is imp ortant to k eep in mind here, is that the output of the metho d is a filtered estimator ˆ x ∗ of ˆ x, whic h itself is an estimator of the unkno wn parameter. The no ve lt y then is to filter o ut the noise in (2). The metho d of maxim um entrop y in the mean is rather w ell suited for solving problems lik e (1). See Na v aza (1986) for an early dev elopmen t and Dacunha-Castele and Cam b oa (1990) for full mathematical treatment . Below we shall briefly review what the metho d is ab out and then a pply it to obtain an estimator ˆ x from (2). In section 3 obtain the maxen t r o pic estimator and in section 4 w e examine some of its prop erties, in particular w e examine what the results w ould b e if either the noise lev el w ere small or the n um b er of measuremen ts w ere large. W e dev ote section 4 to some sim ulations in whic h the metho d is compared with a ba yes ian and a maxim um lik eliho o d approac hes. 2. The basics of MEM MEM is a tec hnique for transforming a p ossibly ill-p osed, linear pro blem with con v ex constrain ts in to a simpler (usually unconstrained) but non-linear minimization pro blem. The n umber of v ariables in the auxiliary problem b eing equal to the num b er of equations FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 3 in the original problem, k in the case of example 1. T o carry out the transformat io n one thinks of the x there a s the exp ected v alue of a random v ariable X with resp ect to some measure P to b e determined. The basic datum is a sample space (Ω s , F s ) on whic h X is to b e defined. In our setup the natural choice is t o take Ω s = K , F s = B ( K ), the Borel subsets of K , and X = id K the identit y map. Sim ilarly , w e think of e as the exp ected v alue of a random v ariable V taking v alues in R k . The na t ur a l choice of sample space here is Ω n = R k and F n = B ( R k ) t he Borel subsets. T o con tin ue w e need to select to prio r measures d Q s ( ξ ) and dQ n ( v ) on (Ω s , F s ) and (Ω n , F n ). The only restriction tha t w e imp ose o n them is that t he closure o f the conv ex h ull of b ot h supp ( Q s ) (resp. of su p p ( Q n )) is K (resp. R k ). These prior measures em b o dy kno wledge that w e may hav e ab out x and e but are not prio r s in the Bay esian sense. Ac- tually , the mo del f o r the noise comp onen t describes the c hara cteristics of the measuremen t device or pro cess, a nd it is a da t um. The t wo pieces are put together setting Ω = Ω s × Ω n ; F = F s ⊗ F n , a nd dQ ( ξ , v ) = d Q s ( ξ ) dQ n ( v ). And to get g o ing w e define the class (3) P = { P | P << Q ; AE P [ X ] + E P [ V ] = y } . Note that for any P ∈ P having a strictly p ositive density ρ = dP dQ , then E P [ X ] ∈ in t( K ). F or this standard result in analysis c heck in Rudin’s (1 973) b o ok. The pro cedure to explicitly pro duce suc h P ’s is kno wn as the maxim um en tropy metho d. The first step of whic h is t o assume that P 6 = ∅ , whic h a moun ts to say tha t our in v erse problem (1) has a solution and define S Q : P → [ −∞ , ∞ ) b y the rule (4) S Q ( P ) = − Z Ω ln( dP dQ ) dP whenev er the function ln( dP dQ ) is P - in tegrable and S Q ( P ) = −∞ otherwise. This en tropy functional is conca ve on the con ve x set P . T o guess the form of the densit y of the measure 4 HENR YK GZYL, ENR IQUE TER HORST P ∗ that maximizes S Q is to consider the class of expo nen tial measures o n Ω defined by (5) dP λ = e − <λ, A ξ > − <λ,v > Z ( λ ) dQ where the normalization factor is Z ( λ ) = E Q [ e − <λ, A ξ > − <λ,v > ] . Here λ ∈ R k . If we define the dual en tro p y function Σ( λ ) : D ( Q ) → ( −∞ , ∞ ] b y the rule (6) Σ( λ ) = ln Z ( λ )+ < λ, y > or Σ( λ ) = ∞ whenev er λ / ∈ D ( Q ) ≡ { µ ∈ R k | Z ( µ ) < ∞} . It is easy to pro v e that, Σ( λ ) ≥ S Q ( P ) for any λ ∈ D ( Q ), and a ny P ∈ P . Th us if w e w ere a ble to find a λ ∗ ∈ D ( Q ) suc h tha t P λ ∗ ∈ P , we are done. T o find suc h a λ ∗ it suffices to minimize (the con v ex function) Σ( λ ) ov er (the conv ex set) D ( Q ). W e lea v e for the reader to v erify that if the minim um is reac hed in the interior of D ( Q ), then P λ ∗ ∈ P . W e direct the reader to Borw ein and Lewis (20 00) for all ab out this, and muc h more. 3. E ntro pic Es tima t ors Let us now turn our a t ten tion to equation (2). Since our estimator is a sample mean of an exp onen tial (of unkno wn parameter) it is natural to assume f o r the metho d described in section 2, to assume that the prior Q s for X is a Γ( n, α/n ), where α > 0 is our b est (o r prior) guess o f the unkno wn parameter. Belo w w e prop o se a criterion for the b est c ho ice of α. Similarly , we shall c hose Q n to b e the distribution of a N (0 , δ / n ) random v ariable as prior for the noise comp onen t. Things a re ra t her easy under these assumptions. T o b egin with, note that Z ( λ ) = e λ 2 δ 2 2 n ( λ nα + 1 ) n FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 5 and the t ypical member dP λ ( ξ , v ) of the exp onential family is now (7) dP λ ( ξ , v ) = ( λ + nα ) n ξ n − 1 Γ( n ) e − ( λ + nα ) ξ e − ( v + δ 2 λ n ) 2 n 2 δ 2 (2 π δ 2 /n ) 1 / 2 dξ dv . It is also easy to v erify that the dual en tropy function Σ( λ ) is giv en by Σ( λ ) = λ 2 δ 2 2 n − n ln( λ nα + 1 ) + λ ˆ y the whose minim um v alue is reac hed at λ ∗ satisfying (8) λ ∗ δ 2 n − 1 /α λ ∗ nα + 1 + ˆ y = 0 and, discarding o ne of the solutions (b ecause it leads to a negative estimator of a p ositiv e quan tity), w e are left with λ ∗ nα = 1 2 ( − (1 + ˆ y αδ 2 ) + ((1 − ˆ y αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ) from whic h w e obtain that (9) λ ∗ nα + 1 = 1 2 ((1 − ˆ y αδ 2 ) + ((1 − ˆ y αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ) as w ell a s (10) ˆ x ∗ = E P ( λ ∗ ) [ X ] = n ( λ ∗ + nα ) = [ α 2 (1 − ˆ y αδ ) + q (1 − ˆ y αδ ) 2 + 4 α 2 δ 2 1 / 2 ] − 1 ˆ e ∗ = E P ( λ ∗ ) [ V ] = − δ 2 λ ∗ n . Commen t 1. Cle arly, f r om (8) it fol lows that ˆ y = ˆ x ∗ + ˆ e ∗ . Thus it m akes sense to think of ˆ x ∗ as the estimator wi th the noise filter e d out, and to think of ˆ e ∗ as the r e sidual no i s e. 4. P r oper ties of ˆ x ∗ Let us now sp ell out some of the notatio n underlying the probabilistic mo del b ehind (1) . W e shall assume that the x i and the e i in the first section are v alues of random v ariables X i and ǫ i defined on a sample space ( W , W ). F or each θ > 0, w e assume to b e giv en a probabilit y la w P ( θ ) o n ( W , W ), with resp ect to which the sequences { X k | k = 1 , 2 , ... } and { ǫ k | k = 1 , 2 , ... } are b oth i.i.d. and indep enden t of each ot her, a nd that with resp ect 6 HENR YK GZYL, ENR IQUE TER HORST to P ( θ ), X k ∼ exp( θ ) and ǫ k ∼ N (0 , δ 2 ). That is w e consider the underlying mo del f or the no ise as our prio r mo del f o r it. Minimal consistency is all r ig h t. F or m the ab o ve , the follo wing basic results are easy to obta in. F ro m (9) and (10) it is clear that Lemma 1. I f we take α = 1 / ˆ y , then λ ∗ = 0 and ˆ x ∗ = ˆ y and ˆ e ∗ = 0 . Commen t 2. A ctual ly it is e asy to verify that the so lution to ˆ x ∗ ( α ) = 1 /α i s α = 1 / ˆ y . T o examine the case in whic h la rge data sets w ere a v aila ble, let us add a superscript n and write ˆ y ( n ) to emphasize the size of the sample. If ˆ x ( n ) denotes the arithmetic mean of an i.i.d. sequence of random v a r ia bles ha ving exp( θ ) as common la w, it will follo w form the LLN that Lemma 2. As n → ∞ then (11) ( ˆ x ( n ) ) ∗ → ˜ x ( α ) ≡ [ α 2 (1 − θ αδ 2 ) + ((1 − θ αδ 2 ) 2 + 4 α 2 δ 2 ) 1 / 2 ] ( − 1) . Pr o of. Start fro m (10), in vok e the LLN to conclude that ˆ y ( n ) t ends to θ and obtain (11 ) . Corollary 1. The true parameter is the solution of ˜ x ( α ) − 1 /α = 0. Pr o of. Just lo ok at the rig h t hand side of (11) t o conclude that ˜ x (1 /θ ) = θ . Commen t 3. What this asserts is that when the numb er of m e as ur ements is lar ge, to find the right va lue of the p ar ameter it suffic es to solve ˜ x ( α ) − 1 /α = 0 . And when the noise lev el go es to zero, w e hav e Lemma 3. With the notations i n tr o duc e d ab ove, ˆ x ∗ → ˆ y a s δ → 0 . Pr o of. When δ → 0, the dQ n ( v ) → ǫ 0 ( dv ) the D irac p oin t mass at 0. In this case, we just set δ = 0 in (8) and the conclusion follows. When w e c ho o se α = 1 / ˆ y , the estimator ˆ x ∗ happ ens to b e un biased. FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 7 Lemma 4. L et θ deno te the true but unknown p ar ameter of the exp onential, an d P θ ( dy ) have density f θ ( y ) = Z y −∞ θ n ( y − s ) n − 1 e − θ ( y − s ) e − s 2 / 2 δ 2 Γ( n ) √ 2 π δ ds for y > 0 and 0 otherwise. With the notations intr o duc e d ab ov e , we have E P ( θ ) [( ˆ x ( n ) ) ∗ ] = 1 / θ whenever the prior α for the m axent is the sam ple me an ˆ y . Pr o of. It drops out easily from Lemma 1, fro m (2) and the fact tha t the jo in t density f θ of ˆ y is a con volution. But the right c hoice of the parameter α is still a p ending issue. T o settle it w e consider once more the identit y | ˆ y − ˆ x ∗ | = | ˆ e ∗ | . In o ur particular case w e shall see that α = 0 minimizes the righ t hand side of the previous iden tity . Thus , w e prop ose to c ho ose ′ alpha to minimize the residual or reconstruction error. Lemma 5. With the same notations as ab ove, ˆ e ∗ happ ens to b e a monotone function of α and ˆ e ∗ ( α = 0) = 1 2 ˆ y − p ˆ y 2 + 4 δ 2 and ˆ e ∗ ( α → ∞ ) = ˆ y . I n the first c ase ˆ x ∗ ( α = 0) = 1 2 ˆ y + p ˆ y 2 + 4 δ 2 , wher e as in the se c ond ˆ x ∗ ( α → ∞ ) = 0 . Pr o of. Recall from the first lemma that when α ˆ y = 1, then ˆ e ∗ = 0. A simple algebraic manipulation sho ws that when α ˆ y > 1 then ˆ e ∗ > 0 , and that when α ˆ y < 1 then ˆ e ∗ < 0 . . T o compute the limit of ˆ e ∗ as α → ∞ , note that f o r la rge α w e can neglect the term 4 /δ 2 under the square ro ot sign, and then the result drops out. It is also easy to c hec k the p ositivit y of the deriv ativ e of ˆ e ∗ with resp ect to α. Also clearly | ˆ e ∗ (0) | < | ˆ e ∗ ( ∞ ) | . T o sum up, with the c hoice α = 0, the entropic estimator and residual error are (12) ˆ x ∗ (0) = 1 2 ˆ y + p ˆ y 2 + 4 δ 2 , ˆ e ∗ (0) = 1 2 ˆ y − p ˆ y 2 + 4 δ 2 . 5. S imula tion and comp arison with the Ba yes ian and Maximum Likelihood appro aches In this section w e compare the prop osed maximim um en tropy in the mean pro cedure with the bay esian and maxim um lik eliho o d estimation pro cedures. W e do that sim ulating data 8 HENR YK GZYL, ENR IQUE TER HORST 0 2 4 6 8 10 0 10 20 30 40 50 60 Figure 1. Histogra m for E ( x ) with the Maxim um En tropy Metho d. and carrying out the thr ee pro cedures and plotting the histograms of the corresp onding histograms. First, w e g enerate histograms that describ e the statistical nature of ˆ x ∗ as a function o f the pa r a meter α . F or that we generate a da t a set of 1 000 samples, and for eac h of them w e obtain ˆ x ∗ from (12). Also, for eac h data p oint we apply b o th a Bay esian estimation metho d and a maximum lik eliho o d metho d, and plot the resulting histog r a ms. 5.1. The maxen tropic estimator. The simulated data pro cess go es as follow s. F or n = 3 the data p oints y 1 , y 2 , y 3 are obtained in the following w ay: • Sim ulate a v alue fo r x i from an exp onen tial distribution with parameter θ ( = 1). • Sim ulate a v alue fo r e i from a normal distribution N (0 , δ = 0 . 5) • Sum x i with e i to get y i , if y i < 0 rep eat first t wo steps until y i > 0 • Do t his for i = 1 , 2 , 3 . • Compute the Maximu m en tropy estimator giv en b y equation (10). W e then sdispla y the resulting histogra m in Figure 1. 5.2. The ba ye sian estimator. In this section w e deriv e the a lgorithm fo r a Bay esian in- ference of the mo del giv en b y y i = x + e i , for i = 1 , 2 , ..., n . The classical lik eliho o d estimator of x is give n by ˆ y = 1 n P n i =1 y i . As w e kno w that the unknow n mean x has an exp o nen tial FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 9 0 2 4 6 8 10 0 10 20 30 40 50 Figure 2. Histogra m for E ( x ) with Bay es Metho d. probabilit y distribution with par a meter θ ( x ∼ E ( θ )), therefore the joint densit y of the y i and µ is prop ortional to: (13) n Y i =1 1 √ 2 π δ 2 exp − ( y i − x ) 2 2 δ 2 θ exp( − θ x ) π ( θ ) where θ exp( − θ x ) is the densit y of the unkno wn mean x and where π ( θ ) ∝ θ − 1 is the Jeffrey’s non informative prior distribution for the parameter θ Berger (1 9 85). In order to deriv e the Bay esian estimator, w e need to g et the p osterior probability distri- bution for θ , whic h w e do with the following Gibbs sampling sche me, described in Rob ert and Casella (2005) : • Draw x ∼ N ˆ y − θ δ 2 n , δ 2 n 1 x> 0 • Draw θ ∼ E ( x ) Rep eat this algo rithm many times in order to obtain a large sample from the p o sterior dis- tribution of θ in order to obtain the p osterior distribution of E ( x ) = 1 θ . F or our application, w e sim ulate data with θ = 1 , whic h giv es a n exp ected v alue for x equal to E ( x ) = 1. W e get the histogram display ed in Figur e 2 fo r the estimations of E ( x ) after 1000 iterations when sim ulat ing data f or θ = 1. 10 HENR YK GZYL, ENR IQUE TER HORST 5.3. The Maxim um Like liho o d estimator. The problem of obtaining a ML estimator is complicated in t his setup b ecause data p oin ts are distributed like f θ ( t ) = Z t −∞ θ e − θ ( t − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) f θ ( t ) = θ e − θ t + ( θδ ) 2 2 P ( S < t ) where S ∼ N ( θδ 2 , δ 2 ). Therefore, after observing t 1 , t 2 , and t 3 , w e get the following lik eliho o d that w e maximize n umerically: (14) θ 3 e − θ P 3 i =1 t i + 3( θδ ) 2 2 3 Y i =1 P ( S < t i ) . If w e attempted to obtain the ML estimator analytically , we w ould need to solv e n θ − n X j =1 R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) = 0 . Notice t hat as δ → 0 this equation tends to n θ − P n j =1 t j = 0 as exp ected. W e can mov e forw ar d a bit, and integrate b y parts each n umerator, and after some calculations w e arriv e to n θ − n X j =1 t j + nδ 2 θ − n X j =1 δ e − t 2 j / 2 δ 2 R t j −∞ θ e − θ ( t j − s ) e − s 2 / 2 δ 2 ds/ p (2 π δ 2 ) = 0 . T rying to solv e this equation in θ is ra ther hop eless. That is the reason why w e carried on a n umerical maximization pro cedure o n (1 4). T o understand what happ ens when the noise is small, w e drop the la st term in the last equation and w e are left with n θ − n X j =1 t j + nδ 2 θ the solution of whic h is 1 θ ∗ = 1 2 ˆ y + p ˆ y 2 − 4 δ 2 FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 11 0 2 4 6 8 10 0 50 100 150 200 250 Figure 3. Histogra m for E ( x ) with the Maxim um Lik eliho o d Metho d. or θ ∗ = 2 ˆ y + p ˆ y 2 − 4 δ 2 − 1 , and w e see that the effect of noise is to increase the ML estimator. In figure 3 w e plot the histogram of 1 θ ∗ obtained by n umerically maximizing (14 ) for each sim ulated data p o int. When simulating data fo r θ = 1, the MEM, Maxim um lik eliho o d and Bay esian histograms are all sk ew ed to the right and yield a mean under the three simulated histograms close to 1. The MEM metho d yields a sample mean of 1 . 3252 with a sample standard deviation of 0 . 5, the Bay esian yields sample means equal to 1 . 045 and sample standard deviation of 0 . 552 9, and the Maxim um Lik eliho o d metho d yields a sample mean of 1 . 81 with a sample standard deviation of 2 . 29 . All the three metho ds pro duce righ t sk ewe d histograms for E ( x ). The MEM and Bay esian metho d provide b etter and similar results and more a ccurate tha n the Maxim um Lik eliho o d metho d. 6. Concluding remarks On one hand, MEM bac ks up the intuitiv e b elief, a ccording to whic h, if the y i are all the data that y ou hav e, it is all righ t to compute y our estimator o f the mean fo r α = 0. The MEM and Bay esian metho ds yield closer results to the true para meter v alue than the maxim um lik eliho o d estimator. 12 HENR YK GZYL, ENR IQUE TER HORST On the other, and this dep ends on your c hoice of priors, MEM provide s us with a w a y of mo difying those priors, a nd obta in represen tations like ˆ y = ˆ x ∗ + ˆ e ∗ ; where of course ˆ x ∗ = ˆ x ∗ ( ˆ y ). What w e sa w a b o v e, is that there is a choice of prior distributions suc h tha t ˆ x ∗ = ˆ y and ˆ e ∗ = 0 . The imp ortan t thing is that this is actually true rega r dless of what the “ true” probabilit y describing the x i is. FIL TERING ADDITIV E MEASUREMENT NOISE WITH MAXIMUM ENTROPY IN THE MEAN 13 Reference s [1] Berger, J.O. (1985) Statistic al De cision T he ory and Bayesian Analysis , Springer V er- lag; 2nd ed. Berlin. [2] Borw ein, J and Lewis, A. (2000) Convex Analysis and Nonline ar Optimi z a tion , Springer V erlag, Berlin 2 000. [3] Na v aza, J. (1986) The use of non -lo c al c onstr ai n ts in ma ximum-entr opy ele ctr o n den- sity r e c onstruction Acta Cryst., A42 , pp.21 2-223 [4] Dacunha-Castelle, D. a nd Gamboa, F. (199 0) Maxim um d ’ e ntr op i e et pr obleme des moments Ann. Inst. Henr ´ ı Poinc ar ´ e, 26 , pp. 567-59 6 . [5] Rob ert, C. and Casella, G. (2005 ) Monte Carlo Statistic al Metho ds , Springer T exts in Statistics, Springer V erlag, Berlin. [6] Rudin, W. (19 73) F unction al Analysis , Mc Graw Hill, New Y or k. F a cul t a d de Ciencias, Universidad Central de Venezuela, In stituto de E studios Superi- ores de Administraci ´ on IESA, Caracas, DF, Venezuela. E-mail addr ess : h gzyl@ reacci un.ve, enri que.te rhors t@iesa.edu.ve
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment