The use of the logarithm of the variate in the calculation of differential entropy among certain related statistical distributions

The use of the logarithm of the variate in t he calculation of differential entropy among certain related statist ical distributions. Author: Thomas M. Eccardt Abstract: This paper demonstrates that basic statistics (mean, variance) of the logarithm of the variate itself can be used in the calculati on of differential entropy among random variables known to be multiples and powers of a common underly ing variate. For the same set of distributions, the variance of the diffe rential self-information is shown also to be a function of statistics of the logarithmic variate . Then entropy and its “variance” can be estimated using only statistics of the logarit hmic variate plus constants, without reference to the traditional parameters of the variate . Text: Although the entropy or information content of a stati stical distribution is an average of the logarithm of its probability density, the logarithm of the variate itself can occasionally play a role in the calculation of entropy. Likewise, t he variance of logarithmic probability density, which we shall abbreviat e here as “entropy variance,” can be calculated through the use of the logarithm ic variate. Jones (1979:137) has shown that if a random variable X has the probability densi ty function f (x), if the random variable Y is a function of that random variable , i.e. y = g (x), a nd if H[ X ] represents the entropy of X , then H[ Y ]=H[ X ] + ∫ f (x) ln ( g′ (x)) dx, (1) or H[ Y ]=H[ X ] + E[ ln ( g′ (x))], (2) if E[ X ] is the expectation of the random variable X . This paper will utilize (2), along wit h an analogous formula for the variance of differential self-information, to show how entropy and e ntropy variance change with the multiples and powers of a given distribution. And since the average and variance of the logarithmic variates are functions of the m ultipliers and exponents used to create the distributions, formulas can be derived to calculat e entropy and entropy variance based only the average and variance of the logarit hmic variate. In particular, the formula for entropy involves only one constant, whose range appears to be quit e narrow among common statistical distributions. And so there may be a pplications where the notion of an underlying distribution may be disregarded, and whe re entropy may be estimated directly from statistics of the logarithmi c variate. If the function of a random variable is g (x) = a X b , then g′ (x) = ab X b-1 , and 1 ln ( g′ (x)) = ln (ab) + (b-1) ln ( X ). So, by (2), H[ Y ]=H[ X ] + E[ ln (ab) + (b-1) ln ( X ) ]. (3) And since E[a+b X ] = a + bE[ X ] and ln (a b) = ln (a) + ln (b), H[a X b ] = H[ X ] + ln (b) + (b-1)E[ ln ( X )] + ln (a). (4) Clearly, then, the expectation of the l ogarithm of the variate itself, E[ ln ( X )], can be used in the calculation of the differential ent ropy of a power of a random variable. Of course, the variate must be restricted to values grea ter than zero, otherwise its logarithm and powers may be undefined or complex. However if X has a minimum, the n a constant can be added to the variate so as to make it always pos itive, without affecting its entropy. Incidentally, when b = 1 equation (4) reduces to the wel l-known formula for the entropy of a multiple of a random variable (see, for exam ple, Cover and Thomas (1991:233)), H[a X ] = H[( X )] + ln (a). (5) The next step is to replace the multipl ier and exponent with statistics of the logarithmic variate. Clearly, the expectation is E [ ln (a X b )] = ln (a) + bE[ ln ( X )], a nd the variance is V[ ln (a X b )] = b 2 V[ ln ( X )]. Solving these for ln ( a ) and b we get ln (a) = E[ ln (a X b )] - bE[ ln ( X )] and (6) b = √( V[ ln (a X b )] / V[ ln ( X )] ). (7) Equation (4) can be restated as the change in ent ropy due to the application of the exponent b and the multiplier a to the variate X : H[a X b ] - H[ X ] = (b-1)E[ ln ( X )] + ln (a) + ln (b). (8) Substituting (6) into (8) gives H[a X b ] - H[ X ] = (b-1)E[ ln ( X )] + E[ ln (a X b )] - bE[ ln ( X )] + ln (b), (9) or, with simplification, H[a X b ] - H[ X ] = -E[ ln ( X )] + E[ ln (a X b )] + ln (b). (10) Then if (7) is substituted into (10), the result is H[a X b ] - H[ X ] = E[ ln a X b )] - E[ ln ( X )] + ln (V[ ln (a X b )]) / 2 - ln (V[ ln ( X )] )) / 2. (11) 2 Clearly both sides of (11) represent a function as applied to a X b , less the same function applied to X. This also proves that the entropy of any family of sta tistical distributions, X , related by the transformation a X b , is equal to a function of the logarithmic va riate, plus a constant, K , which varies with the type of underlying dist ribution: H[ X ] = E[ ln ( X )] + ln (V[ ln ( X )]) / 2 + K (12) In some distributions (e.g. Weibull and lognormal) a a nd b are either parameters of the distribution or functions of the parameters. For those distri butions, if the entropy of the distribution in general is known, and the mean and varianc e of the logarithmic variate are known (either through mathematical proof or through simul ation), then K can be calculated. So if a set of data is known to com e from such a distribution, then the entropy can be calculated exclusively through the mean and variance of the logarithmic variate, without reference to its traditional para meters, which may be difficult to calculate. Entropy has often been difficult to estimate directly from statistical data, especially in continuous distributions, since the probability density of ea ch value that the variate assumes must be reckoned. This usually results in downwa rdly biased estimates which increase as the size of the data set inc reases. Using only (a function of) the variate eliminates this problem. The fact tha t this new estimator uses only the logarithm of the variate should also reduce the bias which arises from da ta collection methods that tend to truncate or censor high-valued data points. On the other hand, exact calculation of K m ay not always be essential. Kapur (1989:68) has shown that when E[ ln ( X )] and E[ ln ( X ) 2 ] are both prescribed, it is the lognormal distribution that has the highest entropy. This impl ies that the lognormal distribution has the highest K , at about 1.42. Though there may be no theoret ical minimum for K , it appears that the most common positive distributi ons have a K greater than or equal to unity. A range of 1-1.42 in nats rescales to a range of less t han 2/3 bit. So if the required accuracy for an entropy estimate is a litt le less than one bit, then K could be assumed to be equal to about 1.2, and the type of underlying distribut ion ignored. Another occasion where K may be ignored would be when comparing the entropi es of unknown or intractable distributions that can be shown to be rel ated through the transformation a X b . Such distribution functions are easily recogniz ed, having the same shape when graphed on a logarithmic scale. If only the relat ive entropies are sought, then the value of K is irrelevant. The generalized gamma (or Stacey) distribut ion provides a useful example to examine how equation (12) works in practice . This is due to the genera lized gamma’s self- reproductive property when multiplied or exponenti ated. In other words, some of its parameters correspond to a and b i n the transformed random variable, a X b . However, it has another parameter, v , whose values trans late to the typical range of K , 1-1.42. Another advantage is that some of the commonest stati stical distributions are special cases of the generalized gamma. The probability density function of the general ized gamma distribution is as follows: 3 GG[x | a,b,v] = b / (a bv Γ (v)) · x bv-1 · exp {-(x/a) b } (13) A. Dadpay et al (2007:571) give its entropy as: H[ X GG ] = ln (a) + ln ( Γ (v)) + v - ln (b) + (1/b – v) Ψ (v), (14) where Γ (v) is the gamma function of v , a nd Ψ (v) is its digamma function, the differential of the logarithm of the gamma function. Using the l ogarithm of its Mellin transformation, two cumulants (mean and vari ance) of the logarithm of the generalized gamma distribution can be derived as: E[ ln ( X GG )] = ln (a) + (1 / b) Ψ (v) (15) and V[ ln ( X GG )] = (1 / b 2 ) Ψ’ (v), (16) where Ψ’ (v) is the first derivative of the diga mma function. Substituting (15) and (16) into (12) gives H[ X GG ] = ln (a) + (1 / b) Ψ (v) + ln ((1 / b 2 ) Ψ’ (v)) / 2 + K. (17) Setting (17) equal to (14) and solving for K yields: K = ln (Γ(v)) -v Ψ (v) + v - ln (( Ψ’ (v)) / 2. (18) The multiplier a and the exponent b a re not present in (18), because K is a constant for all powers and multiples of a given underlying distribution. V -v Ψ (v) ln (Γ(v)) - ln (( Ψ’ (v)) / 2 K Common distribution(s) 30 -101.5322 71.2570 1.69224 1.41704 ≈ Lognormal 10 -22.5175 12.8018 1.12610 1.4104 2 -0.84556 0.0 0.21931 1.37375 Chi Square 1 +0.57721 0.0 -0.2488 1.32841 Exponential, Weibull ½ +0.981755 .572365 -0.7981561 1.25596 Half-normal, Chi 0.125 +1.0485 2.0184 -2.0901 1.1018 0.001 +1.0006 8.9072 -8.90775 1.00105 Table 1. Calculation of K for various forms of the generali zed gamma distribution Each row of Table 1 lists a different form of the generalized gamma distribution based its v parameter, the other corre sponding terms of (18), and their total, K , plus some of the common names of the resulting distributions that are special cases of it. The first row, where K equals 30, approximates the lognormal distribution; Lawless (1982:26) has demonstrated that the generalized gamma di stribution approaches lognormality as v approaches infinity. 4 When v equals ½ (fifth row), the generalize d gamma is equivalent to the standardized half-normal distribution, with the proviso that b e quals 2 and a equals √2. Substituting these values into (15) and (16) yields E[ ln ( X )] + l n (V[ ln ( X )]) / 2 = -0.520185. Adding this to the value of K = 1.25596 from Table 1 gives a tot al entropy of 0.735775, for the half-normal distribution with parameters μ = 0 and σ 2 = 1. Half of the support of the true standard normal distribution involves negative val ues, so its entropy cannot be calculated through equation (12). But because its probability dens ity function is symmetrical about zero and because its absolute value can be generat ed by the half-normal distribution, it can be simulated by the half-normal, if only i ts sign is assigned by the tossing of a fair coin. The coin toss involves one bit of information, or 0.69315 nats, w hich, added to the half-normal entropy, 0.735775, gives about 1.42 for the standard norm al, agreeing with Jones’ (1979:138) derivation, ½ ln (2πeσ 2 ). The last rows of Table 1 are supplied to show how K approaches unit y as v approaches zero. Euler’s infinite product formula for the gam ma function clearly shows that as v approaches zero, Γ (v) approaches 1/v. Ψ (v) also approache s 1/v, and Ψ’ (v) approaches 1/v 2 . So the third and fourth columns of Table 1 cancel, and K a pproaches unity as v approaches zero. Furthermore, it can be proven tha t K equals unity for the Pareto and Uniform distributions. It is gratifying to realize that several of the items in Table 1 are (limiting) distributions that result from combinations of independent distri butions. The lognormal is the limiting distribution for the product of distributions. The gamma is the resulting distribution for the sum of gamma/exponential distributions, and t he chi square results when the squares of normal distributions are added. And the Weibull result s from repeatedly picking the minimum value of a fixed number of Weibull/expone ntial distributions. All of these distributions have a K that ranges between 1.32841 and 1.41704. So if i t can be assumed that a set of data results from a combina tion of independent factors as just described, then its entropy can be estimated with reasonable ac curacy exclusively through statistics of the logarithmic variate, without any of the tradit ional parameters. In practice, if the data set contains negative values, the lowest negative va lue might be excluded, and its absolute value added to all other data points. If it is surprising that entropy (or the mean of the sel f-information) of a random variable can be estimated through the mean and varia nce of the logarithmic variate itself, it is just as interesting, if not as useful, that the second ce ntral moment of the self-information can also be calculated using only the variance of t he logarithmic variate. In sum, there is a surprisingly strong relation between the first two cumul ants of self-information and the first two cumulants of the logarithm of the variate itself, at least among random variables that are multiples and powers of each other. In order to investigate the second central mome nt of self-information, or the “entropy variance,” we need to derive an equation anal ogous to (2). If f (x) represents a probability density function, then ordinary entropy, the first moment of self-information, can be represented as 5 H[X] = E[ ln ( f (x))]. (19) As a second central moment, the entropy variance ca n be calculated similarly to the variance of any variate, V[ ln ( f (x))] = E[ ln 2 ( f (x))] – E 2 [ ln ( f (x))]. (20) Following Jones’ (1979:137) treatment of entropy, we shall derive a formula for the entropy variance of a function, Y= g (X), of the random variable X. We note in advance that the second term of the right side of (20) will s imply be replaced by the square of the formula that Jones ultimately derives, name ly, H[ Y ]=H[ X ] + ∫ f (x) ln ( g′ (x)) dx. (21) Now, the first term of (20) is equivalent to E[ ln 2 ( f (y))] = ∫ f (y) · ln 2 ( f (y)) · dy. (22) We can follow Jones and substitute f (x)/ g′ (x) for f (y) and dx · g′ (x) for dy, to begin t he derivation of a formula for a function of first term: E[ ln 2 ( f (y))] = ∫ f (x)/ g′ (x) · l n [ f (x)/ g′ (x)] · ln [ f (x)/ g′ (x)] · dx · g′ (x) (23) The two g′ (x) factors cancel, and after t he quotient law of logarithms is applied, restating the resulting equation as an expectation give s E[ ln 2 ( f (y))] = E 2 [ ln(f(x)) - ln ( g′ (x)) ]. (24) Carrying out the square, and re-writing the term s as separate expectations gives E[ ln 2 ( f (y))] = E[ ln 2 ( f (x) ] + E[ – 2 · ln(f(x)) · ln(g′(x)) ] + E[ ln 2 ( g′ (x)) ], (25) which analogous to (2), only it represents the pure second mome nt (about zero) of the self-information for a function of the random variabl e X . Rewriting (21) as expectations and squaring it yie lds H 2 [ Y ]=H 2 [ X ] + 2 · H[ X ] · E[ ln ( g′ (x))] + E 2 [ ln ( g′ (x))] . (26) If we denote the entropy variance of the random varia ble Y as HV[Y], then clearly, HV[ Y ] = E[ ln 2 ( f (y))] - H 2 [ Y ]. (27) 6 Now (27) suggests that we subtract (26) from (25). But before we do, w e can recognize that the parallel combinations of their first t erms and then of their last terms separately represent (entropy) variances of their own: HV[ X ] = E[ ln 2 ( f (x) ] - H 2 [ X ] and (28) V[ ln ( g′ (x))] = E[ ln 2 ( g′ (x))] - E 2 [ ln ( g′ (x))]. (29) So the resulting formula for the entropy variance of a funct ion, Y, of the random variable X is HV[ Y ] = HV[ X ] + V[ ln ( g′ (x))] + E[ – 2 · ln(f(x)) · ln ( g′ (x)) ] - 2 · H [ X ] · E[ ln ( g′ (x))], (30) which is reminiscent of (2), though certainly more com plicated. Again it is the original function plus or minus some other terms. It will be convenient to treat the l n ( g′ (x)) terms of (30) separately when substituting ln ( g′ (x)) = ln (ab) + (b-1) ln ( X ). The first t erm gives V[ ln ( g′ (x))] = V[ ln (a b) + (b-1) ln ( X) ]. (31) Applying a well-known rule about the variance of t he function of a random variable, namely, V[a+b X ] = b 2 V[ X ], gives V[ ln ( g′ (x))] = (b 2 – 2b + 1) · V[ ln ( X) ]. (32) Replacing ln(g′(x)) in the next term of (30) yi elds E[-2 · ln(f(x)) · ln ( g′ (x))] = -2E[ ln(f(x)) · ln (ab) + ln(f(x)) · (b - 1) · ln ( X )], (33) or, after applying some familiar rules about expec tations, E[-2· ln ( f (x))· ln ( g′ (x))] = -2E[ ln(f(x)) · ln (ab)] - 2(b - 1)E[ ln(f(x)) · l n ( X )]. (34) Since H[ X ] = -E[ ln ( f (x))], the last t erm of (30) can be rewritten as - 2H[ X ] · E[ ln ( g′ (x))] = 2 · E[ l n(f(x)) ] · E[ ln ( g′ (x))], (35) and when ln(g′(x)) is replaced it becomes -2H[ X ]·E[ ln ( g′ (x))] = 2E[ l n(f(x)) · ln (ab)] + 2(b - 1)E[ ln(f(x)) ] · E[ ln ( X )]. (36) When (32), (34) and (36) are recombined, the 2E[ l n(f(x)) · ln (ab)] terms cancel, and grouping the terms into factors of either b 2 or b , we get 7 HV[ Y ] = HV[ X ] + b 2 · V[ ln ( X) ] + b · (-2 · V[ ln ( X) ] + 2E[ ln ( f (x))] · E [ ln ( X )] - 2E[ ln(f(x)) · ln ( X )]) + V[ ln ( X) ] - 2E[ ln(f(x)) ] · E[ l n ( X )] + 2E[ ln(f(x)) · ln ( X )], (37) which is the analog of (4). This formula of entropy variance is again more compl icated than that of ordinary entropy. However it is simpler in at least one way: the a variable is missing. This means that if one multiplies a random variable by a constant, its ent ropy variance is unaffected. Since b is a function only of V[ ln ( X )], i.e. b = √( V[ l n ( X b )] / V[ ln ( X )] ), only the variance of the logarithmic variate will appe ar in the formula for the entropy variance of a multiple and power of a random variable: HV[a X b ] = HV[ X ] + V[ ln ( X b )] + V ½ [ ln ( X b )] · V -½ [ ln ( X) ] · ( -2V[ ln ( X) ]+2E[ ln ( f (x))]·E[ l n ( X )]-2E[ ln(f(x)) · ln ( X )] ) + V[ ln ( X) ] - 2E[ ln(f(x)) ] · E[ l n ( X )] + 2E[ ln(f(x)) · ln ( X )]. (38) The result (38) is not nearly as simple as equati on (12). The coefficients of V[ ln ( X b )] and V ½ [ ln ( X b )] are complicated, and even include such unfamiliar sta tistics as E[ ln(f(x)) · ln ( X )], the average of the product of the logarithmic probability and the logarithmic variate, which would ordinarily be difficult to calculate. Unlike the simple constant coefficients of (12), the coefficients of (38) vary w ith the type of distribution, so there is no reasonable way to estimate entropy varia nce without regard to the distribution. Nevertheless, the coefficients are constant for multiples and powers of a given distribution, and the gist of (38) is that for any such distri butions, entropy variance changes only with the variance and standard deviati on of the logarithm of the variate itself. But what is the meaning or use of entropy variance? If data from a given statistical distribution (continuous or discrete) are coded using a n optimally efficient algorithm (with respect to code length), then the average l ength of the code is proportional to the entropy of the given data distribution. The varia nce of the code lengths will then be entropy variance of the distribution. Now if the e ncoded data are statistically independent from each other, then not only are the entropi es additive, but also the entropy variances ought to be additive. For example, the a verage length of three consecutive code words is three times the length of one code word, and the variance of the length of three consecutive code words is three times tha t of one. Although equations (4) and (37) apply strictly to continuous dist ributions, they represent approximations when applied to discrete distributi ons, if the probabilities are arranged by magnitude, and the variate is defined as the ra nk in the arrangement. For example, if the events of a space have probabilities 0.4, 0.2, 0.1 and 0.3, then the vari ates are assigned as 1, 3, 4, and 2, to rank the probabilities. Let us assume that t he events are now sorted by 8 probability, and that each is assigned an optimal code. But now let a new code be created, by concatenating any one of, say, four equi-probable fixed-length prefixes to each code word, effectively quadrupling the cardi nality of the probability space. For each old code word, there will be four code words, and if the ne w code remains optimal, when sorted, all the words with different prefixe s but the same ending (and therefore equal probability) should have neighboring rankings. The ranki ngs would have been equal, except for the need to disambiguate the equi probable events. On the other hand, all words with any given prefix will be about four ranks aw ay from each other in the new code. Now, since the new probability distribution is simpl y a subdividing (by 4) of the old distribution, the cumulative probability up t o each use of any prefix in the distribution should match the cumulative probability of the c ode without that prefix. Therefore the ranks (variates) of new code will be about 4 times highe r than the ranks of the old code. In other words, the essential difference between the two distributions is that the variate of the new system is four times higher than that of the old. Now the entropy of the new system will certainly be higher than that of t he old system, since there are more events to code, and of course the code length is longer by the length of t he prefix. However the entropy variance of the rank distribution will be unc hanged, since adding a constant to the random variable of the code length will not alter its variance. As already mentioned, equation (37) tells us that multiplying the data in a continuous distribution by a constant will not affect its entropy variance. References: Cover, Thomas M. and Joy A. Thomas. 1991. Elements of Information Theory . New York: John Wiley & Sons. Dadpay, A et al, 2007. “Information measures for generali zed gamma family,” Journal of Econometrics, 138 (2007), pp. 568-585 Jones, D.S. 1979. Elementary Information Theory. Oxford: Oxford Unive rsity Press. Kapur, J.N. 1989. Maximum-entropy models in science and enginee ring. New York : John Wiley & Sons. Lawless, J.F. 1982. Statistical Models and Methods for Lifeti me Data . New York : John Wiley & Sons. 9

The use of the logarithm of the variate in the calculation of differential entropy among certain related statistical distributions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment