Exploring Maximum Entropy Distributions with Evolutionary Algorithms

Exploring Maxim um En trop y Distributions with Ev olutionary Algorithms Raul Ro jas F reie Univ ersitaet Berlin F ebruary 7, 2020 Abstract This pap er shows how to evolv e n umerically the maxim um en tropy probabilit y distributions for a given set of constraints, whic h is a v aria- tional calculus problem. An evolutionary algorithm can obtain approxi- mations to some w ell-known analytical results, but is ev en more ﬂexible and can ﬁnd distributions for which a closed formula cannot b e readily stated. The numerical approac h handles distributions ov er ﬁnite in ter- v als. W e show that there are tw o w ays of conducting the pro cedure: b y direct optimization of the Lagrangian of the constrained problem, or by optimizing the en tropy among the subset of distributions which fulﬁll the constrain ts. An incremen tal ev olutionary strategy easily obtains the uni- form, the exp onen tial, the Gaussian, the log-normal, the Laplace, among other distributions, once the constrained problem is solved with an y of the tw o metho ds. Solutions for mixed (“chimera”) distributions can b e also found. W e explain why many of the distributions are symmetrical and contin uous, but some are not. 1 Maxim um En trop y Distributions The principle of maxim um en trop y had b een used implicitly b y statisticians for many y ears un til it b ecame formalized in the mid 1950s. T oday , it is used in information theory [Cov er 06], as w ell as in machine learning [Ro jas 96]. Maxim um en tropy distributions pla y an important role in man y applications. Statistical classiﬁers, for example, try to capture regularities in data sets k eeping a description of the “data cloud”, whic h is summarized b y a minimal n umber of parameters. The extreme and opp osite case is when the data itself pro vides its o wn mo del, for example in nearest neighbor classiﬁers (k-NN). In a kNN, new data is matched with the closest point in the data set, and that new p oint is assigned the class of its nearest neigh b or (or k of them, if w e decide to classify b y taking a ma jority vote). The other extreme approac h is when w e summarize a complete cloud of data points by storing only its cen troid. Giv en several classes, eac h represen ted by a cen troid, w e can compute which centroid is closer to a new data p oin t in order to obtain its classiﬁcation. 1 Therefore, given the data, the problem w e generally ha ve is deciding how man y parameters from the “data cloud” we wan t to store (for example, mean v alue, cov ariance matrix, and so on). Once we ha ve decided which parameters w e wan t to use, we model the probability distribution of eac h class making the least num b er of assumptions ab out the shap e of the distribution. W e apply the principle of “maxim um ignorance”: only the chosen parameters describ e the data set and everything else m ust b e as general as p ossible. That is, we apply the principle of maximum en tropy . The Gaussian distribution is v ery p opular in this context, b ecause it is well kno wn that given only the mean v alue and cov arian ce matrix of the data set, the distribution of maximum entrop y is a m ultiv ariate Gaussian. But there are other probabilit y distributions that can b e used: each one of them is the result of applying the principle of maximum en tropy to a diﬀerent set of parameters that we wan t to store. In this pap er we sho w, with a few examples, that maxim um entrop y distri- butions can b e easily found using an ev olutionary algorithm that samples from the set of possible probability distributions (deﬁned in a given supp ort interv al) constrained b y a choice of statistical parameters. The algorithm progresses by selecting distributions with higher and higher en trop y , until the search settles on a maxim um. This approach is simple but pow erful. Discrete distributions can b e handled in a straigh tforw ard manner. Generally , the maxim um en tropy prop- ert y is prov ed making assumptions ab out the integrabilit y and diﬀeren tiability of the distributions, assumptions that do not need to b e made in the computa- tional approach illustrated here. W e show further do wn that there are tw o wa ys of sampling the space of distributions during the evolution ary optimization. The con tin uity and symmetry of some of the maxim um en trop y distributions that w e can ﬁnd in this wa y , do not hav e to be assumed in adv ance nor ha v e to b e enforced during the computation. They arise as emergent properties of the optimal distributions, as we will see further down. 2 The uniform distribution The simplest case we can handle at the b eginning is that of a uniform distribu- tion. Giv en a random v ariable X whic h takes real v alues in the interv al [ a, b ], and if we do not hav e any further information ab out X , then the most general distribution describing an exp erimen t whic h produces v alues of X is the uniform distribution with supp ort in the interv al [ a, b ]. It makes sense intuitiv ely . The same result can be obtained for a discrete distribution f ( x i ), for i = 1 , , n , where eac h x i is a discrete v alue that the random v ariable X can assume randomly . The en trop y of the distribution is deﬁned as E = − X f ( x i )log( f ( x i )) Maximizing this function, without constrain ts, leads to the uniform distribution. The message is that if we do not ha ve any reason to assume that any p oin t x i 2 is more relev ant than any other p oint, then we should assign each one of them the same probability of being selected. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.00 0.05 0.10 0.15 0.20 Uniform distrib ution x Probability density Figure 1: The uniform distribution in the interv al [ − 4 , 4]. The analytical approach for obtaining this result consists in optimizing the Lagrangian of the contin uous en tropy of the distribution f . The en trop y of f is the negative exp ected v alue of the logarithm of f : E = − Z f ( x )log( f ( x )) dx Additionally , we hav e the constraint R f = 1, so that f represents a probability distribution. The Lagrangian with this constrain t is L ( f ) = − Z f ( x )(log f ( x )) dx + λ Z ( f ( x ) − 1) dx In v ariational calculus such Lagrangians are optimized solving the Euler-Lagrange equation ∂ F ∂ f − ∂ ∂ x  ∂ F ∂ ˙ f  = 0 where F is the function, or sum of functions, inside the in tegral sign. Since the deriv ative of f is not present in our Lagrangian, we only hav e to solve the equation ∂ F /∂ f = 0. In that case w e obtain − log f ( x ) − 1 + λ = 0 Since according to the expression abov e, the logarithm of f is constant, that implies that f itself is constant. It is precisely the uniform distribution. In an in terv al [ a, b ], the v alue of f for an y x in [ a, b ] is 1 / ( b − a ). With this v alue of f the integral of f in the interv al [ a, b ] is 1. Fig.1 sho ws the computational result obtained optimizing the Lagrangian n umerically , for a discrete distribution with 100 points in the in terv al [-4,4] follo wing the approac h explained in the next section. It is a discrete uniform distribution. 3 3 General Optimization Approac h In general, giv en the constraints for the optimization problem, a Lagrangian is deﬁned and we thenﬁnd its extremal v alues [Lisman 72]. The constraints can b e of many types, but usually w e hav e to deal with equality constraints: the mean v alue has to ha ve a certain v alue, or the v ariance, or b oth. Stating the Lagrangian es straightforw ard in suc h cases. F or example, if we require from the distribution f to ha ve the mean v alue µ and the v ariance σ 2 , then the complete Lagrangian, including the constrain t R f = 1 is giv en by: L ( f ) = − Z f ( x )log( f ( x )) dx + λ 1 Z ( f ( x ) − 1) dx + λ 2 Z ( xf ( x ) − µ ) dx + λ 3 Z (( x − µ ) 2 f ( x ) − σ 2 ) dx T aking the deriv ative of the functions inside the in tegrals and setting the result equal to zero we obtain − log( f ( x )) − 1 + λ 1 + λ 2 x + λ 3 ( x − µ ) 2 = 0 This means that the logarithm of f is a quadratic function of x , in general, and therefore f is an exponential function with a quadratic function of x in the exp onen t. Rearranging the Laplace multipliers, w e ﬁnd that the distribution of minimal entrop y , given the mean and v ariance, is a Gaussian distribution. If we wan t to optimize the abov e Lagrangian using a numerical approach, there are tw o alternativ es. On the one hand, w e can start with a distribu- tion whic h fulﬁlls the constraints (in the case ab o ve, having a given mean and v ariance). W e then generate a new distribution sto chastically , or even several alternativ e distributions. W e pic k the one with the highest en tropy and contin ue optimizing. When w e generate the new distributions, we enforce the constrain ts b y scaling the distribution in an appropriate w ay . F or example, if the v ariance is to o high, we can “compress” the distribution around the mean v alue in order to reduce the v ariance. In this w ay the optimization pro cedure nev er lea ves the region of admisible distributions. The worst that can happ en is that the ev olution of the distributions b ecomes trapp ed in a lo cal maxim um. The second alternativ e is to evolv e a giv en random distribution, generating distorted v ersions also in a random wa y . W e then pic k the b est function f in terms of maximizing the Lagrangian and normalizing the result (so that the function is a distribution). Ov er many iterations, the distribution that n umerically maximizes the Lagrangian fulﬁlls all constraints and has maximum en tropy . Of course, in some cases the maximum en tropy distribution could not exist for a giv en set of constrain ts. In that case the n umerical approac h will not pro duce a sensible result, or will not conv erge. W e ha ve tested b oth metho ds numerically and they pro duce essentially the same results for the examples presented here. 4 4 Numerical explorations In this section w e discuss the discrete distributions obtained by an evolutionary strategy . W e start from a completely random distribution and at eac h step the probability density is perturb ed at a single p oint. W e experimented with p erturbations at several p oin ts, without obtaining any signiﬁcant adv antage in terms of conv ergence spee d. Gaussian distribution As w e explained abov e, the Gaussian distribution is obtained when w e constrain t the v alue of the mean and of the v ariance. The constraints are E [ x ] = µ and E [( x − µ ) 2 ] = σ 2 . Including them in the Lagrangian w e obtain the distribution sho wn in Fig.2. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.000 0.010 0.020 0.030 Gaussian distribution Mean=1 V ar=1 Probability density Figure 2: The Gaussian distribution with mean 1 and v ariance 1. Notice that the ev olutionary procedure is agnostic. The symmetry and conti- n uity of the distribution is not included explicitly in the Lagrangian. W e obtain b oth. Intuitiv ely , this is what w e would expect. Given the mean v alue µ , an asymmetrical distribution would b e to o sp ecial, given that w e do not hav e any information that would give more w eight to points to the righ t or to the left of the mean. The constrain t ov er the v ariance means that we w ould expect the distribution to concentrate around the mean v alue, and that the tails of the distribution should go down asymptotically . Both features of the Gaussian are easy to explain. A little more diﬃcult to explain is the fact that w e obtain a contin uous distribution. The reason for this is that in the Langrangian the function f app ears multiplying constant terms, or p o wers of x . The deriv ative ∂ F /∂ f will pro duce a function inv olving logf(x), p ow ers of x and some constan ts (some of them the Lagrange multipliers). It is then clear that we can obtain a closed solution, that is, an arithmetical expression for f ( x ), in terms of p ow ers of x . In the n umerical optimization the contin uity and symmetry of the function 5 f has not b een presupposed. Its app earance is a conﬁrmation that the solution obtained is the distribution of maximum en tropy being searched. Exp onen tial distribution Another interesting result obtained with the numerical approach described ab o ve, is when the only tw o constraints o v er f are: a) b eing a distribution, and b) hav- ing a giv en mean. Fig.3 shows the distribution obtained when the mean v alue has b een constrained to b e 1. It is interesting to see that we do not obtain a piecewise uniform distribution. W e could think that a uniform distribution b et ween -4 and 1, and another, at another level, b etw een 1 and 4 could b e op- timal. Ho wev er, the result shows that the distribution tries to co ver the t wo subin terv als, but in a con tinuous manner. Con tinuit y has not been included in the constrain ts in an explicit w ay but arises in the wa y explained in the previous section. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.000 0.010 0.020 0.030 Exponential distribution Mean=2 Probability density Figure 3: The exp onential distribution. W e obtain an exp onen tial distribution because the Euler-Lagrange diﬀeren- tial equation pro duces an equation inv olving log f(x). Solving the equation we obtain and exp onen tial solution for f ( x ). If the mean v alue µ is lo cated in the middle of the support interv al, the exp onen tial distribution b ecomes “ﬂat” and degenerates into a uniform distri- bution. Therefore, if for a given data set w e only keep the mean v alue of the data, and the mean is not in the middle of the supp ort interv al, our b est guess for the form of the distribution is an exponential function with the shap e sho wn in Fig.3. Laplace distribution The Laplace distribution is also useful in machine learning. It is obtained when the disp ersion of the distribution is measured not b y the v ariance but using the 6 exp ected v alue of the absolute deviation from the mean. It is, in some sense, a measure like the v ariance, but without the square function. In regression w e can measure the deviation of the data p oin ts from the regression line using the sum of squared diﬀerences. But we could also use the sum of absolute v alues of the deviations. In that case we obtain a diﬀeren t regression line (the line of least absolute deviation). W e use one or the other approac h dep ending on the statistics of the regression error terms. The Lagrangian for the Laplace distribution is v ery similar to the Lagrange for the Gaussian distribution. W e just hav e to substitute the square function in the Lagrangian with the absolute v alue function. W e should exp ect to get a symmetrical and con tinuous function, as in the case of the Gaussian. The function in Fig.4 shows the n umerical result obtained. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.000 0.010 0.020 0.030 Laplace distribution Mean=1 Dist=1 Probability density Figure 4: The Laplace distribution. The Laplace distribution has been used in mac hine learning applications. The LASSO regression analysis metho d can be interpreted as standard regres- sion with a Laplace prior. Log-normal distribution The log-normal distribution is very imp ortant in biology , because in may natural phenomena, the eﬀects of successiv e random eﬀects act m ultiplicatively instead of additiv ely . A distribution f is log-normal if the logarithm of f has the normal distribution. Multiplicative eﬀects can b e transformed into additive eﬀects by taking the logarithm. One example of a pro cess where the log-normal distribution could ha ve an application are growth pro cesses dep ending on multiple genes. The eﬀect of those genes could b e explained assuming that eac h gene slightly scales up or do wn an organism. Histograms of the sizes of individuals in a sp ecies ﬁt well log-normal distributions. Fig.5 shows the result obtained constraining the mean and v ariance of logf(x). 7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 0.00 0.01 0.02 0.03 0.04 Log − normal distribution Mean=0.25 V ar=0.5 Probability density Figure 5: The log-normal distribution Median constrained distribution The next example is in teresting from the p oin t of view that the direct use of the Euler-Lagrange equation is not p ossible. If we constrain t a distribution ov er a supp ort in terv al to hav e a giv en median, there is no straightforw ard closed form ula that w e can insert in the Lagrangian. In this case it is easier to optimize the en tropy , selecting only distributions which fulﬁll the constraint. Giv en the median, w e know that half of the w eight of the distribution m ust b e to the left of the median, and the other half to the righ t. Perturbed distributions can b e scaled in order to fulﬁll the constrain t. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.006 0.010 0.014 0.018 Distribution with median constraint Median =2 Probability density Figure 6: The median constrained distribution Fig.6 shows the result of the optimization: w e obtain tw o uniform distribu- tions, one to the left and one to the righ t of the median. It makes sense, since w e do not ha ve an y other special assumptions about the distribution, other than half of the p opulation b eing on one side and the other half b eing on the other side of the median. 8 Median constrained Laplace and Gaussian distribution The next experiment can b e easily solv ed numerically but an analytical solu- tion would b e to o conv oluted. Let us assume that we lo ok for the maxim um en tropy distributions constrained by a given median, and then either by the exp ected v alue of the absolute v alues of the deviations from the median, or the exp ected v alue of the squares of the deviations from the median. I call the ﬁrst distribution the ”median constrained Laplace” distribution, and the second the ”median constrained Gaussian”. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.005 0.015 Median Laplace Distribution Median =1 MD=1.5 Probability density Figure 7: The median constrained Laplace distribution Fig.7 shows the shape of the distribution with a given median and an ex- p ected v alue of the distance to the median with v alue 1. No w we lose the con tinuit y of the complete distribution, b ecause the median constrain has the eﬀect of dividing the supp ort in terv al in to t wo disconnected compartmen ts. Half of the total probabilit y is on one side of the median, half of the probabilit y on the other side. The constrain ov er the expected absolute deviation can b e fulﬁlled with maxim um en tropy solving t wo actually disjoin t problems. The distribution curv es do not touc h at the median. The same happ ens in the case of Fig.8, where w e see the ”median Gaussian”. The result is equiv alent to looking for maxim um entrop y distributions with a giv en v ariance to the left and righ t of the median. It w ould b e interesting to think ab out applications where such median con- strained distributions could mak e sense. Here, they are men tioned as in teresting examples of cases in which direct optimization of the Lagrangian for a subset of distributions that fulﬁll all or some of the con traints leads directly to the solution. In the t wo cases presented in this section, the median constrained was enforced directly on the generated distributions, while the maxim um en tropy and exp ected v alue of the deviations w as left in the Lagrangian. 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.000 0.010 0.020 Median Gaussian Distribution Median =2 MSD=1 Probability density Figure 8: The median constrained Gaussian Chimera distributions W e can b ecome bolder no w and inv estigate ”c himera distributions”, that is, com binations of tw o diﬀerent distributions. An example could b e a generaliza- tion of sk ewed distributions with tails around the mean with diﬀerent shap es. W e can, for example, lo ok for the maximum entrop y distribution with a given mean, but where on the left of the distribution we constrain the exp ected ab- solute deviation from the mean, while on the right we constrain the exp ected squared deviation from the mean. The results are shown in Fig.9. In the ﬁrst case, on the left side we hav e a Gaussian while w e hav e a Laplace distribution on the right. In the second case, the Laplacian and Gaussian sides hav e been transp osed. As can b e seen in the ﬁgures, the chimera distributions show some kind of con tinuit y at the interface with the mean. This is not necessarily so. Additional exp erimen ts with diﬀeren t numerical constraints sho w that the t wo pieces of the distribution can disconnect at the mean. It would b e interesting to inv estigate under which circumstances the resulting c himera distribution can be con tin uous. Cauc h y distribution Another example is the Cauc hy distribution whic h do es not ha ve ﬁnite momen ts of order greater than one, but which nev ertheless is the distribution of maximum en tropy when the exp ected v alue of log(1 + x 2 ) is constrained to b e a certain constan t. The Cauc h y distribution describ es the distribution of the ratio of tw o normally distributed random v ariables and has applications when describing spinning ob jects. Fig.10 sho ws the shap e of the Cauch y distribution obtained n umerically . 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.00 0.02 0.04 Chimera distribution Mean =2 SD , Distance=1 Probability density ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.01 0.02 0.03 0.04 Chimera distribution Mean =2 SD , Distance=1 Probability density Figure 9: A half Laplace, half normal distribution. Chi-squared distribution The Chi-squared distribution is the distribution of the sum of the squares of k independent standard normal v ariables. In the example handled here, the supp ort has been constrained to the in terv al [0 , 8], while usually it is un b ounded to the right. Fig.11 sho ws the shape of the Chi-squared distribution obtained n umerically . It diﬀers from the Chi-squared distribution o ver the un b ounded in terv al [0 , ∞ ] b ecause of the compact supp ort. 5 Conclusions This paper has sho wn that maxim um entrop y distributions can b e readily found once the constraints ov er the distribution are inserted into a Lagrangian to b e optimized. F or the optimization an approac h that evolv es distributions n umer- ically from an initial, randomly chosen distribution, can b e used. W e show ed that some classical analytical results can b e repro duced for discrete distribu- tions, but also that in the case that an analytical approach is not p ossible or just to o diﬃcult, the numerical approac h can b e a go o d exploratory to ol. It can b e used directly in some applications [Buck 91]. In the numerical examples illustrated here, we can use a Lagrangian whic h only includes the en tropy function, enforcing the constraints during the ev olu- 11 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 − 2 0 2 4 0.00 0.04 0.08 Cauchy distrib ution x Probability density Figure 10: The Cauch y distribution. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 6 8 0.00 0.02 0.04 0.06 Chi Square Distribution k=2 Probability density Figure 11: The Chi-squared distribution. tionary pro cess, or we can ha ve some constrain ts in the Lagrangian while others are enforced in the ev olutionary process. In the case of distributions with a constrain t ov er the median, it is easier to enforce the median constrain t in the ev olutionary pro cess and leav e the other constraints in the Lagrangian. W e hav e also deﬁned ”c himera” distributions in this paper as those sub ject to diﬀeren t constrain ts o ver the supp ort interv al. In particular we ha ve presen ted the Laplace-Gaussian sk ewed distribution. It w ould b e in teresting to in vestigate in which applications suc h chimera distributions could b e useful. F or educational purp oses, it is also interesting that the evolutionary pro cess can b e visualized while the constrain ts are in tro duced step b y step. W e can start from distribution with no constrain ts and then displace the mean to one side of the support int erv al. The distribution gradually transforms in to an exponential. W e can then add constraints o ver the v ariance and the distribution has to bend do wn in order to keep the v ariance in chec k. Mo vies of this gradual pro cedure can giv e students a go od feeling for the w ay in which the maximum en trop y principle constrains the shap e of the distribution. 12 References [Co ver 06] Th. Cov er, Elements of Information The ory , Wiley , 2006. [Ro jas 96] R. Ro jas, Neur al Networks , Springer-V erlag, 1996. [Lisman 72] Lisman, J. H. C., v an Zuylen, M. C. A., ”Note on the generation of most probable frequency distributions”, Statistic a Ne erlandic a , 26 (1): 1923, 1972. [Buc k 91] Brian Buck, Vincent A. Macaulay (eds), Maximum Entr opy in A c- tion: A Col le ction of Exp ository Essays , Oxford Universit y Press, Ox- ford, 1991. 13

Exploring Maximum Entropy Distributions with Evolutionary Algorithms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment