Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

Estimation of R ´ enyi Entr opy and Mutual Inf ormation Based on Generalized Near est-Neighbor Graphs D ´ avid P ´ al Department of Computing Science Univ ersity of Alberta Edmonton, AB, Canada dpal@cs.ualberta.ca Barnab ´ as P ´ oczos School of Computer Science Carnegie Mellon Uni versity Pittsbur gh, P A, USA poczos@ualberta.ca Csaba Szepesv ´ ari Department of Computing Science Univ ersity of Alberta Edmonton, AB, Canada szepesva@ualberta.ca Abstract W e present simple and computationally ef ﬁcient nonparametric estimators of R ´ enyi entropy and mutual information based on an i.i.d. sample drawn from an unknown, absolutely continuous distrib ution o ver R d . The estimators are cal- culated as the sum of p -th po wers of the Euclidean lengths of the edges of the ‘generalized nearest-neighbor’ graph of the sample and the empirical copula of the sample respecti vely . For the ﬁrst time, we prov e the almost sure consistenc y of these estimators and upper bounds on their rates of con ver gence, the latter of which under the assumption that the density underlying the sample is Lipschitz continuous. Experiments demonstrate their usefulness in independent subspace analysis. 1 Introduction W e consider the nonparametric problem of estimating R ´ enyi α -entropy and mutual information (MI) based on a ﬁnite sample drawn from an unknown, absolutely continuous distribution over R d . There are many applications that make use of such estimators, of which we list a fe w to giv e the reader a taste: Entropy estimators can be used for goodness-of-ﬁt testing (V asicek, 1976; Goria et al., 2005), parameter estimation in semi-parametric models (W olsztynski et al., 2005), studying fractal random walks (Alemany and Zanette, 1994), and texture classiﬁcation (Hero et al., 2002b,a). Mu- tual information estimators hav e been used in feature selection (Peng and Ding, 2005), clustering (Aghagolzadeh et al., 2007), causality detection (Hla v ´ ackov a-Schindler et al., 2007), optimal exper - imental design (Le wi et al., 2007; P ´ oczos and L ˝ orincz, 2009), fMRI data processing (Chai et al., 2009), prediction of protein structures (Adami, 2004), or boosting and facial expression recogni- tion (Shan et al., 2005). Both entropy estimators and mutual information estimators hav e been used for independent component and subspace analysis (Learned-Miller and Fisher, 2003; P ´ oczos and L ˝ orincz, 2005; Hulle, 2008; Szab ´ o et al., 2007), and image registration (Kybic, 2006; Hero et al., 2002b,a). For further applications, see Leonenko et al. (2008); W ang et al. (2009a). In a na ¨ ıve approach to R ´ enyi entropy and mutual information estimation, one could use the so called “plug-in” estimates. These are based on the obvious idea that since entropy and mutual information are determined solely by the density f (and its marginals), it sufﬁces to ﬁrst estimate the density using one’ s fa vorite density estimate which is then “plugged-in” into the formulas deﬁning entropy 1 and mutual information. The density is, howe ver , a nuisance parameter which we do not want to estimate. Density estimators have tunable parameters and we may need cross validation to achie ve good performance. The entropy estimation algorithm considered here is direct —it does not build on density estimators. It is based on k -nearest-neighbor (NN) graphs with a ﬁxed k . A variant of these estimators, where each sample point is connected to its k -th nearest neighbor only , were recently studied by Goria et al. (2005) for Shannon entropy estimation ( i.e. the special case α = 1 ) and Leonenko et al. (2008) for R ´ enyi α -entropy estimation. They pro ved the weak consistenc y of their estimators under certain conditions. Ho wev er , their proofs contain some errors, and it is not obvious how to ﬁx them. Namely , Leonenko et al. (2008) apply the generalized Helly-Bray theorem, while Goria et al. (2005) apply the in verse Fatou lemma under conditions when these theorems do not hold. This latter error originates from the article of K ozachenko and Leonenko (1987), and this mistake can also be found in W ang et al. (2009b). The ﬁrst main contrib ution of this paper is to giv e a correct proof of consistenc y of these estimators. Employing a very dif ferent proof techniques than the papers mentioned abo ve, we sho w that these estimators are, in fact, strongly consistent provided that the unknown density f has bounded support and α ∈ (0 , 1) . At the same time, we allo w for more general nearest-neighbor graphs, wherein as opposed to connecting each point only to its k -th nearest neighbor, we allow each point to be connected to an arbitrary subset of its k nearest neighbors. Besides adding generality , our numer- ical experiments seem to suggest that connecting each sample point to all its k nearest neighbors improv es the rate of con vergence of the estimator . The second major contribution of our paper is that we pro ve a ﬁnite-sample high-probability bound on the error ( i.e. the rate of conv ergence) of our estimator provided that f is Lipschitz. According to the best of our knowledge, this is the very ﬁrst result that gi ves a rate for the estimation of R ´ enyi entropy . The closest to our result in this respect is the work by Tsybako v and v an der Meulen (1996) who pro ved the root- n consistency of an estimator of the Shannon entropy and only in one dimension. The third contrib ution is a str ongly consistent estimator of R ´ enyi mutual information that is based on NN graphs and the empirical copula transformation (Dedecker et al., 2007). This re sult is proved for d ≥ 3 1 and α ∈ (1 / 2 , 1) . This b uilds upon and extends the pre vious w ork of P ´ oczos et al. (2010) where instead of NN graphs, the minimum spanning tree (MST) and the shortest tour through the sample ( i.e. the trav eling salesman problem, TSP) were used, b ut it was only conjectured that NN graphs can be applied as well. There are se veral adv antages of using k -NN graph ov er MST and TSP (besides the obvious concep- tual simplicity of k -NN): On a serial computer the k -NN graph can be computed somewhat faster than MST and much faster than the TSP tour . Furthermore, in contrast to MST and TSP , computa- tion of k -NN can be easily parallelized. Secondly , for different values of α , MST and TSP need to be recomputed since the distance between two points is the p -th po wer of their Euclidean distance where p = d (1 − α ) . Howe ver , the k -NN graph does not change for dif ferent v alues of p , since p -th power is a monotone transformation, and hence the estimates for multiple v alues of α can be calculated without the extra penalty incurred by the recomputation of the graph. This can be adv an- tageous e.g . in intrinsic dimension estimators of manifolds (Costa and Hero, 2003), where p is a free parameter , and thus one can calculate the estimates efﬁciently for a fe w different parameter v alues. The fourth major contribution is a proof of a ﬁnite-sample high-probability error bound ( i.e . the rate of con ver gence) for our mutual information estimator which holds under the assumption that the copula of f is Lipschitz. According to the best of our kno wledge, this is the ﬁrst result that gives a rate for the estimation of R ´ enyi mutual information. The toolkit for pro ving our results deri ves from the deep literature of Euclidean functionals, see, (Steele, 1997; Y ukich, 1998). In particular , our strong consistenc y result uses a theorem due to Red- mond and Y ukich (1996) that essentially states that any quasi-additiv e power -weighted Euclidean functional can be used as a strongly consistent estimator of R ´ enyi entropy (see also Hero and Michel 1999). W e also make use of a result due to K oo and Lee (2007), who pro ved a rate of con ver gence result that holds under more stringent conditions. Thus, the main thrust of the present work is show- 1 Our result for R ´ enyi entropy estimation holds for d = 1 and d = 2 , too. 2 ing that these conditions hold for p -po wer weighted nearest-neighbor graphs. Curiously enough, up to now , no one has shown this, except for the case when p = 1 , which is studied in Section 8.3 of (Y ukich, 1998). Ho wev er, the condition p = 1 gives results only for α = 1 − 1 /d . All proofs and supporting lemmas can be found in the appendix. In the main body of the paper, we focus on clear explanation of R ´ enyi entropy and mutual information estimation problems, the estimation algorithms and the statements of our con verge results. Additionally , we report on tw o numerical e xperiments. In the ﬁrst experiment, we compare the empirical rates of con vergence of our estimators with our theoretical results and plug-in estimates. Empirically , the NN methods are the clear winner . The second experiment is an illustrati ve applica- tion of mutual information estimation to an Independent Subspace Analysis (ISA) task. The paper is organized as follows: In the next section, we formally deﬁne R ´ enyi entropy and R ´ enyi mutual information and the problem of their estimation. Section 3 explains the ‘generalized nearest neighbor’ graphs. This graph is then used in Section 4 to deﬁne our R ´ enyi entropy estimator . In the same section, we state a theorem containing our con ver gence results for this estimator (strong consistency and rates). In Section 5, we explain the copula transformation, which connects R ´ enyi entropy with R ´ enyi mutual information. The copula transformation together with the R ´ enyi entrop y estimator from Section 4 is used to build an estimator of R ´ enyi mutual information. W e conclude this section with a theorem stating the con ver gence properties of the estimator (strong consistency and rates). Section 6 contains the numerical experiments. W e conclude the paper by a detailed discussion of further related w ork in Section 7, and a list of open problems and directions for future research in Section 8. 2 The F ormal Deﬁnition of the Problem R ´ enyi entropy and R ´ enyi mutual information of d real-valued random variables 2 X = ( X 1 , X 2 , . . . , X d ) with joint density f : R d → R and marginal densities f i : R → R , 1 ≤ i ≤ d , are deﬁned for any real parameter α assuming the underlying inte grals exist. For α 6 = 1 , R ´ enyi entropy and R ´ enyi mutual information are deﬁned respecti vely as 3 H α ( X ) = H α ( f ) = 1 1 − α log Z R d f α ( x 1 , x 2 , . . . , x d ) d( x 1 , x 2 , . . . , x d ) , (1) I α ( X ) = I α ( f ) = 1 α − 1 log Z R d f α ( x 1 , x 2 , . . . , x d ) d Y i =1 f i ( x i ) ! 1 − α d( x 1 , x 2 , . . . , x d ) . (2) For α = 1 they are deﬁned by the limits H 1 = lim α → 1 H α and I 1 = lim α → 1 I α . In fact, Shannon (differential) entropy and the Shannon mutual information are just special cases of R ´ enyi entropy and R ´ enyi mutual information with α = 1 . The goal of this paper is to present estimators of R ´ enyi entropy (1) and R ´ enyi information (2) and study their con vergence properties. T o be more explicit, we consider the problem where we are giv en i.i.d. random variables X 1: n = ( X 1 , X 2 , . . . , X n ) where each X j = ( X 1 j , X 2 j , . . . , X d j ) has density f : R d → R and marginal densities f i : R → R and our task is to construct an estimate b H α ( X 1: n ) of H α ( f ) and an estimate b I α ( X 1: n ) of I α ( f ) using the sample X 1: n . 3 Generalized Nearest-Neighbor Graphs The basic tool to deﬁne our estimators is the generalized nearest-neighbor graph and more speciﬁ- cally the sum of the p -th powers of Euclidean lengths of its edges. Formally , let V be a ﬁnite set of points in an Euclidean space R d and let S be a ﬁnite non-empty set of positiv e integers; we denote by k the maximum element of S . W e deﬁne the gener alized 2 W e use superscript for indexing dimension coordinates. 3 The base of the logarithms in the deﬁnition is not important; any base strictly bigger than 1 is allowed. Similarly as with Shannon entropy and mutual information, one traditionally uses either base 2 or e . In this paper , for deﬁnitiveness, we stick to base e . 3 near est-neighbor graph N N S ( V ) as a directed graph on V . The edge set of N N S ( V ) contains for each i ∈ S an edge from each vertex x ∈ V to its i -th nearest neighbor . That is, if we sort V \ { x } = { y 1 , y 2 , . . . , y | V |− 1 } according to the Euclidean distance to x (breaking ties arbitrarily): k x − y 1 k ≤ k x − y 2 k ≤ · · · ≤ k x − y | V |− 1 k then y i is the i -th nearest-neighbor of x and for each i ∈ S there is an edge from x to y i in the graph. For p ≥ 0 let us denote by L p ( V ) the sum of the p -th powers of Euclidean lengths of its edges. Formally , L p ( V ) = X ( x , y ) ∈ E ( N N S ( V )) k x − y k p , (3) where E ( N N S ( V )) denotes the edge set of N N S ( V ) . W e intentionally hide the dependence on S in the notation L p ( V ) . For the rest of the paper , the reader should think of S as a ﬁxed but otherwise arbitrary ﬁnite non-empty set of integers, say , S = { 1 , 3 , 4 } . The following is a basic result about L p . The proof can be found in the appendix. Theorem 1 (Constant γ ) . Let X 1: n = ( X 1 , X 2 , . . . , X n ) be an i.i.d. sample fr om the uniform distribution over the d -dimensional unit cube [0 , 1] d . F or any p ≥ 0 and any ﬁnite non-empty set S of positive inte gers ther e exists a constant γ > 0 suc h that lim n →∞ L p ( X 1: n ) n 1 − p/d = γ a.s. (4) The v alue of γ depends on d, p, S and, except for special cases, an analytical formula for its v alue is not known. This causes a minor problem since the constant γ appears in our estimators. A simple and effecti ve way to deal with this problem is to generate a lar ge i.i.d. sample X 1: n from the uniform distribution o ver [0 , 1] d and estimate γ by the empirical value of L p ( X 1: n ) /n 1 − p/d . 4 An Estimator of R ´ enyi Entropy W e are now ready to present an estimator of R ´ enyi entropy based on the generalized nearest-neighbor graph. Suppose we are giv en an i.i.d. sample X 1: n = ( X 1 , X 2 , . . . , X n ) from a distrib ution µ ov er R d with density f . W e estimate entropy H α ( f ) for α ∈ (0 , 1) by b H α ( X 1: n ) = 1 1 − α log L p ( X 1: n ) γ n 1 − p/d where p = d (1 − α ) , (5) and L p ( · ) is the sum of p -th powers of Euclidean lengths of edges of the nearest-neighbor graph N N S ( · ) for some ﬁnite non-empty S ⊂ N + as deﬁned by equation (3). The constant γ is the same as in Theorem 1. The following theorem is our main result about the estimator b H α . It states that b H α is strongly con- sistent and giv es upper bounds on the rate of conv ergence. The proof of theorem is in the appendix. Theorem 2 (Consistency and Rate for b H α ) . Let α ∈ (0 , 1) . Let µ be an absolutely continuous distribution over R d with bounded support and let f be its density . If X 1: n = ( X 1 , X 2 , . . . , X n ) is an i.i.d. sample fr om µ then lim n →∞ b H α ( X 1: n ) = H α ( f ) a.s. (6) Mor eover , if f is Lipschitz then for any δ > 0 with pr obability at least 1 − δ ,    b H α ( X 1: n ) − H α ( f )    ≤    O  n − d − p d (2 d − p ) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if 0 < p < d − 1 ; O  n − d − p d ( d +1) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if d − 1 ≤ p < d . (7) 5 Copulas and Estimator of Mutual Inf ormation Estimating mutual information is slightly more complicated than estimating entrop y . W e start with a basic property of mutual information which we call rescaling . It states that if h 1 , h 2 , . . . , h d : R → R are arbitrary strictly increasing functions, then I α ( h 1 ( X 1 ) , h 2 ( X 2 ) , . . . , h d ( X d )) = I α ( X 1 , X 2 , . . . , X d ) . (8) 4 A particularly clev er choice is h j = F j for all 1 ≤ j ≤ d , where F j is the cumulativ e distribution function (c.d.f.) of X j . W ith this choice, the marginal distribution of h j ( X j ) is the uniform distri- bution o ver [0 , 1] assuming that F j , the c.d.f. of X j , is continuous. Looking at the deﬁnition of H α and I α we see that I α ( X 1 , X 2 , . . . , X d ) = I α ( F 1 ( X 1 ) , F 2 ( X 2 ) , . . . , F d ( X d )) = − H α ( F 1 ( X 1 ) , F 2 ( X 2 ) , . . . , F d ( X d )) . In other words, calculation of mutual information can be reduced to the calculation of entropy pro- vided that marginal c.d.f. ’ s F 1 , F 2 , . . . , F d are known. The problem is, of course, that these are not known and need to be estimated from the sample. W e will use empirical c.d.f. ’ s ( b F 1 , b F 2 , . . . , b F d ) as their estimates. Gi ven an i.i.d. sample X 1: n = ( X 1 , X 2 , . . . , X n ) from distribution µ and with density f , the empirical c.d.f ’ s are deﬁned as b F j ( x ) = 1 n |{ i : 1 ≤ i ≤ n, x ≤ X j i }| for x ∈ R , 1 ≤ j ≤ d . Introduce the compact notation F : R d → [0 , 1] d , b F : R d → [0 , 1] d , F ( x 1 , x 2 , . . . , x d ) = ( F 1 ( x 1 ) , F 2 ( x 2 ) , . . . , F d ( x d )) for ( x 1 , x 2 , . . . , x d ) ∈ R d ; (9) b F ( x 1 , x 2 , . . . , x d ) = ( b F 1 ( x 1 ) , b F 2 ( x 2 ) , . . . , b F d ( x d )) for ( x 1 , x 2 , . . . , x d ) ∈ R d . (10) Let us call the maps F , b F the copula transformation , and the empirical copula transformation , respectiv ely . The joint distrib ution of F ( X ) = ( F 1 ( X 1 ) , F 2 ( X 2 ) , . . . , F d ( X d )) is called the copula of µ , and the sample ( b Z 1 , b Z 2 , . . . , b Z n ) = ( b F ( X 1 ) , b F ( X 2 ) , . . . , b F ( X n )) is called the empirical copula (Dedecker et al., 2007). Note that j -th coordinate of b Z i equals b Z j i = 1 n rank( X j i , { X j 1 , X j 2 , . . . , X j n } ) , where rank( x, A ) is the number of element of A less than or equal to x . Also, observe that the random v ariables b Z 1 , b Z 2 , . . . , b Z n are not ev en independent! Nonetheless, the empiri- cal copula ( b Z 1 , b Z 2 , . . . , b Z n ) is a good approximation of an i.i.d. sample ( Z 1 , Z 2 , . . . , Z n ) = ( F ( X 1 ) , F ( X 2 ) , . . . , F ( X n )) from the copula of µ . Hence, we estimate the R ´ enyi mutual infor - mation I α by b I α ( X 1: n ) = − b H α ( b Z 1 , b Z 2 , . . . , b Z n ) , (11) where b H α is deﬁned by (5). The follo wing theorem is our main result about the estimator b I α . It states that b I α is strongly consistent and gives upper bounds on the rate of con ver gence. The proof of this theorem can be found in the appendix. Theorem 3 (Consistency and Rate for b I α ) . Let d ≥ 3 and α = 1 − p/d ∈ (1 / 2 , 1) . Let µ be an absolutely continuous distribution over R d with density f . If X 1: n = ( X 1 , X 2 , . . . , X n ) is an i.i.d. sample fr om µ then lim n →∞ b I α ( X 1: n ) = I α ( f ) a.s. Mor eover , if the density of the copula of µ is Lipschitz, then for any δ > 0 with pr obability at least 1 − δ ,    b I α ( X 1: n ) − I α ( f )    ≤          O  max { n − d − p d (2 d − p ) , n − p/ 2+ p/d } (log(1 /δ )) 1 / 2  , if 0 < p ≤ 1 ; O  max { n − d − p d (2 d − p ) , n − 1 / 2+ p/d } (log(1 /δ )) 1 / 2  , if 1 ≤ p ≤ d − 1 ; O  max { n − d − p d ( d +1) , n − 1 / 2+ p/d } (log(1 /δ )) 1 / 2  , if d − 1 ≤ p < d . 6 Experiments In this section we show two numerical experiments to support our theoretical results about the con- ver gence rates, and to demonstrate the applicability of the proposed R ´ enyi mutual information esti- mator , b I α . 5 6.1 The Rate of Con vergence In our ﬁrst experiment (Fig. 1), we demonstrate that the derived rate is indeed an upper bound on the con vergence rate. Figure 1a-1c sho w the estimation error of b I α as a function of the sample size. Here, the underlying distribution was a 3D uniform, a 3D Gaussian, and a 20D Gaussian with randomly chosen nontrivial cov ariance matrices, respectively . In these experiments α was set to 0 . 7 . For the estimation we used S = { 3 } (kth) and S = { 1 , 2 , 3 } (knn) sets. Our results also indicate that these estimators achieve better performances than the histogram based plug-in estimators (hist). The number and the sizes of the bins were determined with the rule of Scott (1979). The histogram based estimator is not sho wn in the 20D case, as in this large dimension it is not applicable in practice. The ﬁgures are based on a veraging 25 independent runs, and they also sho w the theoretical upper bound (Theoretical) on the rate deriv ed in Theorem 3. It can be seen that the theoretical rates are rather conservati ve. W e think that this is because the theory allo ws for quite irregular densities, while the densities considered in this experiment are v ery nice. 10 2 10 3 10 −2 10 −1 10 0 10 1 kth knn hist Theoretical (a) 3D uniform 10 2 10 3 10 −2 10 −1 10 0 10 1 kth knn hist Theoretical (b) 3D Gaussian 10 2 10 3 10 4 10 0 10 1 kth knn Theoretical (c) 20D Gaussian Figure 1: Error of the estimated R ´ enyi informations in the number of samples. 6.2 Application to Independent Subspace Analysis An important application of dependence estimators is the Independent Subspace Analysis problem (Cardoso, 1998). This problem is a generalization of the Independent Component Analysis (ICA), where we assume the independent sources are multidimensional vector v alued random v ariables. The formal description of the problem is as follows. W e have S = ( S 1 ; . . . ; S m ) ∈ R dm , m inde- pendent d -dimensional sources, i.e. S i ∈ R d , and I ( S 1 , . . . , S m ) = 0 . 4 In the ISA statistical model we assume that S is hidden, and only n i.i.d. samples from X = AS are available for observation, where A ∈ R q × dm is an unknown inv ertible matrix with full rank and q ≥ dm . Based on n i.i.d. observation of X , our task is to estimate the hidden sources S i and the mixing matrix A . Let the estimation of S be denoted by Y = ( Y 1 ; . . . ; Y m ) ∈ R dm , where Y = WX . The goal of ISA is to calculate argmin W I ( Y 1 , . . . , Y m ) , where W ∈ R dm × q is a matrix with full rank. Following the ideas of Cardoso (1998), this ISA problem can be solved by ﬁrst preprocessing the observed quan- tities X by a traditional ICA algorithm which provides us W I C A estimated separation matrix 5 , and then simply grouping the estimated ICA components into ISA subspaces by maximizing the sum of the MI in the estimated subspaces, that is we have to ﬁnd a permutation matrix P ∈ { 0 , 1 } dm × dm which solves max P m X j =1 I ( Y j 1 , Y j 2 , . . . , Y j d ) . (12) where Y = PW I C A X . W e used the proposed copula based information estimation, b I α with α = 0 . 99 to approximate the Shannon mutual information, and we chose S = { 1 , 2 , 3 } . Our experiment shows that this ISA algorithm using the proposed MI estimator can indeed pro vide good 4 Here we need the generalization of MI to multidimensional quantities, but that is obvious by simply re- placing the 1D marginals by d -dimensional ones. 5 for simplicity we used the FastICA algorithm in our e xperiments (Hyv ¨ arinen et al., 2001) 6 estimation of the ISA subspaces. W e used a standard ISA benchmark dataset from Szab ´ o et al. (2007); we generated 2,000 i.i.d. sample points on 3D geometric wireframe distrib utions from 6 different sources independently from each other . These sampled points can be seen in Fig. 2a, and they represent the sources, S . Then we mixed these sources by a randomly chosen in vertible matrix A ∈ R 18 × 18 . The six 3-dimensional projections of X = AS observ ed quantities are shown in Fig. 2b. Our task was to estimate the original sources S using the sample of the observed quantity X only . By estimating the MI in (12), we could recover the original subspaces as it can be seen in Fig. 2c. The successful subspace separation is shown in the form of Hinton diagrams as well, which is the product of the estimated ISA separation matrix W = PW I C A and A . It is a block permutation matrix if and only if the subspace separation is perfect (Fig. 2d). (a) Original (b) Mixed (c) Estimated (d) Hinton Figure 2: ISA experiment for six 3 -dimensional sources. 7 Further Related W orks As it was pointed out earlier, in this paper we heavily built on the results known from the theory of Euclidean functionals (Steele, 1997; Redmond and Y ukich, 1996; Koo and Lee, 2007). Ho wev er, now we can be more precise about earlier work concerning nearest-neighbor based Euclidean func- tionals: The closest to our w ork is Section 8.3 of Y ukich (1998), where the case of N N S graph based p -power weighted Euclidean functionals with S = { 1 , 2 , . . . , k } and p = 1 was in vestigated. Nearest-neighbor graphs have ﬁrst been proposed for Shannon entropy estimation by K ozachenko and Leonenko (1987). In particular , in the mentioned work only the case of N N S graphs with S = { 1 } was considered. More recently , Goria et al. (2005) generalized this approach to S = { k } and proved the resulting estimator’ s weak consistency under some conditions on the density . The estimator in this paper has a form quite similar to that of ours: ˜ H 1 = log( n − 1) − ψ ( k ) + log  2 π d/ 2 d Γ( d/ 2)  + d n n X i =1 log k e i k . Here ψ stands for the digamma function, and e i is the directed edge pointing from X i to its k th nearest-neighbor . Comparing this with (5), unsurprisingly , we ﬁnd that the main difference is the use of the logarithm function instead of | · | p and the different normalization. As mentioned before, Leonenko et al. (2008) proposed an estimator that uses the N N S graph with S = { k } for the purpose of estimating the R ´ enyi entropy . Their estimator takes the form ˜ H α = 1 1 − α log n − 1 n V 1 − α d C 1 − α k n X i =1 k e i k d (1 − α ) ( n − 1) α ! , where Γ stands for the Gamma function, C k = h Γ( k ) Γ( k +1 − α ) i 1 / (1 − α ) and V d = π d/ 2 Γ( d/ 2 + 1) is the volume of the d -dimensional unit ball, and again e i is the directed edge in the N N S graph starting from node X i and pointing to the k -th nearest node. Comparing this estimator with (5), it is apparent that it is (essentially) a special case of our N N S based estimator . From the results of Leonenko et al. (2008) it is obvious that the constant γ in (5) can be found in analytical form when S = { k } . Ho wever , we kindly warn the reader again that the proofs of these last three cited articles (Kozachenk o and Leonenko, 1987; Goria et al., 2005; Leonenko et al., 2008) contain a few errors, just like the W ang et al. (2009b) paper for KL diverg ence estimation from two samples. Kraskov et al. (2004) also proposed a k -nearest-neighbors based estimator for the Shannon mutual information estimation, but the theoretical properties of their estimator are unkno wn. 7 8 Conclusions and Open Problems W e have studied R ´ enyi entrop y and mutual information estimators based on N N S graphs. The estimators were shown to be strongly consistent. In addition, we deri ved upper bounds on their con vergence rate under some technical conditions. Se veral open problems remain unanswered: An important open problem is to understand how the choice of the set S ⊂ N + affects our estimators. Perhaps, there exists a way to choose S as a function of the sample size n (and d, p ) which strikes the optimal balance between the bias and the variance of our estimators. Our method can be used for estimation of Shannon entrop y and mutual information by simply using α close to 1 . The open problem is to come up with a way of choosing α , approaching 1 , as a function of the sample size n (and d, p ) such that the resulting estimator is consistent and con ver ges as rapidly as possible. An alternati ve is to use the logarithm function in place of the po wer function. Howe ver , the theory would need to be changed signiﬁcantly to show that the resulting estimator remains strongly consistent. In the proof of consistency of our mutual information estimator b I α we used Kiefer-Dvoretzk y- W olfowitz theorem to handle the effect of the inaccurac y of the empirical copula transformation. Our particular use of the theorem seems to restrict α to the interval (1 / 2 , 1) and the dimension to values larger than 2 . Is there a better way to estimate the error caused by the empirical copula transformation and prov e consistency of the estimator for a lar ger range of α ’ s and d = 1 , 2 ? Finally , it is an important open problem to prov e bounds on con ver ge rates for densities that ha ve higher order smoothness ( i.e. β -H ¨ older smooth densities). A related open problem, in the context of of theory of Euclidean functionals, is stated in K oo and Lee (2007). Acknowledgements This work was supported in part by AICML, AITF (formerly iCore and AIF), NSERC, the P AS- CAL2 Network of Excellence under EC grant no. 216886 and by the Department of Energy under grant number DESC0002607. Cs. Szepesv ´ ari is on leav e from SZT AKI, Hungary . References C. Adami. Information theory in molecular biology . Physics of Life Revie ws , 1:3–22, 2004. M. Aghagolzadeh, H. Soltanian-Zadeh, B. Araabi, and A. Aghagolzadeh. A hierarchical clustering based on mutual information maximization. In in IEEE ICIP , pages 277–280, 2007. P . A. Alemany and D. H. Zanette. Fractal random walks from a variational formalism for Tsallis entropies. Phys. Rev . E , 49(2):R956–R958, Feb 1994. N. Alon and J. Spencer . The Pr obabilistic Method . John Wile y & Sons, 2nd edition, 2000. J. Cardoso. Multidimensional independent component analysis. Proc. ICASSP’98, Seattle , W A. , 1998. B. Chai, D. B. W alther , D. M. Beck, and L. Fei-Fei. Exploring functional connectivity of the human brain using multiv ariate information analysis. In NIPS , 2009. J. A. Costa and A. O. Hero. Entropic graphs for manifold learning. In IEEE Asilomar Conf. on Signals, Systems, and Computers , 2003. J. Dedecker , P . Doukhan, G. Lang, J.R. Leon, S. Louhichi, and C Prieur . W eak Dependence: W ith Examples and Applications , volume 190 of Lectur e notes in Statistics . Springer , 2007. L. Devro ye and G. Lugosi. Combinatorial Methods in Density Estimation . Springer, 2001. D. P . Dubhashi and A. Panconesi. Concentration of Measur e for the Analysis of Randomized Algorithms . Cambridge Univ ersity Press, 2009. M. N. Goria, N. N. Leonenko, V . V . Mergel, and P . L. Novi Inv erardi. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. Journal of Nonparametric Statistics , 17: 277–297, 2005. A. O. Hero and O. J. Michel. Asymptotic theory of greedy approximations to minimal k -point random graphs. IEEE T rans. on Information Theory , 45(6):1921–1938, 1999. A. O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-div ergence for classiﬁcation, indexing and retriev al, 2002a. Communications and Signal Processing Laboratory T echnical Report CSPL-328. 8 A. O. Hero, B. Ma, O. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal Pr ocessing Magazine , 19(5):85–95, 2002b. D. Hilbert. ¨ Uber die stetige Abbildung einer Linie auf ein Fl ¨ achenst ¨ uck. Mathematische Annalen , 38:459–460, 1891. K. Hlav ´ ackov a-Schindler, M. Palu ˆ sb, M. V ejmelkab, and J. Bhattacharya. Causality detection based on information-theoretic approaches in time series analysis. Physics Reports , 441:1–46, 2007. M. M. V an Hulle. Constrained subspace ICA based on mutual information optimization directly . Neur al Computation , 20:964–973, 2008. A. Hyv ¨ arinen, J. Karhunen, and E. Oja. Independent Component Analysis . John Wile y , New Y ork, 2001. Y . K oo and S. Lee. Rates of con vergence of means of Euclidean functionals. J ournal of Theoretical Probability , 20(4):821–841, 2007. L. F . Kozachenko and N. N. Leonenko. A statistical estimate for the entropy of a random vector . Problems of Information T ransmission , 23:9–16, 1987. A. Kraskov , H. St ¨ ogbauer , and P . Grassberger . Estimating mutual information. Phys. Rev . E , 69:066138, 2004. J. Kybic. Incremental updating of nearest neighbor -based high-dimensional entropy estimation. In Pr oc. Acous- tics, Speech and Signal Pr ocessing , 2006. E. Learned-Miller and J. W . Fisher . ICA using spacings estimates of entropy . Journal of Machine Learning Resear ch , 4:1271–1295, 2003. N. Leonenko, L. Pronzato, and V . Sav ani. A class of R ´ enyi information estimators for multidimensional densi- ties. Annals of Statistics , 36(5):2153–2182, 2008. J. Le wi, R. Butera, and L. Paninski. Real-time adaptiv e information-theoretic optimization of neurophysiology experiments. In Advances in Neural Information Pr ocessing Systems , volume 19, 2007. S. C. Milne. Peano curves and smoothness of functions. Advances in Mathematics , 35:129–157, 1980. G. Peano. Sur une qui remplit toute une aire plane. Mathematische Annalen , 36(1):157–160, 1890. H. Peng and C. Ding. Feature selection based on mutual information: Criteria of max-dependency , max- relev ance, and min-redundancy . IEEE T rans On P attern Analysis and Machine Intelligence , 27, 2005. B. P ´ oczos and A. L ˝ orincz. Independent subspace analysis using geodesic spanning trees. In ICML , pages 673–680, 2005. B. P ´ oczos and A. L ˝ orincz. Identiﬁcation of recurrent neural networks by Bayesian interrogation techniques. Journal of Mac hine Learning Research , 10:515–554, 2009. B. P ´ oczos, S. Kirshner , and Cs. Szepesv ´ ari. REGO: Rank-based estimation of R ´ enyi information using Eu- clidean graph optimization. In AISTA TS 2010 , 2010. C. Redmond and J. E. Y ukich. Asymptotics for Euclidean functionals with power -weighted edges. Stochastic pr ocesses and their applications , 61(2):289–304, 1996. D. W . Scott. On optimal and data-based histograms. Biometrika , 66:605–610, 1979. C. Shan, S. Gong, and P . W . Mcowan. Conditional mutual information based boosting for facial expression recognition. In British Machine V ision Confer ence (BMVC) , 2005. J. M. Steele. Pr obability Theory and Combinatorial Optimization . Society for Industrial and Applied Mathe- matics, 1997. Z. Szab ´ o, B. P ´ oczos, and A. L ˝ orincz. Undercomplete blind subspace decon volution. Journal of Machine Learning Resear ch , 8:1063–1095, 2007. M. T alagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Math- ematiques de l’IHES , 81(1):73–205, 1995. A. B. Tsybakov and E. C. van der Meulen. Root- n consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics , 23:75–83, 1996. O. V asicek. A test for normality based on sample entropy . Journal of the Royal Statistical Society , Series B , 38:54–59, 1976. Q. W ang, S. R. Kulkarni, and S. V erd ´ u. Univ ersal estimation of information measures for analog sources. F oundations and T rends in Communications and Information Theory , 5(3):265–352, 2009a. Q. W ang, S. R. Kulkarni, and S. V erd ´ u. Div ergence estimation for multidimensional densities via k -nearest- neighbor distances. IEEE Tr ansactions on Information Theory , 55(5):2392–2405, 2009b. E. W olsztynski, E. Thierry , and L. Pronzato. Minimum-entropy estimation in semi-parametric models. Signal Pr ocess. , 85(5):937–949, 2005. J. E. Y ukich. Probability Theory of Classical Euclidean Optimization Pr oblems . Springer, 1998. 9 A Quasi-Additive and V ery Str ong Euclidean Functionals The basic tool to prove con vergence properties of our estimators is the theory of quasi-additiv e Euclidean functionals dev eloped by Y ukich (1998); Steele (1997); Redmond and Y ukich (1996); K oo and Lee (2007) and others. W e apply this machinery to the nearest neighbor functional L p deﬁned in equation (3). In particular , we use the axiomatic deﬁnition of a quasi-additiv e Euclidean functional from Y ukich (1998) and the deﬁnition of a very strong Euclidean functional from K oo and Lee (2007) who add two extra axioms. W e then use the results of Redmond and Y ukich (1996) and Koo and Lee (2007) which hold for these kinds of functionals. These results determine the limit behavior of the func- tionals on a set of points chosen i.i.d. from an absolutely continuous distribution over R d . As we show in the follo wing sections, the nearest neighbor functional L p deﬁned by equation (3) is a very strong Euclidean functional and thus both of these results apply to it. T echnically , a quasi-additiv e Euclidean functional is a pair of real non-negativ e functionals ( L p ( V ) , L ∗ p ( V , B )) where B ⊂ R d is a d -dimensional cube and V ⊂ B is a ﬁnite set of points. Here, a d -dimensional cube is a set of the form Q d i =1 [ a i , a i + s ] where ( a 1 , a 2 , . . . , a d ) ∈ R d is its “lower -left” corner and s > 0 is its side. The functional L ∗ p is called the boundary functional . The common practice is to neglect L ∗ p and refer to the pair ( L p ( V ) , L ∗ p ( V , B )) simply as L p . W e provide a boundary functional L ∗ p for the nearest neighbor functional L p in the next section. Deﬁnition 4 (Quasi-additive Euclidean functional) . L p is a quasi-additive Euclidean functional of power p if it satisﬁes axioms (A1)–(A7) below . Deﬁnition 5 (V ery strong Euclidean functional) . L p is a very str ong Euclidean functional of power p if it satisﬁes axioms (A1)–(A9) below . Axioms. For all cubes B ⊆ R d , any ﬁnite V ⊆ B , all y ∈ R d , all t > 0 , L p ( ∅ ) = 0 ; L ∗ p ( ∅ , B ) = 0 ; (A1) L p ( y + V ) = L p ( V ) ; L ∗ p ( y + V , y + B ) = L ∗ p ( V , B ) ; (A2) L p ( tV ) = t p L p ( V ) ; L ∗ p ( tV , tB ) = t p L p ( V , B ) ; (A3) L p ( V ) ≥ L ∗ p ( V , B ) . (A4) For all V ⊆ [0 , 1] d and a partition { Q i : 1 ≤ i ≤ m d } of [0 , 1] d into m d subcubes of side 1 /m L p ( V ) ≤ m d X i =1 L p ( V ∩ Q i ) + O ( m d − p ) , L ∗ p ( V , [0 , 1] d ) ≥ m d X i =1 L ∗ p ( V ∩ Q i , [0 , 1] d ) − O ( m d − p ) . (A5) For all ﬁnite V , V 0 ⊆ [0 , 1] d , | L p ( V 0 ) − L p ( V ) | ≤ O ( | V 0 ∆ V | 1 − p/d ) ; | L ∗ p ( V 0 , [0 , 1] d ) − L ∗ p ( V , [0 , 1] d ) | ≤ O ( | V 0 ∆ V | 1 − p/d ) (A6) For a set U n of n points dra wn i.i.d. from the uniform distribution ov er [0 , 1] d , | E L p ( U n ) − E L ∗ p ( U n , [0 , 1] d ) | ≤ o ( n 1 − p/d ) ; (A7) | E L p ( U n ) − E L ∗ p ( U n , [0 , 1] d ) | ≤ O (max( n 1 − p/d − 1 /d , 1)) ; (A8) | E L p ( U n ) − E L p ( U n +1 ) | ≤ O ( n − p/d ) . (A9) Axiom (A2) is translation inv ariance, axiom (A3) is scaling. First part of (A5) is subadditi vity of L p and second part is super-additi vity of L ∗ p . Axiom (A6) is smoothness and we call (A7) quasi- additivity . Axiom (A8) is a strengthening of (A7) with an explicit rate. Axiom (A9) is the add-one bound. The axioms in K oo and Lee (2007) are slightly different, ho wev er it is a routine to check that they are implied by our set of axioms. W e will use tw o fundamental results about Euclidean functionals. The ﬁrst is (Redmond and Y ukich, 1996, Theorem 2.2) and the second is essentially (K oo and Lee, 2007, Theorem 4). 10 Theorem 6 (Redmond-Y ukich) . Let L p be quasi-additi ve Euclidean functional of power 0 < p < d . Let V n consist of n points drawn i.i.d. from an absolutely continuous distribution over [0 , 1] d with common pr obability density function f : [0 , 1] d → R . Then, lim n →∞ L p ( V n ) n 1 − p/d = γ Z [0 , 1] d f 1 − p/d ( x ) d x a.s. , wher e γ := γ ( L p , d ) is a constant depending only on the functional L p and d . Theorem 7 (K oo-Lee) . Let L p be a very strong Euclidean functional of power 0 < p < d . Let V n consist of n points drawn i.i.d. fr om an absolutely distribution over [0 , 1] d with common pr obability density function f : [0 , 1] d → R . If f is Lipschitz 6 , then      E L p ( V n ) n 1 − p/d − γ Z [0 , 1] d f 1 − p/d ( x ) d x      ≤    O  n − d − p d (2 d − p )  , if 0 < p < d − 1 ; O  n − d − p d ( d +1)  , if d − 1 ≤ p < d , wher e γ is the constant fr om Theor em 6. Theorem 7 dif fers from its original statement (Koo and Lee, 2007, Theorem 4) in two ways. First, our v ersion is restricted to Lipschitz densities. Koo and Lee pro ve a generalization of Theorem 7 for β -H ¨ older smooth density functions. The coefﬁcient β then appears in the exponent of n in the rate. Howe ver , their result holds only for β in the interval (0 , 1] which does not make it very interesting. The case β = 1 corresponds to Lipschitz densities and is perhaps the most important in this range. Second, Theorem 7 has slight impro vement in the rate. K oo and Lee have an extraneous log( n ) factor which we remov e by “correcting” their axiom (A8). In the next section, we prove that the nearest neighbor functional L p deﬁned by (3) is a very strong Euclidean functional. First, in section B, we provide a boundary functional L ∗ p for L p . Then, in section C, we verify that ( L p , L ∗ p ) satisfy axioms (A1)–(A9). Once the veriﬁcation is done, Theorem 1 follows from Theorem 6. Theorem 2 will follow from Theorem 7 and a concentration result. W e prov e the concentration result in Section D and ﬁnish that section with the proof of Theorem 2. Proof of Theorem 3 requires more work—we need to deal with the ef fect of empirical copula transformation. W e handle this in Section E by employing the classical Kiefer -Dvoretzk y-W olfowitz theorem. B The Boundary Functional L ∗ p W e start by constructing the nearest neighbor boundary functional L ∗ p . For that we will need to introduce an auxiliary graph, which we call the nearest-neighbor graph with boundary . This graph is related to N N S and will be useful later . Let B be a d -dimensional cube, V ⊂ B be ﬁnite, and S ⊂ N + be non-empty and ﬁnite. W e deﬁne nearest-neighbor graph with boundary N N ∗ S ( V , B ) to be a directed graph, with possibly parallel edges, on vertex set V ∪ ∂ B , where ∂ B denotes the boundary of B . Roughly speaking, for every verte x x ∈ V and ev ery i ∈ S there is an edge to its “ i -th nearest-neighbor” in V ∪ ∂ B . More precisely , we deﬁne the edges from x ∈ V as follows: Let b ∈ ∂ B be the boundary point closest to x . (If there are multiple boundary points that are the closest to x we choose one arbitrarily .) If ( x , y ) ∈ E ( N N S ( V )) and k x − y k ≤ k x − b k then ( x , y ) also belongs to E ( N N ∗ S ( V , B )) . For each ( x , y ) ∈ E ( N N S ( V )) such that k x − y k > k x − b k we create in N N ∗ S ( V , B ) one copy of the edge ( x , b ) . In other words, there is a bijection between edge sets E ( N N S ( V )) and E ( N N ∗ S ( V , B )) . An example of a graph N N S ( V ) and a corresponding graph N N ∗ S ( V ) are sho wn in Figure 3. Analogously , we deﬁne L ∗ p ( V , B ) as the sum of p -po wered edges of N N ∗ S ( V , B ) . Formally , L ∗ p ( V , B ) = X ( x , y ) ∈ E ( N N ∗ S ( V ,B )) k x − y k p . (13) 6 Recall that a function f is Lipschitz if there exists a constant C > 0 such that | f ( x ) − f ( y ) | ≤ C k x − y k for all x, y in the domain of f . 11 (a) N N S ( V ) (b) N N ∗ S ( V , B ) Figure 3: Figure (a) shows an example of a nearest neighbor graph N N S ( V ) in two dimensions and a corresponding boundary nearest neighbor graph N N S ( V , B ) ∗ is sho wn in Figure (b). W e have used S = { 1 } , B = [0 , 1] 2 and a set V consisting of 13 points in B . W e will need some basic geometric properties of N N ∗ S ( V , B ) and L ∗ p ( V , B ) . By construction, the edges of N N ∗ S ( V , B ) are shorter than the corresponding edges of N N S ( V ) . As an immediate consequence we get the following proposition. Proposition 8 (Upper Bound) . F or any cube B , any p ≥ 0 and any ﬁnite set V ⊂ B , L ∗ p ( V , B ) ≤ L p ( V ) . C V eriﬁcation of Axioms (A1)–(A9) for ( L p , L ∗ p ) It is easy to see that the nearest neighbor functional L p and its boundary functional L ∗ p satisfy axioms (A1)–(A3). Axiom (A4) is veriﬁed by Proposition 8. It thus remains to verify axioms (A5)–(A9) which we do in subsections C.1, C.2 and C.3. W e start with two simple lemmas. Lemma 9 (In-Degree) . F or any ﬁnite V ⊆ R d the in-de gree of any verte x in N N S ( V ) is O (1) . Pr oof. Fix a vertex x ∈ V . W e sho w that the in-degree of x is bounded by some constant that depends only on d and k = max S . For any unit vector u ∈ R d we consider the conv ex open cone C ( x , u ) with ape x at x , rotationally symmetric about its axis u and angle 30 ◦ : Q ( x , u ) = ( y ∈ R d : u · ( y − x ) < √ 3 2 k u − x k ) . As it is well known, R d \ { x } can be written as a union of ﬁnitely many , possibly overlapping, cones Q ( x , u 1 ) , Q ( x , u 2 ) , . . . , Q ( x , u B ) , where B depends only on the dimension d . W e show that the in-degree of x is at most k B . Suppose, by contradiction, that the in-degree of x is larger than kB . Then, by pigeonhole principle, there is a cone Q ( x , u ) containing k + 1 v ertices of the graph with an incoming edge to x . Denote these vertices y 1 , y 2 , . . . , y k +1 and assume that they are indexed so that k x − y 1 k ≤ k x − y 2 k ≤ · · · ≤ k x − y k +1 k . By a simple calculation, we can verify that k x − y k +1 k > k y i − y k +1 k for all 1 ≤ i ≤ k . Indeed, by the law of cosines k y i − y k +1 k 2 = k x − y i k 2 + k x − y k +1 k 2 − 2( x − y i ) · ( x − y k +1 ) < k x − y i k 2 + k x − y k +1 k 2 − k x − y i kk x − y k +1 k ≤ k x − y k +1 k 2 , 12 where the sharp inequality follows from that y k +1 , y i ∈ Q ( x , u ) and so the angle between vectors ( x − y i ) and ( x − y k +1 ) is strictly less than 60 ◦ , and the second inequality follo ws from k x − y i k ≤ k x − y k +1 k . Thus, x cannot be among the k nearest-neighbors of y k +1 which contradicts the existence of the edge ( y k +1 , x ) . Lemma 10 (Growth Bound) . F or any p ≥ 0 and ﬁnite V ⊂ [0 , 1] d , L p ( V ) ≤ O (max( | V | 1 − p/d , 1)) . Pr oof. An eleg ant way to prove the lemma is with the use of space-ﬁlling curves. 7 Since Peano (1890) and Hilbert (1891), it is kno wn that there exists a continuous function ψ from the unit interv al [0 , 1] onto the cube [0 , 1] d ( i.e. a surjection). F or obvious reason ψ is called a space-ﬁlling curve. Moreov er , there are space-ﬁlling curves which are (1 /d ) -H ¨ older; see Milne (1980). In other words, we can assume that there exists a constant C > 0 such that k ψ ( x ) − ψ ( y ) k ≤ C | x − y | 1 /d ∀ x, y ∈ [0 , 1] . (14) Since ψ is a surjective function we can consider a right in verse ψ − 1 : [0 , 1] d → [0 , 1] i.e. a function such that ψ ( ψ − 1 ( x )) = x and we let W = ψ − 1 ( V ) . Let 0 ≤ w 1 < w 2 < · · · < w | V | ≤ 1 be the points of W sorted in the increasing order . W e construct a “nearest neighbor” graph G on W . For ev ery 1 ≤ j ≤ | V | and ev ery i ∈ S we create a directed edge ( w j , w j + i ) , where the addition i + j is taken modulo | V | . It is not hard to see that the total length of the edges of G is X ( x,y ) ∈ E ( G ) | x − y | ≤ O ( k 2 ) = O (1) (15) T o see more clearly why (15) holds, note that e very line segment [ w i , w i +1 ] , 1 ≤ i < | V | belongs to at most O ( k 2 ) edges and the total length of the line se gments is P | V |− 1 i =1 ( w i +1 − w i ) ≤ 1 . Let H be a graph on V ⊂ [0 , 1] d isomorphic to G , where for each edge ( w i , w j ) ∈ E ( G ) there is a corresponding edge ( ψ ( w i ) , ψ ( w j )) ∈ E ( H ) . By the construction of H L p ( V ) ≤ X ( x , y ) ∈ E ( H ) k x − y k p = X ( x,y ) ∈ E ( G ) k ψ ( x ) − ψ ( y ) k p . (16) H ¨ older property of ψ implies that X ( x,y ) ∈ E ( G ) k ψ ( x ) − ψ ( y ) k p ≤ C X ( x,y ) ∈ E ( G ) | x − y | p/d . (17) If p ≥ d then | x − y | p/d ≤ | x − y | since | x − y | ∈ [0 , 1] and thus X ( x,y ) ∈ E ( G ) | x − y | p/d ≤ X ( x,y ) ∈ E ( G ) | x − y | . Chaining the last inequality with (16), (17) and (15) we obtain that L p ( V ) ≤ O (1) for p ≥ d . If 0 < p < d we use the inequality between arithmetic and ( p/d ) -mean. It states that for positive numbers a 1 , a 2 , . . . , a n P n i =1 a p/d i n ! d/p ≤ P n i =1 a i n or equiv alently n X i =1 a p/d i ≤ n 1 − p/d n X i =1 a i ! p/d . In our case a i ’ s are the edge length of G and n ≤ k | V | , and we hav e X ( x,y ) ∈ E ( G ) | x − y | p/d ≤ ( k | V | ) 1 − p/d   X ( x,y ) ∈ E ( G ) | x − y |   p/d . Combining the last inequality with (16), (17) and (15) we get that L p ( V ) ≤ O ( | V | 1 − p/d ) for 0 < p < d . Finally , for p = 0 , L p ( V ) ≤ k | V | = O ( | V | ) . 7 There is an elementary proof, too, based on a discretization argument. Howev er, this proof introduces an extraneous logarithmic f actor when p = d . 13 C.1 Smoothness In this section, we verify axiom (A6). Lemma 11 (Smoothness of L p ) . F or p ≥ 0 and ﬁnite disjoint V , V 0 ⊂ [0 , 1] d , | L p ( V 0 ∪ V ) − L p ( V 0 ) | ≤ O (max( | V | 1 − p/d , 1)) . Pr oof. F or p ≥ d the lemma tri vially follows from the growth bound L p ( V 0 ) = O (1) , L p ( V 0 ∪ V ) = O (1) . For 0 ≤ p < d , we need to prove tw o inequalities: L p ( V 0 ∪ V ) ≤ L p ( V 0 ) + O ( | V | 1 − p/d ) and L p ( V 0 ) ≤ L p ( V 0 ∪ V ) + O ( | V | 1 − p/d ) . W e start with the ﬁrst inequality . W e use the obvious property of L p that L p ( V 0 ∪ V ) ≤ L p ( V 0 ) + L p ( V ) + O (1) . Combined with the growth bound (Lemma 10) for V we get L p ( V 0 ∪ V ) ≤ L p ( V 0 ) + L p ( V ) + O (1) ≤ L p ( V 0 ) + O ( | V | 1 − p/d ) + O (1) ≤ L p ( V 0 ) + O ( | V | 1 − p/d ) . The second inequality is a bit more tricky to prove. W e introduce a generalized nearest-neighbor graph N N S ( W , W 0 ) for any pair of ﬁnite sets W , W 0 such that W ⊆ W 0 ⊂ R d . W e deﬁne N N S ( W , W 0 ) as the subgraph of N N S ( W 0 ) where all edges from W 0 \ W are deleted. Similarly , we deﬁne L p ( W , W 0 ) as the sum p -po wered lengths of edges of N N S ( W , W 0 ) : L p ( W , W 0 ) = X ( x,y ) ∈ E ( N N S ( W,W 0 )) k x − y k p . W e will use two obvious properties of L p ( W , W 0 ) v alid for any ﬁnite W ⊆ W 0 ⊂ R d : L p ( W , W ) = L p ( W ) and L p ( W , W 0 ) ≤ L p ( W ) + O (1) . (18) Let U ⊆ V 0 be the set of vertices x such that in N N S ( V 0 ∪ V ) there exists an edge from x to a verte x V . Using the two observations and the gro wth bound we have L p ( V 0 ) = L p ( V 0 , V 0 ) = L p ( U, V 0 ) + L p ( V 0 \ U, V 0 ) ≤ L p ( U ) + O (1) + L p ( V 0 \ U, V 0 ) ≤ O ( | U | 1 − p/d ) + L p ( V 0 \ U, V 0 ) . The term is L p ( V 0 \ U, V 0 ) can be upper bounded by L p ( V 0 ∪ V ) since by the choice of U the graph N N S ( V 0 \ U, V 0 ) is a subgraph of N N S ( V 0 ∪ V ) . The term O ( | U | 1 − p/d ) is at most O ( | V | 1 − p/d ) since | U | is upper bounded by the number of edges of N N S ( V 0 ∪ V ) ending in V and, in turn, the number of these edges is by the in-degree lemma at most O ( | V | ) . Corollary 12 (Smoothness of L p ) . F or p ≥ 0 and ﬁnite V , V 0 ⊂ [0 , 1] d , | L p ( V 0 ) − L p ( V ) | ≤ O (max( | V 0 ∆ V | 1 − p/d , 1)) , wher e V 0 ∆ V denotes the symmetric differ ence. Pr oof. Applying the pre vious lemma twice | L p ( V 0 ) − L p ( V ) | ≤ | L p ( V 0 ) − L p ( V 0 ∪ V ) | + | L p ( V 0 ∪ V ) − L p ( V ) | = | L p ( V 0 ) − L p ( V 0 ∪ ( V \ V 0 )) | + | L p ( V ∪ ( V 0 \ V )) − L p ( V ) | ≤ O (max( | V \ V 0 | 1 − p/d , 1)) + O (max( | V 0 \ V | 1 − p/d , 1)) = O (max( | V 0 ∆ V | 1 − p/d , 1)) . Lemma 13 (Smoothness of L ∗ p ) . F or p ≥ 0 and ﬁnite disjoint V , V 0 ⊂ [0 , 1] d , | L ∗ p ( V 0 ∪ V , [0 , 1] d ) − L ∗ p ( V 0 , [0 , 1] d ) | ≤ O (max( | V | 1 − p/d , 1)) . 14 Pr oof. The proof of the lemma is identical to the proof of Lemma 11 if we replace L p ( · ) by L ∗ p ( · , [0 , 1] d ) , N N ∗ S ( · ) by N N S ( · , [0 , 1] d ) , L p ( · , · ) by L ∗ p ( · , · , [0 , 1] d ) and N N S ( · , · ) by N N ∗ S ( · , · , [0 , 1] d ) . W e, of course, need to explain what N N ∗ S ( V , W, [0 , 1] d ) and L ∗ p ( V , W, [0 , 1]) mean. For V ⊆ W , we deﬁne N N ∗ S ( V , W, [0 , 1] d ) as the subgraph of N N ∗ S ( W , [0 , 1] d ) , where the edges starting in W \ V are remo ved, and L ∗ p ( V , W, [0 , 1] d ) is the sum the p -th powers of Euclidean lengths of edges of N N ∗ S ( V , W, [0 , 1] d ) . Corollary 14 (Smoothness of L ∗ p ) . F or p ≥ 0 and ﬁnite V , V 0 ⊂ [0 , 1] d , | L ∗ p ( V 0 , [0 , 1] d ) − L ∗ p ( V , [0 , 1] d ) | ≤ O (max( | V 0 ∆ V | 1 − p/d , 1)) , wher e V 0 ∆ V denotes the symmetric differ ence. Pr oof. The corollary is pro ved in e xactly the same way as Corollary 12, where L p ( · ) is replaced by L ∗ p ( · , [0 , 1] d ) . C.2 Subadditivity and Superadditivity In this section, we verify axiom (A5). Lemma 15 (Subadditivity) . Let p ≥ 0 . F or m ∈ N + consider the partition { Q i : 1 ≤ i ≤ m d } of the cube [0 , 1] d into m d disjoint subcubes 8 of side 1 /m . F or any ﬁnite V ⊂ [0 , 1] d , L p ( V ) ≤ m d X i =1 L p ( V ∩ Q i ) + O (max( m d − p , 1)) . (19) Pr oof. Consider a subcube Q i which contains at least k + 1 points. Using the “ L p ( W , W 0 ) notation” from the proof of Lemma 11 L p ( V ∩ Q i , V ) ≤ L p ( V ∩ Q i , V ∩ Q i ) = L p ( V ∩ Q i ) . Let R be the union subcubes that contain at most k points. Clearly | V ∩ R | ≤ k m d . Then L p ( V ) = L p ( V , V ) = L p ( V ∩ R , V ) + X 1 ≤ i ≤ m d | V ∩ Q i |≥ k +1 L p ( V ∩ Q i , V ) ≤ L p ( V ∩ R ) + O (1) + m d X i =1 L p ( V ∩ Q i ) , where we hav e used the second part of (18). The proof is ﬁnished by applying the gro wth bound L p ( V ∩ R ) ≤ O (max( | V ∩ R | 1 − p/d , 1)) ≤ O (max( m d − p , 1)) . Lemma 16 (Superadditi vity of L ∗ p ) . Let p ≥ 0 . F or m ∈ N + consider a partition { Q i : 1 ≤ i ≤ m } of [0 , 1] d into m d disjoint subcubes of side 1 /m . F or any ﬁnite V ⊂ [0 , 1] d , m d X i =1 L ∗ p ( V ∩ Q i , Q i ) ≤ L ∗ p ( V , [0 , 1] d ) . Pr oof. W e construct a ne w graph ˆ G by modifying the graph N N ∗ S ( V , [0 , 1] d ) . Consider any edge ( x , y ) such that x ∈ Q i and y 6∈ Q i for some 1 ≤ i ≤ m d . Let z be the point where ∂ Q i and the line segment from x to y intersect. In ˆ G , we replace ( x , y ) by ( x , z ) . Note that the all edges of ˆ G lie completely in one of the subcubes Q i and they are shorter or equal to the corresponding edges in N N ∗ S ( V , [0 , 1] d ) . 8 In order the subcubes to be pairwise disjoint, most of them need to be semi-open and some of them closed. 15 Let ˆ L i,p be the sum of p -th powers of the Euclidean length of the edges of ˆ G lying in Q i . Since edges in ˆ G are shorter than in N N ∗ S ( V , [0 , 1] d ) , P m d i =1 ˆ L p,i ≤ L ∗ p ( V , [0 , 1] d ) . T o ﬁnish the proof it remains to show that L ∗ p ( V ∩ Q i , Q i ) ≤ ˆ L i,p for all 1 ≤ i ≤ m d . For any edge ( x , z ) in ˆ G from x ∈ V ∩ Q i to z ∈ ∂ Q i , the point z ∈ ∂ Q i is not necessarily the closest to x . Therefore, any edge in N N ∗ S ( V ∩ Q i , Q i ) is shorter than the corresponding edge in ˆ G . C.3 Uniformly Distrib uted Points Axiom (A7) is a direct consequence of axiom (A8). Hence, we are left with verifying axioms (A8) and (A9). In this section, U n denotes a set of n points chosen independently uniformly at random from [0 , 1] d . Lemma 17 (A verage Edge Length) . Assume X 1 , X 2 , . . . , X n ar e chosen i.i.d. uniformly at random fr om [0 , 1] d . Let k be a ﬁxed positive integ er . Let Z be the distance fr om X 1 to k -th near est-neighbor in { X 2 , X 3 , . . . , X n } . F or any p ≥ 0 , E [ Z p | X 1 ] ≤ O ( n − p/d ) . Pr oof. W e denote by B ( x , r ) = { y ∈ R d : k x − y k ≤ r } the ball of radius of r ≥ 0 centered at a point x ∈ R d . Since Z lies in the interv al [0 , √ d ] is non-ne gati ve, E [ Z p | X 1 ] = Z ∞ 0 Pr[ Z p > t | X 1 ] d t = p Z √ d 0 u p − 1 Pr[ Z > u | X 1 ]d u = p Z √ d 0 u p − 1 Pr[ |{ X 2 , X 3 , . . . , X n } ∩ B ( X 1 , u ) | < k | X 1 ] d u = p Z √ d 0 k − 1 X j =0  n − 1 j  u p − 1  V ol ( B ( X 1 , u ) ∩ [0 , 1] d )  j ·  1 − V ol ( B ( X 1 , u ) ∩ [0 , 1] d )  n − 1 − j d u ≤ p Z 2 √ d 0 k − 1 X j =0  n − 1 j  u p − 1 [ V ol ( X 1 , u )] j " 1 −  u 2 √ d  d # n − 1 − j d u . The last inequality follows from the obvious bound V ol ( B ( X 1 , u ) ∩ [0 , 1] d ) ≤ V ol ( B ( X 1 , u )) and that for u ∈ [0 , √ d ] the intersection B ( X 1 , u ) ∩ [0 , 1] d contains a cube of side at least u 2 √ d . T o simplify this complicated integral, we note that V ol ( B ( X 1 , u )) = V ol ( B ( X 1 , 1)) u d and make substitution s = ( u 2 √ d ) d . The last integral can be bounded by a constant multiple of k − 1 X j =0  n − 1 j  Z 1 0 s p/d + j − 1 (1 − s ) n − 1 − j d s . Since  n − 1 j  = O ( n j ) and the sum consists of only constant number of terms, it remains to show that the inner integral is O ( n − p/d − j ) . W e can express the inner integral using the gamma function. Then, we use the asymptotic relation  n   = Θ( n  ) for generalized binomial coef ﬁcients  a b  = 16 Γ( a +1) Γ( b +1)Γ( a − b +1) to upper-bound the result: Z 1 0 s p/d + j − 1 (1 − s ) n − 1 − j d s = Γ( p/d + j )Γ( n − j ) Γ( n + p/d ) = 1 ( p/d + j )  n + p/d − 1 p/d + j  = O ( n − p/d − j ) . Lemma 18 (Add-One Bound) . F or any p ≥ 0 , | E [ L p ( U n )] − E [ L p ( U n +1 )] | ≤ O ( n − p/d ) . Pr oof. Let X 1 , X 2 , . . . , X n , X n +1 be i.i.d. points from the uniform distribution over [0 , 1] d . W e couple U n and U n +1 in the obvious way U n = { X 1 , X 2 , . . . , X n } and U n +1 = { X 1 , X 2 , . . . , X n +1 } . Let Z be the distance from X n +1 to k -th closest neighbor in U n . The in- equality L p ( U n +1 ) ≤ L p ( U n ) + | S | Z p holds since | S | Z p accounts for the edges from X n +1 and since the edges from U n are shorter (or equal) in N N S ( U n +1 ) than the corresponding edges in U n +1 . T aking expectations and using Lemma 17 we get E [ L p ( U n +1 )] ≤ E [ L p ( U n )] + O ( n − p/d ) . T o show the other direction of the inequality , let Z i be the distance from X i its ( k + 1) -the nearest point in U n +1 . (Recall that k = max S .) Let N ( j ) = { X i : ( X i , X j ) ∈ E ( N N S ( U n +1 )) } be the incoming neighborhood of X j . Now if we remove X j from N N S ( V ) , the vertices in N ( j ) lose X j as their neighbor and they need to be connected to a new neighbor in U n +1 \ { X j } . This neighbor is not farther than their ( k + 1) -th nearest-neighbor in U n +1 . Therefore, L p ( U n +1 \ { X j } ) ≤ L p ( U n +1 ) + X X i ∈ N ( j ) Z p i . Summing ov er all j = 1 , 2 , . . . , n + 1 we ha ve n +1 X j =1 L p ( U n +1 \ { X j } ) ≤ ( n + 1) L p ( U n +1 ) + n +1 X j =1 X X i ∈ N ( j ) Z p i . The double sum on the right hand side is simply the sum over all edges of N N S ( U n +1 ) and so we can write n +1 X j =1 L p ( U n +1 \ { X j } ) ≤ ( n + 1) L p ( U n +1 ) + | S | n +1 X i =1 Z p i . T aking expectations and using Lemma 17 to bound E [ Z p i ] we arri ve at ( n + 1) E [ L p ( U n )] ≤ ( n + 1) E [ L p ( U n +1 )] + ( n + 1) O ( n − p/d ) . The proof is ﬁnished by dividing through by ( n + 1) . Lemma 19 (Quasi-additi vity) . 0 ≤ E [ L p ( U n )] − E [ L ∗ p ( U n , [0 , 1] d )] ≤ O (max( n 1 − p/d − 1 /d , 1)) for any p ≥ 0 . Pr oof. The ﬁrst inequality follows from Proposition 8 by taking expectation. The proof of the sec- ond inequality is much more in volved. Consider the (random) subset of points ˆ U n ⊆ U n which are connected to the boundary in N N ∗ S ( U n , [0 , 1] d ) by at least one edge. W e use the notation L p ( W , W 0 ) for any W ⊆ W 0 and its two properties expressed by Eq. (18) and a third ob vious property L p ( W , W 0 ) ≤ L p ( W 0 ) . W e ha ve L p ( U n ) = L p ( U n , U n ) = L p ( b U n , U n ) + L p ( U n \ b U n , U n ) ≤ L p ( b U n ) + O (1) + L ∗ p ( U n , [0 , 1] d ) , 17 R 1 R 2 R m n − 1 /d n − 1 /d n − 1 /d B . . . F n − 1 /d n − 1 /d Figure 4: The left drawing shows the box B = [ n − 1 /d , 1 − n − 1 /d ] d ⊂ [0 , 1] d shown in gray . The right drawing shows partition of B into rectangles R 1 , R 2 , . . . , R m . The diameter of the projection of each rectangle R i on the right side F has diameter (at most) n − 1 /d . In each rectangle R i at most k points are connected to F by an edge. where in the last step we have used that L p ( U n \ b U n , U n ) ≤ L ∗ p ( U n , [0 , 1] d ) which holds since the edges from vertices U n \ b U n are the same in both graphs N N S ( U n ) and N N ∗ S ( U n , [0 , 1] d ) . If we take expectation, we get E [ L p ( U n )] − E [ L ∗ p ( U n , [0 , 1] d )] ≤ E [ L p ( b U n )] + O (1) and we see that we are left to show that E [ L p ( b U n )] ≤ O (max( n 1 − p/d − 1 /d , 1)) . In order to do that, we start by showing that E [ | b U n | ] ≤ O ( n 1 − 1 /d ) . (20) Consider the cube B = [ n − 1 /d , 1 − n − 1 /d ] d . W e bound E [ | b U n ∩ B | ] and E [ | b U n ∩ ([0 , 1] d \ B ) | ] separately . The latter is easily bounded by O ( n 1 − 1 /d ) since there are n points and the probability that a point lies in [0 , 1] d \ B is V ol ([0 , 1] d \ B ) ≤ O ( n − 1 /d ) . W e now bound | b U n ∩ B | . Consider a face of F . Partition B into m = Θ( n 1 − 1 /d ) rectangles R 1 , R 2 , . . . , R m such that the perpendicular projection of any rectangle R i , 1 ≤ i ≤ m , on F has diameter at most n − 1 /d and its ( d − 1) - dimensional volume is Θ( n 1 /d − 1 ) ; see Figure 4. It is not hard to see that, in U n ∩ R i , only the k closest points to F can be connected to F by an edge in N N ∗ S ( U n , [0 , 1] d ) . There are 2 d faces and m rectangles and hence | b U n ∩ B | ≤ 2 dk m = O ( n 1 − 1 /d ) . W e ha ve thus prov ed (20). The second key component that we need is that the e xpected sum of p -th powers of lengths of edges of N N ∗ S ( U n , [0 , 1] d ) that connect points in U n to ∂ [0 , 1] d is “small”. More precisely , for any point x ∈ [0 , 1] d let b x ∈ ∂ [0 , 1] d be the boundary point closest to x . W e sho w that E   X X ∈ b U n k X − b X k p   ≤ O ( n 1 − p/d − 1 /d ) . (21) W e decompose the task as E   X X ∈ b U n k X − b X k p   = E   X X ∈ b U n ∩ B k X − b X k p   + E   X X ∈ b U n ∩ ([0 , 1] d \ B ) k X − b X k p   . Clearly , the second term is bounded by n − p/d E [ | b U n ∩ ([0 , 1] d \ B ) | ] = O ( n 1 − 1 /d − 1 /p ) . T o bound the ﬁrst term, consider a face F of the cube [0 , 1] d and a rectangle R i in the decomposition of B into R 1 , R 2 , . . . , R m mentioned above. Let Z be the distance of the k -th closest point in b U n ∩ R i to F . (If b U n ∩ R i contains less than k points, we deﬁne Z to be 1 − n − 1 /d .) Recall that only the k closest points of b U n ∩ R i can be connected to F and this distance is bounded by Z . There are 2 d faces, m = O ( n 1 − 1 /d ) rectangles and at most k points in each rectangle connected to a face. If we 18 can show that E [ Z p ] = O ( n − p/d ) , we can upper bound the second term by 2 dk m · O ( n − 1 /p ) = O ( n 1 − p/d − 1 /d ) from which (21) follo ws. W e now prov e that E [ Z p ] = O ( n − p/d ) . Let Y = Z − n − 1 /d . Since E [ Z p ] ≤ 2 p E [ Y p ] + 2 p n − p/d it sufﬁces to to sho w that E [ Y p ] = O ( n − p/d ) . Let q be the ( d − 1) -dimensional v olume of the projection of R i to F . Recall that q = Θ( n 1 /d − 1 ) . Since Y ∈ [0 , 1 − 2 n − 1 /d ] we ha ve E [ Y p ] = p Z 1 − 2 n − 1 /d 0 t p − 1 Pr[ Y > t ] d t = p Z 1 − 2 n − 1 /d 0 t p − 1 k − 1 X j =0  n j  ( q t ) j (1 − q t ) n − j d t ≤ pq − p Z 1 0 x p − 1 k − 1 X j =0  n j  x j (1 − x ) n − j d x = pq − p k − 1 X j =0  n j  Γ( p + j )Γ( n − j + 1) Γ( n + p + 1) = pq − p k − 1 X j =0 1 ( p + j )  n j  /  n + p p + j  = Θ( q − p n − p ) = Θ( n − p/d ) . W e now use (20) and (21) to sho w that E [ L p ( b U n )] ≤ O (max( n 1 − p/d − 1 /d , 1)) which will ﬁnish the proof. For any point X ∈ b U n consider the point b X lying on the boundary . Let b V n = { b X : X ∈ b U n } and let N N S ( b V n ) be its nearest-neighbor graph. Since b V n lies in a union of ( d − 1) -dimensional faces, by the growth bound L p ( b V n ) ≤ O (max( | b V n | 1 − p/ ( d − 1) , 1)) . Thus, if 0 ≤ p < d − 1 we use that x 7→ x 1 − p/ ( d − 1) is concav e and (20), and we have E [ L p ( b V n )] ≤ O  E h | b V n | 1 − p/ ( d − 1) i = O  E h | b U n | 1 − p/ ( d − 1) i ≤ O  E h | b U n | i 1 − p/ ( d − 1)  ≤ O ( n 1 − 1 /d ) 1 − p/ ( d − 1) ≤ O ( n 1 − p/d − 1 /d ) . If p ≥ d − 1 then L p ( b V n ) = O (1) . Therefore, for any p ≥ 0 E [ L p ( b V n )] ≤ O (max( n 1 − p/d − 1 /d , 1)) (22) W e construct a nearest-neighbor graph b G on b U n by lifting N N S ( b V n ) . For ev ery edge, ( b X , b Y ) in N N S ( b V n ) we create an edge ( X , Y ) . Clearly , L p ( b U n ) is at most the sum of p -the powers of the edges lengths of b G . By triangle inequality , for any p > 0 k X − Y k p ≤ ( k X − b X k + k b X − b Y k + k b Y − Y k ) p ≤ 3 p ( k X − b X k p + k b X − b Y k p + k b Y − Y k p ) . In-degrees and out-degrees of b G are O (1) and so if we sum over all edges of ( X , Y ) of b G and take expectation, we get E [ L p ( b U n )] ≤ E [ L p ( b V n )] + O   E   X X ∈ b U n k X − b X k p     . T o upper the right hand side we use (21) and (22), which proves that E [ L p ( b U n )] ≤ O (max( n 1 − p/d − 1 /d , 1)) and ﬁnishes the proof. 19 D Concentration and Estimator of Entropy In this section, we show that if V n is a set of n points drawn i.i.d. from any distribution over [0 , 1] d then L p ( V n ) is tightly concentrated. That is, we show that with high probability L p ( V n ) is within O ( n 1 / 2 − p/ (2 d ) ) its expected v alue. W e use this result at the end of this section to giv e a proof of Theorem 2. It turns out that in order to derive the concentration result, the properties of the distribution gener- ating the points are irrelev ant (ev en the existence of density is not necessary). The only property that we exploit is smoothness of L p . As a technical tool, we use the isoperimetric inequality for Hamming distance and product measures. This inequality is, in turn, a simple consequence of T ala- grand’ s isoperimetric inequality , see e.g. Dubhashi and Panconesi (2009); Alon and Spencer (2000); T alagrand (1995). T o phrase the isoperimetric inequality , we use Hamming distance H ( x 1: n , y 1: n ) between two tuples x 1: n = ( x 1 , x 2 , . . . , x n ) , y 1: n = ( y 1 , y 2 , . . . , y n ) which is deﬁned as the number of elements in which x 1: n and y 1: n disagree. Theorem 20 (Isoperimetric Inequality) . Let A ⊂ Ω n be a subset of an n -fold pr oduct of a pr obability space equipped with a pr oduct measure . F or any t ≥ 0 let A t = { x 1: n ∈ Ω n : ∃ y 1: n ∈ Ω n s.t. H ( x 1: n , y 1: n ) ≤ t } be an e xpansion of A . Then, for any t ≥ 0 , Pr[ A ] Pr[ A t ] ≤ exp  − t 2 4 n  , wher e A t denotes the complement of A t with r espect to Ω n . Theorem 21 (Concentration Around the Median) . Let V n consists of n points drawn i.i.d. fr om an absolutely continuous pr obability distribution over [0 , 1] d , let 0 ≤ p ≤ d . F or any t > 0 , Pr [ | L p ( V n ) − M ( L p ( V n )) | > t ] ≤ e − Θ( t 2 d/ ( d − p ) /n ) , wher e M ( · ) denotes the median of a random variable . Pr oof. Let Ω = [0 , 1] d and V n = { X 1 , X 2 , . . . , X n } , where X 1 , X 2 , . . . , X n are indepen- dent. T o emphasize that we are working in a product space, we use the notations L p ( x ) := L p ( { x 1 , x 2 , . . . , x n } ) , L p ( X 1: n ) := L p ( V n ) = L p ( { X 1 , X 2 , . . . , X n } ) and M := M ( L p ( X 1: n )) . Let A = { x ∈ Ω n : L p ( x ) ≤ M } . By smoothness of L p there exists a constant C > 0 such that L p ( x ) ≤ L p ( y ) + C · H ( x , y ) 1 − p/d . Therefore, L p ( x ) > M + t implies that x ∈ A ( t/C ) d/ ( d − p ) . Hence for a random X 1: n = ( X 1 , X 2 , . . . , X n ) Pr[ L p ( X 1: n ) > M + t ] ≤ Pr[ X 1: n ∈ A ( t/C ) d/ ( d − p ) ] ≤ 1 Pr[ A ] e − Θ( t 2 d/ ( d − p ) /n ) by the isoperimetric inequality . Similarly , we set B = A and note that by smoothness we have also the rev ersed inequality L p ( y ) ≤ L p ( x ) + C · H ( x , y ) 1 − p/d . Therefore, L p ( x ) < M + t implies that x ∈ B ( t/C ) d/ ( d − p ) . By the same argument as before Pr[ L p ( X 1: n ) < M + t ] ≤ Pr[ X 1: n ∈ B ( t/C ) d/ ( d − p ) ] ≤ 1 Pr[ B ] e − Θ( t 2 d/ ( d − p ) /n ) . The theorem follows by the union bound and the f act that Pr[ A ] = Pr[ B ] = 1 / 2 . Corollary 22 (De viation of the Mean and the Median) . Let V n consists of n points drawn i.i.d. fr om an absolutely continuous pr obability distribution over [0 , 1] d , let 0 ≤ p ≤ d and S ⊂ N + a ﬁnite set. Then | E [ L p ( V n )] − M ( L p ( V n )) | ≤ O ( n 1 / 2 − p/ (2 d ) ) . 20 Pr oof. F or conciseness let L p = L p ( V n ) and M = M ( L p ( V n )) . W e ha ve | E [ L p ] − M | ≤ E | L p − M | = Z ∞ 0 Pr[ | L p − M | > t ] d t ≤ Z ∞ 0 e − Θ( t 2 d/ ( d − p ) /n ) d t = Θ( n 1 / 2 − p/ (2 d ) ) . Putting these pieces together we arriv e at what we wanted to prove: Corollary 23 (Concentration) . Let V n consists of n points drawn i.i.d. fr om an absolutely continu- ous pr obability distribution over [0 , 1] d , let 0 ≤ p ≤ d and S ⊂ N + and ﬁnite. F or any δ > 0 with pr obability at least 1 − δ , | E [ L p ( V n )] − L p ( V n ) | ≤ O ( n log (1 /δ )) 1 / 2 − p/ (2 d ) . (23) Pr oof of Theorem 2. By scaling and translation, we can assume that the support of µ is contained in the unit cube [0 , 1] d . The ﬁrst part of the theorem follows immediately from Theorem 6. T o prove the second part observe from (23) that for an y δ > 0 with probability at least 1 − δ ,     E [ L p ( V n )] γ n 1 − p/d − L p ( V n ) γ n 1 − p/d     ≤ O  n − 1 / 2+ p/ (2 d ) (log(1 /δ )) 1 / 2 − p/ (2 d )  . (24) It is easy to see that if 0 < p ≤ d − 1 then − 1 / 2 + p/ (2 d ) < − d − p d (2 d − p ) < 0 , and if d − 1 ≤ p < d then − 1 / 2 + p/ (2 d ) < − d − p d ( d +1) < 0 . No w using (24), Theorem 7 and the triangle inequality , we hav e that for any δ > 0 with probability at least 1 − δ ,      L p ( V n ) γ n 1 − p/d − Z [0 , 1] d f 1 − p/d ( x ) d x      ≤     E [ L p ( V n )] γ n 1 − p/d − L p ( V n ) γ n 1 − p/d     +      E [ L p ( V n )] γ n 1 − p/d − Z [0 , 1] d f 1 − p/d ( x ) d x      ≤    O  n − d − p d (2 d − p ) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if 0 < p < d − 1 ; O  n − d − p d ( d +1) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if d − 1 ≤ p < d . T o ﬁnish the proof of (7) exploit the fact that log(1 ± x ) = ± O ( x ) for x → 0 . E Copulas and Estimator of Mutual Inf ormation The goal of this section is to prove Theorem 3 on con ver gence of the estimator b I α . The main additional problem that we need to deal with in the proof is the effect of the empirical copula trans- formation. A version of the classical Kiefer-Dv oretzky-W olfowitz theorem due to Massart gives a con venient way to do it; see e.g. De vroye and Lugosi (2001). Theorem 24 (Kiefer -Dvoretzky-W olfowitz) . Let X 1 , X 2 , . . . , X n be an i.i.d. sample fr om a pr oba- bility distribution o ver R with c.d.f . F : R → [0 , 1] . Deﬁne the empirical c.d.f. b F ( x ) = 1 n |{ i : 1 ≤ i ≤ n, X i ≤ x }| for x ∈ R . Then, for any t ≥ 0 , Pr  sup x ∈ R | F ( x ) − b F ( x ) | > t  ≤ 2 e − 2 nt 2 . As a simple consequence of the Kiefer-Dv oretzky-W olfowitz theorem, we can deri ve that b F is a good approximation of F . 21 Lemma 25 (Con vergence of Empirical Copula) . Let X 1 , X 2 , . . . , X n be an i.i.d. sample from a pr obability distribution over R d with marginal c.d.f. ’ s F 1 , F 2 , . . . , F d . Let F be the copula deﬁned by (9) and let b F be the empirical copula tr ansformation deﬁned by (10) . Then, for any t ≥ 0 , Pr  sup x ∈ R d k F ( x ) − b F ( x ) k 2 > t  ≤ 2 de − 2 ndt 2 . Pr oof. Using k · k 2 ≤ √ d k · k ∞ in R d and union-bound we hav e Pr  sup x ∈ R d k F ( x ) − b F ( x ) k 2 > t  ≤ Pr  sup x ∈ R d k F ( x ) − b F ( x ) k ∞ > t √ d  = Pr  sup x ∈ R max 1 ≤ j ≤ d | F j ( x ) − b F j ( x ) | > t √ d  ≤ d X i =1 Pr  sup x ∈ R | F j ( x ) − b F j ( x ) | > t √ d  ≤ 2 de − 2 ndt 2 . The following corollary is an ob vious consequence of this lemma: Corollary 26 (Con ver gence of Empirical Copula) . Let X 1 , X 2 , . . . , X n be an i.i.d. sample fr om a pr obability distribution over R d with marginal c.d.f. ’ s F 1 , F 2 , . . . , F d . Let F be the copula deﬁned by (9) , and let b F be the empirical copula tr ansformation deﬁned by (10) . Then, for any δ > 0 , Pr " max 1 ≤ i ≤ n k F ( X i ) − b F ( X i ) k < r log(2 d/δ ) 2 nd # ≥ 1 − δ . (25) Proposition 27 (Order statistics) . Let a 1 , a 2 , . . . , a m and b 1 , b 2 , . . . , b m be r eal numbers. Let a (1) ≤ a (2) ≤ . . . ≤ a ( m ) and b (1) ≤ b (2) ≤ . . . ≤ b ( m ) be the same numbers sorted in ascending or der . Then, | a ( i ) − b ( i ) | ≤ max j | a j − b j | , for all 1 ≤ i ≤ m . Pr oof. The proof is left as an e xercise for the reader . Lemma 28 (Perturbation) . Consider points x 1 , x 2 , . . . , x n , y 1 , y 2 , . . . , y n ∈ R d such that k x i − y i k <  for all 1 ≤ i ≤ n . Then, | L p ( { x 1 , x 2 , . . . , x n } ) − L p ( { y 1 , y 2 , . . . , y n } ) | ≤  O ( n p ) , if 0 < p < 1 ; O ( n ) , if 1 ≤ p . Pr oof. Let k = max S , A = { x 1 , x 2 , . . . , x n } and B = { y 1 , y 2 , . . . , y n } . Let w A ( i, j ) = k x i − x j k p and w B ( i, j ) = k y i − y j k p be the edge weights deﬁned by A and B respectively . Let a i ( j ) be the p -th power of the distance from x i to its j -th nearest-neighbor in A , for 1 ≤ i ≤ n , 1 ≤ j ≤ n − 1 . Similarly , let b i ( j ) be the p -th power of the distance from y i to its j -th nearest-neighbor in B . Note that for any i , if we sort the real numbers w A ( i, 1) , . . . , w A ( i, i − 1) , w A ( i, i + 1) , . . . , w A ( i, n ) , then we get a i (1) ≤ a i (2) ≤ . . . ≤ a i ( n − 1) . Similarly for w B ’ s and b i ( j ) ’ s. Using these notations we 22 can write | L p ( A ) − L p ( B ) | =       n X i =1 X j ∈ S a i ( j ) − b i ( j )       ≤ n X i =1 X j ∈ S    a i ( j ) − b i ( j )    ≤ n X i =1 X j ∈ S max 1 ≤ i,j ≤ n    a i ( j ) − b i ( j )    ≤ n X i =1 X j ∈ S max i,j | w A ( i, j ) − w B ( i, j ) | ≤ k n max 1 ≤ i,j ≤ n | w A ( i, j ) − w B ( i, j ) | . The third inequality follows from Proposition 27. It remains to bound | w A ( i, j ) − w B ( i, j ) | . W e consider two cases: Case 0 < p < 1 . Using | u p − v p | ≤ | u − v | p valid for an y u, v ≥ 0 and the triangle inequality     k a − b k − k c − d k     ≤ k a − c k + k b − d k (26) valid for an y a , b , c , d ∈ R d we hav e | w A ( i, j ) − w B ( i, j ) | = |k x i − x j k p − k y i − y j k p | ≤ |k x i − x j k − k y i − y j k| p ≤ ( k x i − y i k + k x j − y j k ) p ≤ 2 p  p . Case p ≥ 1 . Consider the function f ( u ) = u p on interv al [0 , √ d ] . On this interv al | f 0 ( u ) | ≤ pd ( p − 1) / 2 and so f is Lipschitz with constant pd ( p − 1) / 2 . In other words, for any u, v ∈ [0 , √ d ] , | u p − v p | ≤ pd ( p − 1) / 2 | u − v | . Thus | w A ( i, j ) − w B ( i, j ) | = |k x i − x j k p − k y i − y j k p | ≤ pd ( p − 1) / 2 |k x i − x j k − k y i − y j k| ≤ pd ( p − 1) / 2 ( k x i − y i k + k x j − y j k ) ≤ 2 pd ( p − 1) / 2 , where the second inequality follows from (26). Corollary 29 (Copula Perturbation) . Let X 1 , X 2 , . . . , X n be an i.i.d. sample fr om a probability distribution over R d with mar ginal c.d.f . ’ s F 1 , F 2 , . . . , F d . Let F be the copula deﬁned by (9) and let b F be the empirical copula tr ansformation deﬁned by (10). Let Z i = F ( X i ) and b Z i = b F ( X i ) . Then for any δ > 0 , with pr obability at least 1 − δ ,      L p ( Z 1: n ) γ n 1 − p/d − L p ( b Z 1: n ) γ n 1 − p/d      ≤  O  n p/d − p/ 2 (log(1 /δ )) p/ 2  , if 0 < p < 1 ; O  n p/d − 1 / 2 (log(1 /δ )) 1 / 2  , if 1 ≤ p . Pr oof. It follo ws immediately from Corollary 26 and Lemma 28 that with probability at least 1 − δ , | L p ( { Z 1: n } ) − L p ( b Z 1: n ) | ≤  O  n 1 − p/ 2 (log(1 /δ )) p/ 2  , if 0 < p < 1 ; O  n 1 / 2 (log(1 /δ )) 1 / 2  , if 1 ≤ p . W e are now ready to gi ve the proof of Theorem 3. 23 Pr oof of Theorem 3. Let g denote the density of the copula of µ . The ﬁrst part follows from (6), Corollary 29 and a standard Borel-Cantelli argument with δ = 1 /n 2 . Corollary 29 puts the restric- tions d ≥ 3 and 1 / 2 < α < 1 . The second part can be proved along the same lines. From (7) we have that for any δ > 0 with probability at least 1 − δ ,      L p ( Z 1: n ) γ n 1 − p/d − Z [0 , 1] d g 1 − p/d ( x ) d x      ≤    O  n − d − p d (2 d − p ) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if 0 < p < d − 1 ; O  n − d − p d ( d +1) (log(1 /δ )) 1 / 2 − p/ (2 d )  , if d − 1 ≤ p < d . Hence using the triangle inequality again, and exploiting that (log(1 /δ )) 1 / 2 − p/ (2 d ) < (log(1 /δ )) 1 / 2 if 0 < p , δ < 1 , we hav e that with probability at least 1 − δ ,      L p ( b Z 1: n ) γ n 1 − p/d − Z [0 , 1] d g 1 − p/d ( x ) d x      ≤          O  max { n − d − p d (2 d − p ) , n − p/ 2+ p/d } p log(1 /δ )  , if 0 < p ≤ 1 ; O  max { n − d − p d (2 d − p ) , n − 1 / 2+ p/d } p log(1 /δ )  , if 1 ≤ p ≤ d − 1 ; O  max { n − d − p d ( d +1) , n − 1 / 2+ p/d } p log(1 /δ )  , if d − 1 ≤ p < d . T o ﬁnish the proof exploit that when x → 0 then log(1 ± x ) = ± O ( x ) . 24

Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment