Barnes-Hut-SNE

Bar nes-Hut-SNE Laurens van der Maaten Pattern Recognition and Bioinformatics Group, Delft Uni v ersity of T echnology Mekelwe g 4, 2628 CD Delft, The Netherlands lvdmaaten@gmail.com Abstract The paper presents an O ( N log N ) -implementation of t-SNE — an embedding technique that is commonly used for the visualization of high-dimensional data in scatter plots and that normally runs in O ( N 2 ) . The new implementation uses vantage-point trees to compute sparse pairwise similarities between the input data objects, and it uses a v ariant of the Barnes-Hut algorithm to approximate the forces between the corresponding points in the embedding. Our experiments show that the new algorithm, called Barnes-Hut-SNE, leads to substantial computational ad- vantages over standard t-SNE, and that it makes it possible to learn embeddings of data sets with millions of objects. 1 Introduction Data-visualization techniques are an essential tool for an y data analyst, as the y allow the analyst to visually e xplore the data and generate hypotheses. One of the key limitations of traditional visual- ization techniques such as histograms, scatter plots, and parallel coordinate plots (see, e.g . , [10] for an ov ervie w) is that they only facilitate the visualization of one or a few data variables at a time. In order to get an idea of the structure of all variables in the data, it is therefore necessary to per- form an automatic analysis of the data before making the visualization, for instance, by learning a low-dimensional embedding of the data. In such an embedding, each data object is represented by a low-dimensional point in such a way , that nearby points correspond to similar data objects and that distant points correspond to dissimilar data objects. The low-dimensional embedding can readily be visualized in, e.g . , a scatter plot or a parallel coordinate plot. A plethora of embedding techniques hav e been proposed over the last decade, e.g. , [5, 15, 20, 23, 25, 26]. For creating two- or three-dimensional embeddings that can be readily visualized in a scatter plot, a family of techniques based on stochastic neighbor embedding (SNE; [11]) has recently become very popular . These techniques compute an N × N similarity matrix in both the original data space and in the low-dimensional embedding space; the similarities take the form of a probability distribution ov er pairs of points in which high probabilities correspond to similar objects or points. The probabilities are generally deﬁned as normalized Gaussian or Student-t kernels, which makes that SNE focuses on preserving local data structure. The embedding is learned by minimizing the Kullback-Leibler di v ergence between the probability distributions in the original data space and the embedding space with respect to the locations of the points in the embedding. As the resulting cost function is non-conv ex, this minimization is typically performed using ﬁrst- or second-order gradient-descent techniques [5, 11, 27]. The gradient of the Kullback-Leibler diver gence may be interpreted as an N -body system in which all of the N points exert forces on each other . One of the key limitations of SNE (and of its variants) is that its computational and memory com- plexity scales quadratically in the number of data objects N . In practice, this limits the applicability of SNE to data sets with only a few thousand points. T o visualize lar ger data sets, landmark imple- mentations of SNE may be used [25], but this is hardly a satisf actory solution. 1 In this paper , we de velop a ne w algorithm for (t-)SNE that requires only O ( N log N ) computation and O ( N ) memory . Our new algorithm computes a sparse approximation of the similarities between the original data objects using vantage-point trees [31], and subsequently , approximates the forces between the points in the embedding using a Barnes-Hut algorithm [1] — an algorithm commonly used by astronomers to perform N -body simulations. The Barnes-Hut algorithm reduces the number of pairwise forces that needs to be computed by exploiting the f act that the forces ex erted by a group of points on a point that is relativ ely far a way are all v ery similar . 2 Related work A large body of previous work has focused on decreasing the computational complexity of algo- rithms that scale quadratically in the amount of data when implemented naiv ely . Most of these studies focus on speeding up nearest-neighbor searches using space-partitioning (metric) trees ( e.g . , B-trees [2], cover trees [3], and vantage-point trees [31]) or using locality sensitive hashing ap- proaches ( e.g . , [12, 29]). Motiv ated by their strong performance reported in earlier work in [17], we opt to use metric trees to approximate the similarities of the input objects in our algorithm. Sev eral prior studies hav e also de veloped algorithms to speed up N -body computations. Most prominently , [7, 8] developed a dual-tree algorithm that is similar in spirit to the Barnes-Hut al- gorithm we use in this work. The dual-tree algorithm does not consider interactions between single points and groups of points like the Barnes-Hut algorithm, but it only considers interactions between groups of points. In preliminary experiments (see appendix), we found the dual-tree and Barnes- Hut algorithms to perform on par when used in the context of t-SNE — we opt for the Barnes-Hut algorithm here because it is conceptually simpler . Prior work [6] has also used the fast Gaussian transform [9, 30] (a special case of a fast multipole method [19]) to speed up the computation of Gaussian N -body interactions. Since in t-SNE, the forces exerted on the bodies are non-Gaussian, such an approach cannot readily be applied here. 3 t-Distributed Stochastic Neighbor Embedding t-Distributed Stochastic Neighbor Embedding (t-SNE) minimizes the div er gence between two dis- tributions: a distrib ution that measures pairwise similarities between the original data objects and a distribution that measures pairwise similarities between the corresponding points in the embedding. Suppose we are giv en a data set of objects D = { x 1 , x 2 , . . . , x N } and a function d ( x i , x j ) that computes a distance between a pair of objects, e.g. , their Euclidean distance. Our aim is to learn an s -dimensional embedding in which each object is represented by a point, E = { y 1 , y 2 , . . . , y N } with y i ∈ R s . T o this end, t-SNE deﬁnes joint probabilities p ij that measure the pairwise similarity between objects x i and x j by symmetrizing two conditional probabilities as follo ws: p j | i = exp( − d ( x i , x j ) 2 / 2 σ 2 i ) P k 6 = i exp( − d ( x i , x k ) 2 / 2 σ 2 i ) , p i | i = 0 (1) p ij = p j | i + p i | j 2 N . (2) Herein, the bandwidth of the Gaussian kernels σ i is set such that the perplexity of the conditional distribution P i equals a predeﬁned perplexity u . The optimal value of σ i varies per object, and is found using a simple binary search; see [11] for details. A heavy-tailed distribution is used to measure the similarity q ij between the two corresponding points y i and y j in the embedding: q ij = (1 + k y i − y j k 2 ) − 1 P k 6 = l (1 + k y k − y l ) k 2 ) − 1 , q ii = 0 . (3) In the embedding, a normalized Student-t kernel is used to measure similarities rather than a nor- malized Gaussian k ernel to account for the difference in v olume between high- and low-dimensional spaces [25]. The locations of the embedding points y i are learned by minimizing the Kullback- Leibler div er gence between the joint distrib utions P and Q : C ( E ) = K L ( P || Q ) = X i 6 = j p ij log p ij q ij . (4) 2 This cost function is non-con v ex; it is typically minimized by descending along the gradient: ∂ C ∂ y i = 4 X j 6 = i ( p ij − q ij ) q ij Z ( y i − y j ) , (5) where we deﬁned the normalization term Z = P k 6 = l (1 + k y k − y l ) k 2 ) − 1 . The ev aluation of both joint distributions P and Q is O ( N 2 ) , because their respecti ve normalization terms sum ov er all N 2 pairs of points. Since t-SNE scales quadratically in the number of objects N , its applicability is limited to data sets with only a fe w thousand data objects; beyond that, learning becomes very slow . 4 Barnes-Hut-SNE Barnes-Hut-SNE uses metric trees to approximate P by a sparse distribution in which only O ( uN ) values are non-zero, and approximates the gradients ∂ C ∂ y i using a Barnes-Hut algorithm. 4.1 Appr oximating Input Similarities As the input similarities are computed using a (normalized) Gaussian kernel, probabilities p ij corre- sponding to dissimilar input objects i and j are (nearly) inﬁnitesimal. Therefore, we can use a sparse approximation to the probabilities p ij without a substantial negativ e ef fect on the quality of the ﬁ- nal embeddings. In particular, we compute the sparse approximation by ﬁnding the b 3 u c nearest neighbors of each of the N data objects, and redeﬁning the pairwise similarities p ij as: p j | i = ( exp( − d ( x i , x j ) 2 / 2 σ 2 i ) P k ∈N i exp( − d ( x i , x k ) 2 / 2 σ 2 i ) , if j ∈ N i 0 , otherwise (6) p ij = p j | i + p i | j 2 N . (7) Herein, N i represents the set of the b 3 u c nearest neighbors of x i , and σ i is set such that the perplexity of the conditional distrib ution equals u . The nearest neighbor sets are found in O ( uN log N ) time by building a v antage-point tree on the data set. V antage-point tree. In a vantage-point tree, each node stores a data object and the radius of a (hyper)ball that is centered on this object [31]. All non-leaf nodes have two children: data objects that are located inside the ball are stored under the left child of the node, whereas data objects that are located outside the ball are stored under the right child. The tree is constructed by presenting the data objects one-by-one, trav ersing the tree based on whether the current data object lies inside or outside a ball, and creating a ne w leaf node in which the object is stored. The radius of the new leaf node is set to the median distance between its object and all other objects that lie inside the ball represented by its parent node. T o construct a vantage-point tree, the objects need not necessarily be points in a high-dimensional feature space; the av ailability of a metric d ( x i , x j ) sufﬁces. (In our experiments, ho we v er , we use x i ∈ R D and d ( x i , x j ) = k x i − x j k .) A nearest-neighbor search is performed using a depth-ﬁrst search on the tree that computes the dis- tance of the objects stored in the nodes to the tar get object, whilst maintaining i) a list of the current nearest neighbors and ii) the distance τ to the furthest nearest neighbor in the current neighbor list. The value of τ determines whether or not a node should be explored: if there can still be objects inside the ball whose distance to the target object is smaller than τ , the left node is searched, and if there can still be objects outside the ball whose distance to the target object is smaller than τ , the right node is searched. The order in which children are searched depends on whether or not the tar- get object lies inside or outside the current node ball: the left child is e xamined ﬁrst if the object lies inside the ball, because the odds are that the nearest neighbors of the target object are also located inside the ball. The right child is searched ﬁrst whenev er the target object lies outside of the ball. 4.2 Appr oximating t-SNE Gradients T o approximate the t-SNE gradient, we start by splitting the gradient into two parts as follo ws: ∂ C ∂ y i = 4( F attr − F rep ) = 4   X j 6 = i p ij q ij Z ( y i − y j ) − X j 6 = i q 2 ij Z ( y i − y j )   , (8) 3 where F attr denotes the sum of all attracti ve forces (the left sum), whereas F rep denotes the sum of all repulsi ve forces (the right sum). Computing the sum of all attractiv e forces, F attr , is computa- tionally ef ﬁcient; it can be done by summing over all non-zero elements of the sparse distribution P in O ( uN ) . (Note that the term q ij Z = (1 + k y i − y j k 2 ) − 1 can be computed in O (1) .) Howe v er , a naiv e computation of the sum of all repulsiv e forces, F rep , is O ( N 2 ) . W e now dev elop a Barnes-Hut algorithm to approximate F rep efﬁciently in O ( N log N ) . Consider three points y i , y j , and y k with k y i − y j k ≈ k y i − y k k  k y j − y k k . In this situation, the contributions of y j and y k to F rep will be roughly equal. The Barnes-Hut algorithm [1] exploits this by i) constructing a quadtree on the current embedding, ii) tra versing the quadtree using a depth-ﬁrst search, and iii) at e very node in the quadtree, deciding whether the corresponding cell can be used as a “summary” for the gradient contributions of all points in that cell. Figure 1: Quadtree constructed on a two- dimensional t-SNE embedding of 500 MNIST digits (the colors of the points correspond to the digit classes). Note how the quadtree adapts to the local point density in the embedding. Quadtree. A quadtree is a tree in which each node represents a rectangular cell with a partic- ular center , width, and height. Non-leaf nodes hav e four children that split up the cell into four smaller cells (quadrants) that lie “northwest”, “northeast”, “southwest”, and “southeast” of the center of the parent node (see Figure 1 for an illustration). Leaf nodes represent cells that contain at most one point of the embedding; the root node represents the cell that contains the complete embedding. In each node, we store the center-of-mass of the embedding points that are located inside the corresponding cell, y cell , and the total number of points that lie inside the cell, N cell . A quadtree has O ( N ) nodes and can be constructed in O ( N ) time by inserting the points one-by-one, splitting a leaf node when- ev er a second point is inserted in its cell, and updating y cell and N cell of all visited nodes. Appr oximating the gradient. T o approximate the repulsiv e part of the gradient, F rep , we note that if a cell is sufﬁciently small and suf- ﬁciently far away from point y i , the contri- butions q 2 ij Z ( y i − y j ) to F rep will be roughly similar for all points y j inside that cell. W e can, therefore, approximate these contributions by N cell q 2 i,cell Z ( y i − y cell ) , where we deﬁne q i,cell Z = (1 + k y i − y cell k 2 ) − 1 . W e ﬁrst approximate F rep Z = q 2 ij Z 2 ( y i − y j ) by perform- ing a depth-ﬁrst search on the quadtree, assessing at each node whether or not that node may be used as a “summary” for all the embedding points that are located in the corresponding cell. During this search, we construct an estimate of Z = P i 6 = j (1 + k y i − y j k 2 ) − 1 in the same way . The two approximations thus obtained are then used to compute F rep via F rep = F rep Z Z . W e use the condition proposed by [1] to decide whether a cell may be used as a “summary” for all points in that cell. The condition compares the distance of the cell to the target point with its size: k y i − y cell k 2 /r cell < θ , (9) where r cell represents the length of the diagonal of the cell under consideration and θ is a threshold that trades off speed and accuracy (higher values of θ lead to poorer approximations). In prelimi- nary e xperiments, we also explored various other conditions that take into account the rapid decay of the Student-t tail, but we did not ﬁnd to lead these alternative conditions to lead to a better accuracy-speed trade-off. (The problem of more complex conditions is that they require expensiv e computations at each cell. By contrast, the condition in Equation 9 can be ev aluated v ery rapidly .) Dual-tree algorithms. Whilst the Barnes-Hut algorithm considers point-cell interactions, further speed-ups may be obtained by computing only cell-cell interactions. This can be done using a dual-tree algorithm [7] that simultaneously tra verses the quadtree twice, and for every pair of nodes decides whether the interaction between the corresponding cells can be used as “summary” for the interactions between all points inside these two cells. Perhaps surprisingly , we did ﬁnd such an 4 Computation time Nearest neighbor error Figure 2: Computation time (in seconds) required to embed 70 , 000 MNIST digits using Barnes-Hut- SNE (left) and the 1 -nearest neighbor errors of the corresponding embeddings (right) as a function of the trade-off parameter θ . approach to perform on par with the Barnes-Hut algorithm in preliminary experiments. The com- putational advantages of the dual-tree algorithm ev aporate because after computing an interaction between two cells, one still needs to determine to which set of points the interaction applies. This can be done by searching the cell or by storing a list of children in each node during tree construction. Both these approaches are computationally costly . (It should be noted that the dual-tree algorithm is, howev er , much faster in approximating the value of the t-SNE cost function.) The results of our experiments with dual-tree algorithms are presented in the appendix. 5 Experiments W e performed experiments on four large data sets to ev aluate the performance of Barnes-Hut-SNE. Code for our algorithm is av ailable from http://homepage.tudelft.nl/19j49/tsne . Data sets. W e performed experiments on four data sets: i) the MNIST data set, ii) the CIF AR-10 data set, iii) the NORB data set, and iv) the TIMIT data set. The MNIST data set contains N = 70 , 000 grayscale handwritten digit images of size D = 28 × 28 = 784 pixels, each of which corresponds to one of ten classes. The CIF AR-10 data set [14] is an annotated subset of the 80 million tiny images data set [24] that contains N = 70 , 000 RGB images of size 32 × 32 pixels, leading to a D = 32 × 32 × 3 = 3 , 072 -dimensional input objects; each image corresponds to one of ten classes. The (small) NORB data set [16] contains grayscale images of toys from ﬁv e different classes, rendered on a uniform background under 6 lighting conditions, 9 elev ations ( 30 to 70 de grees every 5 degrees), and 18 azimuths ( 0 to 340 every 20 degrees). All images contain D = 96 × 96 = 9 , 216 pixels. The TIMIT data set contains speech data from which MFCC, delta, and delta-delta features were extracted, leading to D = 39 -dimensional features [22]; each frame in the data has one of 39 phone labels. W e used the TIMIT training set of N = 1 , 105 , 455 frames in our experiments. Experimental setup. In all our experiments, we follow the experimental setup of [25] as closely as possible. In particular , we initialize the embedding points by sampling from a Gaussian with a variance of 10 − 4 . W e run a gradient-descent optimizer for 1 , 000 iterations, setting the initial step size to 200 ; the step size is updated during the optimization use the scheme of [13]. W e use an additional momentum term that has weight 0 . 5 during the ﬁrst 250 iterations, and 0 . 8 afterwards. The perplexity u is ﬁxed to 30 . Following [25], all data sets with a dimensionality D larger than 50 were preprocessed using PCA to reduce their dimensionality to 50 . During the ﬁrst 250 learning iterations, we multiplied all p ij -values by a user-deﬁned constant α > 1 . As explained in [25], this trick enables t-SNE to ﬁnd a better global structure in the early stages of the optimization. In preliminary experiments, we found that this trick becomes increasingly important to obtain good embeddings when the data set size increases, as it becomes harder for the optimization 5 Computation time Nearest neighbor error Figure 3: Compution time (in seconds) required to embed MNIST digits (left) and the 1 -nearest neighbor errors of the corresponding embeddings (right) as a function of data set size N for both standard t-SNE and Barnes-Hut-SNE. Note that the required computation time, which is shown on the y -axis of the left ﬁgure, is plotted on a logarithmic scale. to ﬁnd a good global structure when there are more points in the embedding because there is less space for clusters to mov e around. In our experiments, we ﬁx α = 12 (by contrast, [25] used α = 4 ). W e present the results of three sets of experiments. In the ﬁrst experiment, we inv estigate the ef fect of the trade-of f parameter θ on the speed and the quality of embeddings produced by Barnes-Hut- SNE on the MNIST data set. In the second experiment, we in v estigate the computation time required to run Barnes-Hut-SNE as a function of the number of data objects N (also on the MNIST data set). In the third experiment, we construct and visualize embeddings of all four data sets. Results. Figure 2 presents the results of an experiment in which we varied the speed-accuracy trade-off parameter θ used to learn the embedding. The ﬁgure shows the computation time required to construct embeddings of all 70 , 000 MNIST digit images, as well as the 1 -nearest neighbor error (computed based on the digit labels) of the corresponding embeddings. The results presented in the ﬁgure sho w that the trade-of f parameter θ may be increased to a v alue of approximately 0 . 5 without negati v ely af fecting the quality of the embedding. At the same time, increasing the v alue of θ to 0 . 5 leads to very substantial improvements in terms of the amount of computation required: the time required to embed all 70 , 000 MNIST digits is reduced to just 645 seconds when θ = 0 . 5 . (Note that the special case θ = 0 corresponds to standard t-SNE [25]; we did not run an e xperiment with θ = 0 because standard t-SNE would take days to complete on the full MNIST data set.) In Figure 3, we compare standard t-SNE and Barnes-Hut-SNE in terms of i) the computation time required for the embedding of MNIST digit images as a function of the data set size N and ii) the 1 -nearest neighbor errors of the corresponding embeddings. (Note that the y -axis of the left ﬁgure, which represents the required computation time in seconds, uses a logarithmic scale.) In the experiments, we ﬁxed the parameter θ that trades off speed and accuracy to 0 . 5 . The results presented in the ﬁgure show that Barnes-Hut-SNE is orders of magnitude faster than standard t-SNE, whilst the dif ference in quality of the constructed embeddings (which is measured by the nearest-neighbor errors) is negligible. Most prominently , the computational advantages of Barnes-Hut-SNE rapidly increase as the number of objects in the data set N increases. Figure 4 presents embeddings of all four data sets constructed using Barnes-Hut-SNE. The colors of the points indicate the classes of the corresponding objects; the titles of the plots indicate the computation time that was used to construct the corresponding embeddings. As before, we ﬁxed θ = 0 . 5 in all four experiments. The results in the ﬁgure sho ws that Barnes-Hut-SNE can construct high- quality embeddings of, e.g . , the 70 , 000 MNIST handwritten digit images in just ov er 10 minutes. (Although our MNIST embedding contains many more points, it may be compared with that in [25]. V isually , the structure of the two embeddings is very similar .) The results also show that Barnes- 6 Figure 4: Barnes-Hut-SNE visualizations of four data sets: MNIST handwritten digits (top-left), CIF AR-10 tiny images (top-right), NORB object images (bottom-left), and TIMIT speech frames (bottom-right). The colors of the point indicate the classes of the corresponding objects. The titles of the ﬁgures indicate the computation time that was used to construct the corresponding embeddings. Figure best viewed in color . Hut-SNE makes it practical to embed data sets with more than a million data points: the TIMIT embedding shows all 1 , 105 , 455 data points, and was constructed in less than four hours. A version of the MNIST embedding in which the original digit images are shown is presented in Fig- ure 5. The results show that, like standard t-SNE, Barnes-Hut-SNE is very good at preserving local structure of the data in the embedding: for instance, the visualization clearly shows that orientation is one of the main sources of variation within the cluster of ones. 6 Conclusion and Future W ork W e presented a new t-SNE algorithm [25], called Barnes-Hut-SNE, that i) constructs a sparse ap- proximation to the similarities between input objects using vantage-point trees, and ii) approxi- mates the t-SNE gradient using a variant of the Barnes-Hut algorithm. The new algorithm runs in O ( N log N ) rather than O ( N 2 ) , and requires only O ( N ) memory . Our experimental ev aluation of Barnes-Hut-SNE sho ws that it is substantially faster than standard t-SNE, and that it facilitates the visualization of data sets with millions of data objects in scatter plots. A drawback of Barnes-Hut-SNE is that it does not provide any error bounds [21]. Indeed, there exist alternative algorithms that do provide such error bounds ( e.g. , [28]); we aim to explore these alternativ es in future work to see whether they can be used to bound the error made in our t-SNE 7 Figure 5: Barnes-Hut-SNE visualization of all 70 , 000 MNIST handwritten digit images (con- structed in 10 minutes and 45 seconds). Zoom in on the visualization for more detailed views. gradient computations, and to bound the error in the ﬁnal embedding. Another limitation of Barnes- Hut-SNE is that it can only be used to embed data in two or three dimensions. Generalizations to higher dimensions are infeasible because the size of the tree gro ws exponentially in the dimensional- ity of the embedding space. Having said that, this limitation is not very severe since t-SNE is mainly used for visualization ( i.e. for embedding in two or three dimensions). Moreov er , it is relativ ely straightforward to replace the quadtree by metric trees that scale better to high-dimensional spaces. In future work, we plan to further scale up our algorithm by dev eloping parallelized implementations that can run on data sets that are too large to be fully stored in memory . W e also aim to in v estigate the effect of varying the value of θ during the optimization. In addition, we plan to e xplore to what extent adapted versions of our algorithm (that use metric trees instead of quadtrees) can be used to speed up techniques for relational embedding ( e.g . , [4, 18]). Acknowledgments The author is supported by EU-FP7 Social Signal Processing (SSPNet) and by the Netherlands In- stitue for Advanced Study (NIAS). The author thanks Geoffrey Hinton for many helpful discussions, and two anonymous re viewers for their helpful comments. 8 References [1] J. Barnes and P . Hut. A hierarchical O(N log N) force-calculation algorithm. Natur e , 324(4):446–449, 1986. [2] R. Bayer and E. McCreight. Or ganization and maintenance of large ordered indexes. Acta Informatica , 1(3):173–189, 1972. [3] A. Beygelzimer , S. Kakade, and J. Langford. Cover trees for nearest neighbor . In Proceedings of the International Confer ence on Machine Learning , pages 97–104, 2006. [4] A. Bordes, J. W eston, R. Collobert, and Y . Bengio. Learning structured embeddings of knowledge bases. In Pr oceedings of the 25 th Confer ence on Artiﬁcial Intelligence (AAAI) , 2011. [5] M. ´ A. Carreira-Perpi ˜ n ´ an. The elastic embedding algorithm for dimensionality reduction. In Proceedings of the 27 th International Confer ence on Machine Learning , pages 167–174, 2010. [6] N. DeFreitas, Y . W ang, M. Mahdaviani, and D. Lang. Fast Krylov methods for N-body learning. In Advances in Neural Information Pr ocessing Systems , volume 18, pages 251–258, 2006. [7] A.G. Gray and A.W . Moore. N-body problems in statistical learning. In Advances in Neural Information Pr ocessing Systems , pages 521–527, 2001. [8] A.G. Gray and A.W . Moore. Rapid e valuation of multiple density models. In Pr oceedings of the Interna- tional Confer ence on Artiﬁcial Intelligence and Statistics , 2003. [9] L. Greengard and V . Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics , 73:325–348, 1987. [10] J. Heer , M. Bostock, and V . Ogie vetsk y . A tour through the visualization zoo. Communications of the A CM , 53:59–67, 2010. [11] G.E. Hinton and S.T . Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information Pr ocessing Systems , v olume 15, pages 833–840, 2003. [12] P . Indyk and R. Motwani. Approximate nearest neighbors: T owards removing the curse of dimensionality . In Pr oceedings of 30 th Symposium on Theory of Computing , 1998. [13] R.A. Jacobs. Increased rates of conv er gence through learning rate adaptation. Neural Networks , 1:295– 307, 1988. [14] A. Krizhevsk y . Learning multiple layers of features from tiny images. T echnical report, University of T oronto, 2009. [15] N.D. Lawrence. Spectral dimensionality reduction via maximum entropy . In Pr oceedings of the Inter- national Conference on Artiﬁcial Intelligence and Statistics, JMLR W&CP , volume 15, pages 51–59, 2011. [16] Y . LeCun, F .J. Huang, and L. Bottou. Learning methods for generic object recognition with in variance to pose and lighting. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 97–104, 2004. [17] T . Liu, A.W . Moore, A. Gray , and K. Y ang. An in vestig ation of practical approximate nearest neighbor algorithms. In Advances in Neural Information Pr ocessing Systems , volume 17, pages 825–832, 2004. [18] A. Paccanaro and G.E. Hinton. Learning distributed representations of concepts using linear relational embedding. IEEE Tr ansactions on Knowledge and Data Engineering , 13(2):232–244, 2001. [19] V . Rokhlin. Rapid solution of integral equations of classic potential theory . Journal of Computational Physics , 60:187–207, 1985. [20] S.T . Roweis and L.K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding. Science , 290(5500):2323–2326, 2000. [21] J.K. Salmon and M.S. W arren. Skeletons from the treecode closet. Journal of Computational Physics , 111(1):136–155, 1994. [22] F . Sha and L.K. Saul. Large margin hidden Mark ov models for automatic speech recognition. In Advances in Neural Information Pr ocessing Systems , volume 19, pages 1249–1456, 2007. [23] J.B. T enenbaum, V . de Silva, and J.C. Langford. A global geometric framew ork for nonlinear dimension- ality reduction. Science , 290(5500):2319–2323, 2000. [24] A. T orralba, R. Fergus, and W .T . Freeman. 80 million tiny images: A lar ge dataset for non-parametric object and scene recognition. IEEE T r ansactions on P attern Analysis and Machine Intelligence , 30(11):1958–1970, 2008. [25] L.J.P . van der Maaten and G.E. Hinton. V isualizing data using t-SNE. Journal of Machine Learning Resear ch , 9(No v):2431–2456, 2008. 9 [26] J. V enna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski. Information retriev al perspective to nonlinear dimensionality reduction for data visualization. J ournal of Machine Learning Resear c h , 11(Feb):451– 490, 2010. [27] M. Vladymyrov and M. ´ A. Carreira-Perpi ˜ n ´ an. Partial-Hessian strategies for fast learning of nonlinear embeddings. In Pr oceedings of the International Confer ence on Machine Learning , pages 345–352, 2012. [28] M.S. W arren and J.K. Salmon. A parallel hashed octtree N-body algorithm. In Proceedings of the A CM/IEEE Confer ence on Super computing , pages 12–21, 1993. [29] Y . W eiss, A. T orralba, and R. Fergus. Spectral hashing. In Advances in Neural Information Pr ocessing Systems , pages 1753–1760, 2008. [30] C. Y ang, R. Duraiswami, N.A. Gumerov , and L. Davis. Improved fast Gauss transform and efﬁcient kernel density estimation. In Proceedings of the IEEE International Confer ence on Computer V ision , pages 664–671, 2003. [31] P .N. Y ianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Pr oceedings of the A CM-SIAM Symposium on Discr ete Algorithms , pages 311–321, 1993. 10 A Experiments with Dual-T ree t-SNE W e also performed experiments with a dual-tree implementation [7] of t-SNE. Dual-tree t-SNE dif fers from Barnes-Hut-SNE in that it considers only cell-cell instead of point-cell interactions. It simultaneously tra verses the quadtree twice, and decides for each pair of nodes whether the interaction between these nodes can be used as a “summary” for all points in the cells corresponding to these two nodes. W e use the following condition to check whether the interaction between a pair of nodes may be used as a “summary” interaction: k y cell 1 − y cell 2 k 2 / max( r cell 1 , r cell 2 ) < ρ, (10) where y cell 1 and y cell 2 represent the center-of-mass of the two cells, r cell 1 and r cell 2 represent the diameter of the two cells, and ρ is a speed-accuracy trade-off parameter (similar to θ in Barnes-Hut-SNE). Figure 6 presents the results of an experiment in which we in vestigate the inﬂuence of the trade-off parameter ρ on the learning time and the quality of the embedding on the MNIST data set. The results in the ﬁgure may be readily compared to those in Figure 2. The results in the ﬁgure show that, whilst the dual-tree algorithm provides additional speed-ups compared to the Barnes-Hut algorithm, the quality of the embedding also de- teriorates much faster as the trade-of f parameter ρ increases. The quality of the embedding obtained with a dual-tree algorithm with ρ = 0 . 25 roughly equals that of a Barnes-Hut embedding with θ = 0 . 5 , and these two embeddings are constructed in roughly the same time ( viz. in approximately 650–700 seconds). Figure 7 shows the performance of dual-tree t-SNE with ρ = 0 . 25 as a function of the number of MNIST digits N . The results in Figure 7 can be readily compared to those in Figure 3. Again, the results sho w that dual-tree t-SNE performs roughly on par with Barnes-Hut-SNE, irrespectiv e of the size of the data set N . Computation time Nearest neighbor error Figure 6: Computation time (in seconds) required to embed 70 , 000 MNIST digits using dual-tree t-SNE (left) and the 1 -nearest neighbor errors of the corresponding embeddings (right) as a function of the trade-off parameter ρ . The results may be compared to those in Figure 2. Computation time Nearest neighbor error Figure 7: Compution time (in seconds) required to embed MNIST digits (left) and the 1 -nearest neighbor errors of the corresponding embeddings (right) as a function of data set size N for both standard t-SNE and dual-tree t-SNE. The results may be compared to those in Figure 3. 11

Barnes-Hut-SNE

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment