A Unified Algorithmic Framework for Multi-Dimensional Scaling

A Uniﬁed Algo rithmic F ramew o rk fo r Multi-Dimensional Scaling Arvind Agarwal ∗ Jeff M. Phillips † Suresh V enkatasubramanian ‡ Abstract In this paper , we propose a uniﬁed algorithmic framework for solving many known variants of MDS . Our algorithm is a simple iterative scheme with guaranteed convergence, and is modular ; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily . In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. W e expect that this framework will be useful for a number of MDS variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower di- mensional sphere, preserving geodesic distances. As a compliment to this result, we also extend the Johnson- Lindenstrauss Lemma to this spherical setting, where projecting to a random O (( 1 /" 2 ) log n ) -dimensional sphere causes " -distortion. 1 Intro duction Multidimensional scaling ( MDS ) [ 23, 10, 3 ] is a widely used method for embedding a general distance matrix into a low dimensional Euclidean space, used both as a preprocessing step for many problems, as well as a visualization tool in its own right. MDS has been studied and used in psychology since the 1930s [ 35, 33, 22 ] to help visualize and analyze data sets where the only input is a distance matrix. More recently MDS has become a standard dimensionality reduction and embedding technique to manage the complexity of dealing with large high dimensional data sets [ 8, 9, 31, 6 ] . In general, the problem of embedding an arbitrary distance matrix into a ﬁxed dimensional Euclidean space with minimum error is nonconvex (because of the dimensionality constraint). Thus, in addition to the stan- dard formulation [ 12 ] , many variants of MDS have been proposed, based on changing the underlying error function [ 35, 8 ] . There are also applications where the target space, rather than being a Euclidean space, is a manifold (e.g. a low dimensional sphere), and various heuristics for MDS in this setting have also been proposed [ 13, 6 ] . Each such variant is typically addressed by a different heuristic, including majorization, the singular value de- composition, semideﬁnite programming, subgradient methods, and standard Lagrange-multipler-based meth- ods (in both primal and dual settings). Some of these heuristics are efﬁcient, and others are not; in general, every new variant of MDS seems to require different ideas for efﬁcient heuristics. 1.1 Our W o rk In this paper , we present a uniﬁed algorithmic framework for solving many variants of MDS . Our approach is based on an iterative local improvement method, and can be summarized as follows: “Pick a point and move it so that the cost function is locally optimal. Repeat this process until convergence.” The improvement step reduces to a well-studied and efﬁcient family of iterative minimization techniques, where the speciﬁc algorithm depends on the variant of MDS . A central result of this paper is a single general convergence result for all variants of MDS that we examine. This single result is a direct consequence of the way in which we break down the general problem into an ∗ P artially supported by NSF IIS-0712764 † Supported by a subaward to the University of Utah under NSF award 0937060 to the Computing Research Association ‡ P artially supported by NSF CCF -0953066 1 iterative algorithm combined with a point-wise optimization scheme. Our approach is generic, efﬁcient, and simple. The high level framework can be written in 10-12 lines of MA TLAB code, with individual function- speciﬁc subroutines needing only a few more lines each. Further , our approach compares well with the best methods for all the variants of MDS . In each case our method is consistently either the best performer or is close to the best, regardless of the data proﬁle or cost function used, while other approaches have much more variable performance. Another useful feature of our method is that it is parameter-free, requiring no tuning parameters or Lagrange multipliers in order to perform at its best. Spherical MDS. An important application of our approach is the problem of performing spherical MDS . Spherical MDS is the problem of embedding a matrix of distances onto a (low-dimensional) sphere. Spheri- cal MDS has applications in texture mapping and image analysis [ 6 ] , and is a generalization of the spherical dimensionality reduction problem, where the goal is to map points from a high dimensional sphere onto a low- dimensional sphere. This latter problem is closely related to dimensionality reduction for ﬁnite dimensional dis- tributions. A well-known isometric embedding takes a distribution represented as a point on the d -dimensional simplex to the d -dimensional sphere while preserving the Hellinger distance between distributions. A spherical dimensionality reduction result is an important step to representing high dimensional distributions in a lower- dimensional space of distributions, and will have considerable impact in domains that represent data natively as histograms or distributions, such as in document processing [ 29, 18, 2 ] , image analysis [ 25, 11 ] and speech recognition [ 16 ] . Our above framework applies directly to this setting, where for the local improvement step we adapt a technique ﬁrst developed by Karcher for ﬁnding geodesic means on a manifold. In addition, we prove a Johnson-Lindenstrauss-type result for the sphere; namely , that n points lying on a d -dimensional sphere can be embedded on a O (( 1 /" 2 ) log n ) -dimensional sphere while approximately preserving the geodesic distances between pairs of points, that is, no distance changes by more than a relative ( 1 + " ) -factor . This latter result can be seen as complementary to the local improvement scheme; the formal embedding result guarantees the error while being forced to use log n dimensions, while the local improvement strategy generates a mapping into any k dimensional hypersphere but provides no formal guarantees on the error . Summa ry of contributions. The main contributions of this paper can be summarized as follows: • In Section 4 we present our iterative framework, illustrate how it is applied to speciﬁc MDS variants and prove a convergence result. • In Section 5 we present a comprehensive experimental study that compares our approach to the prior best known methods for different MDS variants. • In Section 6 we prove a formal dimensionality reduction result that embeds a set of n points on a high- dimensional sphere into a sphere of dimension O (( 1 /" 2 ) log n ) while preserving all distances to within relative error of ( 1 + " ) for any " > 0. 2 Background and Existing Metho ds Multidimensional scaling is a family of methods for embedding a distance matrix into a low-dimensional Eu- clidean space. There is a general taxonomy of MDS methods [ 10 ] ; in this paper we will focus primarily the metric and generalized MDS problems. The traditional formulation of MDS [ 23 ] assumes that the distance matrix D arises from points in some d - dimensional Euclidean space. Under this assumption, a simple transformation takes D to a matrix of similarities S , where s i j = 〈 x i , x j 〉 . These similarities also arise from many psychology data sets directly [ 33, 35 ] . The 2 problem then reduces to ﬁnding a set of points X in k -dimensional space such that X X T approximates S . This can be done optimally using the top k singular values and vectors from the singular value decomposition of S . A more general approach called SMACOF that drops the Euclidean assumption uses a technique known as stress majorization [ 27, 12, 13 ] . It has been adapted to many other MDS variants as well including restrictions of data to lie on quadratic surfaces and speciﬁcally spheres [ 13 ] . Since the sum-of -squares error metric is sensitive to outliers, Cayton and Dasgupta [ 8 ] proposed a robust variant based on an ` 1 error metric. They separate the rank and cost constraints, solving the latter using either semideﬁnite programming or a subgradient heuristic, followed by a singular value decomposition to enforce the rank constraints. Many techniques have been proposed for performing spherical MDS . Among them are majorization methods ( [ 30 ] and SMACOF -Q [ 13 ] ), a multiresolution approach due to Elad, K eller , and Kimmel [ 14 ] and an approach based on computing the classical MDS and renormalizing [ 31 ] . Emb eddings that guarantee b ounded error. A complementary line of work in dimensionality reduction ﬁxes an error bound for every pair of distances (rather than computing an average error), and asks for the minimum dimension a data set can be embedded in while maintaining this error . The Johnson-Lindenstrauss Lemma [ 19 ] states that any collection of n points in a Euclidean space can be embedded in a O (( 1 /" 2 ) log n ) dimensional Euclidean space that preserves all distances within a relative error of " . If the points instead deﬁne an abstract metric space, then the best possible result is an embedding into O ( log n ) -dimensional Euclidean space that preserves distances up to a factor of O ( log n ) . An exhaustive survey of the different methods for dimensionality reduction is beyond the scope of this paper - the reader is directed to the survey by Indyk and Matousek for more information [ 17 ] . The Johnson-Lindenstrauss Lemma can be extended to data lying on manifolds. Any manifold M with “lin- earization dimension” k (a measure of its complexity) can be embedded into a O (( 1 /" 2 ) k log ( k n )) dimensional space so that all pairwise Euclidean distances between points on M are distorted by at most a relative ( 1 + " ) - factor [ 1, 32, 26 ] . A k -dimensional sphere has linearization dimension O ( k ) , so this bound applies directly for preserving the chordal (i.e. Euclidean) distance between points on a sphere. The geodesic distance between points on a sphere can be interpreted as the angle between the points in radians, and a result by Magen [ 26 ] show that O (( 1 /" 2 ) log n ) dimensions preserve angles to within a relative factor of 1 + p " (which is weaker than our result preserving the geodesic distance to within a relative factor of ( 1 + " ) ). 3 Deﬁnitions Let D = ( d i j ) be an n × n matrix representing distances between all pairs of points in a set Y = { y 1 , . . . y n } . In general, we assume that D is symmetric d i j = d j i , although our method does not formally require this. The multidimensional scaling problem takes as input Y , D and k , and asks for a mapping µ : Y → X from Y to a set of points X in a k -dimensional space T such that the difference between the original and resulting distances is minimized. There are many different ways to measure the difference between the sets of distances, and these can be captured by the following general function C ( X , D ) = X i X j Err ( f ( x i , x j ) − d i j ) where Err measures the discrepancy between the source and target distances, and f denotes the function that measures distance in the target space. • T = R k , Err ( δ ) = δ 2 , f ( x , x 0 ) = k x − x 0 k 2 : This is a general form of the MDS problem, which we refer to as f MDS . 3 • T = R k , Err ( δ ) = | δ | , f ( x , x 0 ) = k x − x 0 k 2 : This is a robust variant of MDS called r MDS , ﬁrst suggested by Cayton and Dasgupta [ 8 ] . • T = S ¯ k , Err ( δ ) = | δ | or δ 2 , f ( x , x 0 ) is either chordal (c) or geodesic distance (g) on S ¯ k . W e refer to this family of problems as { c,g } - { 1,2 } -s MDS . It will be convenient to split the expression into component terms. W e deﬁne C i ( X , D , x i ) = X j Err ( f ( x i , x j ) − d i j ) which allows us to write C ( X , D ) = P i C i ( X , D , x i ) . Notes. The actual measure studied by Cayton and Dasgupta [ 8 ] is not r MDS . It is a variant which takes the absolute difference of the squared distance matrices. W e call this measure r 2 MDS . Also, classical MDS does not appear in this list since it tries to minimize the error between similarities rather than distances. W e refer to this measure as c MDS . 4 Algo rithm W e now present our algorithm P L A C E C E N T E R ( X , D ) that ﬁnds a mapping Y → X minimizing C ( X , D ) . F or now we assume that we are given an initial embedding X 1 ∈ R ¯ k to seed our algorithm. Our experiments indicate the SVD -based approach [ 35 ] is almost always the optimal way to seed the algorithm, and we use it unless speciﬁcally indicated otherwise. Algorithm 1 P L A C E C E N T E R (D) R un any MDS strategy to obtain initial seed X . repeat " ← C ( X , D ) for i = 1 to n do x i ← P L A C E i ( X , D ) { this updates x i ∈ X } end for until ( " − C ( X , D ) < t ) { for a ﬁxed threshold t } return X P L A C E C E N T E R operates by employing a technique from the block-relaxation class of heuristics. The cost function can be expressed as a sum of costs for each point x i , and so in each step of the inner loop we ﬁnd the best placement for x i while keeping all other points ﬁxed, using the algorithm P L A C E i ( X , D ) . A key insight driving our approach is that P L A C E i ( X , D ) can be implemented either iteratively or exactly for a wide class of distance functions. The process terminates when over all i , invoking P L A C E i ( X , D ) does not reduce the cost C ( X , D ) by more than a threshold t . The algorithm takes O ( n 2 ) for each iteration, since P L A C E i ( X , D ) will take O ( n ) time and computing C ( X , D ) takes O ( n 2 ) time. 4.1 A Geometric P ersp ective On Place i ( X , D ) The routine P L A C E i ( X , D ) is the heart of our algorithm. This routine ﬁnds the optimal placement of a ﬁxed point x i with respect to the cost function C i ( X , D , x i ) = P j Err ( f ( x i , x j ) − d i j ) . Set r j = d i j . Then the optimal placement of x i is given by the point x ∗ minimizing the function g ( x ) = X j Err ( f ( x , x j ) − r j ) . 4 x 1 x 2 x ˆ x 1 ˆ x 2 r 1 r 2 Figure 1: A geometric interpretation of the erro r term g ( x ) . Note that the terms f ( x , x i ) and r i = d i i are zero, so we can ignore their presence in the summation for ease of notation. There is a natural geometric interpretation of g ( x ) , illustrated in Figure 1. Consider a sphere around the point x j of radius r j . Let ˆ x j be the point on this sphere that intersects the ray from x j towards x . Then the distance f ( x , ˆ x j ) = | f ( x , x j ) − r j | . Thus, we can rewrite g ( x ) as g ( x ) = X j Err ( f ( x , ˆ x j )) . This function is well-known in combinatorial optimization as the min-sum problem. F or Err ( δ ) = δ 2 , g ( x ) ﬁnds the point minimizing the sum-of-squared distances from a collection of ﬁxed points (the 1-mean), which is the centroid x ∗ = 1 n P j ˆ x j . F or Err ( δ ) = | δ | , g ( x ) ﬁnds the 1-median, the point minimizing the sum of distances from a collection of ﬁxed points. Although there is no closed form expression for the 1-median, there are numerous algorithms for solving this problem both exactly [ 34 ] and approximately [ 4 ] . Methods that converge to the global optimum exist for any Err ( δ ) = | δ | p , p ≤ 2; it is known that if p is sufﬁciently larger than 2, then convergent methods may not exist [ 5 ] . While g ( x ) can be minimized optimally for error functions Err of interest, the location of the points ˆ x j depends on the location of the solution x ∗ , which is itself unknown! This motivates an alternating optimization procedure, where the current iterate x is used to compute ˆ x j , and then these ˆ x j are used as input to the min-sum problem to solve for the next value of x . Algorithm 2 P L A C E i ( X , D ) repeat " ← g ( x i ) for j = 1 to n do ˆ x j ← intersection of sphere of radius r j around x j with ray from x j towards x i end for x i ← R E C E N T E R ( { ˆ x 1 , ˆ x 2 , . . . , ˆ x n } ) until ( " − g ( x i ) < t ) { for a ﬁxed threshold t } return x i 4.2 Implementing Recenter Up to this point, the description of P L A C E C E N T E R and P L A C E has been generic, requiring no speciﬁcation of Err and f . In fact, all the domain-speciﬁcity of the method appears in R E C E N T E R , which solves the min-sum 5 problem. W e now demonstrate how different implementations of R E C E N T E R allow us to solve the different variants of MDS discussed above. 4.2.1 The original MDS: f MDS R ecall from Section 3 that the f MDS problem is deﬁned by Err ( δ ) = δ 2 and f ( x , x 0 ) = k x − x 0 k 2 . Thus, g ( x ) = P j k x − ˆ x j k 2 . As mentioned earlier , the minimum of this function is attained at x ∗ = ( 1 / n ) P j ˆ x j . Thus, R E C E N T E R ( { ˆ x 1 , ˆ x 2 , . . . , ˆ x n } ) merely outputs ( 1 / n ) P j ˆ x j , and takes O ( n ) time per invocation. 4.2.2 Robust MDS: r MDS The robust MDS problem r MDS is deﬁned by Err ( δ ) = | δ | and f ( x , x 0 ) = k x − x 0 k 2 . Minimizing the resulting function g ( x ) yields the famous F ermat- W eber problem, or the 1-median problem as it is commonly known. An exact iterative algorithm for solving this problem was given by W eiszfeld [ 34 ] , and works as follows. At each step of P L A C E i the value x i is updated by x i ← X j ˆ x j k x i − ˆ x j k , X j 1 k x i − ˆ x j k . This algorithm is guaranteed to converge to the optimal solution [ 24, 28 ] , and in most settings converges quadratically [ 21 ] . Other no rms and distances. If Err ( δ ) = | δ | p , 1 < p < 2, then an iterative algorithm along the same lines as the W eiszfeld algorithm can be used to minimize g ( x ) optimally [ 5 ] . In practice, this is the most interesting range of values for p . It is also known that for p sufﬁciently larger than 2, this iterative scheme may not converge. W e also can tune P L A C E C E N T E R to the r 2 MDS problem (using squared distances) by setting r j = d 2 i j . 4.2.3 Spherical MDS Spherical MDS poses special challenges for the implementation of R E C E N T E R . Firstly , it is no longer obvious what the deﬁnition of ˆ x j should be, since the “spheres” surrounding points must also lie on the sphere. Secondly , consider the case where Err ( δ ) = δ 2 , and f ( x , x 0 ) is given by geodesic distance on the sphere. Unlike in the case of R k , we no longer can solve for the minimizer of g ( x ) by computing the centroid of the given points, because this centroid will not in general lie on the sphere, and even computing the centroid followed by a projection onto the sphere will not guarantee optimality . The ﬁrst problem can be solved easily . Rather than draw spheres around each x j , we draw geodesic spheres , which are the set of points at a ﬁxed geodesic distance from x j . On the sphere, this set of points can be easily described as the intersection of an appropriately chosen halfplane with the sphere. Next, instead of computing the intersection of this geodesic sphere with the ray from x j towards the current estimate of x i , we compute the intersection with a geodesic ray from x j towards x i . The second problem can be addressed by prior work on computing min-sums on manifolds. Karcher [ 20 ] proposed an iterative scheme for the geodesic sum-of-squares problem that always converges as long as the points do not span the entire sphere. His work extends (for the same functions Err , f ) to points deﬁned on more general Riemannian manifolds satisfying certain technical conditions. It runs in O ( n ) time per iteration. F or the robust case ( Err ( δ ) = | δ | ), the Karcher scheme no longer works. F or this case, we make use of a W eiszfeld-like adaption [ 15 ] that again works on general Riemannian manifolds, and on the sphere in particu- lar . Like the W eiszfeld scheme, this approach takes O ( n ) time per iteration. 6 4.3 Convergence Pro ofs Here we prove that each step of P L A C E C E N T E R converges as long as the recursively called procedures reduce the relevant cost functions. Convergence is deﬁned with respect to a cost function κ , so that an algorithm converges if at each step κ decreases until the algorithm terminates. Theorem 4.1. If each call to ˜ x i ← P L A C E i ( X , D ) decreases the cost C i ( X , D , x i ) , then P L A C E C E N T E R ( D ) converges with respect to C ( · , D ) . Proof. Let ˜ X ← P L A C E i ( X , D ) result from running an iteration of P L A C E i ( X , D ) . Let ˜ X = { x 1 , . . . , x i − 1 , ˜ x i , x i + 1 , . . . , x n } . Then we can argue C ( X , D ) − C ( ˜ X , D ) = 2 X j = 1 Err ( f ( x i , x j ) − d i , j ) − 2 X j = 1 Err ( f ( ˜ x i , x j ) − d i , j ) = 2 C i ( X , D , x i ) − 2 C i ( ˜ X , D , ˜ x i ) > 0. The last line follows because X and ˜ X only differ at x i versus ˜ x i , and by assumption on P L A C E i ( X , D ) , this sub-cost function must otherwise decrease. Theorem 4.2. If each call x i ← R E C E N T E R ( ˆ X ) reduces P n j = 1 f ( x i , ˆ x j ) p , then P L A C E i ( X , D , x i ) converges with respect to C i ( X , D , · ) . Proof. First we can rewrite C i ( X , D , x i ) = n X j = 1 Err ( f ( x i , x j ) − d i , j ) = n X j = 1 Err (( f ( x i , ˆ x j ) + d i , j ) − d i , j ) = n X j = 1 Err ( f ( x i , ˆ x j )) . Since Err ( f ( x i , ˆ x j )) measures the distance to the sphere ◦ j , choosing x 0 i to minimize (or decrease) P n j = 1 Err ( f ( x 0 i , ˆ x j )) must decrease the sum of distances to each point ˆ x j on each sphere ◦ j . Now let ˆ x 0 j be the closest point to x 0 i on ◦ j . Err ( f ( x 0 i , ˆ x 0 j )) ≤ Err ( f ( x 0 i , x j )) and thus C i ( X , D , x 0 i ) = n X j = 1 Err ( f ( x 0 i , ˆ x 0 j )) ≤ n X j = 1 Err ( f ( x 0 i , ˆ x 0 j )) ≤ n X j = 1 Err ( f ( x i , ˆ x j )) = C i ( X , D , x i ) where equality only holds if x i = x 0 i , in which case the algorithm terminates. 5 Exp eriments In this section we evaluate the performance of P L A C E C E N T E R (PC). Since PC generalizes to many different cost functions, we compare it with the best known algorithm for each cost function, if one exists. F or the f MDS problem the leading algorithm is SMACOF [ 13 ] ; for the r 2 MDS problem the leading algorithm is by Cayton and Dasgupta (CD) [ 8 ] . W e know of no previous scalable algorithm designed for r MDS . W e note that the Cayton-Dasgupta algorithm REE does not exactly solve the r 2 MDS problem. Instead, it takes a non-Euclidean distance matrix and ﬁnds a Euclidean distance matrix that minimizes the error without any rank restrictions. Thus, as suggested by the authors [ 8 ] , to properly compare the algorithms, we let CD refer to running REE and then projecting the result to a k -dimensional subspace using the SVD technique [ 35 ] (our plots show this projection after each step). With regards to each of these Euclidean measures we compare our algorithm with SMACOF and CD. W e also compare with the popular SVD -based method [ 35 ] , which solves the related c MDS problem based on similarities, by seeding all three iterative techniques with the results of the closed-form SVD -based solution. 7 Then we consider the family of spherical MDS problems { c,g } - { 1,2 } -s MDS . W e compare against a version of SMACOF -Q [ 13 ] that is designed for data restricted to a low dimensional sphere, speciﬁcally for the c-2- SMDS measure. W e compare this algorithm to ours under the c-2- SMDS measure (for a fair comparison with SMACOF -Q) and under the g-1- SMDS measure which is the most robust to noise. The subsections that follow focus on individual cost measures. W e then discuss the overall behavior of our algorithm in Section 5.5. Data sets, co de, and setup. T est inputs for the algorithms are generated as follows. W e start with input consisting of a random point set with n = 300 points in R ¯ d for d = 200, with the target space T = R ¯ k with k = 10. Many data sets in practice have much larger parameters n and d , but we limit ourselves to this range because for larger values CD becomes prohibitively slow . The data is generated to ﬁrst lie on a k -dimensional subspace, and then (full-dimensional) P oisson noise is applied to all points up to a magnitude of 30% of the variation in any dimension. Finally , we construct the Euclidean distance matrix D which is provided as input to the algorithms. These data sets are Euclidean, but “close” to k -dimensional. T o examine the behavior of the algorithms on distance matrices that are non-Euclidean, we generate data as before in a k -dimensional subspace and generate the resulting distance matrix D . Then we perturb a fraction of the elements of D (rather than perturbing the points) with P oisson noise. The fraction perturbed varies in the set ( 2%, 10%, 30%, 90% ) . All algorithms were implemented in MA TLAB. F or SMACOF , we used the implementation provided by Bron- stein [ 7 ] , and built our own implementation of SMACOF -Q around it. F or all other algorithms, we used our own implementation 1 . In all cases, we compare performance in terms of the error function Err as a function of clock time. 5.1 The r MDS Problem Figure 2(a) shows the cost function Err associated with r MDS plotted with respect to runtime. P L A C E C E N T E R always reaches the best local minimum, partially because only P L A C E C E N T E R can be adjusted for the r MDS problem. W e also observe that the runtime is comparable to SMACOF and much faster than CD in order to get to the same Err value. Although SMACOF initially reaches a smaller cost that PC, it later converges to a larger cost because it optimizes a different cost function (f MDS ). W e repeat this experiment in Figure 2(b) for different values of k (equal to { 2, 20, 50, 150 } ) to analyze the performance as a function of k . Note that PC performs even better for lower k in relation to CD. This is likely as a result of CD’s reliance on the SVD technique to reduce the dimension. At smaller k , the SVD technique has a tougher job to do, and optimizes the wrong metric. Also for k = 150 note that CD oscillates in its cost; this is again because the REE part ﬁnds a nearby Euclidean distance matrix which may be inherently very high dimensional and the SVD projection is very susceptible to changes in this matrix for such large k . W e observe that SMACOF is the fastest method to reach a low cost, but does not converge to the lowest cost value. The reason it achieves a cost close to that of PC is that for this type of data the r MDS and f MDS cost functions are fairly similar . In Figure 3 we evaluate the effect of changing the amount of noise added to the input distance matrix D , as described above. W e consider two variants of the CD algorithm, one where it is seeded with an SVD -based seed (marked CD + SVD) and one where it is seeded with a random projection to a k -dimensional subspace (marked CD + rand). In both cases the plots show the results of the REE algorithm after SVD -type projections back to a k -dimensional space. The CD + SVD technique consistently behaves poorly and does not improve with further iterations. This probably is because the REE component ﬁnds the closest Euclidean distance matrix which may correspond to points in a much high dimensional space, after which it is difﬁcult for the SVD to help. The CD + rand 1 All of our code may be found at http://www.cs.utah.edu/~arvind/smds.html . 8 0 5 10 15 20 25 30 10 4.2 10 4.3 10 4.4 10 4.5 10 4.6 Time in sec Cost function n=300 d=200 k=10 CD−SVD PC−SVD SMACOF−SVD (a) 0 20 40 10 5 Time in sec Cost function n=300 d=200 k=2 0 50 100 10 4.1 10 4.2 Time in sec Cost function n=300 d=200 k=20 CD−SVD PC−SVD SMACOF−SVD 0 20 40 10 4 Time in sec Cost function n=300 d=200 k=50 0 50 100 10 3 Time in sec Cost function n=300 d=200 k=150 (b) Figure 2: (a) r MDS: T ypical b ehavior of PC, CD and SMA COF. (b) r MDS: Va riation with k = 2, 20, 50, 150 . approach does much better , likely because the random projection initializes the procedure in a reasonably low dimensional space so REE can ﬁnd a relatively low dimension Euclidean distance matrix that is nearby . SMACOF is again the fastest algorithm, but with more noise, the difference between f MDS and r MDS is larger , and thus SMACOF converges to a conﬁguration with much higher cost than PC. W e reiterate that PC consistently converges to the lowest cost solution among the different methods, and consistently is either the fastest or is comparable to the fastest algorithm. W e will see this trend repeated with other cost measures as well. 5.2 The f MDS Problem W e next evaluate the algorithms PC, SMACOF , and CD under the f MDS distance measure. The results are very similar to the r MDS case except now both SMACOF and PC are optimizing the correct distance measure and converge to the same local minimum. SMACOF is still slightly faster that PC, but since they both run very fast, the difference is of the order of less than a second even in the very worst part of the cost / time tradeoff curve shown in Figure 4(a). Note that CD performs poorly under this cost function here except when k = 50. F or smaller values of k , the SVD step does not optimize the correct distance and for larger k the REE part is likely ﬁnding an inherently very high dimensional Euclidean distance matrix, making the SVD projection very noisy . F or the f MDS measure, SMACOF and PC perform very similarly under different levels of noise, both con- verging to similar cost functions with SMACOF running a bit faster , as seen in Figure 4(b). CD consistently runs slower and converges to a higher cost solution. 5.3 The r 2 MDS Problem In this setting we would expect CD to perform consistently as well as PC because both minimize the same cost function. However , this is not always the case because CD requires the SVD step to generate a point set in R ¯ k . As seen in Figure 5 this becomes a problem when k is small ( k = 2, 10). F or medium values of k , CD converges slightly faster than PC and sometimes to a slightly lower cost solution, but again for large k ( = 150), the REE 9 0 20 40 60 80 10 5 10 6 10 7 Time in sec Cost function n=300 d=200 k=10 Noise=2% 0 20 40 60 10 5 10 6 10 7 Time in sec Cost function n=300 d=200 k=10 Noise=10% 0 20 40 60 10 5 10 6 10 7 Time in sec Cost function n=300 d=200 k=10 Noise=30% 0 20 40 60 80 10 6 10 7 Time in sec Cost function n=300 d=200 k=10 Noise=90% CD−SVD PC−SVD SMACOF−SVD CD−rand Figure 3: r MDS: Va riation with noise = 2, 10, 30, 90 . part has trouble handling the amount of error and the solution cost oscillates. SMACOF is again consistently the fastest to converge, but unless k is very large (i.e. k = 150) then it converges to a signiﬁcantly worse solution because the f MDS and r 2 MDS error functions are different. 5.4 The Spherical MDS Problem F or the spherical MDS problem we compare PC against SMACOF -Q, an adaptation of SMACOF to restrict data points to a low-dimensional sphere, and a technique of Elad, Keller and Kimmel [ 14 ] . It turns out that the Elad et.al. approach consistently performs poorly compared to both other techniques, and so we do not display it in our reported results. SMACOF -Q basically runs SMACOF on the original data set, but also adds one additional point p 0 at the center of the sphere. The distance d 0, i between any other point p i and p 0 is set to be 1 thus encouraging all other points to be on a sphere, and this constraint is controlled by a weight factor κ , a larger κ implying a stronger emphasis on satisfying this constraint. Since the solution produced via this procedure may not lie on the sphere, we normalize all points to the sphere after each step for a fair comparison. Here we compare PC against SMACOF -Q in the g-1- SMDS (Figure 6(a)) and the c-2- SMDS (Figure 6(b)) problem. F or g-1- SMDS , PC does not converge as quickly as SMACOF -Q with small κ , but it reaches a better cost value. However , when SMACOF -Q is run with a larger κ , then PC runs faster and reaches nearly the same cost value. F or our input data, the solution has similar g-1- MDS and c-1- MDS cost. When we compare SMACOF -Q with PC under c-2- MDS (Figure 6(b)) then for an optimal choice of κ in SMACOF -Q, both PC and SMACOF -Q perform very similarly , converging to the same cost function and in about the same time. But for larger choices of κ SMACOF -Q does much worse than PC. In both cases, it is possible to ﬁnd a value of κ that allows SMACOF -Q to match PC. However , this value is different for different settings, and varies from input to input. The key observation here is that since PC is parameter-free , it can be run regardless of the choice of input or cost function, and consistently performs well. 5.5 Summa ry Of Results In summary , here are the main conclusions that can be drawn from this experimental study . Firstly , PC is consistently among the top performing methods, regardless of the choice of cost function, the nature of the input, or the level of noise in the problem. Occasionally , other methods will converge faster , but will not in general return a better quality answer , and different methods have much more variable behavior with changing 10 0 1 2 3 4 5 10 5 Time in sec Cost function n=300 d=200 k=2 0 1 2 3 4 5 10 3 10 4 10 5 Time in sec Cost function n=300 d=200 k=20 CD−SVD PC−SVD SMACOF−SVD 0 1 2 3 4 5 10 2 10 3 10 4 10 5 Time in sec Cost function n=300 d=200 k=50 0 1 2 3 4 5 10 0 10 5 Time in sec Cost function n=300 d=200 k=150 (a) 0 10 20 30 10 4 Time in sec Cost function n=300 d=200 k=10 Noise=2% 0 10 20 30 10 4 Time in sec Cost function n=300 d=200 k=10 Noise=10% 0 20 40 60 10 4 10 5 Time in sec Cost function n=300 d=200 k=10 Noise=30% 0 10 20 30 10 5 Time in sec Cost function n=300 d=200 k=10 Noise=90% CD−rand PC−SVD SMACOF−SVD (b) Figure 4: (a) f MDS: V a riation with k = 2, 20, 50, 150 . (b) f MDS: Va riation with noise = 2, 10, 30, 90 . 0 20 40 60 10 5.1 10 5.8 Time in sec Cost function n=300 d=200 k=2 10 20 30 10 5 Time in sec Cost function n=300 d=200 k=20 CD−SVD PC−SVD SMACOF−SVD 0 20 40 10 5 Time in sec Cost function n=300 d=200 k=50 0 20 40 10 3 10 4 10 5 10 6 Time in sec Cost function n=300 d=200 k=150 Figure 5: r 2 MDS: Va riation with k = 2, 20, 50, 150 . inputs and noise levels. 6 A JL Lemma fo r Spherical Data In this section we present a Johnson-Lindenstrauss-style bound for mapping data from a high dimensional sphere to a low-dimensional sphere while preserving the distances to within a multiplicative error of ( 1 + " ) . Consider a set Y ⊂ S ¯ d ⊂ R ¯ d + 1 of n points, deﬁning a distance matrix D where the element d i , j represents the geodesic distance between y i and y j on S ¯ k . W e seek an embedding of Y into S ¯ d that preserves pairwise distances as much as possible. F or a set Y ∈ S ¯ d and a projection π ( Y ) = X ⊂ S ¯ k we say the X has γ -distortion 11 0 10 20 30 40 50 60 1000 1500 2000 2500 3000 3500 4000 4500 Time in sec Cost function n=300 d=200 k=10 PC SMACOF−Q κ =1e2 SMACOF−Q κ =1e5 (a) 0 1 2 3 4 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 x 10 −11 Time in sec Cost function n=300 d=50 k=5 c−2−smds SMACOF−Q κ =1e2 SMACOF−Q κ =1e4 SMACOF−Q κ =1e5 (b) Figure 6: (a) g-1-SMDS: Comparing PC with SMA COF-Q fo r diﬀerent values of p enalt y parameter κ . (b) c-2-SMDS: Comparing PC with SMACOF-Q fo r diﬀerent values of penalty pa rameter κ . from Y if these exists a constant c such that for all x i , x j ∈ X ( 1 − γ ) f ( y i , y j ) ≤ c f ( x i , x j ) ≤ ( 1 + γ ) f ( y i , y j ) . F or a subspace H = R ¯ k , let π H ( Y ) be the projection of Y ∈ R ¯ d onto H and then scaled by d / k . F or X ∈ R ¯ k , let S ( X ) be the projection to S ¯ k − 1 , that is for all x ∈ X , the corresponding point in S ( X ) is x / || x || . When f ( y i , y j ) = || y i − y j || , and Y ∈ R ¯ d , then the Johnson-Lindenstrauss ( JL ) Lemma [ 19 ] says that if H ⊂ R ¯ d is a random k -dimensional linear subspace with k = O (( 1 /" 2 ) log ( n /δ )) , then X = π H ( Y ) has " - distortion from Y with probability at least 1 − δ . W e now present the main result of this section. W e note that recent results [ 1 ] have shown similar results for point on a variety of manifolds (including spheres) where projections preserve Euclidean distances. W e reiterate that our results extend this to geodesic distances on spheres which can be seen as angle ∠ x , y between the vectors to points x , y ∈ S ¯ k . Another recent result [ 26 ] shows that k = O (( 1 /" 2 ) log ( n /δ )) dimensions preserves p " -distortion in angles, which is weaker than the following result. Theorem 6.1. Let Y ⊂ S ¯ d ⊂ R ¯ d + 1 , and let H = R ¯ k + 1 be a random subspace of R ¯ d with k = O (( 1 /" 2 ) log ( n /δ )) with " ∈ ( 0, 1 / 4 ] . Let f ( y i , y j ) measure the geodesic distance on S ¯ d (or S ¯ k as appropriate). Then S ( π H ( Y )) has " -distortion from Y with probability at least 1 − δ . This implies that if we project n data points that lie on any high-dimensional sphere to a low-dimensional sphere S ¯ k with k ∼ log n , then the pairwise distances are each individually preserved. Before we proceed with the proof , we require a key technical lemma. Lemma 6.1. F or " ∈ [ 0, 0.5 ] and x ∈ [ 0, 0.7 ] , (1) sin (( 1 − 2 " ) x ) ≤ ( 1 − " ) sin ( x ) , and (2) sin (( 1 + 2 " ) x ) ≥ ( 1 + " ) s i n ( x ) . 12 origin y i y j x max i x max j x min j x min i Figure 7: Illustration of the b ounds on ∠ x i , x j when || y i − y j || ≤ 1 / 2 . The angle ∠ x max i , x max j is the largest when || x max i || = || x max i || is as small as p ossible (lies on inner circle) and || x max i − x max j || is as large as p ossible (on the outer edges of the disks of diameter " / 8 shifted do wn from dashed line of length || y i − y j || . Bounds for x min i and x min j a re derived symmetrically . Proof. Let g " ( x ) = ( 1 − " ) sin x − sin (( 1 − 2 " ) x ) . W e will show that for x ∈ [ 0, 1 ] and " ∈ [ 0, 0.5 ] , g " ( x ) is concave. This implies that it achieves its minimum value at the boundary . Now g " ( 0 ) = 0 for all " , and it can be easily shown that g " ( 0.7 ) ≥ 0 for " ∈ [ 0, 0.5 ] . This will therefore imply that g " ( x ) ≥ 0 in the speciﬁed range. It remains to show that g " ( x ) is concave in [ 0, 0.7 ] . g 00 " ( x ) = ( 1 − 2 " ) 2 sin (( 1 − 2 " ) x ) − ( 1 − " ) sin x ≤ ( 1 − " )( sin (( 1 − 2 " ) x ) − sin x ) which is always negative for " ∈ [ 0, 0.5 ] and since sin x is increasing in the range [ 0, 0.7 ] . This proves the ﬁrst part of the lemma. F or the second part, observe that h " ( x ) = sin (( 1 + 2 " ) x ) − ( 1 + " ) s i n ( x ) can be rewritten as h " ( x ) = g − " ( − x ) . The rest of the argument follows along the same lines, by showing that h " ( x ) is concave in the desired range using that h 00 " ( x ) = g 00 − " ( − x ) . While the upper bound of 0.7 on x is not tight, it is close. The actual bound (evaluated by direct calculation) is slightly over 0.72. Proof of Theorem 6.1. Let X = π H ( Y ) . W e consider two cases, ( Short Case ) when k y i − y j k ≤ 1 / 2 and ( Long Case ) when k y i − y j k ∈ ( 1 / 2, 2 ] . Short Case: First consider points y i , y j ∈ S ¯ d such that || y i − y j || ≤ 1 / 2. Note that || y i − y j || = 2 sin ( ∠ y i , y j / 2 ) , since || y i || = || y j || = 1. By JL , we know that there exists a constant c such that ( 1 − " / 8 ) || y i − y j || ≤ c || x i − x j || ≤ ( 1 + " / 8 ) || y i − y j || . W e need to compare the angle ∠ x i , x j with that of ∠ y i , y j . The largest ∠ x i , x j can be is when c || x i || = c || x j || = ( 1 − " / 8 ) is as small as possible, and so || c x i − c x j || = ( 1 + " / 8 ) || y i − y j || is as large as possible. See Figure 7. In this case, we have ( || c x i || + || c x j || ) sin ( ∠ x i , x j / 2 ) ≤ || c x i − c x j || 2 ( 1 − " / 8 ) sin ( ∠ x i , x j / 2 ) ≤ ( 1 + " / 8 ) || y i − y j || 2 ( 1 − " / 8 ) sin ( ∠ x i , x j / 2 ) ≤ ( 1 + " / 8 ) 2 sin ( ∠ y i , y j / 2 ) sin ( ∠ x i , x j / 2 ) ≤ 1 + " / 8 1 − " / 8 sin ( ∠ y i , y j / 2 ) , which for " < 4 implies sin ( ∠ x i , x j / 2 ) ≤ ( 1 + " / 2 ) sin ( ∠ y i , y j / 2 ) . 13 Similarly , we can show when ∠ x i , x j is as small as possible (when c || x i || = c || x j || = ( 1 + " ) and || c x i − c x j || = ( 1 − " ) || y i − y j || ), then ( 1 − " / 2 ) sin ( ∠ y i , y j / 2 ) ≤ sin ( ∠ x i , x j / 2 ) . W e can also show (via Lemma 6.1) that since || y i − y j || ≤ 1 / 2 implies ∠ y i , y j < 0.7 we have sin (( 1 − " ) ∠ y i , y j ) ≤ ( 1 − " / 2 ) sin ( ∠ y i , y j ) and ( 1 + " / 2 ) sin ( ∠ y i , y j ) ≤ sin (( 1 + " ) ∠ y i , y j ) . Thus, we have sin (( 1 − " ) ∠ y i , y j / 2 ) ≤ sin ( ∠ x i , x j / 2 ) ≤ sin (( 1 + " ) ∠ y i , y j / 2 ) ( 1 − " ) ∠ y i , y j / 2 ≤ ∠ x i , x j / 2 ≤ ( 1 + " ) ∠ y i , y j / 2 ( 1 − " ) ∠ y i , y j ≤ ∠ x i , x j ≤ ( 1 + " ) ∠ y i , y j . Long Case: F or || y i − y j || ∈ ( 1 / 2, 2 ] , we consider 6 additional points y ( h ) i , j ∈ S ¯ d + 1 (for h ∈ [ 1 : 6 ] ) equally spaced between y i and y j on the shortest great circle connecting them. Let ˆ Y be the set Y plus all added points { y ( h ) i , j } h =[ 1:6 ] . Note that | ˆ Y | = O ( n 2 ) , so by JL we have that ( 1 − " / 8 ) || y i − ˆ y i , j || ≤ c || x i − ˆ x i , j || ≤ ( 1 + " / 8 ) || y i − ˆ y i , j || . F or notational convenience let y i = y ( 0 ) i , j and y j = y ( 7 ) i , j . Since for k y i − y j k ∈ ( 1 / 2, 2 ] then k y ( h ) i , j − y ( h + 1 ) i , j k ≤ 1 / 2, for h ∈ [ 0 : 6 ] . This follows since the geodesic length of the great circular arc through y i and y j is at most π , and π/ 7 < 1 / 2. Then the chordal distance for each pair k y ( h ) i , j − y ( h + 1 ) i , j k is upper bounded by the geodesic distance. Furthermore, by invoking the short case, for any pair ( 1 − " ) ∠ y ( h ) i , j , y ( h + 1 ) i , j ≤ ∠ x ( h ) i , j , x ( h + 1 ) i , j ≤ ( 1 + " ) ∠ y ( h ) i , j , y ( h ) i , j . Then since projections preserve coplanarity (speciﬁcally , the points 0 and y ( h ) i , j for h ∈ [ 0 : 7 ] are coplanar , hence 0 and x ( h ) i , j for h ∈ [ 0 : 7 ] are coplanar), we can add together the bounds on angles which all lie on a single great circle. ( 1 − " ) ∠ y i , y j ≤ ( 1 − " ) 6 X h = 0 ∠ y ( h ) i , j , y ( h + 1 ) i , j ≤ P 6 h = 0 ∠ x ( h ) i , j , x ( h + 1 ) i , j ≤ ( 1 + " ) 6 X h = 0 ∠ y ( h ) i , j , y ( h + 1 ) i , j ≤ min { π , ( 1 + " ) ∠ y i , y j } ( 1 − " ) ∠ y i , y j ≤ ∠ x i , x j ≤ min { π , ( 1 + " ) ∠ y i , y j } . References [ 1 ] P . K. Agarwal, S. Har-P eled, and H. Y u. On embeddings of moving points in Euclidean space. In Proceedings 23rd Symposium on Computational Geometry , 2007. [ 2 ] D. M. Blei, A. Y . Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res. , 3:993–1022, 2003. [ 3 ] I. Borg and P . J. F . Groenen. Modern Multidimensional Scaling . Springer , 2005. [ 4 ] P . Bose, A. Maheshwari, and P . Morin. F ast approximations for sums of distances, clustering and the Fermat–Weber problem. Comput. Geom. Theory Appl. , 24(3):135–146, 2003. [ 5 ] J. Brimberg and R. F . Love. Global convergence of a generalized iterative procedure for the minisum location problem with ` p distances. Operations R esearch , 41(6):1153–1163, 1993. 14 [ 6 ] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Numerical Geometry of Non-Rigid Shapes . Springer , 2008. [ 7 ] M. Bronstein. Accelerated MDS. http://tosca.cs.technion.ac.il/book/resources_ sw.html . [ 8 ] L. Cayton and S. Dasgupta. R obust Euclidean embedding. In ICML ’06: Proceedings of the 23rd Interna- tional Conference on Machine Learning , pages 169–176, 2006. [ 9 ] L. Chen and A. Buja. Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. Journal of the Americal Statistical Association , 104:209–219, 2009. [ 10 ] T . F . Cox and M. A. A. Cox. Multidimensional Scaling, Second Edition . Chapman & Hall / CRC, September 2000. [ 11 ] N. Dalal and B. T riggs. Histograms of oriented gradients for human detection. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and P attern R ecognition , pages 886–893, 2005. [ 12 ] J. de Leeuw. Applications of convex analysis to multidimensional scaling. In J. Barra, F . Brodeau, G. Romier , and B. V an Custem, editors, R ecent Developments in Statistics , pages 133–146. North Holland Publishing Company , 1977. [ 13 ] J. de Leeuw and P . Mair . Multidimensional scaling using majorization: SMACOF in R. T echnical R eport 537, UCLA Statistics Preprints Series, 2009. [ 14 ] A. E. Elad, Y . Keller , and R. Kimmel. T exture mapping via spherical multi-dimensional scaling. In R. Kim- mel, N. A. Sochen, and J. W eickert, editors, Scale-Space , volume 3459 of Lecture Notes in Computer Science , pages 443–455. Springer , 2005. [ 15 ] P . T . Fletcher , S. V enkatasubramanian, and S. Joshi. The Geometric Median on Riemannian Manifolds with Application to R obust Atlas Estimation. Neuroimage (invited to special issue) , 45(1):S143–S152, March 2009. [ 16 ] R. Gray , A. Buzo, A. Gray Jr, and Y . Matsuyama. Distortion measures for speech processing. Acoustics, Speech and Signal Processing, IEEE T ransactions on , 28(4):367–376, Aug 1980. [ 17 ] P . Indyk and J. Matousek. Low-distortion embeddings of ﬁnite metric spaces. In Handbook of Discrete and Computational Geometry , pages 177–196. CRC Press, 2004. [ 18 ] T . Joachims. Learning to Classify T ext Using Support V ector Machines – Methods, Theory , and Algorithms . Kluwer / Springer , 2002. [ 19 ] W . B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into Hilbert space. Contemporary Mathematics , 26:189–206, 1984. [ 20 ] H. Karcher . Riemannian center of mass and molliﬁer smoothing. Comm. on Pure and Appl. Math. , 30:509– 541, 1977. [ 21 ] I. N. Katz. Local convergence in Fermat’s problem. Mathematical Programming , 6:89–104, 1974. [ 22 ] J. B. Kruskal. Multidimensional scaling by optimizing goodness of ﬁt to nonmetric hypothesis. P sychome- trika , 29:1–27, 1964. [ 23 ] J. B. Kruskal and M. Wish. Multidimensional scaling. In E. M. Uslander , editor , Quantitative Applications in the Social Sciences , volume 11. Sage Publications, 1978. 15 [ 24 ] H. W . K uhn. A note on Fermat’s problem. Mathematical Programming , 4:98–107, 1973. [ 25 ] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision , 60(2), 2004. [ 26 ] A. Magen. Dimensionality reductions that preserve volumes and distance to afﬁne spaces, and their algorithmic applications. In Proceedings of the 6th International W orkshop on R andomization and Approx- imation T echniques , Lecture Notes In Computer Science; V ol. 2483, 2002. [ 27 ] A. W . Marshall and I. Olkin. Inequalities: Theory of Majorization and Its Applications . Academic Press, 1979. [ 28 ] L. M. Ostresh. On the convergence of a class of iterative methods for solving the Weber location problem. Operations Research , 26:597–609, 1978. [ 29 ] F . P ereira, N. Tishby , and L. Lee. Distributional clustering of English words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics , pages 183–190, 1993. [ 30 ] R. Pietersz and P . J. F . Groenen. Rank reduction of correlation matrices by majorization. T echnical R eport 519086, SSRN, 2004. [ 31 ] R. Pless and I. Simon. Embedding images in non-ﬂat spaces. In Proc. of the International Conference on Imaging Science, Systems, and T echnology , 2002. [ 32 ] T . Sarlós. Improved approximation algorithms for large matrices via random projections. In Proceedings 47th Annual IEEE Symposium on F oundations of Computer Science , 2006. [ 33 ] W . S. T orgerson. Multidimensional scaling: I. theory and method. Psychometrika , 17:401–419, 1952. [ 34 ] E. W eiszfeld. Sur le point pour lequel la somme des distances de n points donnés est minimum. T ohoku Math. J. , 43:355–386, 1937. [ 35 ] G. Y oung and A. S. Householder . Discussion of a set of points in terms of their mutual distances. P sy - chometrika , 3:19–22, 1938. 16

A Unified Algorithmic Framework for Multi-Dimensional Scaling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment