An Approximation Ratio for Biclustering

The problem of biclustering consists of the simultaneous clustering of rows and columns of a matrix such that each of the submatrices induced by a pair of row and column clusters is as uniform as possible. In this paper we approximate the optimal bic…

Authors: Kai Puolamäki, Sami Hanhijärvi, Gemma C. Garriga

An Appro ximation Ratio for Biclustering ∗ Kai Puolam¨ aki † Sami Hanhij¨ arvi ‡ Gemma C. Garriga § Helsinki Institute for Information T ec hnology HI IT Helsinki Univ ersity of T ec hnology P .O. Bo x 5400 FI-02015 TKK Finland 22 August 2008 Abstract The problem of biclustering consists of the sim ultaneous clustering of ro ws and columns of a matrix such that eac h of the submatrices induced by a pair of row and column clusters is as uniform as p ossible. In this pap er we appro ximate the optimal biclustering by applying one-w ay clustering algorithms independently on the ro ws and on the columns of the input matrix. W e show that such a solution yields a w orst-case appro ximation ratio of 1 + √ 2 under L 1 -norm for 0–1 v alued matrices, and of 2 under L 2 -norm for real v alued matrices. Keyw ords: Appro ximation algorithms; Biclustering; One-wa y clustering 1 In tro duction The standard clustering problem [8] consists of partitioning a set of input v ectors, suc h that the vectors in each partition (cluster) are close to one another according to some predefined distance function. This form ulation is the ob jectiv e of the p opular K -means algorithm (see, for example, [9]), ∗ Cite as arXiv:0712.2682v2 [cs.DS]. This is a revised v ersion of the preprint originally published as arXiv:0712.2682v1 [cs.DS] on 17 Decem b er 2007. T o b e published in Infor- mation Pro cessing Letters 108 (2008) 45–49 [doi:10.1016/j.ipl.2008.03.013]. † Kai.Puolamaki@tkk.fi ‡ Sami.Hanhijarvi@tkk.fi § Gemma.Garriga@tkk.fi 1 (a) (b) (c) Figure 1: (a) An example binary data matrix X of dimensions 4 × 6, with ro ws and columns lab eled with num b ers and characters. (b) The optimal biclustering of X consists of { R ∗ 1 , R ∗ 2 } = {{ 1 } , { 2 , 3 , 4 }} ro w clus- ters and { C ∗ 1 , C ∗ 2 , C ∗ 3 } = {{ b, f } , { a, d, e } , { c }} column clusters when us- ing L 1 -norm. (c) Biclusters of the data matrix returned by our scheme, that is, using twice an optimal one-w ay clustering algorithm, once on the 4 ro w vectors and another on the 6 column vectors, with L 1 -norm. Result- ing clusterings are { R 1 , R 2 } = {{ 1 , 3 , 4 } , { 2 }} for ro ws and { C 1 , C 2 , C 3 } = {{ b, f } , { a, e } , { d, e }} for columns. F or visual clarit y , the ro ws and columns of the original matrix in (a) hav e b een p ermuted in (b) and (c) by making the ro ws (and columns) of a single cluster adjacent. where K denotes the final n um b er of clusters and the distance function is defined by the L 2 -norm. Another similar example of this formulation is the K -median algorithm (see, for example, [3]), where the distance function is giv en b y the L 1 -norm. Clustering a set of input v ectors is a well-kno wn NP- hard problem even for K = 2 clusters [4]. Sev eral approximation guaran tees ha ve been sho wn for this form ulation of the standard clustering problem (see [3, 9, 2] and references therein). In tensive recent researc h has fo cused on the disco very of homogeneous substructures in large matrice s. This is also one of the goals in the problem of biclustering . Given a set of N rows in M columns from a matrix X , a biclustering algorithm iden tifies subsets of rows exhibiting similar b eha vior across a subset of columns, or vice v ersa. Note that the optimal solution for this problem necessarily requires to cluster the N vectors and the M dimensions simultaneously , thus the name biclustering. Eac h submatrix of X , induced by a pair of row and column clusters, is typically referred to as a bicluster . See Figure 1 for a simple to y example. The main c hallenge of a biclustering algorithm lies in the dep endency b et ween the ro w and column partitions, whic h mak es it difficult to iden tify the optimal biclusters. A change in a ro w clustering affects the cost of the induced submatrices (biclusters), and as a consequence, the column clustering ma y also need to b e changed to improv e the solution. Finding an optimal solution for the biclustering problem is NP-hard. 2 This observ ation follo ws directly from the reduction of the standard cluster- ing problem (kno wn to b e NP-hard) to the biclustering problem by fixing the num b er of clusters in columns to M . T o the b est of our knowledge, no algorithm exists that can efficien tly appro ximate biclustering with a prov en appro ximation ratio. The goal of this pap er is to prop ose suc h an approxi- mation guaran tee by means of a very simple scheme. Our approach will consist of relieving the requiremen t for sim ultaneous clustering of ro ws and columns and instead p erform them independently . In other words, our final biclusters will corresp ond to the submatrices of X induced by pairs of ro w and columns clusters, found independently with a standard clustering algorithm. W e sometimes refer to this standard cluster- ing algorithm as one-w ay clustering. The simplicit y of the solution alleviates us from the incon v enient dep endency of rows and columns. More imp or- tan tly , the solution obtained with this approach, despite not b eing optimal, allo ws for the study of appro ximation guarantees on the obtained biclusters. Here we pro ve that our solution achiev es a w orst-case appro ximation ratio of 1 + √ 2 under L 1 -norm for 0–1 v alued matrices, and of 2 under L 2 -norm for real v alued matrices. Finally , note that our final solution is constructed on top of a standard clustering algorithm (applied twice, once in ro w v ectors and the other in column vectors) and therefore, it is necessary to m ultiply our ratio with the approximation ratio achiev ed by the used standard clustering algorithm (suc h as [3, 9]). F or clarity , w e will lift this restriction in the follo wing proofs b y assuming that the applied one-wa y clustering algorithm provides directly an optimal solution to the standard clustering problem. 1.1 Related w ork This basic algorithmic problem and sev eral v ariations w ere initially pre- sen ted in [6] with the name of direct clustering. The same problem and its v ariations hav e also b een referred to as tw o-wa y clustering, co-clustering or subspace clustering. In practice, finding highly homogeneous biclusters has imp ortan t applications in biological data analysis (see [10] for review and references), where a bicluster may , for example, corresp ond to an activ a- tion pattern common to a group of genes only under specific exp erimental conditions. An alternative definition of the basic biclustering problem d escrib ed in the in tro duction consists on finding the maximal bicluster in a giv en matrix. A well-kno wn connection of this alternative form ulation is its reduction to the problem of finding a biclique in a bipartite graph [7]. Algorithms for detecting bicliques enumerate them in the graph b y using the monotonicity prop ert y that a subset of a biclique is also a biclique [1, 5]. These algorithms usually ha ve a high order of complexity . 3 2 Definitions W e assume given a matrix X of size N × M , and integers K r and K c , whic h define the n umber of clusters partitioning ro ws and columns, resp ectiv ely . The goal is to approximate the optimal biclustering of X by means of a one-w ay row clustering into K r clusters and a one-w a y column clustering in to K c clusters. F or an y T ∈ N w e denote [ T ] = { 1 , . . . , T } . W e use X ( R, C ), where R ⊆ [ N ] and C ⊆ [ M ], to denote the submatrix of X induced by the subset of rows R and the subset of columns C . Let Y denote an induced submatrix of X , that is Y = X ( R, C ) for some R ⊆ [ N ] and C ⊆ [ M ]. When required b y the context, we will also refer to Y = X ( R, C ) as a bicluster of X and denote the size of Y with n × m , where n ≤ N and m ≤ N . W e use median( Y ) and mean( Y ) to denote the median and mean of all element s of Y , resp ectiv ely . The scheme for approximating the optimal biclustering is defined as fol- lo ws. Input: matrix X , num b er of row clusters K r , num b er of column clus- ters K c R = k cluster( X , K r ) C = kcluster( X T , K c ) Output: a set of biclusters X ( R, C ), for each R ∈ R , C ∈ C The function k cluster( X , K r ) denotes here an optimal one-w ay clustering algorithm that partitions the ro w vectors of matrix X into K r clusters. W e ha ve used X T to denote the transp ose of matrix X . Instead of fixing a specific norm for the formulas, w e use the dissimilarit y measure V () to absorb the norm-dependent part. F or L 1 -norm, V () w ould b e defined as V ( Y ) = P y ∈ Y | y − median( Y ) | , and for L 2 -norm as V ( Y ) = P y ∈ Y ( y − mean( Y )) 2 . Giv en Y of size n × m , we further use a sp ecial row norm, V R ( Y ) = P m j =1 V ( Y ([ n ] , j )), and a sp ecial column norm, V C ( Y ) = P n i =1 V ( Y ( i, [ m ])). W e define the one-wa y ro w clustering, given b y kcluster ab o ve, as a partition of ro ws [ N ] into K r clusters R = { R 1 , . . . , R K r } such that the cost function L R = X R ∈R M X j =1 V ( X ( R, j )) (1) is minimized. Analogously , the one-wa y clustering of columns [ M ] in to K c 4 clusters C = { C 1 , . . . , C K c } is defined such that the cost function L C = N X i =1 X C ∈C V ( X ( i, C )) (2) is minimized. The cost of biclustering, induced b y the tw o one-w ay clusterings ab o ve, is L = X R ∈R X C ∈C V ( X ( R, C )) . (3) Notice that we are assuming that the one-w ay clusterings ab o v e, denoted R on rows and C on columns, corresp ond to optimal one-wa y partitionings on ro ws and columns, resp ectiv ely . Finally , the optimal biclustering on X is given by sim ultaneous row and column partitions R ∗ = { R ∗ 1 , . . . , R ∗ K r } and C ∗ = { C ∗ 1 , . . . , C ∗ K c } , that mini- mize the cost L ∗ = X R ∗ ∈R ∗ X C ∗ ∈C ∗ V ( X ( R ∗ , C ∗ )) . (4) 3 Appro ximation ratio Giv en the definitions ab o ve, our main result reads as follows. Theorem 1. Ther e exists an appr oximation r atio of α such that L ≤ αL ∗ , wher e α = 1 + √ 2 ≈ 2 . 41 for L 1 -norm and X ∈ { 0 , 1 } N × M , and α = 2 for L 2 -norm and X ∈ R N × M . W e use the following intermediate result to pro ve the theorem. Lemma 2. Ther e exists an appr oximation r atio of at most α , that is, L ≤ αL ∗ , if for any X and for any p artitionings R and C of X , al l biclusters Y = X ( R , C ) , with R ∈ R and C ∈ C , satisfy V ( Y ) ≤ 1 2 α ( V R ( Y ) + V C ( Y )) . (5) Pr o of. First w e note that the cost of the optimal biclustering L ∗ cannot increase when w e increase the n umber of ro w (or column) clusters. F or example, consider the sp ecial case where K r = N (or K c = M ). In such case, each row (or column) is assigned to its o wn cluster and the cost of the optimal biclustering equals the cost of the optimal one-wa y clustering on columns L C (or rows L R ). Hence, the optimal biclustering solution is b ounded from b elo w by L ∗ ≥ max ( L R , L C ) ≥ 1 2 ( L R + L C ) (6) 5 Summing b oth sides of Equation (5), X R ∈R X C ∈C V ( Y ) | Y = X ( R ,C ) ≤ 1 2 α X R ∈R X C ∈C ( V R ( Y ) + V C ( Y )) | Y = X ( R ,C ) , and using Equations (1), (2) and (3), giv es L ≤ 1 2 α ( L R + L C ), whic h to- gether with Equation (6) implies the approximation ratio of L ≤ αL ∗ . Theorem 1 is pro ven separately in Sections 3.1 and 3.2 using Lemma 2. Section 3.1 deals with the case of ha ving a 0–1 v alued matrix X and L 1 -norm distance function, while Section 3.2 deals with real v alued matrix X and L 2 -norm. 3.1 L 1 -norm and 0–1 v alued matrix (a) (b) Figure 2: Examples of sw aps p erformed within bicluster Y for the technical part of the proof in Section 3.1. F or clarit y , the ro ws and columns of the bicluster Y hav e b een ordered suc h that the blo c ks A , B , C and D are con tinuous. Consider a 0–1 v alued matrix X and L 1 -norm. T o prov e Theorem 1 it suffices to show that Equation (5) holds for eac h of the biclusters Y = X ( R, C ) of X , where R ∈ R and C ∈ C . Therefore, in the following w e concen trate on one single bicluster Y ∈ { 0 , 1 } n × m . Without loss of generalit y , we consider only the case where the bicluster Y has at least as man y 0’s as 1’s. In suc h case, the median of Y can be safely tak en to b e zero and the cost V ( Y ) ≤ 1 2 nm is then fixed to the n um b er of 1’s in the matrix. T o get the w orst case scenario to wards the tightest upp er b ound on α in Equation (5), we should find first a configuration of 1’s suc h that, giv en V ( Y ), the sum V R ( Y ) + V C ( Y ) is minimized. Denote by O R and O C the sets of rows and columns in Y which hav e more 1’s than 0’s, resp ectiv ely . Denote A = Y ( O R , O C ), B = Y ( O R , [ m ] \ O C ), C = Y ([ n ] \ O R , O C ), D = Y ([ n ] \ O R , [ m ] \ O C ), n 0 = | O R | and m 0 = | O C | . Note that A , B , C and D are simply blo c ks of bicluster Y , which we need to mak e explicit in our notation for the pro of. 6 Changing a 0 to 1 in A or a 1 to 0 in D decreases V R ( Y ) + V C ( Y ) b y t wo, while c hanging a 0 to 1 or 1 to 0 in B or C c hanges V R ( Y ) + V C ( Y ) b y at most one. It follo ws that swapping a 1 in B or C with a 0 in A (see Figure 2a), or sw apping a 1 in D with a 0 in A , B or C (see Figure 2b) decreases V R ( Y ) + V C ( Y ) while V ( Y ) remains unchanged. In other words, in a solution that minimizes V R ( Y ) + V C ( Y ) no suc h swaps can b e made. In the remainder of this subsection, w e assume that the bicluster Y satisfies this men tioned prop erty . It follows that (i) A , B and C are blocks of 1’s, (ii) A is a blo c k of 1’s and D is a blo c k of 0’s, or (iii) B , C and D are blo c ks of 0’s. Denote b y o () the num b er of 1’s in a giv en blo c k. It follo ws that V ( Y ) = o ( A ) + o ( B ) + o ( C ) + o ( D ) ≤ 1 2 nm , V R ( Y ) = nm 0 − o ( A ) + o ( B ) − o ( C ) + o ( D ) and V C ( Y ) = n 0 m − o ( A ) − o ( B ) + o ( C ) + o ( D ). W e denote x = n 0 /n , y = m 0 /m , a = o ( A ) / ( nm ), b = o ( B ) / ( nm ), c = o ( C ) / ( nm ) and d = o ( D ) / ( nm ) and rewrite Equation (5) as α = sup  2 V ( Y ) V R ( Y ) + V C ( Y )  = 2 sup  a + b + c + d x + y − 2 a + 2 d  , with constraints a + b + c + d ∈ [0 , 1 2 ], x ∈ [0 , 1] y ∈ [0 , 1], as w ell as (i) a = xy , b = x (1 − y ), c = (1 − x ) y and d ∈ [0 , (1 − x )(1 − y )]; (ii) a = xy , b ∈ [0 , x (1 − y )], c ∈ [0 , (1 − x ) y ] and d = 0; or (iii) a ∈ [0 , xy ] and b = c = d = 0. The optimization problem has tw o solutions, (i) x = y = 1 − q 1 2 , a = xy , b = x (1 − y ), c = (1 − x ) y and d = 0, and (ii) x = y = q 1 2 , a = xy and b = c = d = 0, b oth solutions yielding α = 1 + √ 2 when exactly half of the entries in the bicluster Y are 1’s. This pro ves Theorem 1 for 0–1 v alued matrices and L 1 -norm. Notice that the ab ov e proof relies on the fact that the input matrix X has only t w o t yp es of v alues. Therefore, the pro of do es not generalize to real v alued matrices. An example of a matrix with approximation ratio of 2 is giv en by a 4 × (4 q − 1) matrix X =     0 . . . 0 1 . . . 1 0 . . . . . . 0 0 . . . 0 1 . . . 1 1 . . . . . . 1 1 . . . 1 0 . . . 0 0 . . . . . . 0 1 . . . 1 0 . . . 0 1 . . . . . . 1     with q columns in the first column group, q columns in the second column group and 2 q − 1 columns in the third column group, clustered to tw o row clusters, K r = 2, and one column cluster, K c = 1, at the limit of large q . The optimal one-w ay clustering of ro ws is giv en by R = {{ 1 , 2 } , { 3 , 4 }} , 7 L = 8 q − 2, and the optimal biclustering of rows by R ∗ = {{ 1 , 3 } , { 2 , 4 }} , L ∗ = 4 q . 3.2 L 2 -norm and real v alued matrix Consider no w a real v alued matrix X and L 2 -norm. W e wan t to prov e Theorem 1 for the real v alued biclusters Y of X . T o find the approximation ratio, it suffices to show that Equation (5) holds for eac h bicluster Y ∈ R n × m , whic h are determined b y Y = X ( R, C ), where R ∈ R and C ∈ C . Using the definitions of V ( Y ) , V R ( Y ) and V C ( Y ) , we can write V ( Y ) = V R ( Y ) + V C ( Y ) − P n i =1 P m j =1  Y ( i, j ) − Y ( i, j )  2 ≤ V R ( Y ) + V C ( Y ), where Y ( i, j ) = mean( Y ([ n ] , j )) + mean( Y ( i, [ m ])) − mean( Y ). Hence, Equation (5) is satisfied for L 2 -norm and real v alued matrices when α = 2. 4 Conclusions W e ha ve shown that appro ximating the optimal biclustering with indep en- den t row- and column-wise standard clusterings achiev es a go o d appro xima- tion guaran tee. How ev er in practice, standard one-w a y clustering algorithms (suc h as K -means or K -median) are also approximate, and therefore, it is necessary to m ultiply our ratio with the approximation ratio ac hiev ed b y the standard clustering algorithm (suc h as presen ted in [3, 9]) to obtain the true approximation ratio of our sc heme. Still, our con tribution shows that in man y practical applications of biclustering, it may b e sufficien t to use a more straightforw ard standard clustering of ro ws and columns instead of applying heuristic algorithms without p erformance guaran tees. 5 Ac kno wledgmen ts W e thank Nik ola j T atti for reading through the man uscript and giving useful commen ts. References [1] G. Alexe, S. Alexe, Y. Crama, S. F oldes, P . L. Hammer, and B. Sime- one. Consensus algorithms for the generation of all maximal bicliques. Discr ete Appl. Math. , 145(1):11–21, 2004. [2] D. Arthur, S. V assilvitskii. k-means++: the adv an tages of careful seed- ing. Pr o c. of the 18th annual A CM-SIAM symp osium on Discr ete algo- rithms , 1027–1035, 2007. 8 [3] V. Arya, N. Garg, R. Khandek ar, A. Meyerson, K. Munagala, and V. P andit. Local search heuristics for k-median and facility lo cation problems. SIAM J. Comput. , 33(3):544–562, 2004. [4] P . Drineas, A. F rieze, R. Kannan, S. V empala, and V. Vinay . Clustering Large Graphs via the Singular V alue Decomp osition. Mach. L e arn. , 56(1-3):9–33, 2004. [5] D. Eppstein. Arboricity and bipartite subgraph listing algorithms. Inf. Pr o c ess. L ett. , 51(4):207–211, 1994. [6] J.A. Hartigan. Direct clustering of a data matrix. Journal of the A mer- ic an Statistic al Asso ciation , 67(337):123–129, 1972. [7] D. S. Ho c hbaum. Appro ximating clique and biclique problems. J. A lgo- rithms , 29(1):174–200, 1998. [8] A. K. Jain and R. C. Dub es. Algorithms for clustering data. Pr entic e- Hal l, Inc , 1988. [9] T. Kanungo, D. M. Moun t, N. S. Netan yah u, C. D. Piatk o, R. Silv erman, and A. Y. W u. A lo cal searc h appro ximation algorithm for k-means clustering. Computational Ge ometry , 28(2–3):89–112, 2004. [10] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey . IEEE T r ansactions on Computational Biolo gy and Bioinformatics , 1(1):24–45, 2004. 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment