A New Heuristic for Feature Selection by Consistent Biclustering

A Ne w Heuristic for Feat ure Selection by Consistent Bicl ustering A. Mucherino ∗ S. Caﬁeri † ∗ INRIA, Lille Nord Eur ope, France. Email: an tonio.mucherin o@inria.fr † EN A C, T oulou se, France. Email: sonia.ca fieri@enac.fr Abstract —Giv en a set of data, biclustering aims at ﬁ nding simultaneous partitions in biclusters of its samples and of the features which are used for repre senting the samples. Consistent biclusterings allo w to obtain correct classiﬁcations of the samples from the kno wn classiﬁcation of the features, and vice versa , and they are very useful for perf orming superv ised classiﬁcations. The problem of ﬁnding consistent biclust erings can be seen as a feature se lection problem, where the features that are not re lev ant fo r classiﬁcation pu rposes ar e remov ed from the set of data, wh ile the total nu mber of f eatures is maximized in order to preserv e informa tion. This fea ture selection problem can be fo rmulated as a lin ear fractional 0–1 optimization problem. W e propose a ref ormulation of this p roblem as a bilevel optimization problem, and we present a heuristic algorithm f or an efﬁcient solution of the ref ormulated problem. Computational experiments show that the presented algorithm is able to ﬁ nd better solutions with respect to the ones obtained by employing p re viously presented heuristic algorithms. I . I N T RO D U C T I O N Data mining techn iques are nowadays m uch studied, be- cause of th e g rowing amo unt of data which is av ailable and that needs to be an alyzed. In particular, clusterin g techniques aim at ﬁnding suitable partitions of a set of s amples in clusters, where data are group ed by follo wing different criter ia. Th e focus of this paper is biclusterin g, wh ere samples an d featur es in a given set o f data are partition ed simultaneously . Giv en a set of samples, each sample in the set can b e represented by a seq uence of features, which are sup posed to be relevant for the sam ples. If a set of data co ntains n samples which are r epresented by m f eatures, then the whole set can be represented by an m × n m atrix A , w here th e samples a re organized colum n b y column , and the f eatures are organized row b y row . A bicluster is a submatrix of A , which can be equiv alently deﬁned as a pair of subsets ( S r , F r ) , wh ere S r is a clu ster of samples, and F r is a clu ster of featu res. A biclustering is then a par tition of A in k biclu sters: B = { ( S 1 , F 1 ) , ( S 2 , F 2 ) , . . . , ( S k , F k ) } , such that the following conditions are satisﬁed: k [ r =1 S r ≡ A, S ζ ∩ S ξ = ∅ 1 ≤ ζ 6 = ξ ≤ k , (1) k [ r =1 F r ≡ A, F ζ ∩ F ξ = ∅ 1 ≤ ζ 6 = ξ ≤ k , (2) where k ≤ min( n, m ) is the number o f biclusters [2], [6]. Note that th e con ditions (1) e nsures th at B S = { S 1 , S 2 , . . . , S k } is a p artition of the samples in disjoint cluster s, while the condition s (2) ensures that B F = { F 1 , F 2 , . . . , F k } is a partition of the featur es in disjoint clusters. W e focus on the p roblem of ﬁnding biclusterin gs of th e set of samples and of the set of featur es. When such biclusterings can b e found , no t on ly clusters o f samples are obtaine d (a s in standard clustering), but, in addition, the featur es causing the partition of samples in th ese clusters are also iden tiﬁed. This informa tion is very interesting in many r eal-life applications. In particular, bicluster ing techniques are widely applied for analyzing gene expression data, wh ere samp les r epresent par - ticular conditions ( for examp le, the presen ce or absence of a disease), and each sample is represen ted by a sequence of gene expressions. In th is case, ﬁndin g out which f eatures (g enes) are related to the samples can help in discovering inf ormation about diseases [3], [9]. The concep t of consistent biclustering is very impor tant in this do main [2]. Le t us consider a set of samples, and let us sup pose that a certain classiﬁcation is assigned to such samples. In other words, w e know a partition in clusters of these samples: B S = { S 1 , S 2 , . . . , S k } . A classiﬁcation for the correspo nding featur es, i.e. fo r th e f eatures used for representin g these samples, can be obtained from B S (see Section II for details). Let us refer to this partition of the features with B F = { F 1 , F 2 , . . . , F k } . Then, the pr ocedure can be in verted, and f rom the ob tained cla ssiﬁcation B F of the featu res, another classiﬁcation for the samples can be computed : ˆ B S = { ˆ S 1 , ˆ S 2 , . . . , ˆ S k } . In general, B S and ˆ B S differ . In the event in which they in stead coincide, the biclus- tering B = { ( S 1 , F 1 ) , ( S 2 , F 2 ) , . . . , ( S k , F k ) } is ref erred to as a consistent biclusterin g. Consistent biclustering s can be u sed for classiﬁcation pur- poses. Let us supp ose that a training set is av ailable for a certain classiﬁcation p roblem. In other words, we suppose th at a set of sam ples, whose classiﬁcation is known, is av ailable. From the classiﬁcation of the samples, a classiﬁcation of the features ca n be found, and then a certain b iclustering, as explained above. If this biclustering is c onsistent, then the original c lassiﬁcation of the samples in the trainin g set can be reconstruc ted from the classiﬁcation of its featu res. Theref ore, the classiﬁcation o f these features can also be explo ited for ﬁnding a classiﬁcation f or o ther samples, which orig inally have no kn own classiﬁcation. Unfortu nately , sets of data allo wing for con sistent biclu s- terings are qu ite rare. There are usually featur es that are not relev ant f or the classiﬁcation of the samples, which can easily bring to misclassiﬁcations. Because of experimen tal erro rs or noise, these featu res could b e assigned to a b icluster or an other, and this uncertain ty causes errors in the classiﬁcations. F or av o iding th is, all the features that are not rele vant mu st b e removed. Theref ore, we are inte rested in selecting a certain subset of fe atures fo r which a consistent biclustering can be found . Since it is pr eferable to keep the loss of in formatio n as low as p ossible, the numb er of featur es to be selected has to be the maxim um possible. The feature selectio n p roblem related to co nsistent b i- clustering is NP-hard [ 8]. It can be formulated as a 0–1 linear fr actional o ptimization pr oblem, which can be very difﬁcult to solve. In p articular, for large (real-life) sets of data, the correspo nding optimization pro blem is also large, and therefore th ere are no examples in the literature in which deterministic tech niques have been employed. In [2], [12], two heuristic algorith ms have been p roposed for so lving the 0–1 linear fraction al op timization pro blem ar ising in the con text of feature selection by biclu stering. In this paper, we prop ose a new heu ristic algo rithm fo r solving this featu re selection pro blem. W e refo rmulate the optimization pro blem as a bilev el optimizatio n problem, in which the inn er pro blem is linear . Therefor e, we use a d e- terministic algorithm for solving the inner pr oblem, which is nested into a general framework where a heuristic strategy is emp loyed. Ou r compu tational experiments sho w that the propo sed h euristic algorith m is able to ﬁnd subsets o f features allowing fo r con sistent biclusterings. The ob tained results ar e compare d to the ones re ported in other publicatio ns [2], [12]: in general, the h euristic alg orithm th at we pr opose is able to ﬁnd consistent b iclusterings in wh ich th e nu mber o f selected features is larger . The remainin g of the pap er is o rganized as follows. In Section II, we develop the c oncept of co nsistent biclustering in more details, and we presen t the corr espondin g f eature selection problem. In Section III, we reform ulate this featu re selection pr oblem as a bilevel o ptimization prob lem a nd we introdu ce a heuristic algorithm for an efﬁ cient solution of the problem . Computa tional experimen ts on real-life sets o f data are presented in Section IV, as well a s a compar ison to another heuristic algorith m. Conclusions are gi ven in Section V. I I . C O N S I S T E N T B I C L U S T E R I N G Let A be a n m × n matrix r elated to a cer tain set of data, where samples ar e organized column by column, and their features ar e o rganized r ow by row . If a classiﬁcation of the samples is known, then the centr oids of each cluster , computed as the mean amo ng all the member s of the same cluster, can be computed . Let C S be th e matrix co ntaining all these centroid s, organized colu mn b y colum n, wh ere its gene ric element c S ir refers to th e i th feature of th e cen troid of the r th cluster of samples. Analo gously , a matrix C F containing the centroid s of the clusters related to a known classiﬁcation of the featu res can be deﬁned. The g eneric element c F j r of the matrix C F refers to the j th sample related to the centroid of the r th cluster of features. Finally , th e sym bol a i refers to the i th row of th e matrix A , i.e . to a fe ature, and the sym bol a j refers to the j th column of A , i.e. to a sample. In the following discu ssion, k represents the number of b iclusters ( known a prio ri), an d r ∈ { 1 , 2 , . . . , k } refer s to the g eneric biclu ster . The symb ols ˆ r and ξ are used f or ref erring to b iclusters having particu lar proper ties. Let us suppose that a classiﬁcation for the sam ples in A is known. In other words, the following partition in k clusters is av a ilable: B S = { S 1 , S 2 , . . . , S k } . Starting from this c lassiﬁcation, the m atrix C S of centroids can be compu ted. Given a feature a i , we c an check the value of c S ir for all the clusters. I f, for a ce rtain cluster S ˆ r , the elem ent c S i ˆ r is the la rgest f or a ny po ssible r , then S ˆ r is the c luster in which th e feature a i is mo stly expressed. Therefo re, it is reasonable to give to this feature the sam e classiﬁcation a s the samples in S ˆ r . Formally , it is imposed that: a i ∈ F ˆ r ⇐ ⇒ c S i ˆ r > c S iξ ∀ ξ ∈ { 1 , 2 , . . . , k } ξ 6 = ˆ r . (3) Note that a complete classiﬁcation o f all th e features can be obtained by imposin g the equi valence (3) fo r all a i . Let B F = { F 1 , F 2 , . . . , F k } be the com puted classiﬁcation of the f eatures. Startin g fro m this classiﬁcation, the matrix C F can be compu ted. In a similar way , a classiﬁcation of the samples can be obtain ed by imposing the fo llowing equiv alence: a j ∈ ˆ S ˆ r ⇐ ⇒ c F j ˆ r > c F j ξ ∀ ξ ∈ { 1 , 2 , . . . , k } ξ 6 = ˆ r . (4) Let ˆ B S = { ˆ S 1 , ˆ S 2 , . . . , ˆ S k } be the computed classiﬁcation of the samples. In general, the two classiﬁcations B S and ˆ B S are d ifferent from each other . If they coincide, then the partition in biclu sters B = { ( S 1 , F 1 ) , ( S 2 , F 2 ) , . . . , ( S k , F k ) } is, by deﬁnition, a consistent biclustering . As a lready remarked in the Introd uction, the classiﬁcation of the f eatures ob tained from co nsistent b iclusterings can be exploited f or c lassifying samples with an unkn own classiﬁcation [2]. If a consistent b iclustering exists for a certain set of data, then it is said to be b iclustering-ad mitting . H owe ver, sets of data admitting con sistent b iclusterings a re very r are. Therefore, features must be removed from the set of data for making it beco me biclusterin g-admittin g [2]. Du ring this process, it is very importa nt to remove the least p ossible nu mber of features, in o rder to preser ve th e infor mation in the set of data. In p ractice, a maxim al sub set of go od features must be extracted from the initial set. T he proble m o f ﬁnding the max imal con sistent biclustering can b e seen as a featur e selection problem . Let f ir be a binar y param eter wh ich indicate s if the g eneric feature a i belongs to the g eneric cluster F r ( f ir = 1 ) o r not ( f ir = 0 ). Let x ≡ { x 1 , x 2 , . . . , x m } be a bin ary vector of variables, where x i is 1 if the feature a i is selected, and it is 0 other wise. The prob lem of ﬁnding a con sistent biclustering considerin g the maximu m p ossible nu mber of features can be formu lated as follows: max x f ( x ) = m X i =1 x i ! (5) subject, ∀ ˆ r , ξ ∈ { 1 , 2 , . . . , k } , ˆ r 6 = ξ , j ∈ S ˆ r , to: m X i =1 a ij f i ˆ r x i m X i =1 f i ˆ r x i > m X i =1 a ij f iξ x i m X i =1 f iξ x i . (6) The gen eric constraint ( 6) ensur es that the ˆ r -th feature is the mostly expressed if it belon gs to the cluster ( S ˆ r , F ˆ r ) . Note that the two fraction s are used for co mputing the centroid s of the clusters of features, and that the sums (at the numerato rs and at the denominator s) only co nsider the selected f eatures (each unselected f eature is autom atically d iscarded b ecause x i = 0 ). The reader is refer red to [2] for additiona l d etails. In this context, other two optimizatio n problems h av e also been introdu ced [12]. They are extensions of the pro blem (5)- (6), which hav e been proposed in or der to o vercome som e problem s related to data af fected by no ise. If a partition in clusters for the samples is available, then we can ﬁnd a partition in clusters for the featur es. E ach feature is theref ore assigned to th e cluster F ˆ r if c S i ˆ r is the cen troid with the la rgest value. Le t u s suppo se that the following condition ho lds fo r a certain feature a i : min ξ 6 = ˆ r { c S i ˆ r − c S iξ } ≤ ε where ε is a small positiv e r eal n umber . If th is is the case, small ch anges (i.e.: noise) in the data can bring to different partitions of the featu res, because the margin between c S i ˆ r and other centro ids is very small. In or der to overcome this p roblem, the concepts o f α - consistent biclustering and β - consistent biclustering have b een introdu ced in [12]. They bring to the form ulation o f the fol- lowing two optimization prob lems. The prob lem of ﬁnding an α -consistent biclusterin g with a maximal nu mber of featur es is equiv alent to solving the optimization prob lem: max x f ( x ) = m X i =1 x i ! (7) subject, ∀ ˆ r , ξ ∈ { 1 , 2 , . . . , k } , ˆ r 6 = ξ , j ∈ S ˆ r , to: m X i =1 a ij f i ˆ r x i m X i =1 f i ˆ r x i > α j + m X i =1 a ij f iξ x i m X i =1 f iξ x i , (8) where each α j > 0 . Similarly , the p roblem of ﬁnding a β - consistent biclustering with a maximal numb er o f f eatures is equiv alent to so lving the optimizatio n problem: max x f ( x ) = m X i =1 x i ! (9) subject, ∀ ˆ r , ξ ∈ { 1 , 2 , . . . , k } , ˆ r 6 = ξ , j ∈ S ˆ r , to: m X i =1 a ij f i ˆ r x i m X i =1 f i ˆ r x i > β j × m X i =1 a ij f iξ x i m X i =1 f iξ x i , (10) where each β j > 1 . All the presented optimizatio n prob lems are NP- hard [8]. The reader wh o is interested in more infor- mation on the formulatio n of these optim ization pro blems can refer to [2], [12], [1 4]. For a simple an d ampler discussion o n biclustering, refer to [11]. The three optimization pro blems ( 5)-(6), (7)-(8) and (9)- (10) are linear fr actional 0 –1 op timization pro blems. In [2], a p ossible lineariz ation of th e p roblem h as been studied. Howe ver, the autho rs noted that currently available solvers for m ixed integer progra mming are not a ble to solve th e considered linearization, due to th e large numb er o f v ariables which ar e usually inv o lved when dealing with real-life da ta. Therefo re, t hey presented a h euristic algorithm fo r the solution of these p roblems, which is based on the solution of a sequence of lin ear 0–1 (non -fraction al) optimization proble ms. Successiv ely , in [12], a nother heur istic a lgorithm has b een propo sed, where a sequence of continuous linear optimization problem s needs to be solved. The heuristic algor ithm we propo se is able to provide better solutions with respect to the ones provided by these two. I I I . A N I M P ROV E D H E U R I S T I C In the following discussion, only the op timization problem (5)-(6) will be co nsidered, because similar ob servations can be made for th e other two p roblems. Th e compu tational experiments reported in Section I V, howev er , will be related to all three op timization problems. W e propose a refor mulation of the pro blem (5)-(6) as a bilevel optimiza tion prob lem. T o th is aim, we sub stitute the denomin ators in the co nstraints (6) with new variables y r , r = 1 , 2 , . . . , k , wh ere each y r is related to the generic b icluster . Then, we can rewrite the constraints (6) as follows: 1 y ˆ r m X i =1 a ij f i ˆ r x i > 1 y ξ m X i =1 a ij f iξ x i . (11) The constraints (11) must be satisﬁed fo r all ˆ r , ξ ∈ { 1 , 2 , . . . , k } , ˆ r 6 = ξ an d for all j ∈ S ˆ r . Let us c onsider a set of values ¯ y r of y r , and also anoth er propo rtional set of values ˘ y r = δ ¯ y r , with δ > 0 . It is easy to see that, giv en certain values fo r th e variables x i , with i = 1 , 2 , . . . , m , the constraints (1 1) are satisﬁed with ¯ y r if and only if they are satisﬁed with ˘ y r . As an example, if k = 3 and there is a consistent biclustering in w hich 20, 30 and 50 featu res are selected in the k biclusters, then the constraints (11) are also satisﬁed if 0.20 , 0.30 and 0.5 0, respectively , replace the actu al n umber of fea tures ( in this example, the proportion al factor δ is 0.0 1). For this r eason, the variables y r can b e u sed for r epresenting th e pr oportions among the cardinalities of the clusters of featu res. In the previous example, 20% of th e selected fe atures ar e in the ﬁrst bicluster, 30% of the features in the second one, and 50% in the last one. The variables y r can be boun d in th e real interval [0 , 1 ] , an d the f ollowing co nstraint can be includ ed in the optimiz ation p roblem: k X r =1 y r = 1 . (12) W e introduce the function : c ( x, y ˆ r , y ξ ) = X j ∈ S ˆ r | 1 y ξ m X i =1 a ij f iξ x i − 1 y ˆ r m X i =1 a ij f i ˆ r x i | + , where the symb ol | · | + represents the fu nction which returns its argumen t if it is positive, an d it returns 0 otherwise. As a consequen ce, th e value of this fu nction is positiv e if and only if the correspo nding constraints (11) are not satisﬁed. Finally , we ref ormulate the optimizatio n p roblem (5)-(6) as the bilevel optimization prob lem: min y   g ( x, y ) = k X ˆ r =1 X ξ 6 = ˆ r c ( x, y ˆ r , y ξ )   (13) subject to: x = arg max x f ( x ) = m X i =1 x i ! sub ject to constraint ( 1 1 ) , k X r =1 y r = 1 . (14) The objective f unction g of the ou ter pr oblem is the sum of se veral term s wh ich corr espond to the function c ( x, y ˆ r , y ξ ) for each ˆ r and ξ ∈ { 1 , 2 , . . . , k } , with ξ 6 = ˆ r . Th e min imization o f all the term s of g brin gs to th e ide ntiﬁcation of bicluster ings in wh ich th e con straints ( 11) are all satisﬁed. If this is the case, the foun d biclustering is consistent. Algorithm 1 is a sk etch of our h euristic algorithm for feature selection by consistent biclustering. At the beginnin g, the variables x i are all set to 1, and the variables y r are set so that they represen t th e distribution o f all the m featu res among the k clusters. Therefor e, if the biclustering is already consistent, Algorithm 1 A heuristic algorithm for featu re selection. 0: let iter = 0 ; 0: let x i = 1 , ∀ i ∈ { 1 , 2 , . . . , m } ; 0: let y r = P i f ir /m , ∀ r ∈ { 1 , 2 , . . . , k } ; 0: let rang e = starting rang e ; while ( g ( x, y ) > 0 and rang e ≤ max r ang e ) do let iter = ite r + 1 ; solve the inner optimization problem (linear & con t.); if ( g ( x , y ) > 0 ) then increase ran g e ; if ( g ( x , y ) has improved) then rang e = startin g rang e ; end if let r ′ = rando m in { 1 , 2 , . . . , k } ; choose random ly y r ′ in [ y r ′ − rang e, y r ′ + rang e ] ; let r ′′ = rando m in { 1 , 2 , . . . , k } such th at r ′ 6 = r ′′ ; set y r ′′ so that P r y r = 1 ; end if end while then the fu nction g is 0 with this choice for the variables, and all the features can be selected . In this case, the con dition in the while loop is no t satisﬁed and the algorith m ends. At each step o f the algorithm, th e inner optimization prob- lem is so lved. It is a linear 0–1 optimiz ation p roblem, and we consider its contin uous relaxation, i.e. we allow the variables x to take any r eal value in the interval [0 , 1] . Ther efore, a fter a solution has been o btained, we substitute th e fractional values of x i with 0 if x i ≤ 1 / 2 , o r with 1 if x i > 1 / 2 . M oreover , in the expe riments, the strict inequality of the con straints (11) is relaxed, so that th e do mains deﬁned b y the constraints ar e closed domains. In these hyp otheses, the optimization prob lem can be solved by comm only used solvers f or mixed integer linear prog ramming (MILP). In our experim ents, we employ the ILOG CPLEX solver (version 11) [7]. After the solution of the inn er problem , th e fun ction g is ev a luated. If th e o btained values for the variables x i , together with the used values for the variables y r , co rrespond to a value for g equal to 0, then the outer pro blem is a lso solved and the algorithm stops. Other wise, some parameter s and variables ar e modiﬁed in ord er to get r eady for the next iteration of the algorithm . The h euristic par t of this alg orithm takes inspiration from the V ariable Neighbor hood Search (VNS) [5], [ 10], wh ich is one of the most succ essful meta-he uristic searches for glob al optimization [15]. The variables y r are randomly modiﬁed during the algo rithm: at each step, two of such variables y r ′ and y r ′′ are chosen rand omly so that r ′ 6 = r ′′ . Then, y r ′ is perturb ed, and its value is chosen random ly in the interval centered in the pre vious value of y r ′ and with leng th 2 × r ang e . As in VNS, the con sidered interval is r elativ ely small d uring the ﬁrst iteratio ns, in ord er to focus the search in neighb ors of the cu rrent variable values. Th en, the interval is increased and increased . Howe ver , it is set back to its starting size when better solutio ns are fou nd. By employing this strategy α 0 1 2 5 10 f ( x ) 7450 7448 7444 7413 7261 β 1 1.01 1.50 2.00 3.00 f ( x ) 7450 7450 7107 6267 5365 T ABLE I C O M P U TA T I O NA L E X P E R I M E N T S O N A S E T O F S A M P L E S F RO M N O R M A L A N D C A N C E R T I S S U E S . T H E F E A T U R E S A R E S E L E C T E D B Y F I N D I N G A N α - C O NS I S T E N T O R β - C O N S I S T E N T B I C L U S T E RI N G . borrowed f rom VNS, every time there is a new improvement on th e objecti ve functio n value, th e search is initially f ocused in neighb ors of the cur rent solution , an d th en it is extended to the who le search d omain. When the co nsidered interval gets too large ( max r ang e ), then th e sear ch is stopped, becau se there are lo w probabilities to ﬁnd better s olutions. After having chosen a value f or y r ′ , a new value f or y r ′′ is compu ted so that the constrain t on all the variables y r is satisﬁed. Note th at, for values of rang e large enoug h, th e rando mly co mputed y r ′ could be such that X ∀ r 6 = r ′′ y r > 1 . In this case, there are no possible values for y r ′′ in [0 , 1] for which the constraint (12) can be satisﬁed. In o rder to o vercome this issue, too large values for rang e ar e av oided. For its nature, th e propo sed heur istic alg orithm can pr ovide different solutions if it is executed more than once (with dif - ferent seeds for the gen erator of ran dom n umbers). Th erefore, the algo rithm can be executed a given num ber o f tim es and the best obtained solution can be taken into consideration . I V . C O M P U TA T I O NA L E X P E R I M E N T S W e implem ented the presented heu ristic algo rithm for f ea- ture selectio n in AMPL [1], fro m wh ich the ILOG CPLEX1 1 solver is inv o ked for the solution o f th e inner op timization problem . Exper iments ar e carried o ut on an I ntel Core 2 CPU 6400 @ 2.1 3 GHz with 4GB RAM, runnin g Linux. The ﬁrst set of d ata that we consid er is a set of g ene expressions r elated to h uman tis sues from healthy an d sick (affected by canc er) p atients [13]. This set of data is a vailable on th e web site of th e Princeton University (see the p aper fo r the web link). I t contains 36 sam ples classiﬁed as normal or cancer , an d each sample is speciﬁed throug h 7457 fe atures. W e ap plied our heur istic algorithm for ﬁnding a consistent biclustering f or th e samples and th e featu res con tained in this set of data. T ab le I shows some comp utational experimen ts. W e foun d α -consistent biclu sterings and β -consistent biclusterin gs, with different values for α or β . Note that, even though f or eac h sample a different α j or β j can be considered , we use on e unique value for α an d β in each experimen t. In the table, the number of selected features f ( x ) is given in correspo ndence with each experimen t. When α = 0 or β = 1 (consistent biclustering) , after 4 iterations o nly (41 second s of CPU time), our heuristic algorithm is able to provid e the list of selected f eatures, and Alg. 1 Alg. in [12] α f ( x ) er r f ( x ) er r 0 7081 2 7024 2 10 70 76 2 7024 2 20 70 75 2 7018 2 30 70 72 2 7014 2 40 70 68 2 7010 1 50 70 61 1 6959 1 60 70 46 1 6989 1 70 69 54 1 6960 1 β f ( x ) er r f ( x ) er r 1.00 7081 2 7024 2 1.05 7075 2 7017 2 1.10 7068 2 7010 1 1.20 7020 1 6937 1 1.50 6590 1 6508 1 2.00 5987 1 5905 1 3.00 5527 2 5458 1 5.00 5238 2 5173 2 T ABLE II C O M P U TA T I O NA L E X P E R I M E N T S O N A S E T O F S A M P L E S F RO M PA T I E N TS D I AG N O S E D W I T H A L L O R A M L D I S E A S E S . T H E F E A T U R E S A R E S EL E C T E D B Y F I N D I N G A N α - C O N S I S T E N T O R β - C O N S I S T E N T B I C L U S T E R I N G . thus to ide ntify the (few) features to be removed in order to have a consistent bicluster ing. In par ticular , 7 features o n 745 7 need to be r emoved (and therefo re 74 50 fea tures ar e selected). The bilevel optim ization problem to be solved gets harder in the case of α -consistent and β -consistent bicluster ing. As expected, less fe atures are selected when larger α or β values are ch osen, because the c onstraints (11) ar e more difﬁcult to be satisﬁed. Ho wev er , using larger values for α an d β allows for identifying the features that are actually important for the classiﬁcation o f the samples. The compu tational co st of our heuristic alg orithm increases when larger α o r β values are used: some of the p resented exper iments n eed some minutes of CPU time to be per formed . The second r eal-life set o f data we co nsider co nsists of samples fro m patients diag nosed with acute lymph oblastic leukemia (ALL) or acu te myelo id leukemia (AML ) d iseases [4] (to download the set of data, follow the link given in the referenc e). This set of data is d ivided in a training set, wh ich we use for ﬁnding co nsistent b iclusterings, and a validation set, which can be used f or checking the quality of the classi- ﬁcations perfor med by using the featu res previously selected. The training set co ntains 38 samples: 27 ALL sam ples an d 1 1 AML samples. The vali dation set con tains 34 samples: 20 ALL samples and 14 AM L samples. The total numbe r of featu res in both sets of data is 712 9. Sin ce, in this case, a validation set is also av ailable, we are a ble to validate the quality of the obtained bicluster ings in corresponden ce with d ifferent values for the chosen par ameter α or β . The results of our experime nts are in T able II. The total number of featur es that are selected in each exper iment is reported , toge ther with the number e rr of misclassiﬁcations that o ccur when the samples of the validation set are classiﬁed accordin gly with the classiﬁcation of the features in the α - consistent or β -c onsistent biclustering s. When α = 0 or β = 1 , our heuristic algorithm is able to ﬁnd a consistent biclustering, but the selected featur es are not able to provide a correct classiﬁcation for a ll the samples of the validation set ( er r = 2 ) . This is due to the fact that the used data are probab ly no isy , because they h av e been obtained from an experimental tech nique. Howe ver, the num ber er r of m isclas- siﬁcations decreases when α o r β increase. For exam ple, for α ≥ 50 , there is on ly on e misclassiﬁcation for the samples of the validation set. In T able II, we also compare the obtained results to the ones rep orted in [12]. Our heuristic algorithm is ab le to provide b etter-quality solutions in the majority of th e cases. In particular, for given choices of α or β , our heu ristic algorithm is ab le to ﬁn d biclu sterings in which the total n umber o f selected fea tures is larger, except for only one experimen t ( α = 70 ). These b iclusterings allow to pe rform goo d-quality classiﬁcations ( err = 1 or 2), while a larger num ber o f features in the set of d ata are preserved. V . C O N C L U S I O N S W e p roposed a reformu lation for the linear fraction al 0– 1 op timization p roblem for fea ture selection by consistent biclustering. Our r eformu lation transform s the proble m into a bilevel optimization pr oblem, in which the inner problem is linear . W e pr esented a h euristic algo rithm for th e solution of the re formulate d prob lem, where the continuous relax ation of the inner pro blem is solved exactly at each iteratio n of the algo rithm. Computation al experiments showed that the propo sed algor ithm can solve featu re selection pro blems by ﬁnding co nsistent, α -co nsistent and β -consistent bicluster ings of a gi ven set of data. The results also showed that th is algo- rithm is able to ﬁn d b etter solution s with re spect to the ones obtained by pr eviously propo sed h euristic alg orithms. Future works will be d ev o ted to suitable strate gies f or im proving the efﬁciency of the p roposed algorithm. R E F E R E N C E S [1] AMPL, http://www.ampl.com / [2] S. Busygin, O.A. Prokop ye v , P . M. Pardalos, F eatur e Selection for Consistent Bicluste ring vi a F ractional 0-1 Pro gramming , Journal of Combinato rial Optimizat ion 10 , 7-21, 2005. [3] S. Bu sygin, O.A. Prok opyev , P .M. Parda los, Bicluste ring in Data M ining , Computers & Operations Research 35 , 2964–2987, 2008. [4] T .R. Golub, D.K. Slonim, P . T amayo, C. Huard, M. Gaasenbeek, J.P . Mesirov , H. Coller , M. L. Loh, J.R.Downing, M.A. Caligi uri, C.D. Bloomﬁeld, E.S. Lander , Molec ular Classiﬁcation of Cancer: Class Disco very and Class Prodi ction by Gene Expr ession Monit oring , Science 286 , 531–537, 1999. [5] P . Hansen, N. Mladenovic , V ariable Neighborhoo d Sear c h: Principle s and Application s , European Journal of Operationa l Research 130 (3), 449–467, 2001. [6] J. Hartigan, Cluste ring Algorithms , J ohn Wile s & Sons, Ne w Y ork, NY , 1975. [7] ILOG, CPLEX, http://www .ilog.com/product s/cplex/ [8] O.E. Kundakci oglu, P .M. P ardal os, The Comple xity of F eatur e Sele ction for Consist ent Biclust ering , In: Cl ustering Chall enges in Biol ogica l Networ ks, S. Butenko, P .M. Pardalos, W .A. Chaov ali twongse (Eds.), W orld Scientiﬁc Publishing, 2009. [9] S.C. Madei ra and A. L. Oli veira, Biclustering A lgorithms for Biolo gical Data Analysis: a Survey , IEEE Tra nsaction s on Computati onal Biology and Bioinformat ics 1 (1), 24–44, 2004. [10] M. Mladenov ic, P . Hansen, V ariable Neighborhood Sear ch , Computers and Operation s Research 24 , 1097–1100, 1997. [11] A. Mucherino, P . Papaj orgj i, P .M. Pardalos, Data Mining in Agric ultur e , Springer , 2009. [12] A. Nahapatyan, S. Busygin, and P .M. Pardalos, An Impr o ved Heuristic for Consistent Bicluste ring P r oble ms , In: Mathemati cal Modelling of Biosystems, R.P . Mondaini and P . M. Pardalo s (Eds.), Applied Optimiza- tion 102 , Springer , 185–198, 2008. [13] D. A. Notte rman, U. Alon, A.J. Sierk, A.J. Lev ine, T ran scription al Gene Expre ssion Pro ﬁles of Colore ctal Adenoma, Ade nocar cino ma, and Normal Ti ssue Examined by Oligonucleot ide A rrays , Cancer Re search 61 , 3124-3130, 2001. [14] P .M. Pardalos, O. E. Kundakc ioglu, Classiﬁcati on via Mathemat ical Pr ogr amming , Journal of Computational and Applied Mathematics 8 (1), 23–35, 2009. [15] E .-G. T albi, Metahe uristics. F rom Design to Implement ation , John W ile y & Sons, 2009.

A New Heuristic for Feature Selection by Consistent Biclustering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment