Correcting for selection bias via cross-validation in the classification of microarray data

IMS Collectio ns Beyond P arametr ics in Interdisciplinary Resear c h: F estsc hrift in Honor of Professor Pranab K. Sen V ol. 1 (2008) 364– 376 c  Institute of Mathematical Stat istics , 2008 DOI: 10.1214/ 193940307 000000284 Correct i ng for selecti on bias vi a cross-v alidation in t h e c l as s iﬁcation of microarra y data ∗ G. J. McLac hlan 1 , J. Chev el u 2 and J. Zh u 1 University of Que ensland and Univ ersity of R ennes Abstract: There is increasing in terest in the use of diagnostic rules based on microarray data. These rules are f or med by considering the expression levels of thousands of genes in tissue samples tak en on patients of known classiﬁca- tion with r esp ect to a num ber of classes, representing, say , di sease status or treatmen t strategy . As the ﬁnal ve rsions of these rules are usually based on a smal l subset of the av ail able genes, there is a selection bias that has to b e corrected for in the estimation of the asso ciated error rates. W e consider the problem using cross-v ali dation. I n particular, we presen t explicit formulae that are useful in explaining the lay ers of v alidation that hav e to b e p erform ed in order to av oid improp erly cross-v alidated estimate s. 1. In tro duction Microar r ay e x per iments were ﬁrst describ ed in the mid-19 90s as a means o f mea- suring the expressio n levels of thous ands of genes simultaneously (Lipshutz et al. [ 4 ] and Schena et al. [ 8 ]). They were quickly adopted by the rese a rch co mmunit y for the study of a wide ra nge of pro blems in the biologica l and medical sciences (Quack en bush [ 7 ]). One o f the most imp or tant emer ging clinical applications of mi- croar ray technology conce r ns the dia g nosis of diseases, particularly in the context of cancer diagnos is. F o r example, ac c ur ate diagnosis o f breast cancer c an s pare a signiﬁcant num ber of brea st cancer pa tients from r e ceiving unnecessar y a djuv ant systemic trea tmen t. There are t w o basic approaches to genera ting microa rray da ta. In a t wo-colour array , tw o samples of rib onucleic acid (RNA), each lab elled with a diﬀeren t dy e a re simult aneously hybridized to the array . Cy anine 3 (Cy3) and cy anine 5 (Cy5) are ﬂuorescent dyes that are co mmonly used. The sample of interest, for example, a breast cancer s a mple, is lab elled with one dye (say C y 5), and a reference sample, for example, norma l breast tissue, is lab elled w ith a nother dye (say , Cy3 ). With this a ppr oach, the level of gene express io n is es timated by the lo garithm of RNA in the sample of int erest to that in the (con trol) reference sample. F or single-colo ur arrays, such as the Aﬀymetrix GeneChip, each sample is lab elled and individually ∗ Supported by the Australian Researc h Council. 1 Departmen t of Mathematics and Institute f or Molecular Bi oscience, Universit y of Queensland, Australia, e- mail: gjm@math s.uq.edu. au ; justin.zhu@dst o.defence .gov.au 2 IFSIC, Unive rsit´ e Rennes 1 and Ecole Normale Sup erieur Britany , F rance, e-m ail: jonathan .chevelu@ eleves.bretagne.ens-cachan.fr AMS 2000 subje ct classiﬁc ati ons: Pr i mary 62H30; second ary 62H12. Keywor ds and phr ases: cross-v alidation, discri minant analysis, error rate estimation, gene ex- pression data, selection bias. 364 Corr e ction of sele ction bias 365 incubated with an array . The level of expressio n for e a ch gene is represented by a single ﬂuorescenc e intensit y . There is increasing interest in the use o f diag nostic rules ba s ed on micr oarray data. In this con text, there are av aila ble tiss ue sa mples of known classiﬁcatio n with res pe c t to a num be r of cla sses repres ent ing v ar io us conditions; for example, presence or abse nce of a dis ease or diﬀerent types o f treatments. The aim is to form a prediction rule based o n the ge ne-signature vectors of these classiﬁed tissue samples. As these s a mples contain the expres s ion levels on thous ands o f genes , some form o f v ar iable (gene) selection is usually employ e d befo re o r during the formation of the prediction r ule. As the ﬁnal form o f the prediction r ule adopted will typically depe nd on a muc h smaller subset of the genes relative to the full s et, ca re must b e exercised in estimating the erro r rates to ensure tha t the consequent selection bia s is cor rected for. This selection bias is often ov erlo oked in the estimation of the e rror rates in the bioinformatics litera ture s o that it is common to read of a rule based of only a few g enes with a zero or near -zero assesse d erro r rate. W e focus on a nonparametric pr e diction rule, namely the suppo rt mac hine ma- chine a nd the nonparametr ic appro ach of cross-v alidation for the es timation o f its error r ates. In particula r, we prese n t explicit formulae that s how and explain the extra layers of v alida tion that need to b e ca rried o ut to c o rrect s atisfactorily for the selection bias or biases in those situations where there is more than one source. 2. Notation 2.1. P r e di cti on r u les Before we pro ceed to discuss the selection bias problem in estimating the acc ur acy of a prediction rule, we need to intro duce some notation. Also, we need to deﬁne for - mally the prediction rule to b e cons idered and the metho d of er ror- r ate es timation to b e ado pted. Although biologica l exp eriments v ar y co nsiderably in their design, the da ta gen- erated by microa rray exp eriments can b e viewed as a matr ix o f e xpression levels. F o r M micro a rray exper iment s (corr esp onding to M tissue s amples), where we mea sure the ex pr ession levels of N genes in each exper iment, the re s ults can b e r epresented by an N × M matrix. F or each tissue, we can consider the expressio n levels of the N genes, ca lle d its expr ession signatur e . Co nv e rsely , for each g ene, we can consider its expres sion levels acros s the diﬀerent tissue sa mples, called its expr ess ion pr oﬁle . The M tissue sa mples mig ht co rresp ond to each of M diﬀerent pa tien ts. In the present cont ext, the problem is to construct a predictio n (discriminant) rule r ( y ) that can a ccurately predict the class of origin of a tumour tissue with feature vector (g ene-signature vector) y , which is unclassiﬁed w ith re s pe c t to a known num b er g ( ≥ 2) of distinct tissue classes , denoted here by C 1 , . . . , C g . Here the sig na ture vector y contains the expressio n levels of a very lar ge num b er p of genes. If r ( y ) = i , then it implies that y should b e a ssigned to the i th class C i ( i = 1 , . . . , g ). In applications concer ned with the diag nosis o f cancer , one class ( C 1 ) may corre s po nd to ca ncer and the o ther ( C 2 ) to b enign tumour s. In a pplica tions concerned with patient surviv al following treatment fo r cance r, one cla ss ( C 1 ) may corres p o nd to the go o d-prognosis class a nd the other ( C 2 ) to the p o or- prognos is class. Also, there is interest in the identiﬁcation of “ ma rker” genes that characterize the diﬀerent tissue classes. This is the feature selection problem. In the situation where the inten tion is limited to making an outr ight as s ignment to one of the 366 G. J. McLach lan, J. Chevelu and J. Zhu po ssible classes, it is p erhaps more a ppropriate to use the term predic tio n r ather than discriminant to describ e the rule. Ho wev er, we sha ll use either nomenclatur e regar dless of the underlying situation. In the pattern rec o gnition jargon, such a rule is referr ed to a s a classiﬁer . In order to train the prediction rule, there are av ailable tra ining data t co nsist- ing of n tissue sa mples o f known classiﬁca tion. These data are obtained from n microarr ays, where the j th micro a rray exp eriment gives the expression levels of the p genes in the j th tissue s ample y j of the training set. The vector (2.1) t = ( y T 1 , z T 1 , . . . , y T n , z T n ) T , denotes the tr a ining data , where z j = ( z 1 j , . . . , z gj ) T is the clas s-indicator vector, and z ij is one or zer o accor ding as y j comes from the i th class C i or not ( i = 1 , . . . , g ; j = 1 , . . . , n ). W e s hall wr ite the sample rule formed from the training data t as r ( y ; t ) to show its dep endence on the training data t . 2.2. Diﬀer ent typ es of err or r ates F or a g iven realizatio n t o f the training data T , it is the conditiona l o r actual allo cation r ates o f a sample prediction r ule r ( y ; t ) that are of cen tral interest. They are given by (2.2) ec ij = pr { r ( Y ; t ) = j | Y ∈ C i , t } ( i, j = 1 , . . . , g ) . That is, e c ij is the pro bability , conditional on t , that a randomly chosen observ ation from C i is assig ned to C j by r ( y ; t ). The unconditional or exp ected a llo cation rates o f r ( y ; t ) ar e given by eu ij = pr { r ( Y ; T ) = j | Y ∈ C i } = E { ec ij } ( i, j = 1 , . . . , g ) . The unconditional rates are useful in providing a guide to the perfo r mance of the rule b efore it is a ctually formed fro m the training data. Concerning the error rates sp e ciﬁc to a class, the c o nditional pr obability o f mis- allo cating a ra ndomly chosen mem b e r from C i is ec i = g X j 6 = i ec ij ( i = 1 , . . . , g ) . The ov erall c o nditional erro r ra te for an entit y drawn randomly from a mixture G of C 1 , . . . , C g in prop or tions π 1 , . . . , π g , resp ectively , is ec = g X i =1 π i ec i . The individual cla ss and ov erall unconditional error r ates, eu i and e u , are deﬁned similarly . It is no ted that the overall err or r ate may give a mis le ading summary of the erro r ra tes when the class-sa mple sizes ar e dispar ate. Howev er, it is often Corr e ction of sele ction bias 367 reasona ble to take the prior pr obabilities to b e the same, where the latter are now int erpreted as the class -prior pr obabilities adjusted (m ultiplicatively) by the relative impo rtance of the co sts o f misa llo cation. F or e x ample, in the c a se of tw o classes, the cost o f misallo cation is often muc h grea ter for the rar er class; see McLa chlan [ 5 ], Page 9. If r ( y ; t ) is co nstructed from t in a consistent ma nner with resp ect to the Bayes rule r o ( y ), then lim n →∞ eu = e o , where e o denotes the optimal error rate. Interest in the optimal err or rate in prac tice is limited to the extent that it repr esents the er ror of the b est obta ina ble version of the sample- based r ule r ( y ; t ). The Bay es or o ptimal r ule r o ( y ) is deﬁned by (2.3) r o ( y ) = arg max i τ i ( y ) , where τ i ( y ) = pr { Y ∈ C i | y } = π i f i ( y ) /f ( y ) (2.4) is the p osterior proba bility tha t y be longs to the i th class C i ( i = 1 , . . . , g ). Here f i ( y ) denotes the density of y in class C i and (2.5) f ( y ) = g X i =1 π i f i ( y ) is the unconditiona l dens it y of y . 3. Supp ort v ector m ac hi ne 3.1. Deﬁnition In the s equel, we shall fo cus on the use of a nonpar ametric classiﬁer , namely the suppo rt vector ma chine (SVM), as intro duced by V apnik [ 12 ]. Adv antages o f an SVM in the pres ent context, w her e the n um b er of fea ture v ariables (genes ) p is so large relative to the sample size n , ar e that it is able to b e ﬁtted to all the g enes a nd that its perfo rmance appe a rs not to be to o aﬀected by using the full set of genes. How ever, in practice, some form of gene selection would generally be con templated. Another adv antage of the SVM (with a linea r kernel) is that gene selection can b e undertaken fairly simply using the v ector of weigh ts a s the cr iter ion. F or an SVM with linear kernel, the rule r ( y ; t ) in the case of g = 2 class es is equal to 1 or 2, ac c ording as the sign of (3.1) ˆ β 0 + ˆ β T y is p ositive or nega tive. In ( 3.1 ), ˆ β v = ( ˆ β ) v denotes the (ﬁtted) co eﬃcient of the expression level y v for gene v . The SVM le arning algor ithm (with linear kernel) aims to ﬁnd the sepa rating hyperplane ( 3.1 ) that is max ima lly distant from the training data of the tw o classes . When the classes are linea rly separable , the hyperplane is lo cated so tha t it has maximal ma rgin (that is, so tha t ther e is maximal distance betw een the h yper plane 368 G. J. McLach lan, J. Chevelu and J. Zhu and the nearest po int in a ny of the classes) When the data are not separa ble , there is no separa ting hyperplane; in this ca se, the aim is still to try to maximize the margin but allow some classiﬁca tion error s sub ject to the co ns traint that the total error (distance from the hyperplane o n the wro ng side ) is less than a co nstant (V apnik [ 12 ]). 3.2. R e cursi ve fe atur e el imination (RFE) As the num ber of genes is muc h grea ter than the num ber of tiss ues, c o nsideration is usually given initially to reducing the num ber of genes. F requently , this is under- taken in some ad ho c manner befor e a more fo r mal method of feature selectio n is adopted in conjunction with the choice of predictio n rule . F or example, one s uch commonly used metho d in the cas e of tw o cla s ses is to ra nk the features on the basis of the ma gnitude o f the (p o oled) tw o-sample t -test. F or a SVM with a linear kernel, Guyon et al. [ 3 ] hav e shown that a go o d guide to the relative imp or tance of the v ar iables (genes ) is given by the relative size of the absolute v alues of their ﬁtted co eﬃcients ˆ β v (that is, the weights). Hence a ranking of the discrimina tory p ow er of the genes can b e g iven b y ranking the genes from top to bo ttom on the basis o f the absolute v alues o f the w eights ˆ β v . This is what called a wra pping metho d, as the gene re duction method is embedded in the prediction rule. In applying the SVM in this study , w e adopt the selection pro cedure of Guy on et al. [ 3 ], who used a backward sele c tion pro cedure , which they ter med recurs ive feature elimination (RFE). It considers initially a ll the av a ila ble genes, which are ranked acco rding to their weigh ts and the b ottom- ranked genes discar ded. The SVM is then reﬁtted to the remaining genes, which ar e then rera nked acco rding to their new weigh ts. Again, the b ottom-r anked g enes are discarded, and so on. In the applications to follow on micro array data, we ﬁrs t disca rded enoug h b ottom- ranked genes so that the num ber retained was the gr e a test p ower of 2 (less than the origina l num ber of genes). W e then pro ceeded sequentially to disca rd ha lf the current n um b er of g enes on each subsequent step. Initially , the erro r rate us ually falls as genes are deleted, but ge ne r ally , it will start to rise once a s uﬃciently large nu mber of genes hav e b een deleted. As noted b y Guyon et al. [ 3 ], the pro cess can b e reﬁned at the exp ense o f greater computational time by deleting fewer v aria bles o n each step. One sugg e s ted pro c edure is to apply RFE by removing ch unks of feature s in the ﬁrst few itera tions and then removing o ne feature at a time once the feature set size rea ches a few h undred. 4. Estimates of error rates based on cross-v alidation In practice, the er ror rates of a prediction rule r ( y ; t ) a re unknown as they dep end on the unknown class- conditional distributions. Hence they must b e es tima ted. In the ab ov e, w e hav e used the notation r ( y ; t ) to denote a predic tio n rule based on the tra ining data t . It is implicitly assumed that all the av ailable genes p a re b eing used in the fo r mation and application of this r ule. In the c a se wher e only a subset of the g enes ar e used, we write the rule a s r { y ; t , s d ( t ) } to denote that it is fo rmed using the subset s d ( t ). The latter denotes a subset of d ge ne s that has been selected by the adopted metho d of gene se le c tion employed on the tra ining data t . Here with the SVM rule, we a r e using recursive feature eliminatio n (RFE). Corr e ction of sele ction bias 369 The a pparent e rror r ate of r { y ; t , s d ( t ) } is given by the prop or tion of the tissue samples misallo cated when this r ule is applied to the tra ining data t . It can b e expressed therefor e as (4.1) A { s d ( t ) } = 1 n g X i =1 n X j =1 z ij Q [ i, r { y j ; t , s d ( t ) } ] where Q [ u , v ] is one if u 6 = v and zero if u = v , and z ij = ( z j ) i , i = 1 , . . . , g ; j = 1 , . . . , n . Unless the sample siz e n is lar ge relative to the num b er of genes d b eing used, it will provide to o optimistic a n asse s sment of the e r ror rate. In the sequel, we fo cus on the use of cros s-v alida tion to cor rect the appa rent e r ror ra te for its down ward bia s. The leav e-one-out or n -fold cross-v alidated rate is given by (4.2) A n CV { s d ( t ) } = 1 n g X i =1 n X j =1 z ij Q [ i, r { y j ; t ( j ) , s d ( t ( j ) ) } ] where t ( j ) denotes the tr aining data with the j th express ion signatur e vector y j and its clas s lab e l z j deleted; that is, with ( y T j , z T j ) T deleted. As the no tation implies, we hav e to select a new subset of genes, s d ( t ( j ) ), for the training s ubs et t ( j ) used on the j th v alidation tr ial ( j = 1 , . . . , n ). W e sha ll refer to the cr oss-v alidated er ror rate ( 4.2 ) as the external cross- v alidated ra te a s on each v alidation trial, gene s election is undertaken ex ternally o n each training subset t ( j ) . If we were to use the sa me s ubset s d ( t ) of d g enes as obtained orig ina lly from the full training da ta set t on each v alidatio n trial, then there would b e a s election bias as illustrated, for ex a mple, in Am broise and McLachlan [ 1 ]. W e can express this ordinar y or internal cross -v alidated er ror rate as (4.3) A n CVI { s d ( t ) } = 1 n g X i =1 n X j =1 z ij Q [ i, r { y j ; t ( j ) , s d ( t ) } ] . W e use the term “o rdinary” o r “ internal” here to mea n that cross -v alida tio n is being used as it would be in the s itua tion where the rule was based o n d genes with no selection pro cess employ ed to choo se the d genes . As the leav e-one-o ut or n -fold CV ra te is highly v ar iable, it is common to use ﬁve- or ten-fold cross- v alidatio n. F or example, with the latter, we c an divide the n tr aining observ ations ( y T j , z T j ) T in t into 1 0 blo cks (subsamples ) B 1 , . . . , B 10 of approximatively equal size. W e let n k denote the size o f B k ( k = 1 , . . . , 10). In this case we can expres s the ten-fold cro ss-v a lidated rate as (4.4) A 10CV { s d ( t ) } = 1 n g X i =1 10 X k =1 X j ∈ B k z ij Q [ i, r { y j ; t B k , s d ( t ( B k ) ) } ] where s d ( t ( B k ) ) denotes the s e lected subset of size d based on the tr aining data with the k th blo ck B k deleted. In equatio n ( 4.4 ), the third summatio n is over all lab els j of the tissue sa mples that belo ng to B k . Often this cr oss-v alidation is ca r ried out in str atiﬁed for m in that the classes are r e presented in each fold in a pproximately the same prop or tions a s in the original training set. Unlik e n - fo ld cro s s-v alida tion, the v alue of the estimate ( 4.4 ) with ten-fold cros s- v alidation is not unique, as the training data do es not have a unique split into 370 G. J. McLach lan, J. Chevelu and J. Zhu k = 10 v alidation blo cks. W e can a ttempt to reduce the v aria nce of the estimated error rate by calculating ( 4.4 ) for a num b er of splits of the training data t into ten blo cks and taking the av erage. This rep ea ted cr oss-v alidation was not used in the results rep orted here . 5. Illustration o f sel ection bias (subset of ﬁxed si ze) W e demonstr ate the selection bias that can o c cur when we estimate the error rate of a predictio n rule based on a subset of the av a ilable genes via cro ss-v alidation without taking into account that the s ubset has b een selected in some optimal w a y . As no ted by Am broise and McLachlan [ 1 ], this selection bias is often ov erlo oked in the bio info r matics liter ature. W e consider the br east ca ncer data set o f v a n ’t V eer et al. [ 11 ]. The da ta as analyzed her e consis t of the expr ession levels o f 5,42 2 genes in tumours fro m 78 lymph-no de neg ative patients with sp or adic breas t primar y breast cancer catego r ized into tw o clas ses of patients C 1 and C 2 , with n 1 = 44 in C 1 representing a go o d-prog nosis cla ss (that is , thos e who remained metastasis free after a p erio d of mo re than ﬁv e years), a nd with n 2 = 34 in a p o o r -prog no sis class (those who developed distant metastases within ﬁve years). The patients were young (les s than 55 y ears in a ge) and ha d lymph-no de nega tive tumours. F urther, they had not received a djuv ant therapy , which is likely to mo dify outco me, and were diag nosed with breast cancer b etw een 1983 and 1994 , ma king a follow-up of 10 years or mo r e p oss ible . W e a pplied a supp or t vector machine with RFE to these da ta, and the (ten-fold) cross- v alidated e r ror rates at each stage o f the selectio n pro cedure are displayed in T able 1 . The ﬁrst column gives the err or r ate of the (internal) cross- v alidated error rate A 10 C V I { s d ( t ) } for which an externa l cr oss-v alidation has not be e n im- plement ed. The second column giv es the incre ased es timate A 10 C V { s d ( t ) } using an external v a lidation. That is, it uses the ten-fold version ( 4.4 ) More spe c iﬁcally , cons ider the en tries of 0.15 and 0.29 for A (10 C V I ) { s 8 ( t ) } and A (10 C V ) { s 8 ( t ) } , r esp ectively , for the SVM formed from the r emaining 8 g enes during the RFE pro ces s . With the former, the subset o f 8 genes as obtained by RFE applied to the full data set is used o n each of the ten v alidation trials . But with the la tter , the selectio n pro cedure RFE is run on each of the ten v alidation trials, starting with all the genes , to obtain a p ossibly new r e duced subse t of 8 genes. The fact that we do not a lwa ys obtain the same 8 genes on each on the ten v alida tion tria ls can b e used to iden tify p otential marker genes. 6. Selection bi as: Opti mal subs e t of unrestricted size In pra ctice, having selected the “b est” s ubset o f a particular s iz e, say d v ariables (genes), attention turns to seeking the b est subset ov er a ll sizes d . F or example, on comparing the v a lues of the estimated error r ate A 10CV { s d ( t ) } in T a ble 1 for the v alues of d co nsidered, we can see that the minimum v a lue of this estimate o ccur s for d = 8 and d = 256 genes. Thus in pra ctice, considera tion might b e given to using the subs e t of 8 genes. But we know that its estimated err or rate A 10CV { s 8 ( t ) } will not b e an unbiased estimate of the er ror ra te o f the SVM with these 8 genes in its application to future tissue samples . W e can corr ect for this a dditional selection bias due to the optimization o f the A 10CV { s d ( t ) } over v ar ious sizes d by p er forming a seco nd lay er of cross-v alidation. Corr e ction of sele ction bias 371 T ab le 1 Cr oss-validate d e rr or ra tes of SV M wit h RFE applie d to 5,422 ge nes on n 1 = 44 go o d-pr o gnosis p atients ( C 1 ) and n 2 = 34 p o or-pr o gnosis p atients ( C 2 ) Num ber of genes In ternal CV error rate External CV error rate 1 0.28 0.40 2 0.19 0.40 4 0.21 0.42 8 0.15 0.29 16 0.18 0.38 32 0.12 0.38 64 0.10 0.33 128 0.12 0.32 256 0.17 0.29 512 0.15 0.31 1024 0.19 0.32 2048 0.22 0.35 4096 0.31 0.37 5422 0.37 0.37 More spe c iﬁcally , supp ose that r { y ; t , s ∗ ( t ) } denotes the rule for med using the subset o f genes s ∗ ( t ) that minimizes A 10CV { s d ( t ) } ov er all v alues of d that hav e bee n considere d. That is, s ∗ ( t ) = s h ( t ) , where (6.1) h = a rg min d A 10CV { s d ( t ) } . In the c a se where equation ( 6.1 ) is satis ﬁe d by more tha n one v alue of d , we ca n work with the smallest v alue of d . It is clear that A 10CV { s d ( t ) } will underestimate the true er ror ra te o f r { y ; t , s ∗ ( t ) } . W e can cor rect for this additional selection bias in forming o ur ﬁnal rule by the optimiza tio n o f a sequence o f r ules by p er forming a n additional la yer of cr oss- v alidation. Mor e sp eciﬁcally , to tra in the rule ba sed on t ( B k ) on the k th v alidation trial ( k = 1 , . . . , 10 ), we need to p erform an additional cro ss-v a lidation to form the optimal r ule r { y ; t ( B k ) , s ∗ ( t ( B k ) ) } , where s ∗ ( t ( B k ) ) deno tes the optimal subset o ver all subsets for a r ule based o n the training data with the k th blo ck deleted. W e can do this using ten-fold cro s s-v alida tion; or, since t ( B k ) consists of 9 blo cks B h ( h = 1 , . . . , 10 ; h 6 = k ), it is conv enien t to us e nine- fold cro ss-v a lidation. Accordingly , we can provide an appr oximate un biased estimate o f the er ror rate, given by (6.2) A 10CV2 { s ∗ ( t ) } = 1 n g X i =1 10 X k =1 X j ∈ B k z ij Q [ i, r { y j ; t ( B k ) , s ∗ ( t ( B k ) ) } ] where s ∗ ( t ( B k ) ) = s h k ( t ( B k ) ) and h k = ar g min d g X i =1 10 X k ′ =1 k ′ 6 = k X j ∈ B k ′ z ij Q [ i, r { y j ; t ( B k , B k ′ ) , s d ( t ( B k , B k ′ ) ) } ] n − n k and t ( B k , B k ′ ) denotes the full tr aining data set t with the k th a nd the k ′ th blo cks, B k and B k ′ deleted, and n k is the num b er of sa mples in the k th blo c k. 372 G. J. McLach lan, J. Chevelu and J. Zhu Applying this estimate ( 6.2 ) to the data in T able 1 , we ﬁnd that the v alue of A 10CV2 { s ∗ ( t ) } is 0.3 7. This compar es with the v alue of 0.2 9 for the estimate A 10CV { s d ( t ) } for d = 8 genes. Thus the eﬀect of p erfor ming the seco nd lay er of cross- v alidation is to increase the estimate o f the err o r r ate from 0.29 to 0 .37. 7. Additional bi as from prelim inary screening In the previous sections, we hav e considere d the SVM prediction rule for some microarr ay data fr om the breast cancer study o f v an ’t V eer et a l. [ 11 ]. It soug h t to distinguish b etw een patien ts who had the same stage of disea se but a diﬀerent resp onse to treatment and a diﬀerent ov e rall outcome. A signature v ector of 7 0 g e nes was identiﬁed o n the ba sis of the correla tion b etw een a gene and the clas s la be l, which is equiv alent to using the (p o oled) tw o-sample t -statistic to rank the genes. They ca lled these 7 0 genes the pr ognostic ma rker genes. The sup erio rity of this discriminant ana lysis approa ch based on this 70 -gene signature as compared with traditional clinical staging is under dispute. Also, Ein-Dor et al. [ 2 ] in a study of this data set concluded that there ar e sev eral sets of 70 genes with equal predictive powers, and this is supp orted b y subsequent s tudies including Michiels et al. [ 6 ] and our own r esults which are to b e rep o r ted e lsewhere. Note that there is a s election bias if a pr e diction rule is for med for m these 70 genes without taking into a ccount that they a re the “top” 70 genes in a muc h larger list of genes . T o illustra te this, we formed a SVM from the data se t of 78 brea st cancers using just these 7 0 genes . In T able 2 , we rep ort the v alue of the (internal) cross- v alidated er ror r ate A 10CVI { s d ( t (70) } for the v a lues of d co nsidered, where t (70) denotes the tra ining data with express io n levels av a ilable only on the “top” 70 genes. That is, (7.1) A 10CVI { s d ( t (70) ) } = 1 n g X i =1 10 X k =1 X j ∈ B k z ij Q [ i, r { y j ; t (70) ( B k ) , s d ( t (70) ( B k ) ) } ] . On comparing them with the results in T able 1 for the SVM using RFE s tarting from 5,42 2 genes, we can se e that ther e is a selec tio n bias in starting with these “top” 70 genes. In pra ctice, we can correct for this bias provided the expression levels are av ailable for the larg er set o f genes. This ra ises the questio n of how lar ge a set of genes we should star t with in order to avoid this type of bias. Zhu et a l. [ 13 ] co ncluded that the initial set of g enes has to be relatively small re lative to the total num ber of genes av ailable for this bias to b e of practical signiﬁcance. In the second co lumn of T able 2 , we hav e listed the v alue s of the (external) cross- v alidated err or rate g iven by (7.2) A 10CV { s d ( t (70) ) } = 1 n g X i =1 10 X k =1 X j ∈ B k z ij Q [ i, r { y j ; t (70 k ) ( B k ) , s d ( t (70 k ) ( B k ) ) } ] where t (70 k ) denotes the training data for the “top” 70 genes found when using the k th training s ubset t ( B k ) with expressio n levels av ailable on the full set of 5 ,422 genes o n a ll tissues apa rt from those in the k th blo ck B k , a nd t (70 k ) ( B k ) denotes t (70 k ) with the k th blo ck B k deleted. The p oint to b e str essed her e is that if only the subset of selected g enes is av a ila ble, then there is no wa y to corr ect for the selection bias in working with these “ top” genes. Also , it is noted that the ab ov e a pproach of s electing the top Corr e ction of sele ction bias 373 T ab le 2 Cr oss-validate d e rr or ra tes of SV M wit h RFE applie d to the top 70 genes on n 1 = 44 go o d-pr o g nosis p atients ( C 1 ) and n 2 = 34 p o or-pr o gnosis p atie nts ( C 2 ) without and with bias c orr e ction for starting with a top subset of the genes Num ber of genes In ternal C V err or ra te External CV erro r Rat e 1 0.40 0.40 2 0.31 0.37 4 0.24 0.40 8 0.23 0.32 16 0.22 0.33 32 0.19 0.35 64 0.23 0.32 70 0.23 0.32 genes individually v ia univ aria te methods is not the o ptimal multiv a r iate metho d of selection for a pr ediction rule. 8. Another s e lection bi as The s election bias es discuss e d in the pr evious work arise due to the fact that some or all of the data used to form the rule are involv ed in some wa y with the tes ting of the rule. One way o f avoiding s uch biases is to use a holdout metho d. The a v ailable data are split into disjoint training a nd test sets. The pr e diction rule is formed from the training subset and then as sessed on the test set. C le arly , this metho d is ineﬃcient in its use of the data, par ticularly in the context of microa rray data, where the num ber of tissue samples is usually limited. A commonly o ccurring mistake in analy ses o f micr oarray data is to adopt a holdout metho d, but to use the test s e t in the selection of the ge nes. With this approach, the test subset plays a role in leading to the choice of the ﬁnal fo r m of the prediction r ule. How ev er, this is freq uently ov erlo oked in the bioinformatics literature (Somorjai et al. [ 9 ]). It o ften leads to cla ims that a pre dic tio n r ule can b e formed from only a few genes that has almost zer o err or rate. 9. The Nethe rlands breast cancer data A nationwide clinical tr ial MINDA CT (Microarr ay In No de Neg a tive Disease may Avoid Chemo Therapy), is now underwa y in the Nether lands, in which the expres- sion pro ﬁles for the 70-gene s ignature prop os ed by v an ’t V ee r et a l. [ 11 ] are b eing collected from a ll newly identiﬁed co ns ent ing patients with breast cancer and used as an adjunct to class ic clinical staging. It ca n b e seen from T a ble 1 that the data in v an ’t V eer et al. [ 11 ] co nsisting of the sp or adic 78 brea st ca ncers are o f limited discriminatory v a lue. Subsequently , v an de Vijver et al. [ 10 ] co nsidered a la rger set of 2 9 5 breast cancer tissue samples, which consisted of 6 1 of the 78 patients in v an ’t V ee r et a l. [ 11 ]. Each of the new 23 4 patients, so me of which were lymph-no de po sitive, were follow ed for at least 5 years or to their censo r ed time; some actually died during the fo llowup p er io d. There were 55 patients with censor ed outcomes (the o ccur rence o r nono ccurre nc e of metastases within 5 years was not known). F or the 179 new pa tien ts with known o utcomes combined with the 61 fr om the original study of v an ’t V eer et al. [ 11 ], w e ﬁtted a SVM with RFE, starting with all 24 ,481 genes. The results are displayed in T able 3 . The smallest estimated err or rate is 0.2 6 at d = 12 8 genes. If we use ( 6.2 ) to cor rect for the bia s a rising from 374 G. J. McLach lan, J. Chevelu and J. Zhu T ab le 3 Cr oss-validate d err or r ates starting with al l available genes using pr e dicte d class lab els f ro m 61 br ea st c anc er tumours fr om the van ’ t V e er et al. [ 10 ] study and using t he true class lab els fr om actual outc omes Num ber External C V err or ra te External C V err or ra te of using predicted using true genes class lab els class lab els 1 0.42 0.31 2 0.38 0.31 4 0.38 0.34 8 0.31 0.35 16 0.23 0.33 32 0.22 0.30 64 0.19 0.30 128 0.16 0.26 256 0.15 0.29 512 0.14 0.29 1024 0.15 0.28 2048 0.14 0.27 4096 0.14 0.27 8192 0.16 0.27 16384 0.17 0.29 24481 0.18 0.27 the fact that this is the smalles t over a num be r of subsets, we obtain an estimate of 0.28 . In their study , v an de Vijver e t a l. [ 10 ] cre a ted tw o classes corres po nding to go od- and p o or- prognos is by ignoring the actual outcomes wher e av ailable a nd assig ning the patients to the tw o classe s on the bas is of a r ule formed fro m the ge ne-expressio n signatures for these 61 patien ts. Mo re precisely , each of the 234 new tumour s w as assigned to class C 1 or C 2 according as the correla tion b etw een the elements of its gene-signatur e vector a nd the cor resp onding elements of y 1 was g reater or less than 0.4, where y 1 is the mean of the gene signa tures of those patients from the original 61 pa tien ts in the go o d- prognos is class C 1 . Each of the 6 1 o riginal tumours was re a ssigned to C 1 to C 2 according to whether the (cross- v alidated) correla tion betw een its gene-signa ture vector and y 1 was g reater or less than 0.55 (a thresho ld that r esulted in a 1 0 % rate of false negatives in the study of v a n ’t V eer et al. [ 11 ]). Thu s if we were to use their predicted cla ssiﬁcation for these data, we would obtain re s ults that are more o ptimistic than if we used the actual o utco mes for the o ccurrence or nonoc c urrence of metas tases. T o demo ns trate this, we hav e lis ted in T able 3 , the r e sults of ﬁtting a SVM with RFE to the 295 tiss ue samples with the predicted classiﬁcatio n of v an de Vijver et al. [ 10 ]. Some idea of the bias inv olved can be obtained b y c omparing them with the cor resp onding res ults in this table using the known o utco mes for 240 tiss ue samples. Admittedly the training set is smaller for the latter, but it would not account for the diﬀerences obtained here. F urther, v a n de Vijver et al. [ 10 ] used only the “ top” 7 0 g enes as identiﬁed from the 78 tissue samples in the v an ’t V eer et al. [ 1 1 ] study . As 61 of these 7 8 tissue samples a r e b eing used in the enlarg ed set in v an de Vijver et al. [ 10 ], this will b e another sour c e of bia s. 10. Discussio n In this study , we have describ ed so me of the ways in which a selection bias can a rise in the fo r mation of a pr ediction rule using a subset o f the av aila ble genes selected in Corr e ction of sele ction bias 375 some optimal manner from a muc h larger set of g enes meas ur ed on only a limited nu mber of training samples. T o illustr ate these biases, we consider applications of a nonpar ametric r ule, the supp ort vector machine, for med on subsets of the av a ila ble genes b y using the fea tur e selection pro cedure known a s RFE (recur s ive feature elimination). W e employ the nonpar ametric method of bias correction, cr oss- v alidation, to corr ect fo r these selectio n biases. In s ome situations, mo re than one lay er of v alida tion mu st b e p erformed in o rder to ensure that the erro r r ate is prop erly v alida ted. O ur results underscore the impor tance of cross-v alidating a t a ll stages of the pro cedure used to tr ain the prediction rule. The methodolog y is demo nstrated on the Netherlands brea st cancer data arising from the studies of v an ’t V eer et al. [ 11 ] a nd v an de Vijv er et al. [ 10 ]. Ident ifying a gene signature for br east cance r pro g nosis has b een a central goal in these a nd other microar ray studies. These data hav e attracted considera ble a tten tion in the bioinformatics and medical liter ature, as their analyses hav e raised a num ber of statistical issues including aspec ts o f the topic o f se le ction bias as discussed here. In v an ’t V eer et al. [ 11 ], a 70 gene sig nature (known as the Amsterdam signa ture) was derived from a cohor t of 7 8 bre ast cance r patients. Our results here show that the SVM has an a ccuracy of only 0.6 3%. v an de Vijv er et al. [ 10 ] subsequently attempted to provide further v alidation of the Amsterdam sig nature by considering a larger set of 295 breast cance r patients. The r esults rep orted fo r the SVM trained on the bre a st c a ncers o f known o utcomes in this expanded study of v an de Vijver et al. [ 10 ] show that it has an accuracy of approximately 72% in predicting prognosis status. This is b etter than the 6 3% obtained on the o riginal set o f 7 8 breast cancer tumours, but is still not high. Clear ly , the molec ular cla ssiﬁcation of breas t cancer is still a work in progress . References [1] Ambroise, C. and McLachlan, G . (2002 ). Selection bia s in gene extr action on the bas is of microa r ray gene- expression data. Pr o c. Natl. A c ad. Sci. USA 97 6562 – 6566 . [2] Ein-Dor, L., Kela, I. , Getz, G ., Giv ol, D. an d Domany, E . (200 5). Outcome s ignature genes in br east cancer: Is there a unique set? Bioinformatics 21 171– 1 78. [3] Guyon, I., Weston, J., Barnhil l, S. an d V apnik, V. (2002). Gene selec- tion for ca ncer class iﬁcation using supp or t vector machines. Mac hine L e arning 46 389– 4 22. [4] Lipshutz, R., Fodor, S., Gingeras, T. and L ockhar t, D. (19 99). High density synthetic olig onucleotide arrays. Natur e Genetics 21 20 –24. [5] McLachlan, G. (19 92). Discriminant Analy sis and Statistic al Pattern Re c o g- nition . Wiley , New Y ork. MR11904 69 [6] Michiels, S., Koscielny, S. and Hill , C. (2005). Prediction of cancer outcome with microar r ays: a multiple ra ndom v alidatio n strategy . L anc et 365 488–4 92. [7] Quackenbush, J. (20 06). Microa rray analysis and tumor clas siﬁcation. New England J. Me dicine 35 4 2 463– 2472. [8] Schena, M., Shaon, D., Heller, R., Chai, A ., Brow n, P. and Da vis, R. (199 6). Parallel human genome analysis: microa rray-based expr e ssion mon- itoring of 100 0 genes . Pr o c. Natl. A c ad. Sci. USA 93 10 6 14–1 0619. [9] Somorjai, R. , Dolenko, B. and Baumgar tner, R. (20 03). Class predic- 376 G. J. McLach lan, J. Chevelu and J. Zhu tion a nd discov ery using gene micr o array and proteomics mass sp ectrosco py data: Curses, caveats, cautions. Bioinformatic s 19 148 4 –149 1. [10] v an de Vijver, M. , H e, Y ., v an ’t Veer, L., Dai, H., Har t, A. , Voskuil, D., Schreiber, G., Peterse, J. , Rober ts, C., Mar ton, M., P arrish, M., A tsma, D., Witteveen, A., Gl as, A., Del aha ye, L. , v an der V alde, T., Bar tel ink, H., Rodenhuis, S., R utgers, E., Friend, S. and Bernards, R. (2002). A gene-express ion signature as a pr edictor o f surv iv al in breast cance r. New England J. Me dicine 347 199 9–20 0 9. [11] v an ’t Veer, L., Dai, H., v an de V ijver, M. , He, Y., Har t, A., Mao, M., Peterse, H., v a n der Kooy, K., Mar ton, M., Witteveen, A ., Schreiber, G. , Kerkhoven, R. , Rober ts, C., Linsl ey, P ., Bernard s, R. and Friend, S. (2002 ). Gene expre s sion proﬁling predic ts clinical o utcome of breas t cancer. Natu r e 415 530– 536. [12] V apnik, V . (1 9 98). St atistic al L e arning The ory . Wiley , New Y or k. MR16412 50 [13] Zhu, X., Ambroise, C. and McLachlan, G . (2 006). Selectio n bias in work- ing with the top g e nes in sup e r vised classiﬁca tion of tissue samples. St atistic al Metho dolo gy 3 29 –41. MR22242 78

Correcting for selection bias via cross-validation in the classification of microarray data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment