Transcription factor binding site prediction with multivariate gene expression data

The Annals of Applie d Statistics 2008, V ol. 2, No. 1, 332–365 DOI: 10.1214 /10.121 4/07-A OAS142 c  Institute of Mathematical Statistics , 2 008 TRANSCRIPTION F A C TOR BINDING SITE PRED ICTION WITH MUL TIV ARIA TE GENE E XPRESSION DA T A By Nancy R. Zhang, Mar y C. Wildermuth 1 and Terence P. Speed 1 Stanfor d University , University of California , Berkeley and University of California , Be rke ley Multi-sample microarra y ex p eriments ha ve b ecome a stand ard exp erimental method for study ing biologica l systems. A frequent goal in such studies is to un ravel the regulatory relationships b etw een genes. Du ring the last few years, regression mo dels ha ve b een p ro- p osed for the de novo discov ery of cis -acting regulatory sequences using ge ne expression data. How ever, when applied to multi-sample exp eriments, existing regression based meth od s mo del each ind ivid- ual sample separately . T o bett er capture the dynamic relationships in multi-sa mple microarray ex p eriments, w e prop ose a ﬂex ib le metho d for the joint mo deling of p romoter sequen ce and multiv ariate expres- sion data. In higher order euk aryotic genomes expression regulation usually inv olves com binatorial interaction b etw een several transcription fac- tors. Experiments ha ve sho wn that spacing betw een transcription fac- tor bin ding sites can signiﬁcan t ly aﬀect their strength in activ ating gene exp ression. W e prop ose an adapt ive mo del building p rocedu re to capture suc h spacing dep endent cis -acting regulatory mo dules. W e apply our meth ods to the analysis of m icroarray time-course exp eriments in yeast and in Arabidopsis. These exp eriments exhibit very diﬀeren t dy namic temp oral relationships. F or b oth data sets, we hav e found all of th e wel l-known cis -acting regulatory elements in the related con text, as w ell as b eing able to predict nov el elements. 1. Problem f orm u lation and review of metho ds. Tw o imp ortan t sources of h igh-throughput data h a ve b ecome av ailable to mo dern b iology: microar- ra ys, whic h allo w measurement of genome-wide gene expression patterns o ver m u ltiple biological conditions, and sequenced genomes, whic h allo w compu- tational searc h of an y sequence pattern of inte rest in any genomic region of Received Ap ril 2007; revised Septem b er 2007. 1 Supp orted b y NSF Arabidopsis 2010 Gran t MCB-04-0267. Key wor ds and phr ases. Multiva riate analysis, linear mo dels, transcription regulation, DNA motifs, gene expression. This is an electronic reprint of the or iginal ar ticle published by the Institute o f Mathema tical Statistics in The Annals of Applie d Statistics , 2008, V ol. 2 , No. 1, 332– 365 . This r eprint diﬀers from the original in pa g ination and typog raphic detail. 1 2 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED in terest. Th is wealt h of resources has triggered attempts to compu tation- ally learn the r egulatory grammar within the p romoters of genes through com bined an alysis of expr ession and p r omoter sequence data . The logic that underlies such eﬀo rts is that gene expression is initiated by the binding of transcription factors to u pstream sequences (called cis -acting regulatory el- emen ts), and thus, gene expression p atterns should b e corr elated with the presence of certain cis -acting regulatory elemen ts in th e promoter sequence. Not long after the adven t of microarra y te c hn ology , gene expr ession data w as used to locate kno wn or putativ e transcription facto r b inding sites (TFBS). Initially , most metho ds w er e based on the presp e ciﬁcation of a set of h yp othesized co- expressed genes, whic h can b e obtai ned through cluster analysis of gene expression proﬁles. Statistical metho ds ha ve b een d ev el- op ed to ﬁnd enr ic hed motifs in the p r omoters of su ch pre-sp e ciﬁed gene sets [Bailey and Elk an ( 1994 ) and Liu, Brutlag and Liu ( 2002 )]. Such methods can mo del degenerate motifs thr ou gh p ositio n sp eciﬁc weig h t matrices or graph-based motif represen tations [F ratkin et al. ( 2006 )]. Mo dels that in- corp orate int eractions b et ween motifs h a ve also b een dev elop ed [Zhou and W ong ( 2004 )]. Ho wev er, all of these metho ds are very sensitiv e to the p re- sp eciﬁed gene list. I t is usu ally unclear ho w to obtain suc h a gene list from microarra y data, b ecause clustering method s are often u nreliable and most genes are n ot part of a tight cluster. Also, these metho ds make use of the expression data only to the ext en t of obtaining the set of h yp othesized coreg- ulated genes, and d o not explicitly mo d el the relationship b etw een promoter sequence con tent and gene expr ession. Regression based appr oac hes ha ve b een prop o sed to more dir ectly mo del the relationship b e t wee n the expression pattern of genes and the rep ertoire of motifs in their pr omoters. Bussemake r, Li and Siggia ( 2001 ) prop osed the follo wing simple linea r mod el b et w een X g ,m , the coun t of a motif m in the promoter of gene g , and Y g , the exp ression of gene g : Y g = X m X g ,m β m + ǫ g , where the su m mation is o ver the set of all motifs that is b eliev ed to con- tribute to the expression of genes in th e sample. Th is basic mo del has since b een exp anded to utilize p osition sp e ciﬁc weig h t m atrices (PS WM) in stead of coun ts [Conlon et al. ( 2003 ) and Das, Nahl ´ e and Zh ang ( 20 06 )], as we ll as to mo del interac tions b et w een motifs that app ea r together in the same pro- moter sequen ce [Das, Nahl´ e and Zhang ( 2004 ) and Keles, V an der Laan and V u lp e ( 2004 )]. Notably , Keles, V an der Laan and V u lp e ( 2004 ) used logic regression to mo del com binatorial motif interact ions and Das, Nahl ´ e and Zhang ( 2004 ) used linear sp lines to ca pture the h yp othesized log-sigmoi dal relationship b etw een transcription resp on s e and motif strength. All of the metho ds s o far mentio n ed ha ve attained some degree of success with data TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 3 from ye ast, b ut with th e exception of Das, Nahl´ e and Z hang ( 2006 ), the applicabilit y of these metho d s in h igher ord er organisms is still u nprov en. A t present, all of the existing regression based metho ds mod el only uni- v ariate expr ession data. F or example, Bussemak er, Li and Siggia ( 2001 ), Das, Nahl ´ e and Zhang ( 2004 ), Keles, V an der Laan and V u lp e ( 2004 ) and Conlon et al. ( 2003 ) all analyzed the yeast cel l cycle time series [Sp ellman et al. ( 1998 )] b y d oing a separate r egression at eac h time p oint. Ho wev er, it is also clear from these studies that, f or the cell cycle exp erimen ts, the kno wn biological motifs ha ve a meaningful time-v arying pattern, while the false p ositiv e motifs ha ve a p attern across time that is n ot recogniza b le or that is typica l of experimental artifacts. Multiv ariate gene expr ession data, measured across diﬀeren t biologic al conditions, times and treatmen ts, con- v ey m u c h m ore information than single c h ip data. Ho wev er, at pr esent there is no clear metho d to com bine information across samples. It is often un- clear, when m ultiple samples are a v aila ble, ho w to quan tify whether a gene’s expression proﬁ le is “i n teresting.” Y et, to capture d ynamic regulatory rela- tionships and to resp ond to the gro wing a v ailabilit y of m ulti-c h ip d ata, it is necessary to com b ine information across samples in mo deling ci s -acting regulatory elemen ts. In this pap er we p resen t a lin ear mo del f or this purp ose. W e also p rop ose in this pap er a new framewo rk for mo d eling interact ions b et w een cis -acting regulatory ele m en ts that tak es in to consideratio n the dis- tance b et wee n the elemen ts in the promoter sequence. T ranscription factors, when b ound to the promoter sequ en ce, interact w ith o ther transcription fac- tors and nuclear p r oteins to initiate or inhibit transcription. T he distance b et w een the bin d ing sites aﬀects the strength of suc h in teractions. F or exam- ple, Rushton et al. ( 2002 ) sho wed using s y nthetic plan t promoters that the distance b etw een the b in ding sites of the WRKY transcrip tion factors causes a pronou n ced diﬀerence in the strength of the mo du le. Another example is giv en in Segal and Berk ( 1991 ), wh ere they sh o wed that in viv o transcription stim ulation by Sp 1 transcrip tion factor bindin g sites in the adeno vir u s type 2 early region 1B p romoter is strongly dep enden t on its distance from the T A T A b ox. In an analysis of 4 y east sp ec ies, Ch iang et al. ( 2003 ) found that pairs of join tly conserved motifs exhibit nonr andom relativ e sp acing. In this pap er we give a compu tational metho d to incorp orate this eﬀect in to motif detection. A critical s tep in all regression based metho d s is the selection of the b est subset of motifs to b e includ ed in the mo d el. The initial set of m otifs under consideration can b e q u ite large, esp ec ially for de novo motif ﬁn ding. Most previous methods [Bussemak er, Li and Siggia ( 20 01 ), Conlon et al. ( 2003 ), Keles, V an der Laan and V ulp e ( 2004 )] resorted to simple stepwise v ariable addition to searc h the space. Das, Banerjee and Zhang ( 200 4 ), Das, Nahl ´ e and Zhang ( 2006 ), whic h u ses th e multiv ariate adaptive r egression sp lines 4 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED (MARS) algorithm [F riedman ( 1991 )], used a forw ard step wise searc h f ol- lo we d by mo d el pru ning. I n this pap er w e examine t w o diﬀeren t mo del se- lection metho d s with the aim of obtaining a simp le, in terp r etable mo del. W e use a p erm utation study to ev aluate the eﬀectiv eness of our mo d el se- lection pro cedure and to estimate the exp e cted num b e r of false p ositiv es. Most previous studies on this problem h a ve n ot done su ch sp eciﬁcit y ev alu- ations, w hic h we b eliev e to b e critical to th e un d erstanding and assessmen t of mo dels. The pap er is organized as f ollo ws: T o m otiv ate our metho d s , we b egin b y describing tw o diﬀerent multi-sample gene expression exp erimen ts in Section 2 . W e present our mo del and metho d s in Section 3 . Metho ds of statistical v alidation are imp ortant for interpretation of the results, and thus, in Section 4 we describ e three v alidation metho ds, includ in g a new appr oac h based on ﬂanking sequence inform ation conten t that h as nev er b een applied in pr evi- ous pu blished studies. In S ection 5 w e pr esen t the results for the exp erimen ts describ ed in Section 2 . Finally , w e conclude with a few remarks in Section 6 . 2. Description of exp erimen ts. W e b egin by describing t w o m ulti-sample gene expression exp erimen ts w h ic h will b e used to illustrate ou r metho ds . Both are time-course exp erim ents. The ﬁ rst is a p erio dic time course rep- resen ting the self-regenerating pro cess of ce llular gro wth and division. The second is a non-p erio d ic time course r ep resen ting an organism’s r esp onse to a biotic stim u lus (a pathogen). 2.1. Y e ast c el l cycle. The cell cycle is a tigh tly co ordin ated set of pro- cesses by w hic h cells gro w, replicate their DNA, segregate their c hromo- somes, and divide into daught er cells. Chec kp o in ts du ring the cycle en- sure that at sp ec iﬁ c time p oints, sp eciﬁc pro ce sses must hav e b een com- pleted. Such co ordination requires a complex netw ork of regulatory r ela- tionships. W e app ly our method s to learn these relationships from a set of gene expression m easur emen ts captured at time-p oin ts during the cell cy- cle. The data set we u se comes from the α -factor syn c hr onized cu ltu r es from Sp ellman et al. ( 1998 ), and can b e do wnloaded from the w ebsite http://c ellcycle - w ww.stanford .edu . The samples of the α -arrest exp erimen t are tak en at 7 minute in terv als after sync hronization of the cell culture. There are a total of 18 samples, spanning t wo cell cycles. Thus, many genes related to the cell cycle ha ve a p eriodic expression with tw o p eriods captured in this data set. Muc h is kno wn about the reg ulatory mec h an ism s inv olv ed, and a list of the transcrip- tion factors that is kn own to pla y a crucial role can b e foun d in Sp ellman et al. ( 1998 ). F or th e y east an alyses, we included 1600 genes: 800 deﬁn ed b y Sp ellman as cel l-cycle related genes [S p ellman et al. ( 1998 )] and 800 genes TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 5 selected at random from genes that did n ot exhib it a cell-cycle related pat- tern. In th is wa y we are able to determine whether cis -acting regulatory elemen ts asso ciated w ith sp ec iﬁc gene expression patterns emerge f r om a mixed dataset. Details ab out data pre-pro cessing are giv en in Section A.1 . 2.2. Systemic ac quir e d r esp onse in plants. Plant are commonly exp osed to bacterial, fu n gal and viral pathogens. In resp o nse to certain pathogens, plan ts activ ate defensive resp onses r esulting in enhanced resistance to sub- sequen t pathogen exp osure. This acquired r esistance resp onse can b e b oth lo cal and s y s temic (o ccur ring in unin fected parts of the plan t). Th e small molecule 2-h yd ro xyb enzoic, commonly kn o wn as salicylic acid (SA), is r e- quired for this lo cal and systemic resistance resp onse. Wilderm u th et al. ( 2001 ) sh o wed that this SA is synthesized via iso chorismate syn thase (A t- ICS1) in Arabidopsis thaliana. Null ics 1 mutan t plan ts do not syn thesize SA in resp onse to pathoge n, are more su sceptible to pathogens and compro- mised in lo c al and systemic acquired resistance (SAR) resp ons es. T o inv estigate the sp eciﬁc comp onen ts and p r o cesses of p lant defense mediated by SA, Wildermuth et al. ( 2 007 ) infected w ild typ e and ics 1 m u- tan t Arabidopsis plant s with the p owdery mildew fungal pathoge n Golo vi- nom yces oron tii. In this replicated time-course exp erimen t, samp les w ere harv ested at 0 (just prior to infection), 6, 24, 48, 72, 120 and 168 hr s p ost infection (hpi). This timing fo cuses on the pr ogressiv e gro wth and repr o- duction of the fu n gus. Analysis of this A TH1 Aﬀymetrix dataset in d icates that th e ma jorit y of genes with a s igniﬁ can t diﬀerence in expression in th e ics 1 m utant compared with w ild t yp e exhibit an increase in expression in re- sp onse to the p o w dery mildew in wild t yp e plants with ab olished or r educed expression in the m u tan t. A num b er of these are kn o wn to act do w nstream of SA in SAR. In addition, there are genes that exhibit enh anced expression in the ics 1 m utan t vs. wild t yp e in resp on s e to p o wder y mildew. In our anal- ysis of this data, w e w ould lik e to capture this dynamic time-relationship in ﬁ n ding regulatory TFBS related to SA-mediated SAR. Th e details of the regulatory net w orks mediating SAR are an area of activ e in vesti gation. Binding sites for a few k ey transcription factors inv olv ed in plan t defense ha ve b een determined; ho w ever, m u c h is still u n kno w n. F or the Arabid opsis analysis, w e included the top 1500 genes that exhibited diﬀeren tial expres- sion b et wee n the wild t y p e and ics 1 mutan t in resp onse to pathogen and the b ottom 1500 genes exhibiting no diﬀerence in expression b et ween w ild type and mutan t. T he inclusion of an equal subset of genes with unaltered expr es- sion serv es as a control , allo wing us to determine if we are able to identify genes sp eciﬁcally asso ciated w ith altered expression p atterns. Details of the data pre-pro cessing are giv en in Sectio n A.1 . 6 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED Fig. 1. Flow diagr am of the analysis pr o c ess. We start with the multivariate expr ession data for a ﬁlter e d set of genes and the c ounts of al l nonde gener ate motifs of pr e-sp e ciﬁe d lengths in the upstr e am pr omoter se quenc e of these genes. Line ar c ontr asts u j of the ex- pr ession data ar e use d to build the dictionary (step A ), and subse quently buil d the mo del (step B ). The mo del is then prune d (step C ) for p arsimony. T he steps lab ele d A , B and C ar e detaile d i n Se ctions 3.2 , 3. 3 and 3.4 , r esp e ctively. 3. Mo del an d metho d. Figure 1 sho w s a ﬂow-diagram of our prop osed metho d. T he main steps of the analysis follo w: (A) Dictionary construction, (B) Adaptive d istance-based m o del build ing, and (C) Mod el pr u ning. W e start in S ection 3.1 with a description of our mo del and loss fun ction, whic h will b e in tegral to steps (B–C). 3.1. M o del and loss f u nction sp e ciﬁc ation . The observ ed data are Y def = { Y g ,t : 1 ≤ g ≤ G, 1 ≤ t ≤ T } , where Y g ,t is the log 2 expression inte nsit y for gene g in sample t , and S def = { S g ,i : 1 ≤ g ≤ G, 1 ≤ i ≤ R } , wh ere S g ,i is the DNA base at p osition − i from the start of gene g in the 5 ′ to 3 ′ template DNA str an d , w h ic h we refer to as gene g ’s pr omoter. More sp eci ﬁc d eﬁ nitions of the promoter sequence of a ge ne are giv en in App endix A.1 . W e denote the expression and promoter sequence of a gene g r esp ectiv ely by Y g = { Y g , 1 , . . . , Y g ,T } and S g = ( S g , 1 , . . . , S g ,R ). TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 7 Our mo del assumes th at Y g is a sum of tw o orthogonal comp onents: Y g = Z g + ε g , (1) where Z g is the signal that the exp erimen t is designed to measur e, and ε g is the error comp onent due to tec h n ical or b iologica l noise. Note that ε g ma y b e systematic error with a biological ly meaningful trend in t , but that is not of interest in the con text of the give n exp erimen t. By orth ogonalit y of ε g and Z g , we mean th at the co v ariance matrix C ov[ Z g , ε g ] = 0 . W e mo d el the dep en d ence of Y g on S g through the signal comp on ent Z g . Sp eciﬁcally , let Z g reside in a d dimensional linear subspace of R T , with linear decomp osition Z g = d X j =1 u j,g v j (2) on basis v ectors v 1 , . . . , v d . W e assu me the f ollo wing lin ear mo d el: u j,g = β 0 ,j + X e ∈E β j ( e ) X g ( e ) + ǫ j,g , j = 1 , . . . , d, (3) where E is a set of promoter elemen ts. Section 3.3 deﬁnes the concept of promoter elemen t in detail, while h er e w e simply describ e it in tuitiv ely as a collect ion of DNA letters that satisfy a certain arrangemen t. The v ariables X g ( e ), deﬁned in Section 3.3 , quan tify the pr esence of promoter elemen t e in S g . W e assum e th at the errors ǫ j,g are indep endent across g and j with mean 0 and equal v ariance. Note that in the mo del deﬁned in ( 3 ), the set of p r omoter elements E is common to all b asis functions { v j : j = 1 , . . . , d } . Alternativ ely , one could ﬁt a separate model with a diﬀeren t set E for eac h basis function. T h is w ould b e meaningful if the basis functions were con trasts that individ u ally represent a quantit y of interest. Ho we v er, with mo del ( 3 ), we are looking for a set of pr omoter elemen ts that play a signiﬁcan t r ole in the systemic con text of the exp eriment (represen ted by the signal comp onent Z g ), without b ei ng sp eciﬁc to the c hoice of basis. Th e goal is then to c ho ose E and p arameters β E = { β j ( e ) : e ∈ E , 1 ≤ j ≤ d } that are the m ost eﬀectiv e in explaining the v ariance in Z g . F or eac h gene g , assume that we ha v e the ﬁtted v alues { ˆ u j,g : j = 1 , . . . , d } , wh ic h gives us ˆ Z g = P d j =1 ˆ u j,g v j , the ﬁtted v alue for Z g . The loss function is d eﬁned as L ( E , β E ) = G X g =1 k ˆ Z g − Z g k 2 . (4) If E , the set of promoter elemen ts in the m o del, were ﬁxed, th e goal would b e to ﬁnd β E that maximizes ( 4 ). How ev er, for v arying E , the loss function 8 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED w ould of course need to b e p en alized f or the degrees of freedom of the mod el sp eciﬁed by E . S ection 3.4 d iscusses p ossible p enalt y functions. In pr actice, one w ould need to sp ecify the basis fu nctions { v j } of the signal comp onent Z g . This can sometimes b e done a priori by using con textual kno wledge of the exp er im ent. F or example, for case-con trol exp erimen ts, v j can b e set to the con trasts of interest. Ho wev er, for the t wo exp eriments that w e study , there are no ob vious con trasts. F or the y east cell cyc le exp erimen t, an a priori meaningfu l set of basis could b e orthogonal p eriodic functions with peaks at diﬀeren t phases of the cycle. F or the p owdery mildew in f ection exp eriment, a goo d c h oice of basis is ev en hard er to ﬁnd without examination of the d ata. A go o d method for c ho osing { v j } is to use the principal comp onents of Y , whic h pro v ed eﬀect iv e on b oth the y east and the Arabid op s is exp erimen ts. With some o verlap with previous notation, consid er the singu lar v alue de- comp osition Y = UΛV ′ , where we let U b e the matrix of score vecto r s [ u 1 , . . . , u T ], Λ b e the diagonal matrix with diagonal elemen ts [ λ 1 , λ 2 , . . . , λ T ], and V = [ v 1 , . . . , v T ] b e the T × T square matrix that con tains the loadings o f the T pr incipal comp onent v ectors. O ften, a visu al examination of th e load- ings can yield meaningfu l linear pro jectio n s of the d ata, and goo d c h oices for { v j } . Using con textual kno w ledge ab out the exp erimen t, one ma y b e able to iden tify a su bset A ⊆ { 1 , 2 , . . . , T } of p r incipal comp o nen ts v ectors to serve as the b asis. Other w ise, one can c h o ose the top few p rincipal comp onent s using a scree plot. With the b asis v ectors c hosen using p rincipal comp onents, the loss function ( 4 ) w ould then b e equ iv alen t to the we igh ted s um of losses o ve r the p rincipal comp onen t scores: L A ( E , β E ) = X j ∈A λ 2 j k u j − β j X E k 2 , (5) where β j = ( β j ( e ) : e ∈ E ). Note that the weig ht of eac h comp onent in the ab o ve loss function is exac tly the v ariance of the data alo n g th at comp onent. If A we re chosen to b e the en tire set of all T principal comp onents, then the minim u m of L A ( E , β E ) o ver β E w ould b e equiv alen t to the naive u n - w eighte d sum of th e squ ared error losses from a s eparate regression f or eac h sample: min β E L A ( E , β E ) = T X j =1 min β j ( E ) k Y · ,j − β j ( E ) X . , E k 2 . In mo del ( 1 ), this w ould b e equiv alen t to assumin g ε g = 0 for all genes g . By selecting A to b e a p rop er subset of { 1 , . . . , T } , we are p e rforming we igh ted m u ltiv ariate resp onse regression on a reduced d im en sional subspace of Y . F or time-course exp eriments, this metho d ﬁnds motifs that most signiﬁ- can tly aﬀe ct the shap e of the time course. In h igh d imensions man y sh ap es TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 9 are p ossible. W e f o cus on th e shap es that are most meaningful to th e ex- p eriment, or that are the most eﬀectiv e in explaining the data, or b oth. The mo d el gains p o w er o ve r p revious metho d s by combining information across time-p oin ts to capture the dynamic relationships that can only b e observ ed across samples. In particular, this multi v ariate cross-sample ap- proac h is pr eferable to the single-sample app roac h for ﬁn ding motifs related to path wa ys that hav e distinct time-related patterns of activit y . The prin cipal comp onents approac h give s an automatic, data-based m etho d of weig hting the residual sum-of-squares wh en m ultiple basis v ectors (com- p onents) are c hosen. In b ot h the y east cell cycle and the Arabidopsis ex- p eriments, w e f ou n d more th an on e meaningful orth ogonal pro jection of th e data. On e could, in principle, apply the method s describ ed in the next few sections to eac h pro jection separately , obtaining separate sets of motifs E . Ho we v er, with the weig h ted loss f unction ( 4 ), we hop e to capture motifs that may not b e strongly correlated with any sin gle pro jection, bu t that are w eakly correlated with man y pro jections. Another b eneﬁt of com bined analysis of m ultiple orthogonal pro jections is in th e detectio n of in teractio ns b et w een TFBS. T ranscrip tion factors that are activ e in diﬀeren t path w ays ma y interact to assume a new role. Such in teractions may b e lost if one limits the analysis to one-dimensional pro jections. When there are multiple b asis v ectors, what determines wh ic h motifs get added to the mod el? Some insigh ts can b e ga in ed from examining the simple case wh ere E is a sin gleton { e } . Th en, the wei gh ted loss ( 5 ) h as a simple represent ation L A ( e, β e ) = X j ∈A λ 2 j (1 − ρ 2 j,e ) , where ρ j,e = corr[ u j , X ( e )]. Thus, the promoter element e that has the max- im u m weig hted sum of squared correlation with the comp onents in A would b e added to the mod el ﬁr s t, with the w eigh ts b eing the prop ortion of v ari- ance explained b y th at comp onen t. In the follo wing sections we will assu me that the basic v ectors v = { v 1 , . . . , v d } in ( 2 ) ha v e b een c hosen, either through prin cipal comp on ents or other meth- o ds. W e will u se th e notation λ j = k Y v j k 2 , u j = Y v j / q λ j . 3.2. Step A . Dictionar y c onstruction. W e c h ose to represen t motifs using nondegenerate w ords . In the organisms that we stu d y , a w ord and its rev erse complemen t repr esen t the same biological motif. This is b eca use DNA is double stranded, and the app ea rance of the rev erse complemen t of the w ord in the 5 ′ to 3 ′ template strand is equiv alent to the app ea rance of the word in 5 ′ to 3 ′ orien tation in the co d in g strand. F or example, in the diagram b elo w, 10 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED the top strand is the template strand, the b otto m is th e co ding stran d , and TTGAC and GTCAA r epresen t the same b iologic al motif presen ted 5 ′ to 3 ′ : 5’...TTG AC...3’ 3’...AAC TG...5’. W e start our analysis with the set of all u nique deterministic biological motifs of a p re-c hosen length L . T he size of this set is [4 L − 4 L/ 2 ] / 2 + 4 L/ 2 if L is ev en, or 4 L / 2 , if L is o d d. The ab o ve en um eration coun ts eac h w ord and its rev erse complement only once. In the dictionary construction step, this initial set of words is reduced to a m u c h smaller set f or subs equ en t mo d el building. Although d ictionary construction has b een viewe d in su c h studies as a cru de p re-ﬁltering step to reduce th e size of the mo del search space, it is v er y imp ortan t to the ensuing analysis. F or a m otif to b e selected in the ﬁ nal mo del, it must ﬁ rst b e in clud ed in the dictionary . Therefore, the dictionary m u st p ro vid e a r ic h enough starting s et of motifs, while at the same time reducing the set of initial exhaustiv e list of w ord s to a more computationally manageable set. Here we c hose to represent motifs as nondegenerate w ord s. W e also tried represent ing motifs as “consensus” sequ ences b y includ in g nond egenerate core letters with degenerate outermost letters. Using a 4 letter core, we allo w ed the outermost 2 (or 4) letters to v ary . This “consensus” sequen ce approac h red uced p erformance as it drastically increased the initial dictio- nary aggra v ating the multiple testing problem and increased d ep endency b et w een the motifs. On e could also u se our approac h to identi fy known TFBS PS WMs by setting the dictionary to a kno w n collection of PSWMs, as in Conlon et al. ( 2003 ) and Da s, Nahl ´ e and Zhang ( 2006 ). Ho wev er, the a v ailabilit y of PSWMs is limited for man y organisms, includin g Arabidopsis. This is why we sp ec iﬁcally dev elop ed a metho d that allo ws one to identify kno wn and n o vel motifs w ithout p rior kno wledge. Let D (0) b e the set of all words of length within a pre-c hosen range. F or eac h w ∈ D (0) , let X g ( w ) b e th e coun t of the n u m b er of o ccurr en ces of w in S g . Let X ( w ) = [ X 1 ( w ) , . . . , X G ( w )] ′ . F or any set Γ of motifs, we will denote b y X (Γ) = [ 1 [ X ( w )] w ∈ Γ ], with the v ector of ones alw a ys included in the mo del matrix as the in tercept term. W e s tart b y constructing a smaller dictionary D ( j ) for eac h c h osen basis v ector v j : 1. Let m = M , w here M is a pr e-c hosen v alue. Let D ( j ) = ∅ . 2. Rep ea t un til m = 0: (a) C ompute ξ ( w ) ← X ( w ) − X ( D ( j ) )[ X ( D ( j ) ) ′ X ( D ( j ) )] − 1 X ( D ( j ) ) ′ X ( w ) , w ∈ D (0) L , r ← u j − X ( D ( j ) )[ X ( D ( j ) ) ′ X ( D ( j ) )] − 1 X ( D ( j ) ) ′ u j . TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 11 (b) F or eac h w ∈ D (0) L , let p w b e the t-test p-v alue for the univ ariate regression of r on ξ ( w ) . Add the m w ords with smallest p w to D ( j ) . (c) L et m = ⌊ m/ 2 ⌋ . This greedy step wise ﬁltering approac h adds to th e dictionary not only those w ord s th at are highly correlated with u j , but also those w ords that are highly correlated w ith u j after accoun ting for the aﬀects of the previously added w ord s. T ranscription factor bindin g s ites are usually degenerate , and thus, the w ord s with the smallest p w are often v ariations of the same TFBS. There- fore, at eac h step in the ab o ve algorithm, the set of words that are add ed to D ( j ) is usually swamp ed by o ve rlapping w ords represen ting a single T FBS, th u s red u cing its richness. Since there is a high correlation among these o ve rlapping words, the s tep wise ﬁ ltering app r oac h mitigates this p roblem. F or the ﬁn al dictionary , we set D = S d j =1 D ( j ) . 3.3. Step B . A daptive distanc e-b ase d motif mo deling. Let D b e the dic- tionary of motifs constructed using the method describ ed in Section 3.2 . Let ∆ = { δ i } r i =1 , where δ i ∈ Z + , b e a set of p ossible ranges of interactio n s. W e deﬁne a promoter elemen t to b e either a word fr om D , or an inte raction of the form ( e 1 , e 2 , δ ), where e 1 , e 2 are themselv es promoter elemen ts, and δ ∈ ∆ . W e call elemen ts that are words fr om D simple , an d elemen ts that con tain interact ions c omp osite . Let e b e a promoter elemen t. If e is s imple and of length l , then for an y gene g , we deﬁne the lo cations of e in S g as A g ( e ) = { i : S g ,i : i + l − 1 = e } . If e is comp osite, then its lo cations are d eﬁ ned recur siv ely by A g ( e ) =  i + j 2 : i ∈ A g ( e 1 ) , j ∈ A g ( e 2 ) , | j − i | ≤ δ  , where e 1 and e 2 can b e either simple or comp osite. W e d enote b y I ( · ) the indicator function for the eve n t in its argum en t. Then , w e deﬁn e the v ariables X g ( e ), whic h are used as co v ariates in the mo del ( 3 ), as f ollo ws: X g ( e ) =  | A g ( e ) | , e simple; I ( | A g ( e ) | ≥ 1) , e comp osite. (6) In w ord s, if e were a simple elemen t, th en X g ( e ) would simply b e its count in the promoter of g , as done in Bussemaker, Li and S iggia ( 2001 ). If e w ere comp osite, then X g ( e ) would b e an in d icator of whether it exists in the promoter of g . W e deﬁne the order of a pr omoter element to b e the num b er of in teractions it con tains: order( e ) =  0 , e simp le; order( e 1 ) + order( e 2 ) + 1 , e comp osite. W e denote X ( e ) = [ X 1 ( e ) , . . . , X G ( e )] ′ . 12 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED T o learn the mo d el deﬁ ned in S ection 3.1 , w e need to build the set E and estimate the parameters β . T he metho d we prop ose is a ste p w ise pro cedure, where at eac h step w e searc h o ve r a v ariable p o ol V for the m inimizer of the loss function ( 4 ), and add it to the mo del. V is initialized to con tain all elemen ts in the d ictionary D . T he mo del is initialize d to con tain only the inte rcept term { β 0 ,j : j = 1 , . . . , d } . With eac h addition of an element form the v ariable p o ol to the mo d el, its inte ractions with all elements in the dictionary , at all d istances in ∆, are add ed to the v ariable p o ol. Th us, the v ariable p ool adaptiv ely expands with the model. The algorithm is d escrib ed in detail b elo w : 1. Initialize V = D , ˜ E = ∅ , 2. Rep ea t un til | ˜ E | = M : (a) C ompute r j ← r j − X ( ˜ E )[ X ( ˜ E ) ′ X ( ˜ E )] − 1 X ( ˜ E ) ′ r j , j = 1 , . . . , d ; ξ ( e ) ← X ( e ) − X ( ˜ E )[ X ( ˜ E ) ′ X ( ˜ E )] − 1 X ( ˜ E ) ′ X ( e ) , e ∈ V , where X ( ˜ E ) is a matrix con taining co lumns { X ( e ) : e ∈ ˜ E } . (b) S elect e ∗ ∈ V by the criterion e ∗ = arg min e ∈ V d X j =1 λ 2 j min β k r j − β ξ ( e ) k 2 . (c) ˜ E ← ˜ E ∪ { e ∗ } . (d) if order( e ∗ ) < o max , V ← V ∪ { ( e ∗ , w, δ ) : w ∈ D , δ ∈ ∆ } . The ab ov e algorithm requires t wo tuning parameters: M , the maxim u m size of the mo del, and o max , the maxim um order of in teractions. Limiting the maxim um order of int eractions u sing o max is an eﬀectiv e wa y of restrict- ing the gro w th of V . Since with eac h addition of an elemen t to the mod el, C = |D | × | ∆ | elements are add ed to the v ariable space, with o max = ∞ , the size of the v ariable space w ould b e | V | = C ˜ E + |D | at eac h up dating step. Ev en though the v ariable sp ace increases linearly w ith the mo del size, com- putational cost is still considerable for large C . F urthermore, multiple testing problems can b ecome quite severe eve n with linear gro wth of | V | . W e foun d that the algorithm works well w h en ∆ is set to a small set of in tegers u sing prior knowledge ab o ut th e r anges of diﬀeren t t yp es of b iologic al interac - tions, and o max is set to 2 or 3. The setting of ∆ is organism-dep enden t. W e c hose the v alues ∆ = { 30 , 100 , 400 , 1000 } for y east and ∆ = { 50 , 200 , 100 0 } for Arabidopsis. TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 13 3.4. Step C . Mo del pruning. The step wise pro cedure describ ed in Step B results in a list of selecte d v ariables ˜ E of size M . Th e goal of the pruning step is to eliminate some of the “false p ositiv es” in ˜ E through backw ard deletion of v ariables. Selecting the b est o verall mod el d ep ends up on the c hoice of a lac k-of-ﬁt fu nction lof ( · ), whic h we describ e in more detail b elo w. First, we giv e th e algorithm for b ac kwa rd deletion: 1. Let E M = ˜ E . 2. F or m = M − 1 , . . . , 1, do (a) F or e ∈ E m +1 , obtain m o del E m ( e ) by r emoving e from E m +1 . Compute lof [ E m ( e )]. (b) Let e ∗ = arg min j lof [ E m ( e )]. (c) L et E m = E m ( e ∗ ), lof m = lof [ E m ]. 3. Pic k m ∗ = arg min m lof m , and let E = E m ∗ . W e explored tw o d iﬀerent metho ds for assessing a mo d el’s lac k of ﬁt. Th e ﬁrst is a wei gh ted generalized cross v alidation error ( wGCV ). F or a giv en mo del E , let d reg ( E ) b e the n u m b er of “regular” parameters in E , whic h is equiv alen t to |E | + 1. Let d knot ( E ) be the num b e r of “knot” parameters, whic h is equal to the total num b er of distinct in teractions in the mo del. T h en, wGCV is deﬁn ed as the w eigh ted sum of the GCV o ver eac h comp onent: wGCV ( E ) = d X j =1 λ 2 j RSS j / [1 − d ( E ) /G ] 2 , where d ( E ) = d reg ( E ) + γ d knot ( E ) and RSS j ( E ) = k u j − ˆ u j ( E ) k 2 is the residual sum of squares of the least-squares ﬁt of the mo d el E . I n using wGCV , one needs to c ho o se the sm o othing parameter γ . F riedman ( 1991 ) discussed approac hes for c ho osin g this parameter for MARS, which is an adaptive mo del that also con tains irregular knot parameters, and sug- gested using a ﬁ xed v alue of γ = 2 or a data adaptiv e v alue c h osen through cross-v alidatio n. In our mo del, γ represents th e degrees of freedom of in- teractions of the form e 1 , 2 = ( e 1 , e 2 , δ ). Th e c hoice of γ needs to accoun t for the maximization of the parameter δ o ver the set ∆ . F or e ∈ E , let τ e = P G g =1 I ( { X g ( e ) > 0 } ) b e the n umb er of genes that conta in at least one instance of e in its promoter region. W e use an adaptiv e rule of selecting a diﬀeren t γ = γ ( e 1 , 2 ) for eac h inte raction term e 1 , 2 through the f ollo wing form u la: γ ( e 1 , 2 ) = 2[log R + log τ e + log( G − τ e ) − log G ] / log G. (7) 14 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED The in tuition for th e formula comes from the deriv ation of a mo d iﬁed Ba y es information criterion for the mo del, describ ed in App endix A.2 . The s econd lac k-of-ﬁt criterion that we examine is a w eigh ted version of the mo diﬁed Ba y es information criterion (wmBIC) giv en in Zhang and Sieg- m u nd ( 2007 ), whic h has the form wmBIC ( E ) = P d j =1 λ 2 j mBIC j ( E ) , where mBIC j for eac h comp onent j is d eﬁned as mBIC j ( E ) def = 1 2 ( G − d reg ( E ) + 1) log RSS j ( E 0 ) RSS j ( E ) + log Γ[( G − d reg ( E ) + 1) / 2] Γ[( G + 1) / 2] (8) + 1 2 d reg log RSS ( E 0 ) − X e ∈E log τ e + log G − d knot log R . In the ab o ve formula, E 0 is the mo del with only the in tercept term. F or deriv ation of this formula, see App endix A.2 an d Z hang ( 2005 ). The aim of wGCV is to r educe prediction error, with mo del p arsimon y as a s econdary concern. In con trast, wmBIC is d eriv ed u nder the Ba y esian framew ork of m aximizing p osterior mo d el probabilit y instead of prediction accuracy . Hence, wmBIC , with a log n p enalt y for eac h degree of freedom, fa vors smaller mo dels. In S ection 4.1 we see that wmBIC ind eed selects a m u c h smalle r set of motifs than wGCV . 4. Metho ds of v alidation. As with all studies of this t yp e, there is no sin - gle ob jectiv e measure of p erformance. Exp e rimen tal v alidation is th e gold standard that is also h ard to come b y . In the absence of exp erimen tal v ali- dation, pr evious studies ha ve r elied on anecdotal evidence from existing lit- erature, and s ome ha v e used pr ediction error as a measure of p erformance. Ho we v er, prediction er r or do e s not add to one’s understand ing of the mo del or interpretation of the results. F or this reason, w e emplo y three additional v alidation appr oac hes, the second, b ased on ﬂ anking sequence analysis, is a new metho d. A third metho d th at we used to v alidate our results is gene list enric h men t. If the motifs w ere true, th en th e set of genes that h a ve that motif sh ould b e enric hed with genes that are known to b e r elated to the exp erimen t. In Section 4.3 w e describe the metho d w e used for gene list enric hment analysis. 4.1. Pe rmutation analysis. P erm u tation analysis allo ws us to compare results obtained u sing the r eal d ata with that p erformed on the randomly decoupled real d ata in whic h the genes’ promoters are decoupled from their expression patterns and then re-asso ciated at random. The p erm utation pr o cedure that we use is as follo ws . Let π = ( π 1 , . . . , π n ) b e a random p e rm utation of (1 , . . . , G ). Pa ir the expression vecto r Y g of TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 15 gene g with the promoter sequence S π g of gene π g . W e call su ch a d ata set a randomly decoupled data set. The en tire p ro cedure detailed in Figure 1 (i.e., S teps A–C) is p erf orm ed on this randomly decoupled data set. Th e lof curv es f rom N rand omly decoupled data sets is compared to the lof curv e of the r eal d ata set. In Section 5.2 we show the results of applying this pro cedu r e to the Ara- bidopsis p o wdery mildew exp eriment. W e ob tained b etter resu lts in y east, whic h is a simpler organism with a higher signal to n oise ratio. 4.2. Flanking se qu enc es. An in dep end en t source of v alidation comes from the ﬂanking sequences. If a promoter elemen t found b y our metho d w ere noise instead of signal, then the ﬂanking s equences should n ot b e an y dif- feren t from the bac kground sequence. Ho we v er, if the motif w ere in fact a real TFBS, then the ﬂanking sequences ma y ha ve a distribution with lo wer en tropy that is diﬀerent from the bac kgroun d sequ en ce. This is b ecause the promoter elemen ts are comp osed of short words with wh ic h we hop e to cap- ture only the core consen s us sequence, and bind ing sites often exte nd b ey ond the core sequence. This is esp e cially true for transcription factors that b ind o ve r-represent ed sequ en ces in the promoters of a particular genome, h a ve binding sites with highly degenerate core sequences, or are members of large transcription f actor families that bind a common core sequence (as is the case for man y Arabidopsis transcription factors). Therefore, if the ﬂanking sequences of a motif hav e lo w er en tropy than their bac kground sequences, then this is indep end en t evidence that the motif is b iologica lly signiﬁcant. Let the wo rd w b e a comp onent of th e promoter elemen t e . W e ﬁnd all lo cations of w in S that app ear as a comp onent of some instance of e . Let L e,w b e the set of length 2 L ﬂanking sequences ( L bases on eac h side) of these app earances of w . Then, align the sequences in L e,w to form the matrix M e,w , where M e,w ( i, a ) = P l ∈L e,w I ( { l i = a } ) . Also compute the b ackg round base frequencies { π e ( a ) : a ∈ A} for promoters of genes in th e set { g : X g ( e ) > 0) } . The sequence information con tent I seq of M e,w is d eﬁned as I seq = 2 L X j =1 A X i =1 M e,w ( i, j ) log M e,w ( i, j ) π e ( i ) . The s tatistical signiﬁcance of a giv en v alue of I seq dep end s on the length of the ﬂ anking s equence L and N ( p, w ), whic h is the num b er of instances of w that app ear in the conte xt of e . W e will u se the large deviatio n s metho d d ev elop ed by Hertz and Stormo ( 1999 ) for computing the p -v alue of N ( p, w ) I seq . 16 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED 4.3. Gene list enrichment. Gene list enrichmen t is our ﬁ nal metho d of v alidation. If the motifs iden tiﬁed we re biologically real, then the set o f genes that hav e that motif sh ould b e enr ic hed in the half of the dataset exhibiting a diﬀeren tial pattern of expression compared with the control half of the dataset that did not exhibit a p r o cess-asso ciated pattern. Gene O n tology (GO) analysis is a p op u lar metho d of s tatistica l v alidation. If the set of genes that cont ain a m otif is enric hed for genes that b el ong to a GO-category that is related to the exp erimen t, then this is evidence that the motifs are biologically meaningful. Ho we v er, experiments usually p erturb man y pathw a ys that relate to eac h other in a complex wa y . The genes in these path wa ys often b elo ng to d iﬀeren t GO categories, b ut ma y share common regulatory mec hanisms . In addition, f or some organisms, pro c ess-sp eciﬁc GO annotation is limited. T o get around b oth of these issu es, w e use gene list enr ic hment as a v alidation metho d. If one has a list of genes w h ose expression is signiﬁcant ly imp acted in the exp eriment (as we do), one can v alidate a giv en identi ﬁed motif based on the probabilit y of it b eing en r ic hed in the pr omoters of genes wh ose expr ession w as signiﬁcan tly impacted in the exp eriment compared with the entire d ata set. Let N b e the total n umb er of genes used in the study , of whic h M b elong to this list that con tains all genes wh ose exp ression change s signiﬁcant ly in the exp eriment. Let τ e b e the n u m b er of genes that con tain promoter elemen t e , out of which m e app ears in the list. The p-v alue for an observed v alue of m e , based on the Fisher’s exact test [Fisher ( 1922 )], can b e computed. A small p -v alue for this test is evidence that the elemen t e is a tru e r egulatory element. 5. Results. 5.1. Y e ast. W e used the list of 1600 selec ted ge nes d escrib ed in Section ( 2.1 ) for our a nalyses. The principal components of the α -arrest exp eriments p erformed on these 1600 genes are giv en in Supp lemen tary Figure 1 (the ﬁr st 3 p rincipal comp onen ts are shown in Figure 2 ), with the scree p lot given in Supp lemen tary Figur e 2 [Zhang, Wildermuth and Sp eed ( 2008 )]. By the scree plot, there seems to b e a drop in p ercent of v ariance explained from the third to fourth principal comp onent . The ﬁrst t w o principal comp onents capture the p eriodic nature of the data, p eaking r esp ectiv ely du r ing the G2/M and M/G1 ph ases of the cell cycle. W e c ho o se as our basis s et th e ﬁrst, second and third pr in cipal comp onen ts, wh ic h together explain 63% of the total v ariance. T able 1 giv es a partial list of the promoter elemen ts foun d for this data set. The complete list can b e found in Supp lemen tary T able 1(a) [Zh an g, Wilderm u th and Sp ee d ( 2008 )]. Of the 39 p romoter elemen ts in the m o del, 35 remain after bac kward deletion with BIC as the lac k-of-ﬁt cr iterion. Before deletion, th e set of 39 motifs contai n 7 singletons, 20 p airs and 12 triples. TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 17 Fig. 2. First thr e e princip al c omp onents of the Sp el lman et al. ( 199 8 ) ye ast c el l cycle exp eriment. The table also shows, for eac h promoter elemen t rep orted for the α -arr est exp eriment, the num b er of genes that ha v e that elemen t that also b elong to Sp ellman’s 800 list. No te that S p ellman’s 800 ge nes comprise exactl y 50% of the gene s et on whic h we conducted our analysis. T he gene list enric hment test r esu lts indicate that our metho d can extract p r omoter elements that are enric h ed in ge nes asso ciated with the cell cycle. Supp lemen tary T able 1(b) [Zhang, Wilderm uth and Sp ee d ( 2008 )] sho w s the ﬂanking sequence analysis results for th e y east α -arrest exp erimen t. Note that most of the p-v alues are quite small (Fi gure 3 sho ws a histogram). This is strong evidence that many of the rep orted pr omoter elements are true p ositiv es. T able 2 sh ows, for a small subset of the rep orted elemen ts, the plots of th e correlation co eﬃcien ts b et w een X ( e ) and Y t, · at eac h time p oint t , whic h w e call “eﬀect curves.” W e s ee that, b ecause we considered pro jecti ons on to principal comp onen ts 1–3 simulta n eously , we hav e found promoter element s that are inﬂ uen tial for eac h of the diﬀerent ph ases of the cell cycle. Of the 7 s in gleton motifs rep orted in T ab le 1 , 5 are known m otifs related to the 18 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED T able 1 Examples of pr omoter elements identiﬁe d for the ye ast α -arr est exp eriment. Al l of the elements m entione d in the c ase studies of Se ction 5.1 ar e shown her e. The c omplete list c an b e found in Supplementary T able 1( a ) [ Zhang, Wil dermuth and Sp e e d ( 2008 )]. R ank is r everse or der of pruning fr om mo del, with the hi gher r anks b eing prune d ﬁrst. “ Putative site ” is the assignment of known bindi ng site names to the elements. If the element do es not match exactly to any known motif, i t is lab ele d new x , wher e x is the or der of app e ar anc e in the li st. Many of the “ new ” motifs ar e sim ilar to and may b e variations of known motifs. “ Phase ” i s the phase of the c el l cycle at whi ch the eﬀe ct of the pr omoter element is str ongest. The c olumns n and m ar e the numb er of genes i n the tr aining set and in Sp el lman ’ s 800 l ist, r esp e ctively, that c ontain the element. The c olumn “ p-value ” c ontains the p-values for ( n, m ) c ompute d using Fisher ’ s exact test Rank Motif Putativ e site Phase n m p-v alue 1 CGCGT d-MCB G1 582 380 8 × 10 − 21 2 ACG CGT MCB G1 180 141 1 × 10 − 16 4 TTTCGC G SCB mixed 160 123 2 × 10 − 13 8 GCTGG SWI5 mixed 809 400 7 × 10 − 1 9 TTGTTT SFF S/G2 1131 610 4 × 10 − 7 12 GGCTCCG new8 G2/M/G1 38 25 3 × 10 − 2 14 GCCCGTT MCM1 M 59 27 8 × 10 − 1 17 (TGCTGGC,C GCGT,3 0) SWI5, d-MCB M/G1 7 7 8 × 10 − 3 18 (CGCGT,CGC GT,30) d-MCB,d-MCB G1 82 77 1 × 10 − 18 21 (TCGCGGG,T TGTTT, 30) new13, SFF S, S/G2 5 5 3 × 10 − 2 29 (TTCGTGT, TTTCGC G,100) SCB, SCB G1 12 12 2 × 10 − 4 31 (TGGTCTG,T TTCGC G,400) new19, SCB S 9 6 3 × 10 − 1 35 (TTTCCT A,TTGTTT ,400) MCM, SFF M 178 105 7 × 10 − 3 37 (TCCGA GC,CGC GT,100) CSRE or GAL4, S 9 7 9 × 10 − 2 d-MCB 38 (TGTTCTC, CGCGT, 30) new2, d-MCB S 7 7 8 × 10 − 3 cell cycle (d-MCB, MCB, S CB, SWI5, SFF). Th e tw o “new” motifs are GGCTCCG and GCCCGTT. Th e latter, GCCCGTT, is a pu tativ e MCM1 site, b ecause it aligns with the M phase an d con tains CCC GT T , whic h has b een exp erimentall y v eriﬁed to b e a MCM1 site in CLN3, SWI4 and CLB2. The pair-wise in teractions are also v ery interesti ng. Belo w we list some notew orthy cases. MCM1–SFF pair: Con s ider the mot if (TTTCCT A,TTGTTT,400), rank 3 5. This motif com bin ation a pp ears in a large set of genes (178 to tal), most of whic h are categorized as M-phase genes by Sp ellman et al. ( 1998 ). Ou t of these 178 genes, 105 app ea r in S p ellman’s 800 list, which has a p -v alue of 0.007. These 178 genes include w ell-kno wn pla y ers in the ce ll cycl e, such as CDC10, SWI5, MCM3, SWI4, STE2, MCM6, STE3, STE6, C LB4, CDC5 and BUB2. TTTC CT A is a sub-word of the MCM1 bind in g site, while TTGTTT is the core of the S FF motif. This is strong evidence that this promoter elemen t is a co op e rativ e bindin g site for MCM1 and SFF. Since TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 19 Fig. 3. Histo gr am for p-values of inf ormation c onte nt N I seq in ﬂanking r e gion of motifs disc over e d f or ye ast c el l cycle. The ﬂanking r e gion of 20 b ases (10 e ach on the left and right of the motif ) is sele cte d. the MCM1 b inding site is highly degenerate, we sough further evidence that T TTCCT A is part of an MCM1 site b y analyzing its ﬂanking re- gion. Th e ﬂanking region has information conte n t I seq = 0 . 44 for N = 193 instances, whic h h as a h ighly signiﬁcant p -v alue of 5 × 10 − 12 . dMCB–dMCB pair: (CGCGT,C GCGT,30), rank 18, is a short range int er- action of t wo d egenerate MCB motifs. Out of the 82 genes that h a ve this elemen t in their pr omoters, a highly signiﬁcan t 77 genes (p-v alue = 1 × 10 − 18 ) are in S p ellman’s 800 list. This motif is strongly aligned with the G1 phase, w hic h is consisten t with existing kno wledge ab out MCB activit y . Comparin g this with the elemen t contai ning only MCB (row 1 of table), we see that by including the short-range in teraction of CGCGT with itself, we can signiﬁcan tly r ed uce the num b er of false p ositiv e ap- p earances. A GO analysis of th e list of 82 genes that con tains (CGCGT, CGCGT,30) retur ns s igniﬁ can t hits to many GO catego ries, includ ing DNA directed DNA p olymerase activit y . The list of 582 genes that con- tain CGCGT, h o wev er, is so diluted w ith man y diﬀerent functions that it is n ot signiﬁcant for any one GO categ ory . This sho ws that the d istance- based in teraction mod el captures additional r elev an t information. Finally , w e lo ok at the ﬂanking sequences. I seq for CGCGT alone is alrea dy highly signiﬁcan t, with a v alue of 0 . 18 and a p-v alue of 6 × 10 − 47 . How ev er, the I seq for (CGCGT,C GC GT ,30) is far greater at 0 . 76, with a p-v alue that is essential ly 0. 20 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED T able 2 Plots of c orr elation c o eﬃci ent b etwe en X ( e ) and Y t,g for a sele cte d set of pr omoter elements. The ﬁrst c olumn is the phase of the c el l cycle during which the curve r e aches its p e ak. The name and se quenc e of e ach binding site is in c olumn 2. The vertic al lines in the plot denote r ough tr ansition times b etwe en phases Phase Element Eﬀect curv e M/G1 (TGCTGGC ,CGCGT ,30) SWI5, d-MCB M/G1 GGCTCC G new8 G1 (CGCGT ,CGCGT ,30) d-MCB, d-MCB G1 (TTCGT GT,TTTCGCG,400) SCB, SCB S (TCC GAG C, CGCGT,30) UAS1, d-SCB S (TGTT CTC, CGCGT,30) UAS1, d-MCB S/G2 TTGTTT SFF M (GCCCGTT ) MCM1 M (TTTCCT A,TTGTTT,400) MCM1, SFF TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 21 SCB–SCB pair: Th e element (TT C GTGT,TTTCGCG,100), rank 29, is an in teraction of SCB motif TTTCGCG with the word TTCGTG T, whic h o ve rlaps with T TTCGTG , also an instance of SCB. T o verify that T T CGTGT is in d eed part of an S CB motif, we analyzed its ﬂanking s equence. The p osition that immediately precedes TTCGTG is indeed highly en r ic hed for th ymin e. T h u s, w e hyp othesize that (T T CGTGT,TTTCGCG,100) is a pu tativ e S CB–SCB pair. There are only 12 genes that contai n th is ele- men t, al l of w hic h b elong to Sp ellman’s 800 list, and all but 1 of them p ea k in G1. Th is list of 12 genes con tain wel l-kno wn cell cycle pla yers CLN2, PCL1 and P CL2, and is enriched for the GO cat egory cyclin dep end en t protein kinase regulator activit y . SWI5–dMCB pair: T he ﬁrst word in the elemen t (TGCTGGC,CGCGT,30), rank 17, con tains the SWI5 motif GCTGG, wh ile the second w ord is dMCB. As seen from the plot of the eﬀect cur v e, this motif is strongly aligned with M/G1 transitio n , w hic h is consistent with the fac t that S WI5 regulation occurs during this p oint of the cyc le. A total of 7 genes con tain this elemen t, in cluding the cell-cycle related trans cr ip tion factors ASH1 and WTM1. This gene list is enric hed for the GO categories h yd rolase activit y and b et a-glucosidase activit y . The histone clusters: A striking result of app lying our m etho d to the yeast cell cycle exp erimen t is that, without sup ervision, it w as able to detect promoter elemen ts associated w ith tigh tly regulated small sets of histone genes. These promoter elemen ts (ranks 21, 31, 37, 38), along with the genes that h a ve them, are listed in T ab le 3 . Among the w ords con tained in these promoter elemen ts are the degenerate MCB motif CGCGT, the SFF motif TTGTT T, th e S CB motif TTT CGCG, and some new mo- tifs that are not commonly asso ciated with the cell cycle: TGTTCTC , TCGCGGG, TGGTCTG and TCCGAG C. Among these “new” motifs, TTGTTCTC and TCCGA GC are parts of mapp ed UAS1/UAS2 elemen ts [Osley ( 1991 )]. 5.1.1. Comp arison with pr evious metho ds. All existing r egression-based metho ds ﬁnd motifs at eac h time p oint sep arately , and lo ok across samp les mainly for inte rpretation of already id entiﬁed motifs. T herefore, in compar- ison to pr evious results, we emp hasize th at we do not expect to ﬁnd exact concordance. Th e most signiﬁ can t motifs that we identiﬁed by cross sam- ple analysis, su c h as SWI5, S FF, MCM1 and MCB, ha ve also b een iden- tiﬁed b y c h o osing the correct time p oin t and us ing one of the pr evious single-sample metho d s [Bussemak er, L i and Siggia ( 2001 ), Das, Banerjee and Zhang ( 2004 ), Conlon et al. ( 2003 ) and K eles, V an der Laan and V ulp e ( 2004 )]. Ho wev er, as exp ected, there is no sin gle time p oint that allo ws the iden tiﬁ cation of all of the motifs in our ﬁnal set. 22 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED T able 3 Pr omoter elements f ound f or the ye ast c el l cycle exp eriment that ar e enriche d with histone genes. F or e ach pr omoter element, the genes that c ontain i t ar e liste d, along with their pr o c ess and f unction annotations obtaine d fr om the Sac char omyc es Genome Datab ase ( http: // www. yeastgenom e. org/ ) Motif ORF YPD Process F u nction Pea k (TGTTCT C,CGCG T,30) YBL002W HTB2 chroma tin structure histone H2B S YBL003C HT A2 chroma tin structure histone H2A S YDR224C HTB1 chromatin structure histone H2B S YDR225W HT A1 chromati n structure histone H2A S YGR014W MSB2 bud emergence unknown G1 YOR317W F AA 1 f atty aci d metab olism long chain fatt y acyl: M/G1 CoA synthetase YPL127C HHO1 chroma tin structure histone H1 S (TCGCGGG, TTGTT T,30) YBR009C HHF1 chromatin structure histone H4 S YBR010W HHT1 c hromatin structure histone H3 S YDR261C EXG2 cell wall biogenesis exo-b eta-1,3-glucanase S YKL096W CWP1 cell w all protein b eta-1,6-glucan acceptor S/G2 YKL096W CWP1 cell w all protein b eta-1,6-glucan acceptor S/G2 (TGGTCTG ,TTTC GCG,400) YHR061C GIC1 bu d emergence binds Cdc42 p S YLR056W ERG3 sterol metab olism C-5 sterol desaturase S/G2 YNL030W HHF2 chromatin structure histone H4 S YNL031C HHT2 chroma tin structure histone H3 S YOR247W YOR247W unk now n unknown; similar to Sv s1p G1 YPL111W CAR 1 arginine metab olism arg inase G2/M YLR162W YDR015C YNL323W (TCCGA GC,CGCGT,10 0) YBR009C HHF1 chromatin structure histone H4 S YBR010W HHT1 chromatin structure h istone H3 S YDL037C YDL037C unkno wn similar to glucan G2/M 1,4-alpha-glucosidase YER001W MNN1 protein glycosylation al pha-1, G1 3-mannosyltransferase YNL030W HHF2 chromatin structure histone H4 S YNL031C HHT2 chroma tin structure histone H3 S YOR084W YOR084W unk now n unknown G1 YER060W YER104W TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 23 An in teresting observ ation is that many of the previous metho ds found strong signals for motifs related to stress resp onse [CCCCT and A GGGG in Bussemak er, Li and Siggia ( 2001 )] and pheromone induction [STE12 motif in Conlon et al. ( 2003 ) and Das, Ba nerjee and Zhang ( 2004 )]. These motifs are activ e in the ﬁrst few time-p o in ts, and w ere h yp othesized in Conlon et al. ( 2003 ) to b e an exp erimen tal artifact due to cen trifugation. W e did not iden tify these motifs, because our approac h uses linear pro jectio n s to ﬁ lter out pro cesses that are not of interest. Das, Banerjee and Zhang ( 2004 ) used MARS [F r iedman ( 1991 )] to ﬁ n d motif pairs in the yea st cell cycle data. Their mod el do es not consider dis- tance eﬀects, but instead u ses linear splin es resembling a ho c key stic k to mo del what is hypothesized to b e a switc h -like b eha vior in gene transcrip- tion co n tr ol. They used only the top 800 cell cycle rela ted genes identiﬁed by Sp ellman et al. ( 1998 ), while we also included 800 con trol genes. They u sed simple degenerate words, as w ell as man ually curated wei gh t matrices. Their metho d also treats eac h time p oin t separately . Thus, in ﬁnding su btle sec- ond order eﬀects, one w ould exp ec t signiﬁcan tly diﬀerent mo d els applied to diﬀeren t d ata to pro duce v arying results illuminating diﬀeren t asp ects of a complicated pro cess. Ho wev er, of the list of in teracting motif pairs rep orte d in Das, Banerjee and Zhang ( 2004 ), our results agreed b y exact matc h in the motif pairs MCM1-SFF and SWI5-SFF, w hic h are well- kno wn pairwise in teractions. 5.2. Powdery mildew infe ction in Ar abidopsis thaliana. T he Arabidop- sis resp onse to the pathogen in volv es m u ltiple transcription factor family mem b ers. Though SA-dep enden t resp onses d ominate systemic acquired re- sistance resp onses, eth ylene (ET)- and jasmonic acid (JA)-dep enden t re- sp onses also play critical r oles in the Arabidopsis resp onse to the p athogen with outcomes d ep endent up on the complex in terp la y b et w een these path- w ays. T able 4 sho ws the ma jor transcription factors with kno w n bin ding domain consensus sequences that are inv olv ed in the Ar ab id opsis defense resp onse [e.g., r eview by Gur r and Rushto n ( 2005 )]. The three most well- studied of these transcription factors are the WRKY family- which mediate b oth SA- dep end en t and E T /JA-dep endent r esp onses [Ulk er and S omss ic h ( 2004 )], the ERF family , k ey regulators of ET- and JA-dep end en t defense- asso ciated pathw a ys [Gutterson and Reub er ( 2004 )], and the TGA tran- scription factors w h ic h are able to inte r act with the SAR master regulator NPR1 [e.g., Johnson, Bo den and Arias ( 2003 )]. Figure 4 shows the ﬁrst and s econd pr incipal comp onen ts of the p o w- dery mildew exp erimen t p erformed on the 300 0 selected genes describ ed in Section 2.2 . The ﬁrst principal compon ent sho ws an increase in expression after infection in wild t yp e plant s with a red uced/dela ye d resp onse in the ics 1 m u tan t. It also included genes whose in duced expression in wild type 24 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED T able 4 Major tr anscription f actors, and their c orr esp onding bindi ng sites, involve d in Ar abidopsis def ense against the p atho gen. Note that most of these tr anscription factors r epr esent a multi-gene fami ly F actor name Site name Site consensus Reference WRKY W-Bo x (T)TGAC (T/C) Eulgem ( 2005 ) ERF GCC-Bo x (A )GCCGCC Gurr and Rushton ( 2005 ) TGA/OBF as1/ocs T GACG Gurr and Rushton ( 2005 ) MYC bHLH G-Bo x CAC N TG Gurr and Rushton ( 2005 ) MYB MYB (T/C)AAC (T/G)G Eulgem ( 2005 ) G(G/T)T(A/T)G(G/T )T Eulgem ( 2005 ) SR genes CGCG-Bo x (A/C/G)CGCG(G/T/C) Gurr and Rushton ( 2005 ) is abrogated in the ics 1 m u tan t. These expression patterns reﬂect p o wder y mildew-induced genes th at are partially or fully S A-dep endent [Wildermuth et al. ( 2007 )]. The second prin cipal comp onen t is ve ry in teresting, as it shows an increase in expression ov er time in the ics1 mutan t as compared to the wild type th at exhibits little resp onse to the pathogen. Genes that ha v e a high score for this comp onen t ma y b e in volv ed in pathw a ys that resp ond to the ab s ence of I CS1 (and S A). K no wn genes asso ciated w ith PC2 in- clude genes asso ciated with ET/JA-dep end en t resp ons es [Wilderm uth et al. ( 2007 )]. W e c h o ose as our b asis set the ﬁ rst and second principal comp o- nen ts, whic h together exp lain 94 .1% of the total v ariance. T able 5 giv es a partial list of th e motifs that were fou n d by our metho d on this data set. The complete list can b e foun d in Supp lemen tary T able 2(a) [Zhang, Wilderm u th and Sp ee d ( 2008 )]. T he four th column of the table sho w s wh ether the m otif h as a strong eﬀect in the direction of pr incipal comp onent 1 or 2. A “strong eﬀect” is declared if the list of genes that ha ve that motif ha ve a high r anking in the considered linear com bination, with Wilco xon rank-su m test p-v alue < 0 . 00 1. The table also lists the p- v alues of Fisher’s exact test for enrichmen t in the top 1500 genes ranke d b y ˜ T 2 out of the total of 3000 genes in the ﬁ ltered list (see App end ix A.1 ). Of the kno w n TFBS, we foun d that most hav e strong eﬀects along the ﬁrst pr in cipal comp onent, w ith the exception that the ERF binding site GCCGCC is aligned with the second prin cipal comp onent . Since the ﬁrst principal comp onen t exp lains m uc h more of the v ariance in the data set than the second principal comp onent (80.7% compared to 13.4%), it is giv en a muc h larger w eight in the mo del. T h us, most of the r ep orted promoter elemen ts aligned with the ﬁrst prin cipal comp onen t. The GCC-b o x, whic h has a very str ong correlatio n with the seco nd principal co mp onent, squeeze d in to th e list at num b er 42. Our ﬁ ndings were v alidated in four w a ys. First, as shown in T able 5 , we iden tiﬁ ed all six defense-associated motifs listed in T able 4 as w ell as the TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 25 NFkB-lik e motif identi ﬁed b y Leb el et al. ( 1998 ). This motif is asso ciated with inn ate immune resp onses in mammals, is present in the p romoter of A t- ICS1 [Wildermuth et al. ( 2001 )], and has b ee n implicated in plan t resp onse to pathogens. Second, w e p erform ed p ermutation analysis, the results of whic h are shown in Figure 5 . C omp aring the top and b ottom plot, w e s ee that wmBIC is ind eed a more conserv ativ e mo del selection criterion than wGCV . The p ermutatio n results also sho w that our m etho d decides up on a muc h larger mo del for the r eal data set th an for the rand omly decoupled data sets. This fact giv es conﬁdence that some of the discov ered motifs for the r eal data set are b iologica lly meaningfu l. Third , w e anal yzed the ﬂanking regions of th e ident iﬁed motifs and presented the p-v alues asso ciated with this an alysis in Supp lemen tary T able 2(b) [Zhan g, Wildermuth and S p eed ( 2008 )]. Ov erall, th ese p-v alues we re less signiﬁcan t than th ose for the ye ast motifs, likely d ue to the presence of large transcription fact or familie s that bind a similar core motif in Arabidopsis. Ho w ever, a n u m b er of the identiﬁed Fig. 4. First two princip al c omp onents of the A r abidopsis p ower dery m ildew infe ction exp eriment. 26 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED motifs did exhibit signiﬁcan t conserv ation of ﬂanking sequences. Fi nally , the results of our gene list enrichmen t analysis is sh own in T able 5 . W e include a detailed discussion for a f ew interesting case studies extracted from the results b elo w . W-b o x clusters and in teractions: WRKY transcription f actors are critical regulators of plan t r esp onse to abiotic and biotic stress [Ulk er and Somssic h ( 2004 )]. WRKY transcription factors bin d the W-box core TGA C, with rep orted ﬂankin g sequences in Arabidopsis inv olv ed in plan t d efense re- sp onse biased tow ard TTGA C (T) [e. g., Y u , Chen and Chen ( 2001 )] and TTGA C C [Laloi et al. ( 2004 )]. W-b o x cont aining elemen ts are signiﬁcan tly aligned with b oth the ﬁr s t and second p rincipal comp onents. This mak es sense biologically , as WRKY factors can mo du late b oth SA-dep endent and ET /JA-dep enden t p ath- T able 5 Examples of pr omoter elements identiﬁe d for the Ar abidopsis p owdery mil dew infe ction exp eriment, l i ste d i n r everse or der of pruning fr om the mo del. Al l of the elements mentione d i n the c ase studies of Se ction 5.2 ar e shown her e. The c omplete list c an b e found i n Supplementary T able 2( a ) [Zhang, W ildermuth and Sp e e d ( 2008 )]. “ Putative site ” is the assignment of known binding site names to the elements. Putative sites liste d in T able 4 ar e sp e ciﬁe d by name as i s the NFkB-like motif i dentiﬁe d by L eb el et al. ( 1998 ) and asso ciate d with innate immunity. Al l other identiﬁe d m otifs ar e liste d as new, though some of these exhibit signiﬁc ant overlap with known m otif s [Hi go et al. ( 1999 )]. “ Comp onent ” i s the princip al c omp onent with whi ch the element has a str ong eﬀe ct, and “ none ” i f no princip al c om p onent signiﬁc antly dominates the other. The c olumns n and m ar e the numb er of genes in the tr aining set and i n the top 1500-list, r esp e ctively, that c ontain the element. The c olumn “ p-value ” c ontains the p-values for ( n, m ) c ompute d using Fi sher ’ s exact test Rank Mo tif Putative site Comp onent n m p-v alue 1 GAC TTT NF κ B-like 1,2 1672 909 5 × 10 − 8 2 TTGA CT W-b ox 1,2 1629 914 2 × 10 − 13 4 (TGAC T A,TTGA CC,1000) W-b ox, W-b o x 1 445 268 2 × 10 − 6 5 AGT CTT NF κ B-like 1,2 1460 802 9 × 10 − 8 8 TGA CGT TG A none 711 398 2 × 10 − 4 11 (AGA CTT,TTGA CT,200 ) NF κ B-like, W-b ox 1,2 478 313 8 × 10 − 14 12 (GTCGTC,T TGA CT,200 ) new6, W- b o x 1 197 130 2 × 10 − 6 19 (CA TGTG, GAA T A T,100 0) Myc, new11 1 665 320 9 × 10 − 1 21 (TTCGTC,T TGA CT,200) new15, W-b ox 1 293 192 1 × 10 − 8 25 (CGCGTT,T TTCCA,20 0) CGCG-b ox, new8 1 72 42 9 × 10 − 2 26 (TCAAAC,TT GAC C,200) new19, W-b ox 1 366 210 2 × 10 − 3 28 (GAGCTT ,TTGACC, 1000) new20, W-b ox none 376 200 1 × 10 − 1 29 TCAACG Myb 1 951 556 2 × 10 − 10 33 (TGTCGA,TTGA CC,200) new23, W-box none 106 59 1 × 10 − 1 42 (GGCGGC,AA TTTT,200) GCC-b ox, new3 2 145 76 3 × 10 − 1 TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 27 w ays, the latter of which are lik ely to b e asso ciated with principal com- p onent 2. In our analyses, the TTGACT motif app eared alone (Rank 2), and in close pro x im ity (within 200 nt) to other motifs: NF κ B-lik e motif A GACTTT (rank 11), T CAA CT (rank 12) and T TCGTC (rank 21). In addition, it app ears within 1000 nt of TGA CT A, w hic h conta in s the W-b ox core (Rank 4). En ric h men t of W-b o xes in promoters of genes with altered expression in resp on s e to b iotic stress is consisten tly observ ed [e.g., Malec k et al. ( 2000 )], in agreemen t with our ﬁn d ing of a W-b o x, W- b o x pair. The TTGA CC app eared in com bin ation with other motifs in close p ro x- imit y [TC AAA C (rank 26) and TGTCGA (r an k 33)] or within 1000 nt [TGA CT A (rank 4) and GA GCT T (rank 28)]. In all of these instances, the ﬂ anking sequ ences of these W-b o xes ha ve p -v alues < 0 . 001. Resolv- ing ﬂanking sequen ce sp ec iﬁcit y and genes targeted by sp eci ﬁc WRKY factors has b een extremely chall enging as the Arabid opsis WRKY family Fig. 5. Permutation analysis for the Ar abidopsis p owdery mi ldew infe ction data. The top and b ottom plots show, r esp e ctively, the wGCV and wmBI C curves for 10 r andomly de c ouple d data sets (gr ay lines) versus the r e al data set (black line). V ertic al lines show the m o del size that minim i zes these curves. 28 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED has ov er 70 members and promoters of regulated genes tend to con tain m u ltiple W-b o xes. F u rthermore, stimulus-dep endent c hanges in WRK Y binding aﬃnities result in WRKY shuﬄing on pr omoter element s [T urck, Zhou and Somssich ( 2004 )]. Our appr oac h allo ws us to identi fy putativ e WRKY f actor in teractio n pairs in silic o and to pr edict those cases where ﬂanking sequences m ay b e more readily resolv ed, greatly facilitati ng ex- p erimental eﬀorts. (CGCGTT, TTTCCA, 200) T h is mot if com bination (ranked 25) app ears in a set of 72 genes. This set of 72 aligns signiﬁcan tly with principal com- p onent 1. When w e look at the ﬂank in g sequences, I seq for CGCGTT is highly signiﬁcan t with a v alue of 0.76 and a p -v alue of 5 × 10 − 6 . Int er- estingly , th e T TTCCA motif al so app ears in another motif com bination (TCAA CT , TTTC CA, 200) ranke d 13. In th is case, analysis of th e ﬂank- ing sequ ence for T TTCCA is 0.14, with a p -v alue of 2 × 10 − 6 . The CGCGTT motif comprises part of a kn o wn motif, the “CGCG b ox,” with consensus sequence (A/C/G)CGCG(C/G/T); this C GC G b o x is recognized by all 6 memb ers of th e A. thaliana signal r esp onsive genes A tSR1-6 [Y ang and Poov aiah ( 2003 )]. The Arabidopsis SR pr oteins are Ca2+/calmodu lin-binding/DNA-binding p roteins that are ind uced in re- sp onse to a v ariet y of plant phytohormones and stresses [Y an g and P o o v a- iah ( 2003 )]. Ca2+ pla ys an imp ortan t role in mediating SA and H2O2 signal transduction [Y ang and Poo v aiah ( 2003 )]; ho wev er, our kno w ledge ab out the s p eciﬁc mec h anisms inv olv ed is limited. A tSR3-6 ha v e b ee n found to b e rapidly indu ced in resp onse to treatmen t b y salicylic acid or H2O2 [Y ang and Poo v aiah ( 2003 )]. W e do not observ e statistical ly signiﬁ- can t changes in expression for an y of the A tSR genes o v er the time course of p o w dery mildew infection. Ho w ever, as this time co urse fo cused on later stages of infection (1–7 da ys), it is v ery p ossible that early transcriptional resp onses (suc h as a p ossible c hange in At SR transcription) we re not cap- tured. Instead, w e resolv e the later progressiv e transcriptional r esp onse asso ciated with extensive gro wth and r epro du ction of the p o wdery mildew including do wnstream genes (e.g., cont aining the C GCG b ox) that ma y b e regulated b y r apidly-induced transcription factors suc h as the A tSRs. Though the 72 genes with the (CGCGTT, TTTCCAA, 200) motif com bi- nation were n ot signiﬁcan tly enric h ed in an y MIPS GO functional ann o- tation category (p erformed u sing Virtual Plant 0.9, BioMaps function), this 72 g ene set includes a n umber of transcription factors, defense-related genes and a calcium trans p orting A TP ase (see Supp lemen tary T able 3 ) [Zhang, Wildermuth and Sp eed ( 2008 )]. T o further assess whether ad- ditional memb ers of the 72 gene set had b een previously found to b e directly mo d ulated by Ca2+/calmod ulin, w e compared the 72 gene set with a union of genes (of 709) compiled from the follo wing Arabidop- sis datasets: (1) the A tSR genes (6 genes) and genes iden tiﬁed as con- taining the CGCG b o x (19 genes) [Y ang and P o o v aiah ( 2002 )]; (2) the TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 29 Ca2+/calmodu lin-binding, BTB and T AX d omain-con taining At BT p ro- tein family (5 genes) and inte r actors (2 genes) [Du and P o o v aiah ( 2004 )]; (3) rap id calcium-resp onsive up and down regulated genes (229 genes) [Kaplan et al. ( 2006 )]; (4) calmo dulin -binding p roteins ident iﬁed using high densit y protein arrays (173 genes) [Pop escu et al. ( 2007 )]; and (5) genes whose annotati on included the k eyword calc ium or calmo dulin (303 genes), ob tained using VirtualPlan t 0.9. Of the 72 genes in our combined motif set, only 3 were p resen t in th e compiled Ca2+/calmodulin gene set (see T able 1 ). This suggests that we may ha ve elucidated a previously unc haracterized sp eciﬁc subset of resp ons es requ iring a CGCGTT cis - acting element and TTTC CAA element in close pro xim ity that can th en b e exp erimental ly v alidated. AA TTTT, GGCGGC T o our kno wledge, the AA TTTT motif has not pre- viously b een described in its e n tirety as a cis -acting regulatory elemen t in Arabidopsis. The AA T T TT motif is a comp onent of the p lant cis -acting regulatory el emen t CAAAA T TTTGT A [PLACE database mot if S0004 66, Higo et al. ( 1999 )] and is sp eciﬁcall y activ ated during the early phases of an incompatible plant /bacterial pathogen inte raction in tobacco [Pon tier et al. ( 2001 )]. It app ears alone (rank 7) and in combinatio n with other motifs. F or ranks 18, 36 and 42 , the AA TTT T motif is within 20 0 nt of its partner; whereas for ranks 22 and 41, it is within 1000 nt of the other motif. The p-v alues for the I seq v alues AA TTTT in these in teractions are all quite lo w. T he (GGCGCC, AA TTTT, 200) p air is esp ecia lly interesting as it is the on ly elemen t that is mainly asso ciated with the second comp onent. Genes associated with comp onent 2 exhibit a trend of enhanced expres- sion in the SA biosyn thetic mutan t compared with w ild t yp e o ve r the time course of expression and m a y b e negativ ely regulated by S A. Th e GGCGCC m otif is commonly k n o wn as the GCC b o x r ecognized by eth ylene-resp onsiv e f actors (ERFs). ERF transcrip tion factors regulate dev elopmental and defense pro cesses and are asso ciated with ET and JA signal tran s duction pathw a ys [Gutterson and Reub e r ( 2004 )]. T he set of genes with th e GCC motif is 245; th is set is highly ranked wh en com- p onent 2 alone is exa mined, but falls b elo w our threshold when the lo ss function com bin es b oth comp onen ts 1 and 2. The GCCGCC set con- tains the defensin PDF1.2 (At 5g444 20), a marker of ET- and JA acid- dep end en t defens e resp onses, regulated in part by ERFs. The (GGCGCC, AA TT TT, 200) p air d o es not include PDF1.2, b ut do es in clude the ERF A t1g06160 and At WRKY75 (At5 g1308 0) trans cription factors, as we ll as defen s e-related genes asso ciated with ET- and JA-dep end en t defens e resp onses suc h as chitinases, a germin-lik e p rotein, and defensin-fusion protein (A t2g2602 0). Both ERF and WRKY factors can mediate cross 30 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED talk b et w een SA- and ET /JA-dep endent signaling path wa ys. This is p ar- ticularly in teresting as this sub s et includes comp onent 1- (SA-dep end en t resp onses) and comp onen t 2-associated genes. 6. Discuss ion. W e ha v e shown using tw o exp erimenta l data s ets that the mo del and metho ds we prop ose in Section 3 can b e quite useful in ﬁnding transcrip tion factor bin ding sites u sing m u ltiv ariate gene exp r ession data. The mo del stated in ( 1 ) and ( 3 ) can b e qu ite general to accommod ate an y linear con trast(s) of interest. F or cases where no obvious contrasts are a v ailable from the exp erimenta l design, we suggest selecting th e basis vect ors { v j } using principal comp onen ts. F or b ot h th e y east α -arrest exp eriment and the Arabidopsis p o wdery mildew infection exp erimen t, the ﬁ rst few principal comp onents are v ery eﬀectiv e in captur ing meaningful structure in the data. T o mo del co op erativ e regulation b et ween T FBS, we develo p ed a r ecursiv e mo del for in teractiv e eﬀects that is limited to a c hosen range along the promoter sequen ce. Th e range p arameter is also c hosen during the mo del ﬁtting pr o cess, and a m o del selection criteria is p rop osed to adjust for this additional degree of freedom. Ho w ever, th ese mo del selection criterion are only mean t as a guid eline for in terp r eting th e mo d els, and should not b e tak en as strict ru les for inclusion and exclusion of v ariables. In addition to v alidation us in g p ublished exp erimen tal literature, w e em- plo yed three simp le metho ds for interpretation of the mod el and statistical v alidation of the rep orted motifs. W e f ound the p erm utation pr o ce dure to b e quite useful (and necessary) for assessing ho w m uch n oise the a pproac h is exp ected to appr oac h. The ﬂanking sequen ce and gene list enric h men t anal- yses are particularly imp ortan t to exp erimental biologists as it allo ws them to prioritize motifs of in terest and to understand th e d iﬀeren tial v ariabilit y in ﬂanking sequences of particular motifs. APPENDIX A.1. Data pre-pr o ce ssing. A.1.1. Y e ast α - arr e st exp eriment. As in Sp ellman et al. ( 1998 ) and Buss e- mak er, Li and Siggia ( 2001 ), w e limit our searc h of TFBS to the 700 base promoter sequence up stream of the gene immediately p r eceding the tran- scription start site. Missing data hav e b een imputed using KNNimpu te [T ro yansk ay a et al. ( 2007 )]. Prior to the analyses, the gene expression v alues from eac h sample w ere cen tered and scaled to ha v e mean 0 and v ariance 1. W e c hose the genes for our analysis as follo w s : Th e list of 80 0 “cell -cycle related” genes identiﬁed b y Sp ell man et al. ( 1998 ) are automatica lly includ ed (w e refer to these as S p ellman’s 800 in our analysis). T o c ho ose the negativ e TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 31 con trols, we ﬁrst clus tered the data u sing K -means, and then identifying those clusters that by visual examination d id not exhibit a cell-cycle related pattern (these genes hav e v ery little v ariation across th e 18 time-p oin ts). 800 genes are sampled r andomly from these clusters to b e in cluded in the reduced set. Th us, we u s ed 1600 genes in our ﬁnal analysis, exa ctly 50% of whic h are in Sp ellman’s 800 list. A.1.2. Powdery mildew infe ction exp eriment. Ar abidopsis tha liana Colum- bia strain w ildt yp e and ICS 1-n ull mutan t plan ts w ere ev enly p ositioned and intermixed in ﬂats consisting of 6 b o xes and p laced in gro w th c h am- b ers [Wildermuth et al. ( 2007 )]. Eac h b o x con tained 12 plant s. When the plan ts were four w eeks old, they we re inf ected with a hea vy inno cu lu m of p o wdery mildew ( Golovinomyc es or ontii ). Uninfected plant s serv ed as con- trols and were grown in growth c h am b ers with iden tical conditions. Mature lea v es w ere harv ested (for RNA) at 6 time p oints: (0 hr, jus t prior to in- fection), 6, 24, 72, 120 and 168 hpi. Plant s could n ot b e re-sampled, so at eac h time p oin t, paired samples w ere h arv ested from tw o randomly selected plan ts. mRNA extraction, target lab eling and hybridization to Aﬀymetrix Arabidopsis A TH1 GeneChips w as p erformed for 4 co mplete biolo gical repli- cates, y ieldin g information on 22810 prob ese ts for 56 arra ys. F or our analysis, w e a verage d the expr ession lev el in the 4 biological replicates for eac h gene, time p oint and plant typ e (mutan t or wild type). W e discarded data p oints at hour 6, du e to the fact that they we re not collect ed at the s ame time during th e da y as the other samp les and thus, circadian eﬀects, rather than eﬀects due to infection, could confound our analysis. O ut of th e 2281 0 prob esets, w e selected s maller, ﬁ ltered gene s ets for further analysis using the ˜ T 2 -statistic from T ai and S p eed (2006) . ˜ T 2 is an empirical Ba y es statistic that ran k s genes from replicated time-course exp eriments b y diﬀerentia l expression o ver time in a single biological condi- tion or across multiple biolog ical conditions. Our ﬁltered gene set con tains the top 1500 and b otto m 1500 genes r ank ed by the ˜ T 2 statistic for diﬀer- en tial expression o v er time b etw een th e wildt yp e and i cs 1-m u tan t str ains. The b ottom 1500 genes includ ed in eac h gene list are necessary as negativ e con trols. F or S g , we extracted the 1000 b ase p romoter sequence upstream of gene g immediately preceding the translation start site. Thus, S g includes the 5 ′ UTR sequence. A.2. Mo del selection criterion. F or ease of notation, w e ﬁr s t assume that there is only one b asis ve ctor, and thus, th e r esp onses Y g are univ ariate for eac h gene g . Also, for simplicit y , we assume that th e v ariance of the err or term ǫ g in ( 3 ) is kno w n and equal to 1 [the un kno w n v ariance case yields similar d egrees of freedom calculations, see Zhang ( 2005 )]. Let e 1 and e 2 b e 32 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED t wo promoter ele men ts. By the notation of Section 3.3 , A g ( e 1 ) and A g ( e 2 ) are the lo cations of e 1 and e 2 , r esp ectiv ely , in S g . Deﬁne D g ( e 1 , e 2 ) = min {| k − l | : k ∈ A g ( e 1 ) , l ∈ A g ( e 2 ) } to b e the minimum d istance b et wee n any p air of ( e 1 , e 2 ) in S g . F or A g ( e 1 ), A g ( e 2 ) emp t y , D g ( e 1 , e 2 ) is deﬁned to b e ∞ . Let e 1 , 2 = ( e 1 , e 2 , δ ) b e the promoter elemen t represen ting the δ -r ange inte raction o f e 1 and e 2 . Then, by the d eﬁnition of X ( e ) in ( 6 ), inclusion of e 1 , 2 adds the term αI ( D g ( e 1 , e 2 ) < δ ) to th e existing mo d el. Th at is, the mo d el th at includes X ( e ) can b e re-written as Y g = µ + X e ∈E \ e 1 , 2 β ( e ) X g ( e ) + αI ( D g ( e 1 , e 2 ) < δ ) + ǫ g , (9) where δ is a c h ange-p oin t parameter. In cluding e 1 , 2 as a pr edictor adds the parameters γ , δ to the mo del. W e giv e here a crud e analysis of the eﬀectiv e degrees of freedom con tribu ted b y th is interac tion term. W e assume that the pr omoter length R and the prior for δ are ﬁx ed and do not increase w ith th e sample size G , so that the prob ab ility π δ ≡ P ( D g ( e 1 , e 2 ) < δ ) can b e considered as a ﬁxed function of δ . Without loss of generali t y assume that the gene indices are ordered so that D g ( e 1 , e 2 ) is monotone nondecreas- ing. Deﬁne τ δ = m ax i { D i ( e 1 , e 2 ) ≤ δ } to b e the n um b er of genes that ha v e the elemen t e 1 , 2 . Th en, assuming that the promoter sequences S g are i.i.d., τ δ is b inomial with log-lik elihoo d log P δ ( τ δ = m ) = log  G m  π m δ (1 − π δ ) G − m = 1 2 log G 2 π m ( G − m ) − GI δ , where I δ is th e large deviations co nstan t I δ = m G log Gπ δ m + G − m G log Gπ δ G − m . F or ease of computation, w e assume that e 1 , 2 is the only term in the mo del, and thus, ( 9 ) is redu ced to Y g = µ + αI ( D g ( e 1 , e 2 ) < δ ) + ǫ g . Let M 0 b e the mo d el w here α = 0, an d M 1 b e the alternativ e mo d el where α is TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 33 arbitrary . Then, under the Ba y esian m o del select ion framew ork, we c h o ose the mo del with the largest Bay es f actor, wh ic h has the form P ( M 1 | Y ) P ( M 0 | Y ) = Z R 0 Z ∞ −∞ Z ∞ −∞ exp − 1 2 " τ δ X i =1 ( Y i − α − µ ) 2 + G X i = τ δ +1 ( Y i − µ ) 2 # (10) + log P δ ( τ δ ) ! dα dµ dδ /R ! ×  Z ∞ −∞ e − (1 / 2) P i ( Y i − µ ) 2 dµ  − 1 . In ( 10 ) w e h a ve assumed un iform pr iors f or δ , µ and α . Since π δ do es not change b etw een M 0 and M 1 , the term GI δ in the n u merator of ( 10 ) con ve rges to a c hi-squ are distribu tion under b oth mo dels, and th us is sto chasti cally b o unded a w ay from 0 an d ∞ as G → ∞ . Also, due to this assumption, with prob ab ility one, τ δ = O ( G ), and thus, the metho d s in Zhang and Siegm und ( 2007 ) can b e dir ectly applied to the ev aluation of ( 10 ) to y ield the follo w in g appro ximation when G is large: log P ( M 1 | Y ) P ( M 0 | Y ) = l ( ˆ α, ˆ µ, ˆ δ ) − [log ( τ δ ) + log ( G − τ δ ) − log G ] (11) − log R + O p (1) , where l ( ˆ α, ˆ µ, ˆ δ ) is the maximized lik eliho o d. Comp ared with the classic BIC whic h has a p enalt y of 1 2 p log G , where p is the degrees of freedom of the mo del, this new result suggests that eac h in teraction term ( δ , γ ) con tributes 2[log R + log τ δ + log( G − τ δ ) − log G ] / log G (12) degrees of freedom to th e mo del. When the resp onse is m u ltiv ariate w ith w eights { d j } as in ( 5 ), we simply tak e a weigh ted su m of the BICs for eac h comp onen t. The v ariance un kno wn case is more tec hn ically messy , bu t th e same logic applies, yielding appro ximation ( 8 ). The pro of will not b e sho wn here; the in terested reader can refer to Zhang ( 2005 ). SUPPLEMENT AR Y MA TERIAL Additional tables an d ﬁgures (doi: 10.121 4/07-A OAS142SUPP ; .zip). Sup- plemen tary Figures 1 and 2 sho w r esp ectiv ely the pr in cipal comp onen ts and 34 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED screeplot for Sp ellman et al. ( 1998 ) y east cell cycle data set. S upplementary T able 1 (a-b) shows gene list en ric hm en t, annotation, and ﬂankin g sequence analysis for promoter elements ident iﬁed in Sp ell man et al. ( 1998 ) ye ast cell cycle exp eriment. Sup plemen tary T able 2 (a-b) sho ws the same information for the Wilderm uth et al. ( 2007 ) arabidopsis p o werdery m ildew infection ex- p eriment. Supp lemen tary table 3 lists the genes con taining the (CGCGTT, TTTCCA, 200) element. REFERENCES [1] Bailey , T. L. and Elkan, C. (1994). Fitting a mixture model b y expectation max- imization to discov er motifs in b iopolymers. Pr o c e e dings of the Se c ond International Confer enc e on Intel ligent Systems f or Mole cular Bi olo gy 28–36. AAAI Press, Stan- ford, CA. [2] Bussema ker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory element detection using correlation with expression. Natur e Genetics 27 167–171. [3] Chian g, D. Y., Moses, A. M., Kellis, M., Lander, E. S. and Eisen, M. B. (2003). Phylogenetically and spatially conserved w ord pairs associated with gene- expression changes in yeasts. Ge nome Biolo gy 4 R43. DOI: 10.118 6/gb-2003-4-7-r43 . [4] Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. ( 2003). Integrating regulatory motif discov ery and genome-wide expression analysis. Pr o c. Natl. A c ad. Sci. 100 3339– 3344. [5] Das, D., Bane rjee, N. and Zhang, M. Q. (2004). I nteracting mo dels of coop erative gene regulation. Pr o c. Natl. A c ad. Sci. 101 16234– 16239. [6] Das, D., Nahl ´ e, Z. and Zhang, M. Q. (2006). Adaptively inferring human t ran- scriptional subnetw orks. Mole cular Systems Biolo gy 2 DOI: 10.38/ msb4100067 . [7] Du, L. and Poo v aiah, B . W. (2004). A nov el family of Ca2+/calmodulin-binding proteins inv olved in transcriptional regulation: Interaction with fsh/Ring3 class t ran- scription activ ators. Plant Mol e cular Bi olo gy 54 54 9–569. [8] Eulgem, T. (200 5). Regulation of the Arabidopsis defense transcriptome. T r ends i n Plant Scienc e 10 71–77 . [9] Fisher, R. A. (1922). On the in terpretation of χ 2 from con tingency tables, and the calculation of P . J. R oy. Statist. So c. 85 87–94. [10] Fra tkin, E., Naughton, B ., B rutl ag, D. L. and Ba tzoglou, S. (2006). Motif cut: An algorithm for ﬁnding regulatory motifs. Bioinformatics 22 150–157. [11] Frie dman, J. H. (1991). Multiva riate adaptive regressio n splines (with discussion). Ann . Statist. 19 1–141. MR1091842 [12] Gu rr, S. and R ushton, P. (2005). En gineering plan ts with increased disease resis- tance: Ho w are w e going to express it? T r ends in Biote chnolo gy 23 283–2 90. [13] Gu tterson, N. and Reuber, T. L. (2004). Regulation of disease resistance path- w ays by AP2/ERF transcription factors. C urr ent Opinions in Plant Biol o gy 7 465–71. [14] He r tz, G. Z. and Stormo, G. D. (1999). Identifying DNA and protein p atterns with statistically sig niﬁcant alignmen ts of multiple sequen ces. Bioi nf ormatics 15 563 –577. [15] Hi go, K., Uga w a, Y., Iw amoto, M. and Ko renaga, T . (1999). Plant cis -acting regulatory DNA elemen t s (PLAC E) database. Nucleic A cids R ese ar ch 27 297–300 . [16] Johnson, C., Bode n, E. and Arias, J. (20 03). S alicylic acid and N PR1 in d uce the recruitment of trans-activ ating TGA factors to a d efense gene promoter in Arabidop- sis. Plant Cel l 15 1846–1858. TRANSCRIPTION F ACT OR BINDIN G SITE PR ED ICTION 35 [17] Kap lan , B. et al. (2006). Rapid t ranscriptome c hanges ind uced by cytosolic Ca2+ transients reveal ABRE-related sequences as Ca2+-resp onsive cis elements in A ra- bidopsis. Plant Cel l 18 2733–274 8. [18] Keles, S., V an der Laan, M. J. and V u lpe, C. (2004). R egulatory motif ﬁn ding by logic regression. Bi oinformatics 20 2799–2811 . [19] Laloi, C. et al . (2004). The Arabidopsis cytosolic thioredo xin h5 gene induction b y o xidative stress and its W- b o x-mediated resp onse to path ogen elicitor. Plant Physi- olo gy 134 100 6–1016 . [20] Lebel, E., Heifetz , P., Thorne, L., Uknes, S. , R y als, J. and W ard, E. (1998). F u nctional analysis of regulatory sequences controlling PR-1 gene exp ression in Ara- bidopsis. The Plant J. 16 223–233. [21] Liu , X . S., Brutla g, D. L. and Liu, J. S. (2002). An algorithm for ﬁndin g protein- DNA bindin g sites with applications to chromatin immunoprecipitation microarray exp eriments. Natur e Bi ote chnolo gy 20 83 5–839. [22] Maleck, K. et al. (2000). The t ranscriptome of Arabidopsis thaliana du rin g systemic acquired resistance. Natur e Genetics 26 403–410. [23] Osley, M. A. (1991). The regulation of histone synthesis in the cell cycle. Annual R eview of Bi o chemistry 60 827–861. [24] Popescu, S . C. et al. (2007). Diﬀerential binding of calmo dulin-related proteins to their targets revealed th rough high-densit y A rabidopsis p rotein micro arra y s. Pr o c. Natl. A c ad. Sci . USA 104 4730–4735. [25] Pontier, D., Balague, C., Bezombes-Marion, I., Tronchet, M., Deslandes, L. and R oby, D. (2001). Identiﬁcatio n of a n o vel pathogen-resp onsive element in the promoter of the tobacco gene HSR203J, a molecular marker of th e hypersensitive response. Plant J. 26 495–507. [26] Rushton, P. J., Reinst ¨ adler, A., Lipka, V., Lippok, B. and Somssich, I. E. (2002). Sy nthetic plant promoters containing deﬁned regulatory elemen ts p ro vide nov el insights into pathogen- and w ound - induced signaling. Plant Cel l 14 749–762. [27] Se gal, R. and B erk, A. J. (199 1). Promoter activity and distance constraints of one versus tw o spl binding sites. J. Biolo gic al Chemistry 266 2040 6–2041 1. [28] Sp ellman, P. T., Sherlock, G., Zh a ng, M. Q., Iyer, V. R., A n ders, K., Eisen, M. B., Bro wn, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identiﬁcatio n of cell cy cle-regulated genes of the yeast sacc haromyces cerevisiae by microarra y hybridization. Mole cular Biolo gy of the Cel l 9 3273–3297. [29] Tro y anska y a, O. , Cantor, M., S herlock, G., Br o wn, P., Hastie, T., Tibshi- rani, R., Botstein , D. and R uss, B. (2007). Missing va lue estimation metho d s for DNA microarra ys. Bioinformatics 17 520–525. [30] Turck, F., Zhou, A . and Somssich, I. E. ( 2004). Stimulus-dep en dent, promoter- sp eciﬁc binding of transcription factor WRK Y1 to its n ative promoter and the defense-related gene PcPR1-1 in P arsley . Plant Cel l 16 2573–2585. [31] Ulker, B. and Somssich, I. E. (2004). WRK Y transcription factors: F rom DNA binding tow ards biological function. Curr ent Opinions in Pl ant Biolo gy 7 491–498. [32] UR L for VirtualPlant: http://vir tualplant - prod.bio.nyu.edu/cgi- bin/virtualplant.cgi . [33] Wi ldermuth, M. C., De wdney, J., W u, G. and Ausubel, F. M. (2001). Isocho- rismate sy nthase is required to synthesize salicylic acid for p lant d efence. Natur e 41 4 562–565 . [34] Wi ldermuth, M. C., T ai, Y. C., Dewdney, J., Denoux, C., S peed, T. P. and Ausubel, F. M. (2007). App lication of the MB-statistic to temp oral global Ara- bidopsis ex p ression data ov er course of p owdery mildew infection reveals integrated biological pro cesses. In preparation. 36 N. R. ZHANG, M. C. WI LDERMUTH AND T. P . SPEED [35] Y ang, T. and Poo v aiah, B. W. (2002). A calmodu lin- binding/CGCG b ox D N A- binding protein family involv ed in multiple signaling pathw a y s in plants. J. Biolo gic al Chemistry 277 45049–45058. [36] Y ang, T. and Poov a iah, B. W. (2003). Calcium/calmodu lin-mediated signal net- w ork in plants. T r ends in Pl ant Scienc e 8 50 5–512. [37] Yu, D., Ch en, C. and Che n , Z. (2001). Ev idence for an imp ortant role of WR KY DNA binding p roteins in the regulation of NPR1 gene ex p ression. Plant Cel l 13 1527–15 40. [38] Zhan g, N. R. , Wilderm uth, M. C. and Speed, T. R. (2008). Supplement to “T ranscription factor binding site prediction with multiv ariate gene ex pression data.” DOI: 10.121 4/07-A OAS142SUPP . [39] Zhan g, N. and Siegm und, D. (2007). A mod iﬁed Bay es information criterion with applications to comparativ e genomic hybridization data. Biometrics 63 22–32. [40] Zhan g, N. (2005). Change-p oint mod els and sequence aliguments: Statistical prob- lems of genomics. Ph.D. dissertation, Dept. Statistics, Stanford Univ. [41] Zhou, Q. and Wong, W. (2004). Cis Mod u le: De novo disco very of cis -regulatory mod u les b y hierarc hical mixture mo deling. Pr o c. Natl. Ac ad. Sci. 101 12 114–11 2119. N. R. Zhang Dep ar tment of St a tistics St anford Un iversity St anford, California 94305-406 5 USA E-mail: nzhang@stanford.edu M. C. Wildermuth Dep ar tment of Plant an d Micr ob ial Biol ogy University of Calif ornia Berkeley, California 94720-3 102 USA E-mail: wildermuth@nat ur e.ber keley .edu T. P. S peed Dep ar tment of St a tistics University of Calif ornia Berkeley, California 94720-3 860 USA E-mail: terry@stat.berkeley .edu

Transcription factor binding site prediction with multivariate gene expression data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment