Identifying differentially expressed transcripts from RNA-seq data with biological variation

Iden tifying diﬀeren tially expressed transcripts from RNA-seq data wit h biological v ariation P eter Glaus 1 , ∗ , An tti Honk ela 2 , ∗ † and Magn us Rattray 3 , ∗ , † 1 School of Computer Science, Univ er sit y of Manc hester, Oxford Road, Manchester M13 9PL, UK 2 Helsinki Institute for Information T echnology HI IT, Depar t men t of Computer Science, Univ ersity of Helsinki, Helsinki, Finland 3 Department of Computer Science and Sheﬃeld Institute of T rans lational Neuroscience, The Universit y of Sheﬃeld, Sheﬃeld S10 2HQ, UK No ve m b er 12, 2018 Abstract Motiv ation: High-throughput sequencing enables expres sion analysis at the level of individual transcripts. The a nalysis of transcriptome expressio n levels and diﬀerential expression estimation r e- quires a proba bilistic approach to pro perly account for ambiguity caused by s h ared exons and ﬁnite read sa mpling as well as the intrinsic biologica l v ar iance o f transcript expression. Results: W e present BitSeq (Bayesian Inference of T ra nscripts from Sequencing data), a Bayesian approach for estimation of trans cript ex pression level from RNA-seq experiments. Inferred relative expression is represented by Marko v c hain Mon te Carlo (MCMC) samples from the poster ior probabilit y distribution of a generative mo del of the read data. W e prop ose a nov el metho d for diﬀerential expr ession analysis across replicates which pr opagates uncertaint y fro m the sample-level mo del while mo delling biological v ariance using an expressio n-lev el-dep enden t prior. W e de m onstrate the adv antages of our metho d us ing sim ulated data as w ell a s an RNA-seq datas et with technical and biological replication for bo t h studied conditions. Av ailabilit y: The implement ation of t he transcriptome expression estimation and diﬀere n tial ex- pression analysis, B it Seq, has bee n written in C++ . Con tact: glau s@cs.man.a c. uk , an tti.honkel a@hiit.fi , M.Rat tray@sheff i eld.ac.uk 1 In tro duction High-throughput sequencing is an eﬀectiv e approac h for transcriptome analysis. This m e tho dolog y , also called RNA-seq, has b ee n us e d to analyse unkn o wn transcrip t sequences, estimate gene expression le v- els and study single n ucleotide p olymorphisms (W ang et al. , 2009). As sho wn by ot her researc h e rs (Mortaza vi et al. , 2008), RNA-seq p ro vides man y adv an tages o v er microarra y tec hnology , although ef- fectiv e analysis of RNA-seq data r e mains a c h a llenge. A fu ndamen tal task in the analysis of RNA-seq d a ta is the ident iﬁcation of a set of diﬀeren tially ex- pressed genes or trans c ripts. Results fr o m a d i ﬀeren tial expression (DE) analysis of in dividual transcripts are essen tial in a div erse range of problems suc h a s iden tifying diﬀerences b et ween tissues (M ortaza vi et al. , 2008), und e rstanding d e v elopmenta l c h a nges (Gra vele y et al. , 2011) and regulator such as microRNA tar- get p redicti on (Xu et al. , 2010). T o carry out an eﬀectiv e DE analysis it is imp ortan t to obtain accurate ∗ to whom corresp o ndence should b e addressed † These authors contributed eq uall y to this work. 1                                                                                !  "        #   $   # "        "                   %&      "           "                '       #(               ) *  + * , + ,  ' - . / 0               ,               ,                Figure 1: D iagram showing the BitSeq analysis p i p eline divided in to t wo separate stages. In Sta ge 1, transcript expression leve ls are estimated using reads from ind ivi dual sequencing exp erimen ts. In step 1, reads are alig ned to the transcriptome. In step 2, the probabilit y of a read originating from a giv en transcript P ( r n | I n ) is co m puted fo r eac h alignmen t based on Eq. (1 ) . These probabilities are used i n step 3 of the analysis, MCMC sampling fr o m th e p osterior d istribution in Eq. (3 ) . In Stage 2 of the analysis, the p osterior distributions o f transcript expression level s from m ultiple conditions and replicas are used to in fer the pr o b abilit y that transcrip t s are diﬀerent ially expressed. In step 4, a s uita ble normalisation for e ac h exp erimen t is estimated. The normalised expression samples are fur ther u sed to infer expression-dep enden t v ariance h yp erparameters in step 5. Using these results, r e plicates are summarized by estimating th e p er- condition mean expression for eac h transcript, E q. (4 ) , in step 6. Finally , in step 7, samples repr ese n ting the distribution of within-condition expr e ssion are used to estimate the p robabilit y of positiv e log ratio (PPLR) b et wee n conditions, w hic h is used to rank transcrip ts based on DE b elief. estimates of exp ression f o r eac h sample but it is equally im portant to pr o p erly account for all sour c es of v ari- ation, tec hn ic al and biological, to a void spur io us DE c alls (Robinson and S m yth, 2007; Anders and Hub er, 2010; Oshlac k et al. , 201 0 ). In this contribution we address b o th of these problems b y dev eloping i n te- grated probabilistic m odels of the read generation pro cess and the biological replication pro cess in an RNA-seq exp erimen t. During the RNA-seq exp erimen tal pro cedure, a studied sp ecimen of transcriptome is synthesised into cDNA, ampliﬁed, fr a gmen ted and th e n sequenced by a high-throughp ut sequ e n cing device. This pr ocess results in a d a taset consisting of up to hundreds of millions of short sequen c es, or r e ads, enco ding observ ed n u cl eotide sequences. T he length of the reads dep ends on the sequencing platform and currently t ypically ranges fr om 25 to 300 b a s e pairs. Reads ha ve to b e either assem b le d into tr a nscript sequen c es or aligned to a reference genome by an aligning to o l, to determine the sequence they originate from. With pr o p er sample pr e paration, the num b er of reads aligning to a certain gene is app ro ximately prop ortional to the abundance of fragmen ts of transcripts for that gene within the sample (Mortaza vi et al. , 2008) allo wing researc hers to study gene expression (Clo onan et al. , 2008; Marioni et al. , 2008). Ho w ev er, during the pro cess of tr a n scription, most euk ary otic genes can b e spliced into d i ﬀeren t transcripts which share parts of their s e quence. As it is the transcripts of genes that are b eing sequenced d uring RNA-seq, it is p ossible to distinguish b et ween individual transcripts of a gene. Several metho ds ha ve b een prop osed to estimate transcrip t expression lev els (Li et al. , 2010; Nicolae et al. , 2010; Katz et al. , 2010; T u rro et al. , 2011). F u rthermore, W ang et al. (2010) sho w ed that estimati ng gene expression as a sum of transcript expression leve ls yields more pr e cise results than inferring the gene expression by summing reads ov er all exons. Since the transcrip t of origin is undecidable for r e ads a ligning to shared subsequ e nce, estimatio n of transcript expression lev els has to b e completed in a probabilistic manner. Initial studies of transcript expression used the Exp ectatio n-Maximizatio n (EM) app roa c h (Li et al. , 2010; Nicolae et al. , 2010). Th is 2 is a maximum lik eliho od pro cedure whic h on l y provides a p oin t estimate of tr a n script abundance and do es not measure the uncertaint y in these estimates. T o o v ercome this limitatio n , Katz et al. (2010) used a Ba y esian app roa c h to captur e the p osterior distribu tio n of the transcrip t expression level s using a Marko v c hain Monte Carlo (MCMC) algorithm. T urro et al. (2011) ha ve also prop osed MCMC estimation for a mo del of read counts ov er r e gions that can corresp ond to exons or other suitable su bparts of tr a n scripts. In this con tribution w e presen t Bit Seq (Ba ye s i an Inference of T ranscripts from Sequencing data), a new metho d for inferring transcript exp ressio n and analysing expr e ssion c hanges b et wee n conditions. W e use a probabilistic m odel of th e read generation pro ce ss similar to th e mo del of Li et al. (2 010) and w e dev elop an MCMC algorithm for Bay esian in fe rence o ver the mo del. Katz et al. (201 0 ) dev elop ed an MCMC algorithm for a similar generativ e mo del bu t our mo del diﬀers from theirs b ecause we allo w for m u lt i-aligned r ea d s mappin g to diﬀeren t ge nes. F urthermore, w e inf e r the o v erall relativ e expression of transcripts across the transcriptome whereas Katz et al. (2010) f ocus on relativ e expression of transcripts from the same gene. W e ha ve implemente d MCMC u sing a collapsed Gibbs sampler to sample fr o m th e p osterior d istribution of mo del parameters. In man y gene expression stud ie s expression leve ls are u sed to select genes w it h diﬀerences in expression in t w o conditions, a pro ce s s r e ferred t o as a DE analysis. W e prop ose a no ve l metho d for DE analysis that includes a mo del of biological v ariance while also allo wing f o r the technical uncertaint y of transcrip t expression which is represented b y samples from the p osterior p robabilit y distribu t ion obtained from the probabilistic mo del of read generatio n. By retaining the f ull p osterior distrib utio n, r a ther than a p oin t estimate s ummary , we can propagate uncertaint y from the in i tial read summ a rization stage of analysis in to the DE analysis. Similar strategies ha ve b een sho wn to b e eﬀectiv e in the DE analysis of microarray data (Liu et al. , 2006; Rattra y et al. , 200 6) but giv en the in heren t un ce rtain ty of reads mapping to m ultiple transcrip t s w e exp ect t he approac h to brin g ev en more adv an tages for transcript- lev el DE analyses. F urth e rmore, th is m e tho d accounts for d e creased tec h nica l repro ducibilit y of RNA-seq for lo w-expressed tr a n scripts recent ly rep orted by Laba j et al. (2011) and can decrease the n u m b er of transcripts falsely id e n tiﬁed as diﬀerent ially expressed. 2 Metho ds The BitSeq analysis p ipeline consists of tw o m a in stages: transcript expr e ssion estimation and d iﬀe ren tial expression assessment, see Figure 1. F or the transcrip t expression estimation the inp ut data are single-end or pair-end rea d s fr o m a single sequencing run. The m e tho d pro duces samples from the inferred probabilit y distribution o v er transcripts’ expr e ssion lev els. This distribu ti on can b e summarized by the sample mean in case one is only intereste d in expression. The DE analysis uses p osterior samples of expression lev els from t w o or more conditions and all a v ailable replicates. The cond it ions a re summarized by inferr i ng the p osterio r distribution of condition mean expression. Samples from the p osterior distr i butions are compared to score the transcripts based on the b elief in change of exp ressio n lev el b et ween conditions. 2.1 Stage 1: T ranscript expression es timation The initial in terest when dealing with RNA-seq d at a is estimation of expression lev els within a sample. In this w ork, we focus on the t ranscript expression le v els, mainly represen ted b y θ = ( θ 1 , . . . , θ M ), the relativ e abundance of transcrip ts’ fr a gmen ts within the studied sample, where M is the total n umber of transcripts. This can b e furth e r transform e d into relativ e expression of transcripts θ ( ∗ ) m = θ m / ( l m ( P M i =1 θ i /l i )), where l m is th e length of the m -th transcript. Alternativ ely , expression can b e repr ese n ted by r e ads p er kilob ase p er mil lion mapp e d r e ads , RP K M m = θ m × 10 9 /l m , int ro duced by Mortaza vi et al. (2008). W e use a generativ e mo del of the d a ta, depicted in Figure 2, whic h mod el s the RNA-seq data as indep enden t observ ations of in dividual r e ads r n ∈ R = { r 1 , . . . , r N } , dep ending on the relativ e abundan c e 3 θ I n r n N Z n act θ act Figure 2: Graphical represent ation of the RNA-seq data probabilistic mo del. W e can consider the observ ation of reads R = ( r 1 , . . . , r N ) as N conditionally ind e p enden t ev en ts, with eac h observ ation of a read r n dep ending on the transcript (or isoform) it originated from I n . The probabilit y o f sequen c ing a giv en transcript I n dep ends on the relativ e exp ression of fragmen ts θ and the n o ise indicator Z act n . The noise indicator v ariable Z act n dep ends on noise parameter θ act , and indicates that the transcript b eing sequenced is regarded as n oi se, whic h enables observ ation of lo w qualit y and u n-mappable reads. of transcripts’ fragmen ts θ and a noise parameter θ act . The parameter θ act determines the n u m b er of rea ds regarded as noise and enables the mo del to accoun t f o r unm a pp ed r e ads as well as for lo w-qualit y reads within a sample. Based on the parameter θ act , indicator v ariable Z act n ∼ Bern( θ act ) determines whether read r n is considered as noise or a v alid sequence. F or a v alid sequence, the pro cess of s e quencing is b eing mo delled. Under the assumption of reads b eing uniformly s e quenced f rom the m o lecule fragments, eac h read is assigned to a transcript of origin by the indicator v ariable I n , which is giv en b y categorica l distribution I n ∼ Cat ( θ ). F or a trans c ript m we can express the probabilit y of an observed alignmen t as th e prob a b ili t y of c ho osing a sp eciﬁc p osition p an d sequencing a sequence of giv en length w i th all its m i smatc hes, P ( r n | I n = m ) = P ( p | m ) P ( r n | seq mp ). F or paired-end reads we compute the join t probabilit y of the alignmen t of a whole pair, in wh i c h case we also hav e to consider f rag men t length distribu ti on P ( l ), P ( r (1) n , r (2) n | I n = m ) = P ( p | l, m ) P ( l | m ) P ( r (1) n | seq mlp 1 ) P ( r (2) n | seq mlp 2 ) . (1) Details of alignment probabilit y computation in c luding optional p osition and sequence-sp eciﬁc bias cor- rection m e tho ds are pr e sen ted in Supp le men tary Material. F or ev ery aligned read, we also calculate the probabilit y that the r e ad is from neither of the aligned trans c ripts, but is r e garded as sequencing error or noise P ( r n | noise). Th i s v alue is calculated by taking the probabilit y of the least pr o bable v alid alignmen t corrupted with tw o extra b a s e mismatc hes. The join t probabilit y distribu ti on of the mo del can n o w b e written as P ( R, I , Z act , θ , θ act ) = P ( θ ) P ( θ act ) × Q N n =1  P ( r n | I n ) P ( I n | θ , Z act n ) P ( Z act n | θ act )  , (2) where w e u se w eak conjugate Diric hlet and Beta pr io r d istributions for θ and θ act , resp ec tiv ely . The p osterior d i stribution of the mo del’s p a rameters giv en the data R can b e simp liﬁed by inte grating o v er all p ossible v alues of Z act : P ( I , θ , θ act | R ) ∝ P ( θ ) P ( θ act ) Q n ; I n 6 =0  P ( r n | I n )Cat( I n | θ ) θ act  × Q n ; I n =0  P ( r n | noise)(1 − θ act )  . (3) According to th e mo del an y read can b e a result of sequen c ing either strand of an arbitrary transcript at a rand o m p osition. Ho wev er, the probabilit y of a read originating from a lo cation where it do es not align is n eg ligible. T h us the term P ( r n | I n )Cat( I n | θ ) θ act has to b e ev aluated only for transcripts and p ositions to wh i c h the read do es align. T o accomplish this we ﬁ rst align the reads to th e trans c ript sequences us i ng the Bo wtie alig nment to ol (Langmead et al. , 200 9), preservin g p ossible m ultiple ali gnmen ts to diﬀeren t transcripts. W e then pre-compute P ( r n | I n ) only for the v alid alignments. (See steps 1-2 in Figure 1.) 4 The closed form of the p osterior d ist ribution is n o t analytical ly tractable and an approxi mation has to b e u sed. W e can analytically m a rginalise θ and apply a collapsed Gibbs sampler to pro duce samp le s from the p osterior probabilit y distribution o ver I n (Geman and Geman, 1993; Griﬃths and Steyve rs, 2004). These are used to compute a p osterior for θ , wh ic h is the main v ariable of int erest. F ull up date equ a tions for the sampler are give n in S upplemen tary Material. In the MCMC appr o ac h, m ultiple chains are sampled at the same time and con verge n ce is monitored using th e b R statistic as describ ed by Gelman et al. (2 003). The b R sta tistic is an estimate of a p ossible scale reduction of the marginal p osterior v ariance and provi des a measure of usefulness of pro ducing more samples. P osterior samples of θ provide an a s se ssmen t of the abundance of individual transcripts. As w ell a s provi ding an accurate p oin t estimate of the expr essio n lev els through the mean of the p osterior, the p robabilit y d istribution pro v i des a measure of conﬁd ence for the results, which can b e used in f urther analyses. 2.2 Stage 2: Com bining data from m ultiple replicates and estimating diﬀeren tial expression T o ident ify transcripts th a t are truly diﬀerentia lly expr e ssed it is necessary to account for biological v ariation b y using replication for eac h experimental condition. Our metho d summarizes th e se replicates b y estimating the biolo gical v ariance and inferring p er-condition mean expression lev els for e ac h transcript. During the diﬀerentia l expression analysis we consider the logarithm of transcrip t expression lev els y m = log θ m . The mo del for d a ta originating fr o m multiple replicates is illustrated in Figure 3. W e u se a hierarc h i cal log-normal mo del of within-condition expr e ssion. The p rior o v er the b io logica l v ariance is dep enden t on the mean expr e ssion leve l across conditions and the prior parameters (h yp er-parameters) are learned from all of th e data by ﬁtting a n o n-parametric r e gression mo del. W e ﬁt a mo del for eac h gene using the expression estimates from Stage 1. A nov el asp ect of our S ta ge 2 app roa c h is that we ﬁt mo dels to p osterior samp l es obtained f rom the MCMC sim ulation from Stage 1, whic h c an be considered “pseudo-data” represen ting expression corrupted b y tec h nica l noise. A pseudo-data v ector in constructed using a single MCMC sample f o r eac h r e plicate across all conditions. Th e p osterior distribution o ve r p er-condition means is inferred for eac h pseud o- data v ector using the mo del in Figure 3 (describ ed b elo w). W e th e n use Ba yesia n mo del-a v eraging to com b i ne the evidence from eac h pseudo-data vecto r and determine th e p robabilit y of d iﬀe ren tial expression. This approac h a llows us to account for t he in trinsic tec hnical v ariance in the d a ta; it is also compu t ationally tractable b ecause the mod el for a s i ngle pseudo-data v ector is conjugate and therefore inferen c e can b e carried out exactly . This eﬀectiv ely regularize s our v ariance estimate in the case that the num b er of replicates is lo w. As sho wn in Section 3.5 this pro vides impro ved con trol of error rate s for w eakly expressed transcripts where the technical v ariance is large. F or a condition c we assume R c replicate datasets. The log- expression from replicate r , y ( cr ) m is assumed to b e distribu t ed according to a normal distribu tio n with cond i tion mean exp ressio n µ ( c ) m , norm a lised by replication sp eciﬁc constant n ( cr ) , and precision λ ( c ) m , y ( cr ) m ∼ Norm( µ ( c ) m + n ( cr ) , 1 /λ ( c ) m ). As our parameters represent the relativ e expression lev els in the sample, BitSeq implicitly incorp orates normalisation b y th e total num b er of r e ads or th e RPKM measur e , as w as done when generating the resu l ts in th i s p ublicati on. F u rther n o rmalisation can b e implemen ted u sing the normalisation constant n ( cr ) , w hic h is c onstan t for all transcripts of a given replicate and can b e estimated prior to probabilistic mo deling usin g , for examp l e, a quant ile based metho d (Robins o n and Oshlac k, 2010) or any other suitable tec hn ique. The condition mean expression is normally distribu te d µ ( c ) m ∼ Norm( µ (0) m , 1 / ( λ ( c ) m λ 0 )) with mean µ (0) m , whic h is empirically calculated from multi ple samples, and scaled pr ec ision λ ( c ) m λ 0 . T he p rior distribu t ion o ve r p er-transcript, co ndition speciﬁc precision λ ( c ) m is a Gamma d ist ribution with h yp erparameters α G , β G , whic h are ﬁxed for a group of tr a n scripts with similar expression lev el, G . 5 n (cr) λ 0 R c C M α G β G R c C λ m (c) μ m (c) μ m (0) y m (cr) Figure 3: Graph ic al model o f the biological v ariance in tr a nscript exp ressio n experim e n t. F or r e plicate r , condition c and tr a n script m , the observ ed log-expression lev el y ( cr ) m is normally distributed around the normalised condition mean expression µ ( c ) m + n ( cr ) with biologic al v ariance 1 /λ ( c ) m . The condition mean expression µ ( c ) m for eac h condition is normally distribu te d with o verall m e an exp ression µ (0) m and scaled v ariance 1 / ( λ ( c ) m λ 0 ). The inv erse v ariance, or p reci sion λ ( c ) m , for a giv en transcr i pt m follo ws a Gamma distribution with expression-dep enden t hyp erparameters α G , β G , whic h are constant f o r a group of transcripts G w i th similar expression. The hyp erparameters α G , β G determine the distribution o v er p er-transcript precision paramet er λ m whic h v aries with the expr essio n level of a transcript (see Sup pleme n tary Figure 3 of the supplementa ry material). F or this reason, we inferred these hyp erparameters from the d at aset for v arious lev els of expres- sion, prior to the estimation of precision λ m and mean exp ression µ m . W e u se d the s a me mo del as Figure 3 applied join tly to m u lt iple transcripts with similar empirical mean expression leve ls µ (0) m . W e s e t a u ni- form pr io r for the hyp erparameters, marginalized out condition means and precision, and u sed a MCMC algorithm to sample α G , β G . The samples o f α G , β G w ere smo othed b y Lo wess regression (Clev eland, 1981) against empirical m e an expression to p rodu c e a single pai r o f h yp erparameters for eac h group of transcripts with similar expression leve l. This mo del is conjugate and th u s leads to a clo sed fo rm p osterior distribu t ion. T his allo ws us to directly sample λ m and µ m giv en eac h pseudo-data ve ctor y m constructed from the Stage 1 MCMC samples: P ( µ m , λ m | y m ) = Q C c =1 Gamma( λ ( c ) m | a c , 1 /b c ) Norm  µ ( c ) m     µ (0) m λ 0 + P R c r =1 ( y ( cr ) m − n ( cr ) ) λ 0 + R c , 1 λ ( c ) m ( λ 0 + R c )  , (4) a c = α G + R c 2 , b c = β G + 1 2  ( µ (0) m ) 2 λ 0 + + P R c r =1  y ( cr ) m − n ( cr )  2 −  µ (0) m λ 0 + P R c r =1  y ( cr ) m − n ( cr )  2 λ 0 + R c ! . Samples of µ ( c 1 ) m and µ ( c 2 ) m are used to compu te the pr o b abilit y of expression lev el of transcript m in condition c 1 b eing greater than the expr e ssion lev el in condition c 2 . This is done b y counting the fraction of samples in whic h the mean expression from the ﬁrst condition is greater, that is P ( µ ( c 1 ) m > µ ( c 2 ) m | R ) = 1 N P N n =1 δ ( µ ( c 1 ) m,n > µ ( c 2 ) m,n ) which w e refer to as the Probabilit y of Po sitiv e Log-Rati o (PPLR). Here, n = 1 . . . N represents one s a m ple from the ab o ve p osterior distribution for eac h of N indep endent pseudo-data v ectors. Su bsequen tly , ord e ring transcripts based on P PLR pro duces a r a nking of most probable up-regulated and do wn-regulated transcripts. This kind of one-sided Ba y esian test has previously b een used f o r the analysis of microarray data (Liu et al. , 2006). 6 (a) A n ti-correlation of transcripts. (b) No observ able correla- tion. (c) Posterior distribution of expression level s for eac h transcript. Figure 4: In p lo ts (a) and (b) w e sho w the p osterior transcript expr e ssion densit y for pairs of transcripts from the s a me ge ne. This is a densit y map constructed using the MCMC expression samples for these three transcripts. In (c) we sho w th e marginal p osterior distribution of expression lev els of the same trans c ripts as illustrated by histograms of MCMC samp l es. The s e quencing d at a is from miRNA-155 stud y published b y Xu et al. (2010). 3 Results and Discussion 3.1 Datasets W e carried out exp erimen ts ev aluating b oth gene expression estimatio n accuracy as we ll as diﬀeren tial expression analysis precision. F or the ev aluation of bias correction eﬀects as w ell as comparison with other metho ds (T able 1) we used paired-end RNA-seq data from the Microarra y Qualit y Con trol (MAQC) pro ject (Shi et al. , 2006) (Short Read Archiv e accession num b er SRA012427 ), b ecause it cont ains 907 transcripts whic h w ere also analysed b y T aqMan qR T-PC R. The r esults fr o m qR T-PCR prob es are generally re- garded as groun d truth expression estimates for comparison of RNA-seq analysis metho ds (Rob erts et al. , 2011). W e used RefSeq refGene tr a nscriptome annotation, assembly NCBI36/hg18 in order to ke ep results consisten t with qR T-PCR data as w ell as previously pub lished comparisons by Rob erts et al. (2011). The second dataset used in our ev aluation was originally pub l ished by Xu et al. (201 0) in a stud y fo cused on id e n tiﬁcation of microRNA ta rgets and pro vides tec hnical as well as b io logical replicates for b oth studied conditions. W e u se this d a ta to illustrate th e imp o rtance of biologica l replicates for DE analysis (Figure 5, Supp le men tary Figure 3 f o r biological v ariance) and the adv anta ges of u sing a Ba yesia n approac h for b oth exp ressio n inference and DE analysis (Figure 4). F or the purp ose of ev aluating and comparing BitSeq t o existing diﬀeren tial expression analysis metho ds, w e crea ted artiﬁcia l RNA-seq datasets with kn o wn expression leve ls and diﬀerentiall y expressed tr a n scripts. W e selected all trans c ripts of chromosome 1 from human genome assem bly NCBI37/hg19 and simulate d t wo biologica l replicates for eac h of th e t w o conditions. W e initially sample th e expression for all replicates using the same m e an r e lativ e expression and v ariation b et ween replicates as were observ ed in the Xu et al. data estimates. Afterwa rds we randomly c ho o se one third of the transcrip ts a n d shift one of the conditions up or down b y a kno wn fold c hange. Give n th e adjusted expression lev els, we generated 300k single-end reads uniformly distributed along the transcripts. The reads w ere rep orted in F astq format with Phred scores rand omly generate d according to empirical distribution learned f rom the SRA012427 dataset. With the error pr o babilit y given by a Ph red score, we generated base mismatc hes along th e reads. 3.2 Expression lev el inference Figure 4 demonstrates the am b i guit y that may b e presen t in the pro cess of expr e ssion estimation. In Figures 4 (a ) and 4(b) we sh o w the densit y of samples from the p osterior distr ibution of expression lev els for t wo pairs of t ranscripts. The expression lev els of transcrip t s u c 010oho.1 and u c0 10ohp.1 (F ig. 4 (a)) 7 read mo del BitSeq Cuﬄinks RSEM MMSEQ uniform 0.7677 0.7503 0.7632 0.7 614 non-uniform 0.8011 0.8056 0.7633 — T able 1: C o mparison of expr e ssion estimation accuracy against T aqMan qR T-PCR d a ta and the ef- fect of non-uniform read distribution mo dels using co r rela tion coeﬃcien t R 2 of a v erage expression from three tec hnical replicates with the 893 matc hin g transcr i pts analysed by qR T-PCR. The sequencing data (SRA01242 7) is part of the MAQC p ro ject and was originally p ublished by Shi et al. (2006). are negativ ely correlat ed. On the other hand transcripts uc010oho.1 and uc001b wm.3 exhibit n o visible correlation (Fig. 4 (b)) in th e ir expression leve l estimates. Ev en though this kind of correlation does not ha ve to imply biological signiﬁcance, it d oes p oin t to tec h nical diﬃ c ulties in the estimation pro cess. These transcripts share a signiﬁcant amount of sequence and the consequent read mapping am biguity leads to greater un ce rtain ty in expression estimates (See S upplemen tary Figure 1(d) for transcript pr o ﬁ le ). Ba y esian inference can b e us e d to assess the uncertain t y due to suc h confoun ding factors, unlike th e maxim um lik eliho od p oin t estimates p ro vided by an EM algorithm. T he marginal p osterior p robabilit y of tran sc ript expr e ssion for eac h transcript is sho w n in Figure 4(c). In our analysis pip eline, the marginal p osterior distribu t ions are p ropaga ted into the diﬀeren tial expression estimation stage, th us the uncertain t y from expression estimation is tak en in to accoun t when assessing wh et her there is strong evidence that transcripts are diﬀerential ly expr e ssed. 3.3 Expression estimation accuracy and read distribution bias correction Initially , it was assumed that high-thr o u ghput sequencing pr oduces r e ads un iformly d istributed along transcripts. How ev er, more recent studies sho w biases in the read distrib utio n dep ending on the p osition and surroun ding sequence (Do h m et al. , 20 08; W u et al. , 2011; Ro b erts et al. , 2 011). Our generativ e mo del for transcript expression inference (Figure 2) includes a m odel of the und erlying read distribution w hic h in the P ( r n | I n = m ) term that is calculated as a p re-processing step. The current BitSeq imp le men tation con tains the option of us i ng a un iform read d e nsit y mo del or using the mo del p roposed by R o b erts et al. (2011) whic h c an account for p ositional and sequence bias. The eﬀect of co r rect ing for read d i stribution was analysed u sing the S RA0 12427 dataset and r e sults are presented in T able 1. W e also compare BitSeq w it h three other transcript expression estimatio n metho ds: Cuﬄinks v0.9.3 (Rob erts et al. , 2011), MMSEQ v0.9.18 (T urro et al. , 2011) and RSEM v1.1.14 (Li and Dewe y, 2011). The d at aset cont ains thr e e tec hn ic al replicates. These w ere analysed separately and the resulting estimates for eac h metho d we re a v eraged together. Sub sequen tly , we calculated the squ a red Pearson correlation co e ﬃcien t ( R 2 ) of the a verag e expression estimate and th e resu lt s of qR T-PCR analysis. All four metho ds used with th e default uniform read d istribution mo del pr o vide s i milar lev el of accuracy with BitSeq p erforming sligh tly b etter than the other three metho ds. Both BitSeq and Cuﬄinks use the same method for read distr ibution b ia s correction and pro vide impro v ement ov er the uniform mo del similar to improv emen ts previously rep orted b y Rob erts et al. (2011). W e used v ersion 0.9.3 of Cu ﬄinks (as us e d b y Rob erts et al. ) since we foun d that the most recen t stable v ersion of Cuﬄin ks (v ers i on 1.3.0) leads to muc h worse p erformance for b oth uniform and bias-corrected mo dels (see Sup plemen tary results Section 2.2). The RSEM pac k age u ses its o wn metho d for bias correction based on the relativ e p osition of fragments, which in th i s case did not improv e the expression estimation accuracy for the selected transcrip t s. W e we r e not able to compare the bias corrected results of MMSEQ (T urro et al. , 2011) due to an error in an external R p ac k age mseq used for the bias correction. Ho w ever, the bias correctio n of mseq pac k age itself was already compared against Cuﬄin ks on the same dataset sho w i ng sligh tly w orse accuracy and less imp ro v ement (Rob erts et al. , 2011). 8 (a) (b) (c) (d) Figure 5: Comparison of BitSeq to naive approac h for combining replicates with i n a condition for tr a n - script u c0 01a vk.2 of the Xu et al. dataset. (a) Initial p osterior distributions of trans c ript expression lev els for t wo conditions (lab e led C0, C1), with t wo biolog ical r e plicates eac h (lab eled R0, R1). (b) Mean expr e s- sion lev el for eac h condition u sing the naiv e app roac h for com bin ing replicate s. The p osterior distributions from replicates are joined into one d a taset for eac h condition. (c) Inferred p osterior distr i bution of mean expression lev el for eac h condition u sing the probabilistic mod e l in Fig ure 3. (d) Distribu ti on of diﬀerences b et w een conditions from b oth appr o ac hes show that the naiv e ap proac h leads to ov erconﬁden t conclusion. In case o f BitSeq, the ma jor impro v ement of accuracy originate s from using the eﬀec tiv e length normal- ization. T o compare the resu lt s w it h qR T-PC R, the relativ e expression of fragmen ts θ h a s to b e con verte d in to either r e lativ e exp ressio n of transcripts ( θ ∗ ) or RPK M units. Using th e bias corrected eﬀectiv e length for this con version leads to the h ig her correlation with qR T-PCR (Supplementary T able 1). T his means that using an expression measure adju ste d b y the eﬀectiv e length, suc h as RPKM, is more suitable than normalized read count s for DE analysis. F or more results comparing the transcrip t expression estimation accuracy and within gene relativ e expression accuracy , please refer to supp le men tary material Section 2.3. 3.4 Diﬀeren tial expression analysi s W e use the Xu e t al. dataset to demonstrate the DE analysis pro c ess of BitSeq. This dataset c on tains tec hnical and biologica l replication for b oth studied conditions. W e obs e rv ed signiﬁcan t diﬀerence b et w een biologica l and tec hnical v ariance of exp ression estimates (S upplemen tary Figure 3). F urthermore, the prominence of biological v ariance increases with transcript expression lev el. W e illustrate ho w BitSeq handles biological replicates to accoun t for this v ariance in Figure 5, by sh o wing the mo delling pro cess for one example transcript give n only tw o b io logica l replicates for eac h of tw o conditions. Figure 5(a) sh o ws histograms of expression le vel samples pr oduced in the ﬁ rst stage of our pip eli ne. BitSeq p robabilistic ally infers condition mean expression leve ls using all rep l icates. F or comparison, we used a n ai v e w ay of combining tw o replicates by combining the p osterior distributions of expression in to a single distribution. The resu lt ing p osterior d i stributions for b oth approac h e s are d e p ict ed in Figures 5(b) and 5(c). The probabilit y of diﬀeren tial expression for eac h transcript is assessed by compu t ing the diﬀerence in p osterior expr essio n distrib utio ns of the tw o conditions. Resulting d istributions of diﬀerences for b oth approac hes are p ortra ye d in Figure 5(d) with ob vious diﬀerence in the lev el of conﬁd e n ce . The naiv e approac h r e p orts high conﬁdence of u p-reg ulation in the second condition, with the p robabilit y of p ositiv e log ratio (PPLR) b eing 0 . 995. When biological v ariance is b eing c onsidered b y inferring the condition mean expression, the s i gniﬁcance of diﬀerentia l expression is d ec reased to PPLR 0 . 836. 9 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (a) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (b) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (c) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (d) Figure 6: R OC ev aluation of transcript lev el DE analysis using artiﬁcial dataset, comparing BitSeq with alternativ e app roa c hes. The curve s are av eraged o ver 5 runs with d iﬀe ren t set of transcripts b eing d iﬀe r - en tially expr e ssed b y fold change uniformly distrib uted in th e interv al (1 . 5 , 3 . 5). W e discarded transcrip t s without any r e ads in it ially generated as these provi de n o signal. P anel (a) s ho ws global a verage b eha viour while in (b), (c) and (d) transcripts w ere divided into 3 e qually sized groups based on the logged mean generativ e read count : [0 , 1 . 061), [1 . 061 , 2 . 940) , [2 . 940 , ∞ ), resp ectiv ely . 3.5 Assessing DE p erformance with simula ted data Using artiﬁcially simula ted d a ta with a pr e deﬁned set of diﬀerenti ally expressed transcripts, w e ev aluated our approac h and compared it w it h four other metho ds commonly used for diﬀerent ial expression analysis. DESeq (Anders and Hu ber, 2010), edgeR (Robinson et al. , 2010), ba ySeq (Hardcastle and Kelly, 2010) w ere d esig ned to op erate on th e gene lev el and Cuﬀd i ﬀ (T rapnell et al. , 2010) on the transcript lev el. Despite not b eing designed for this p urp ose , w e consider the ﬁr st th ree in this comparison as the use case is very similar and there are no other w ell kn o wn alternativ es b esides C uﬀdiﬀ that w ould us e replica tes for transcript lev el DE analysis. All other metho ds b eside Cu ﬀdiﬀ u se BitSeq Stage 1 trans c ript exp ression estimates conv erted to coun ts. Details regarding us e of these m e tho ds are provided in the Supp l emen tary material, Section 2.5. Figure 6 sho ws the ov erall results as w ell as split in to three parts based on the expression of the transcripts. The ROC c urve s w ere generated b y a v eraging o ver 5 runs with diﬀerent transcripts b eing diﬀeren tially expr e ssed and the ﬁgur e s are fo cused on the most signiﬁcant DE calls with false p ositiv e r a te b elo w 0 . 2. Ov erall (Figure 6(a)), BitSeq is the most accurate metho d, follo w ed ﬁ rst by baySeq, then edgeR and DESeq with Cuﬀdiﬀ fur t her b ehind. This tren d is esp ecially clear for low er expression lev els (Figure 6(b), 6(c)). The o veral l p erformance here is fairly lo w b ecause of high leve l of b i ologic al v ariance. F or highest expressed transcripts (Figure 6(d)), DESeq and ed g eR sho w slightl y higher true p ositiv e rate than BitSeq and baySeq, esp ecially at larger false p ositiv e rates. F urther d e tails and m o r e results from the DE analysis comparison can b e foun d in the sup plemen tary m a terial Section 2.5. 4 Conclusion W e h a v e present ed metho ds for transcript expression lev el analysis and diﬀerentia l exp ression analysis that aim to mo del th e un ce rtain ty present in RNA-seq datasets. W e used a Bay esian approac h to pr o vide a probabilistic mod e l of transcriptome sequencing and to samp le from the p osterior distribu ti on o f the transcript expr e ssion lev els. The m odel incorp orates read and alignmen t qu a lit y , adjusts for non-unif o r m read distributions and accoun ts for exp erimen t-sp eciﬁc f rag m e n t length distribution in case of paired-end 10 reads. The accuracy of inferred expression is comparable and in s o me cases o utp erforms other comp e t- ing metho ds. Nev ertheless, the ma jor b eneﬁt of using BitSeq for transcript expression inference is the a v ailabilit y of full p osterior d ist ributions useful for fur ther analysis. The inferr e d distributions of transcr i pt expression lev els can b e f urther analysed by the second stage of BitSeq for DE analysis. Giv en biological r e p lic ates, BitSeq accoun ts f o r the intrinsic noise and v ariation and pro duces more reliable estimates of expression lev els within eac h condition, thus pro vid i ng few er f a lse diﬀeren tial expression calls. W e wan t to highlight th a t in order to m a k e m o st accurate d i ﬀeren tial expres- sion assessment , exp erimen tal d e sign m ust include biologica l replication. BitSeq is capable of combining information from multiple biological a nd te chnical replicas and co mparing m ultiple conditions. F urther studies including m u lti ple replicates are n e cessary to inv estigate the eﬀects of library preparation an d biologica l v ariance. Ac kno wledgemen t F unding : This work was supp orted un der the Eu ropean ERASysBio+ initiativ e pro j e ct SYNER GY b y the Biotec hnology and Biologic al Sciences Research Council [BB/I00476 9/2 to M.R.] and the Academ y of Finland [135311 to A.H.]; b y the Academ y of Finland [121179 to A.H.]; and the IST Pr o gramme of the Europ ean Communit y , under th e P ASCAL2 Net w ork of Excellence [IST-2007-2168 86]. T his publication only reﬂects the authors’ views. References Anders, S. and Hub er, W. (201 0). Diﬀeren tial expressio n analysis for se quence count data. Genome Biol , 11 (10 ) , R106. Cleveland, W. S. (19 81). LOWESS: A Pr ogram for Smo othing Scatterplots by Robust Lo cally Weigh ted Reg ression. Am S tat , 35 (1). Clo onan, N. et al. (200 8). Stem cell tr anscriptome proﬁling via massive-scale mRNA sequencing. Nat Metho ds , 5 (7), 613–9 . Dohm, J. C., Lo t taz, C., Bor odina, T., and Himmelbauer, H. (2008). Substantial biases in ultra -short read data sets from high-throughput DNA sequencing. Nucleic A cids R es , 36 (16), e105. Gelman, A., Carlin, J. B ., Stern, H. S., and Rubin, D. B. (20 03). Bayesian Data Analysis . Chapman and Hall/CRC, 2 e dit ion. Geman, S. and Geman, D. (19 93). Sto ch astic relaxa t ion, Gibbs distributions and the Bayesian restoratio n of image. J Appl St a t , 20 , 25–62 . Grav eley , B. R. et al. (2011). The developmen tal tra nscriptome of Dro sophila melanogas t er. N a tur e , 471 (733 9), 473–4 79. Griﬃths, T. L. and Steyvers, M. (2004). Finding scientiﬁc topics. Pr o c Natl A c ad Sci USA , 101 Suppl , 522 8–35. Hardcastle, T. J. and K elly , K. A. (20 10). baySeq: Empiric al Bay esian Metho ds F or Identifying Diﬀerential Expres- sion In Sequence Count Data. BMC Bioinformatics , 11 (1), 422 . Katz, Y., W ang, E. T., Airoldi, E. M., and Burge, C. B. (201 0). Analysis and desig n of RNA sequencing exp erimen ts for iden tifying isoform regulatio n. Nat Metho ds , 7 , 1009– 1015. Laba j, P . P ., Leparc , G. G., Linggi, B. E., Ma rkillie, L. M., Wiley , H. S., and Kr eil, D. P . (2 011). Char acterization and improv ement of RNA-Seq pr ecision in q u antitativ e tr anscript expressio n proﬁling . Bioinforma tics , 27 (13), i383–i3 91. 11 Langmead, B ., T rapnell, C., Pop, M., and Salzb erg, S. L. (20 09). Ultr afast a nd memo ry-eﬃcien t alig nm ent of sho rt DNA sequences to the h uma n genome. Genome Biol , 10 (3), R25. Li, B . and Dew ey , C. N. (2011). RSEM: accurate transcr ipt quantiﬁcation from RNA-Seq data with or without a reference genome. BM C Bioinformatics , 12 , 32 3. Li, B., Ruotti, V., Stewart, R. M., Thomson, J . A., and Dew ey , C. N. (2010). RNA -Seq gene expressio n estimation with read mapping uncertaint y . Bio informatics , 26 (4 ) , 49 3–500. Liu, X., Milo, M., Lawrence, N. D., and Rattray , M. (2006 ). P robe-level measur emen t error improv e s a ccuracy in detecting diﬀerential gene expressio n. Bioinformatics , 22 (17 ), 2 107–13. Marioni, J. C., Mason, C. E ., Mane, S. M., Stephens, M., and Gilad, Y. (2008). RNA-seq: an assessment of technical repro ducibilit y and compa rison with gene ex pression arrays. Genome Res , 18 (9), 150 9–17. Mortazavi, A., Williams, B. A., McCue, K., Sc hae ﬀ e r, L., and W old, B. (2008). Mapping and quantifying ma m malian transcriptomes by RNA-Seq. Nat Metho ds , 5 (7), 621–8 . Nicolae, M., Mangul, S., M, I., and Zeliko vsky , A. (2010). Estimation of alterna tiv e splicing isofor m frequencies from RNA-Seq data. In Pr o c. 10th Int l Conf. on Algori thms in Bioinformatics , pag es 202–2 14. Springer- V erla g. Oshlack, A., Robinson, M. D., and Y oung, M. D. (2010). F r om RNA-seq reads to diﬀerential express ion results. Genome Biol , 11 (12), 220 . Rattray , M., Liu, X., Sanguinetti, G., Milo, M., and Lawrence, N. D. (2 006). Propa gating uncertain ty in microar ra y data analysis. Bri ef Bioinform , 7 (1), 37–4 7. Rob erts, A., T rapnell, C., Donaghey , J ., Rinn, J . L., and Pach ter, L. (2 011). Improving RNA-Seq expression estimates by corr ecting for frag men t bias. Genome Biol , 12 (3), R22. Robinson, M. D. and Oshlack, A. (2010). A sca ling nor malization metho d for diﬀerential expression ana lysis of RNA-seq data. Ge nome Biol , 11 (3), R25. Robinson, M. D. a nd Smyth, G. K. (2007 ). Modera ted statistical tests for a ssessing diﬀerences in tag a bun dance. Bioinfo rmatics , 23 (21), 2 881–2887 . Robinson, M. D., McCarth y , D. J ., and Sm yth, G. K . (2010). edgeR: a Bio conductor pack age for diﬀer en tial expression analysis of digital gene expressio n data. Bioi n f ormatics , 26 (1), 1 39–40. Shi, L. et al. (2006). The MicroArray Q ualit y Control (MAQC) pro ject shows inter- and in tr aplatform repro ducibility of gene expression measurements. Nat Biote chnol , 24 (9), 1 151–61. T rapnell, C. et al. (201 0). T ranscr ipt assembly and quantiﬁcation by RNA-Seq reveals unanno tated tra n scripts and isoform switching during cell diﬀerentiation. Nat Biote chnol , 28 (5), 516–5 20. T urro , E. et al. (2011). Haplo t yp e and isofor m sp eciﬁc expressio n e stimation using multi-mapping RNA-seq rea ds. Genome Biol , 12 (2), R13 . W ang, X., W u, Z ., a nd Zhang , X. (2010). Isofor m abundance inference provides a more accur ate estimation of gene expression levels in RNA-seq. J Bioinform Comput Biol , 8 , 177 –192. W ang, Z., Gerstein, M., and Snyder, M. (2009 ). RNA-Seq: a revolutionary to ol for tra nscriptomics. Nat R ev Genet , 10 (1), 57– 63. W u, Z., W ang, X., and Z h ang, X. (2011). Using no n-uniform read distribution mo dels to improv e isofor m expression inference in RNA-Seq. Bio informatics , 27 (4 ) , 50 2–508. Xu, G. et al. (2010). T ranscriptome a n d ta rgetome analysis in MIR1 55 expressing ce lls using RNA-seq. RNA , pages 1610– 1622. 12 Supplemen tary Information A Metho ds A.1 Alignmen t probabilities W e presen t the alignmen t probability computation for the case of paired end r ea d s. F or s i ngle reads, the terms related to fr a gmen t or in se rt length d i stribution and the other paired read disapp ear. F or a giv en transcript I n = m ; m ∈ { 1 , . . . , M } , the p robabilit y of observing a p a ir of reads ( r (1) n , r (2) n ) is determined by the probabilit y of the read b eing sequenced from a sp eciﬁc s t rand s at a s peciﬁc p osition p with a sp eciﬁc insert length l and the p robabilit y of rep orting th e r e ads after sequen c in g th e sequences ( seq (1) mlps , s e q (2) mlps ), P ( r (1) n , r (2) n | I n = m ) = P ( l | m ) P ( p | l , m ) P ( s | m ) P ( r (1) n | seq mlps ) P ( r (2) n | seq mlps ) . (5) Unless a strand sp eciﬁc sequencing proto col is used, the probabilit y of observing a read from either str a nd is th e same, P ( s | m ) = 1 / 2, and can b e ignored. Th e fragmen t length distribu tio n P ( l | m ) is assumed to b e log-normal with its parameters giv en by t he user or estimated from read p a irs with only a sin g le transcript alignmen t. The probabilit y of sequen c ing a giv en p osition is in general P ( p | I n = m, l ) = b m ( p ) P l m − l r +1 p =1 b m ( p ) . (6) where b m ( p ) d e notes bias for a particular p ositio n p on trans c ript m . F or a constan t b m ( p ) corresp onding to a uniform read d istribution, this r e duces to P ( p | m ) = 1 / ( l m − l r + 1) whic h only d epend s o n the lengths of the transcript l m and the read l r . W e calculate the p robabilit y of obser vi ng a s e quence based on the read’s qualit y base scores an d mismatc hes. The Phr e d sco r e can b e con verted in to probabilit y of base-calling error p err ,i . The ﬁn a l sequence probabilit y is now obtained as P ( r ( j ) n | seq mps ) = Y i ∈ matc hes (1 − p err ,i ) Y i ∈ mismatch es p err ,i , (7) where the pr o babilit y of err o r for a giv en base i is b a sed on the Phred score p err ,i = 10 − Phred i / 10 . A.1.1 Bias estimation Our mo del can easily incorp orate a correction for p osition and sequence sp eciﬁc biases. One example of such a mo del is presen ted by Rob erts et al. (20 11) f o r correcting the fragmen tation bias. Under th is mo del, we h a v e b m ( p ) = b s, 5 m ( e 5 ) b s, 3 m ( e 3 ) b p, 5 m ( e 5 ) b p, 3 m ( e 3 ) , (8) where b s, 5 m ( e 5 ) and b s, 3 m ( e 3 ) are the sequence sp eciﬁc biases f o r 5’ and 3’ end s of th e fr a gmen t, r e sp ectiv ely , and b p, 5 m ( e 5 ) and b p, 3 m ( e 3 ) are the corresp onding p ositional biases. W e us e separate v ariable length mark ov m odels to capture the bias for eac h end. The str uct ure of this mo del is the same as that of Rob erts et al. (2011), p resen ted in Figure 2 of the supplementa ry metho ds. F or the sequence bias th e se are b s, 5 m ( e 5 ) = 21 Y n =1 ψ 5 ,R n,π n ψ 5 ,U n,π n , (9) 13 whic h are based on 21 probabilities ψ 5 n,π n from 8 b a ses b efore and 12 bases after the read s t arting p osition. Here ψ 5 ,R refers to the biased and φ 5 ,U to a u niform mo del, n is a no de or a p osition, π n are the parents of no de n and ψ 5 n,π n is the pr o babilit y of base X at no de (or p osition) n giv en the b a s e s observ ed on paren t nod e s π n . T he mo del has 744 parameters in a ll, with eac h no de ha ving 0, 1 or 2 parents a s in the mo del of Rob erts et al. (2011). The parameters are estimated from empirical f requencies using reads with a single alignment. F or a read r aligning to trans c ript m we in c rease appropriate probabilities ψ 5 ,R b y 1 /θ m , w here θ m is an initial coarse expression estimate obtained by runn ing BitSeq with uniform read distribution mo del b eforehand. In the cont rasting uniform m odel for all K = l m − l r + 1 p ossible p ositions of r e ad of length l r , the appropriate p robabiliti es ψ 5 ,U are increased by 1 θ m K . The mo del b s, 3 m ( e 3 ) is similar. In addition to the sequence-sp eciﬁc bias, there is a mo del for p o sitional bias within the transcript. This is b p, 5 m ( e 5 ) = ω R l m ,e 5 /l m ω U l m ,e 5 /l m , (10) where ω l,p is th e probabilit y for starting p osition within transcrip t of length l on p osition p . The pr o b a- bilities are m odelled w i thin 5 transcrip t length bins and 20 bins of r el ativ e p osition. Th e probabilities are again estimated from empirical f requencies of reads with single alignmen ts taking into accoun t exp ression θ . A.2 Eﬀectiv e length compu tation F or the pu rp o se of rep orting normalized measure such as RP KM, θ , the relativ e expression of fragmen ts, has to b e normalized by the amount of r e ads or fr a gmen ts that can b e pro duced by a u nit of trans c ript. When assum ing uniform read distribution of single-end reads, this w ould b e l m − l r as the num b er of starting p ositions for a read of length l r . F or pair-end reads, the eﬀectiv e length of a transcript has to accoun t for fragmen t length distrib utio n as we ll, l ( ef f ) m = l m X l f =1 p ( l f | m ) ∗ ( l m − l f ) . (11) With the use of read distribu ti on with bias correction, w e learn m o re ab out the distribu ti on of fragments and th u s can u se this information wh en computing the eﬀectiv e length. In this case, the eﬀectiv e length tak es in to account bias weig ht for eve ry p osition of the transcript, l ( ef f + bias ) m = l m X l f =1 p ( l f | m ) l m − l f X p =1 b m ( p ) (12) As we show later in Section B.2 of this S upplemen tary material, u sing the bias corrected eﬀectiv e length can substantial ly improv e the accuracy of our metho d. A.3 Gibbs sam pling in exp ression estimation (Stage 1) W e apply a collapsed Gibbs samp l er for Stage 1 estimation by marginalising out the expression lev el and noise lev el parameters θ and θ act and iterativ ely resamp l ing the isoform assignmen ts I n of eac h r e ad giv en 14 the assignment s of other reads I ( − n ) . T he full up date ru l es f o r the sampler are P ( I n | I ( − n ) , R ) = Cat( I n | φ ∗ n ) , (13) φ ∗ n 0 = P ( r n | noise)( β act + C ( − n ) 0 ) / Z ( φ ∗ ) n , m 6 = 0; φ ∗ nm = P ( r n | I n )( α act + C ( − n ) + ) ( α dir + C ( − n ) m ) ( M α dir + C ( − n ) + ) / Z ( φ ∗ ) n , C ( − n ) m = P i 6 = n δ ( I i = m ) , C ( − n ) + = P i 6 = n δ ( I i > 0) , with Z ( φ ∗ ) n b eing a constant n o rmalising φ ∗ n to sum up to 1, and α dir = 1 , α act = 2 , β act = 2. As an alternativ e, it is also p ossible to u se a r eg u la r Gibb s sampler alternating b et we en samp l ing I n and θ . Th e corresp onding up date ru l es are P ( I n | θ , θ act , R ) = Cat( I n | φ n ) , (14) φ n 0 = P ( r n | noise)(1 − θ act ) / Z ( φ ) n , m 6 = 0; φ nm = P ( r n | I n ) θ m θ act / Z ( φ ) n , P ( θ | I , θ act , R ) = Dir( θ | ( α dir + C 1 , . . . , α dir + C M )) , (15) P ( θ act | I , θ , R ) = Beta( θ act | α act + N − C 0 , β act + C 0 ) , (16) C m = P N n =1 δ ( I n = m ) . This approac h is usu a lly less eﬃcien t in practice, though. A.4 Diﬀeren tial Expression model (stage 2) The Diﬀeren tial Exp ression (DE) mo del is sho w n in Figure 3 of the main pap er. W e consider data from conditions c = 1 . . . C with num b er of r e plicates for eac h condition denoted R 1 , . . . , R C . W e ﬁ t the mo del to ea c h transcr i pt m indep enden tly u sing “pseu do- data” y ( cr ) m = log θ ( cr ) m whic h is creat ed from MCMC samples from Stage 1. One sample of θ ( cr ) m is dra wn for eac h ( r , c ) combination to create a pseud o -data v ector y m of length P C c =1 R c . Inf e rence is carried out indep endently f o r eac h p se udo-data v ector and the results are then combined as describ ed in the main text. This allo ws the technical err o r from Stage 1 to b e p ropaga ted through the mo del. Sin c e the mo del is conjugate then the in fe rence for eac h ps e udo-data v ector is exactly tractable an d no further MCMC is r equired to sample the cond i tion means. A.4.1 P arameter estimation for ea c h tra ns cript The condition means a re denoted µ m = ( µ (1) m , . . . , µ ( C ) m ) and we are intereste d in inf e rring the p osterior distribution o v er the means given one pseudo-data v ector y m . T he mo del is d eﬁned as, y ( cr ) m ∼ Norm( µ ( c ) m , 1 /λ ( c ) m ) µ ( c ) m ∼ Norm( µ (0) m , 1 / ( λ 0 λ ( c ) m )) λ ( c ) m ∼ Gamma( α G , β G ) with hyper-p a rameters λ 0 , α G , β G whic h are estimated from group s of transcripts with similar mean ex- pression across conditions. The hyp er-parameter µ (0) m is ﬁ xed at th e empirical mean transcript expression across conditions. 15 p ( µ m , λ m | y m ) ∝ p ( y m | µ m , λ m ) p ( µ m ) p ( y m ) ∝ C Y c =1 p ( µ ( c ) m ) p ( λ ( c ) m ) R c Y r =1 p ( y ( cr ) m | µ ( c ) m , λ ( c ) m ) ∝ C Y c =1 Gamma( λ ( c ) m | a c , b c )Norm µ c      λ 0 µ (0) m + S y c λ 0 + R c , 1 λ ( c ) m ( λ 0 + R c ) ! a c = α G + R c 2 b c = β G + 1 2 λ 0 µ (0) m 2 + S 2 y c − ( λ 0 µ (0) m + S y c ) 2 λ 0 + R c ! where S y c d e notes P R c r =1 y ( cr ) m and S 2 y c denotes P R c r =1 y ( cr ) m 2 . A.4.2 Hyp er-parameter estimat ion across transcript groups F or hyp er-paramet er estimation w e consider a set of transcripts m = 1 . . . M ′ in a group g of tran sc ripts with similar expression. The h y perp a r a meter µ 0 is no w set to the mean expression p er g roup of transcripts and λ 0 is set to 2.0. W e hav e pseud o -data s a m ples y ( cr ) m for eac h tr a n script and we are in terested in h y perp a r ame ters α and β , wh e re β is the rate of Gamma d i stribution. The mo del is deﬁn ed as, y ( cr ) m ∼ Norm( µ ( c ) m , 1 /λ ( c ) m ) µ ( c ) m ∼ Norm( µ (0) m , 1 / ( λ ( c ) m λ 0 )) λ ( c ) m ∼ Gamma( α, β ) P ( α, β ) ∼ Uniform(0 , ∞ ) The hyper-parameter p osterior distribu ti on is given by , P ( α, β | y ) ∝ P ( α, β ) P ( y | α, β ) ∝ M ′ Y m =1 C Y c =1 P ( y c m | α, β ) ∝ M ′ Y m =1 C Y c =1 Z dλ ( c ) m p ( λ ( c ) m | α, β ) Z dµ ( c ) m P ( µ ( c ) m | λ ( c ) m ) R c Y r =1 P ( y ( cr ) m | λ ( c ) m , µ ( c ) m ) ∝ M ′ Y m =1 C Y c =1 β α Γ( α ) Γ( α + R c )  β + 1 2  λ 0 µ (0) m 2 + S 2 y c − ( λ 0 µ (0) m + S y c ) 2 λ 0 + R c  α + R c . This distribution is not in a stand a rd form and w e use Metrop olis-Ha s t ings Rand o m w alk MCMC to sample α and β . W e then use lo wess smo othing across groups to estimate the mean hyp e r-parameter for eac h transcript according to its emp iric al mean expression leve l across cond it ions. 16 (a) Anti-correla tion of transcripts. (b) No observ able correlation. (c) Anti-correlatio n of transcripts.. (d) T ranscript sequence proﬁle. Figure 7: In plots (a), (b) and (c) w e sho w the p osterior transcript exp ressio n dens i t y fo r pairs of transcr i pts from the same gene. This is a d ensit y m a p co nstructed using the MCMC expr e ssion sa mples for these three transcrip t s. In ( d) w e sho w t ranscript sequence proﬁle obtained fr o m the UCSC genome br o wser. The sequencing data is fr o m miRNA-155 study pu blished b y Xu et al. (2010). B Results B.1 T ranscript expression in ference In the main text (Figure 4) w e illustrate the correlation presen t in the expr essio n p osterior distribu tio n for tran sc ripts that share a large pr o p ortion of transcrib ed sequence. Here w e pr o vide all thr e e pairwise plots for the transcripts u c 010oho.1 , uc010ohp.1 and u c0 01bwm.3, whic h are the only transcripts of gene Q6ZMZ0 in the UCSC known Gene annotation. T he expression samples of uc001b wm . 3 and uc010ohp.1 (Figure 7(c)) are also negativ ely correlated. This means that the mo del is not able to d ec ide from w hic h transcript some of the reads originated and the p osterior d i stribution captur e s all viable assignmen ts. The transcript sequence pr o ﬁle in Figure 7(d) clearly d e monstrates the similarit y of the transcripts that causes higher uncertain t y when inferrin g the transcript expression leve ls. B.2 Read di stribution bias correction W e compared four diﬀeren t metho ds for expression estimation w hic h include bias co rrection options fo r non-uniform read distr i bution. The e xtended results are presen ted in T able 2, where w e rep o rt the R 2 correlation of 893 tr a n script expression estimates w it h the T aqMan qR T-PCR results. W e used ev ery metho d to analyse eac h of the three tec hn ic al replicates separately and th e n used the a v erage expression lev el for the comparison. As was already stated in the main pap er, the newest stable v ers i on of Cuﬄinks do es provi de the lo west correlation. W e resorted to using the version 0.9.3 w hic h wa s used in the pap er present ing the bias correction metho d adopted by BitSeq (Rob erts et al. , 2011). F or BitSeq, the ma jor ben e ﬁt of the bias correction alg orithm comes f rom the eﬀec tiv e transcrip t length normalisation. Relati v e expression of fragments used by BitSeq can b e con verted into r el ativ e exp ressio n of trans c ripts or into RP KM measure b y adjusting the expression b y eﬀectiv e length (see Su pplemen tary Section 1). In T able 2 w e compare three diﬀeren t approac hes for length normalisatio n. In the ﬁrst appr o ac h ( ∗ ), the expression is adju ste d by the length of a transcrip t. The second app roa ch ( † ) uses eﬀectiv e length taking in to account th e paired-end read fragmen t length distrib utio n and the num b er of all p ositions fr o m 17 Metho d ver. Read distribu ti on Av erage Rep. 1 Rep. 2 Rep. 3 BitSeq 0.4 uniform ∗ 0.7585 0.7575 0.7580 0 .7594 BitSeq 0.4 uniform † 0.7677 0.7672 0.7669 0 .7675 BitSeq 0.4 bias corrected ∗ 0.7565 0.7554 0.7561 0 .7573 BitSeq 0.4 bias corrected † 0.7652 0.76495 0.7647 0 .7652 BitSeq 0.4 bias corrected ‡ 0.8011 0.8018 0 .7959 0.8041 Cuﬄinks 0.9.3 uniform 0.7503 0.7470 0.7513 0 .7519 Cuﬄinks 0.9.3 bias corrected 0.8056 0.8018 0 .8050 0.8083 Cuﬄinks 1.3.0 uniform 0.5331 0.5130 0.5336 0 .5477 Cuﬄinks 1.3.0 bias corrected 0.684 2 0.6858 0 .6917 0.6446 RSEM 1.1.14 uniform 0.7632 0.7623 0.7628 0 .7640 RSEM 1.1.14 bias corrected 0.7633 0.7623 0 .7628 0.76409 MMSEQ 0.9.18 uniform 0.7614 0.76099 0.7606 0.7 620 T able 2: Ev aluation of tran sc ript expression inference algorithms u sing the SRA012427 RNA-seq d a ta and T aqMan qR T -PCR expression measures for 893 matc hin g transcripts. Reported v alues a re P earson R 2 correlation co eﬃci en t of the 893 tr a n scripts’ exp ressio n estimates and qR T-PC R results, b est correlation of a metho d using av eraged expression is highlighte d. F or eac h metho d we presen t v alues for a verage expression tak en from thr ee rep l icates as w ell as for eac h tec h nica l replicate separately . BitSeq was used with three d iﬀe ren t ve r sions of expression length n o rmalisation: ∗ – using actual transcript length, † – using eﬀectiv e length accoun ting for fragment length distribu t ion, ‡ – using eﬀectiv e length accounti ng for fragmen t length and read distr i bution bias. whic h a f rag m e n t could originate. The last approac h ( ‡ ), which p ro vides b est results on this dataset, uses eﬀectiv e length computed usin g the fr ag m e n t length distribu tio n as well as read distr i bution bias w eigh ts (see Equatio n 12). More careful in v estigatio n of t his p rocess is required, ho we v er it is limited by small n u m b er of RNA-seq d a tasets with known un derlying exp ression, esp eciall y w hen using paired-end reads. B.3 Assessing transcript expression inference using simula ted data W e used s i m u l ated d at aset of 10M paired-end r ea d s to examine th e expression estimation accuracy of Bit- Seq and compared it against other three p opular metho ds. The reads were generated based on expression estimates fr o m the Xu et al. d a taset with a fr a gment size distribu t ion l f ∼ LogNorm(5 . 32 , 0 . 12), in ferred from the S RA0 12427 d a taset. W e used the UCSC NCBI37/ h g1 9 kno wn Ge ne a nnotation transcripts to generate the read fragmen ts. First w e compared th e o verall expression accuracy against th e generativ e read coun t v alues (Figure 8). W e used read count in order to facilitate the second p a r t of our comparison, the assessment of the within gene relativ e expression estimation (T ab l e 3), for wh ic h the us e RPKM would not b e feasible. F or comparison of ov erall e xpression accuracy , w e rep o r t the Pearson R 2 correlation co eﬃcien t with the groun d tru t h. Th e co eﬃcien t wa s calculated for transcripts with at least one read generated (46841 transcripts). In t his c omparison RSEM ( R 2 = 0 . 998) has the highest correlation with M MSeq ( R 2 = 0 . 997) and BitSeq ( R 2 = 0 . 995) b eing closely b ehind. Unfortunately we again hav e to rep ort p o or resu lt s for the latest version of Cu ﬄinks ( R 2 = 0 . 307) with the version 0.9.3 s t ill p erforming wo rse than the other thr ee metho ds ( R 2 = 0 . 784) . In the withing gene exp ression comparison (T able 3), we u sed t wo cu t oﬀs for relev an t transcrip t s. The ﬁrst taking into account transcripts for whic h their gene has at least 10 r e ads in the ground truth (45662 transcripts) and the second considering on l y tr a n scripts f o r which th e gene has at least 100 reads (33757 transcripts). BitSeq p erforms the b e st for th e n a rro w range of trans c ripts with RSEM and MMSEQ ha ving comparable r esults. F or th e less stringent criteria, BitSeq still r e tains ve ry go od correlation with 18 (a) (b) (c) (d) Figure 8: Comparison of expression estimates using 10M simulated pair-end reads with kn o wn exp ression. The expr essio n estimates w ere con verted into estimated read count s for eac h transcript an d compared against ground truth using Log-Log plot. W e calculated Pe arson R 2 correlation co eﬃcie n t for transcrip t s with at least one generated p a ired-end read.The ﬁ gures show (a) BitSeq, (b) Cuﬄ inks v0.9.3, (c) RS EM, and (d) MMSEQ. the groun d truth w hile th e p erformance of th e other tw o m e tho ds deteriorates. As we are using the same dataset, b oth versions of C uﬄinks provide p o o r correlation when compared to other th ree metho ds. BitSeq Cuﬄinks Cuﬄin ks 0.9.3 RSEM MMSEQ ab o v e 10 reads 0.951 0.205 0.739 0.876 0.888 ab o v e 100 reads 0.964 0.17 6 0.787 0.945 0.948 T able 3: The R 2 correlation coeﬃcient of estimated within-gene relativ e expression and ground truth . Th e correlation was calculated for t wo group s, ﬁrst one con taining transcripts of genes with at least 10 r ea d s and the second one contai ning transcripts of genes w it h at least 100 r e ads according to the ground tru t h. 19 6 4 2 0 2 4 6 8 10 Log RPKM expression 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Averaged standard deviation within experiment technical replicate biological replicate Figure 9: Comparison of standard deviation of p osterior samples within single d a taset and com bined datasets of tec hnical replicates and biologic al replicates, with log RPKM expression on the x-axis and standard deviation of log RPKM expression on y-axis. The standard deviation is a sliding a verage o ver groups of transcrip t s with similar expression in ord er to highlight its dep endents on th e expression. B.4 Biological v ariance of RN A-se q d ata W e used RNA-seq data from the microRNA target id e n tiﬁ c ation study (Xu e t al. , 2010 ) to test and com- pare the d iﬀe ren tial expression analysis metho d used in BitSeq S t age 2. This dataset conta ins tec h nical as wel l as biolog ical r e plicates for eac h studied cond i tion allo wing assessmen t of the eﬀec ts of biolog ical v ariation. Similarly to previous resu lt s (And ers and Hub er, 2010; Osh la c k e t al. , 201 0 ), we observe signiﬁ- can t biological v ariation w it hin conditions. Figure 9 sh o ws the standard deviation of transcript expression lev el p osterior MCMC samples as a fu nctio n of the mean expression lev el of the tr a nscript. W e compare the standard d e viation f o r samples from with i n one exp erimen t, b et w een t wo tec hn i cal replicates and b e- t wee n tw o biological r eplic ates. In order to calculate the stand a rd deviation b et wee n replicate s w e took the squ a red ro ot of v ariance whic h wa s estimated b y computin g mean s quare distance b et w een samples. Plotted v alues are a v eraged for a sliding window of similarly expr e ssed tr a nscripts. The MCMC sample v ariation captures the intrinsic estimation v ariance in th e “within-exp eriment ” case. The tec hnical v ari- ance includes a contribution due to re-sequencing the same biological sample while the b i ologic al v ariance includes a con tribution du e to rep eating the exp erimen t. W e see that with h i gher expression the v ariation of the expr e ssion lev el estimation decreases as exp ected. A t high expr e ssion lev els the v ariance asso ciate d w it h tec h nica l replicates approac h es the lev el o f t he within- exp erimen t v ariance. On the ot her hand, the biologica l v ariance b ecome s rela tiv ely more signiﬁcant in this regime. Without consideration of biolo gical diﬀerences, high c onﬁdence of expression estimation of these tr a n scripts will lead to f a lse d i ﬀeren tial expression calls. It can also b e observ ed that the within- exp erimen t v ariance is a signiﬁcant con tribution to replicate v ariance (tec h nica l and biological) at lo wer expression lev els. Therefore th e intrinsic v ariance d ue to mappin g am b ig uit y and limited read depth, as 20 estimated b y our MCMC expression estimatio n pro cedure, will pro vide useful information for assessing replicate v ariance in this low expr e ssion regime. B.5 Assessing DE p erformance with sim ulated data 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR BitSeq DESeq edgeR BaySeq Cuffdiff Figure 10: R O C curves av eraged o ver 5 ru ns w it h standard deviation depicted by error bars . T he curv e w as calculated for trans c ripts with a verage of at least one r e ad in the ground truth. The fold change wa s uniformly distrib uted in the int erv al (1 . 5 , 3 . 5). W e carried out extensive assessment of DE analysis accuracy of BitSeq with comparison to other metho ds. The Cuﬀd i ﬀ metho d from Cuﬄinks pac k age (T rapn e ll et al. , 2010) is the only other method designed for t ranscript lev el DE analysis that uses replicates and accounts for biological v ariation. W e also included th ree p opular m e tho ds whic h are primarily designed for gene level DE analysis (DESeq (Anders and Hub er, 2010), e dgeR (Robinson e t al. , 2010), ba yS e q (Hardcastle and Kelly, 2 010)), but give n the lac k of other options and their inpu t b eing on l y th e read coun t v ectors, they could b e considered for the tr a nscript level analysis use case as w ell. Using expression estimates obtained by BitSeq Stage 1, w e conv erted the relativ e expression of fragmen ts i n to read counts by simply m u lt iplying it by the total n u m b er of aligned reads and used th is as an in put for the gene-lev el m e tho ds. F or eac h of these metho ds w e used default parameter settings according to the p a ck ages’ vignettes. The Figure 10 sho w s th e same R OCs as Fig ure 6( a) in the main p a p er w i thout the 0. 2 cutoﬀ. The ev aluatio n is only for transcripts with at least one generated read on a ve rage with fold c h a nge b eing uniformly generated from the in terv al (1 . 5 , 3 . 5). In this ﬁgure, the error bars depict the standard deviation 21 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (a) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (b) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (c) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (d) 0.00 0.05 0.10 0.15 0.20 FPR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 TPR BitSeq DESeq edgeR BaySeq Cuffdiff (e) Figure 11: Diﬀeren tial expression analysis of simulated data with v arious levels of fold change. The ﬁgures fo cus on the most relev an t regio n with false p ositiv e rate ab o v e 0. 2, and sho wing the y-axis up to true p ositiv e rate 0.65. Th e su b ﬁ g ures show data with fold change: 1.5 (a), 2.0 (b), 2.5 (c), 3.0 (d) and 5.0 (e). for the a verag ed curves s ho wing consistent results trough the exp eriment s. W e can see that BitSeq p erforms sligh tly b etter than the other metho ds with b a ySeq having higher tru e p ositiv e range in area with ab o v e 0.4 false p ositiv e r a n ge , ho wev er this area is not inte resting fr o m the application p ersp ectiv e. In the v ery last ﬁgu re (11), we compare the accuracy of these metho ds with resp ect to the fold c hange of diﬀeren tially expressed transcripts. W e again restrict the ﬁ g ures to the area with false positiv e rate b elo w 0.2 which in our op i nion is the most imp ortan t in terms of app li cabilit y . Instead of usin g randomly selected fold c hange, all diﬀerent ially expr e ssed transcripts are either u p-regulat ed or do wn -reg ulated by constan t fold c hange. The increa se of fold c hange clearly impro v es the p e rformance of the metho ds as w e exp ected. BitSeq and ba ySeq hav e consistently b etter results than the other m e tho ds except for the 22 lo we st fold c hange 1 . 5 , in whic h ba ySeq has the lo we st true p ositiv e rate and edgeR with DESeq outp erform BitSeq in h a lf of the sp ectrum. In all o f our DE e xp erimen ts, C uﬀdiﬀ, despite being designed for trans c ript lev el analysis p erforms w orse out of the 5 compared algorithm. This could b e largely attributed to the expression estimation problem, ho w ever for DE analysis r et urn to the older version (0.9.3) did not imp ro v e th e results, p ossibly b ecause of diﬀerent DE mo del. Our data also s ho ws that for most parts, the DESeq and edgeR metho ds pro duce v ery similar resu l ts in terms of accuracy . W e hav e to n o te, that ev en though w e tried to s i m u la te the data in wa y to resem ble real RNA-seq exp erimen ts, the data pr o v ed to b e r a ther hard for all m e tho ds b eing compared. References Anders, S . and Hub er, W. (2010). Diﬀeren tial expr e ssion analysis f o r sequence coun t data. Genome Biol , 11 (10), R106. Hardcastle, T. J. a n d Kelly , K. A. (2010). baySeq: Empir ic al Bay esian Metho ds F or Iden tifying Diﬀe ren tial Expression In Sequence C o u n t Data. BM C Bioinformatics , 11 (1), 422. Oshlac k, A., Robinson, M. D., and Y oung, M. D. (2010). F rom RNA-seq reads to diﬀeren tial expression results. Genome Bi o l , 11 (12), 220. Rob erts, A., T rapn ell , C., Donaghey , J., Rinn, J. L., and Pac h ter, L. (2011). Impro ving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol , 12 (3), R22. Robinson, M. D., McCarth y , D. J . , and Sm yth, G. K. (20 10). edgeR: a Biocond ucto r pac k age for diﬀerent ial expression analysis of d ig ital gene expression data. Bioinformatics , 26 (1), 139–4 0. T rapnell, C. et al. (2010). T r a nscript assembly and quantiﬁca tion b y R N A-Seq r e v eals un annotat ed tran- scripts and isoform s wit c hing during cell diﬀerent iation. Nat Biote chnol , 28 (5), 516–520 . Xu, G. et al . (2010). T ranscrip t ome and targetome analysis in MIR155 expr e ssing cells usin g RNA-seq. RNA , pages 1610–16 22. 23

Identifying differentially expressed transcripts from RNA-seq data with biological variation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment