TaSer (TabAnno and SeqMiner): a toolset for annotating and querying next-generation sequence data

Summary: We develop TaSer (TabAnno and SeqMiner), a toolkit for annotating and querying next generation sequence (NGS) dataset in tab-delimited files. TabAnno is a powerful and efficient command-line tool designed to pre-process sequence data, annota…

Authors: Xiaowei Zhan, Dajiang J. Liu

1 PREPRINT TaSer (Tab Anno and Seq Miner): a toolset for annotati ng and q ue- rying next-gene ration sequenc e data Xiaowei Zhan 1+,* Dajia ng J. Liu 1+,* 1 Department o f Biostatisti cs, Center of St atistical Genetics, Uni versity of Mi chigan, Ann Arbo r, MI, 48109 ABSTRAC T Summary: W e develop TaSer ( TabAnno and SeqMiner ), a toolkit for annotating and querying next generation sequence (NGS) da- taset in tab-delimit ed files. Tab Anno is a powerful and ef ficient command-line tool designed to pre-pr ocess sequen ce data, anno- tate variations and generate an indexed feature-enriched project file that c an integ rate multiple sources of i nformation. Using the projec t file generated by TabAnno , com ple x quer ies to the sequence da- taset can be perform ed using SeqMiner , an R-package designed to efficie ntly acc ess large da tasets. Extracted inform ation can be c on- veniently viewed and analyzed by tools in R. Ta Ser is optimized and computati onally m ore effi cient than software us ing database sys- tems. It enables annotating and querying NGS dataset using m od- erate computi ng resource. Availabili t y and implementation: TabAnn o can be dow nloaded from github ( zhanxw.githu b.io/anno/ ). SeqMiner is d istributed on CRAN ( cran.r-project .org/web/pa ckages/seqminer ). Contact : X .Z. ( zhanxw@umich.ed u ) D.J.L ( dajia ng@umich.edu ) 1 INTRODUCTIO N Large amount of next generation s equence (NGS) data hav e b een generated using next generation sequ encing, in order to get a m ore detailed un derstanding of h uman ge no mic variations. Th e analysis of NGS data poses f ormidable computational chall enges. Datasets of these sequence variations [e.g. i n variant call format (VCF) (Danecek, et al., 2 011)] can be very la rge in si ze, even a fter com- pression. It is usually impossibl e to load an entire file into comput- er memory. Qu ery in g information for a genomic r egion of interest can be challenging. One standard approach to manipulate genetic data file is to use database management s ystem (DBMS) (San Lucas, et al., 2012) , where a database project will n eed t o b e b uilt for a NGS dataset to facilitate compl ex q ueries. Altho ugh DBM S is very p owerful f or annotating and querying variants , buildi ng and u pdating projects for large NG S datasets requir e considerable computational re- sources. Even with a hi gh-end computer server, it may st ill be necessary to divid e large fil es, and b uild multipl e s maller projects, in orde r t o c omplete the analysis. For simple int erval queries, an alternative tool t abix (Li, 201 1) can be u sed. Ta bix is computation- ally efficient, but cannot be directly app lied to make more complex queries, e.g. getting genotype information for functionally damag- ing variants in a gene/pathway. To ove rcome limitations of e xistin g tools, we d eveloped TaSer ( TabAnno and SeqMiner ), a new toolset for ann otating and q uery- * To whom correspondence should be addressed. +These authors cont ributed equally ing sequence datasets, which combines the fl exibility of DBMS and the efficiency of tabix . Our workflow starts by pre-processing the sequ ence dataset usi ng TabAn no , a c ommand lin e annotator, and generating a feature-enrich ed i ndex project file that int egrates annotation information of genetic variants and optionally exter na l bioinformatics databases, e. g. pre-calculated PolyPh en scores (Adzhubei, et al., 2010). Using the ind exed project file generated by TabAnno , complex queries ca n be performed via S eqMiner , an R-package that int egrates the tab ix library and supports e fficient random access to large datasets. Extracted information is a vailable as standard R obj ects. S ubsequent analysis of th e d ataset can b e conveniently performed in R. Preprocessing and annotating NGS dataset using T aSer is much faster and memory efficient than bu ilding database proj ects. Addi - tionally, our tools ar e integrat ed with R, an in terac- tive/programmable environment, so it can be more p owerful and flexible than DBMS for p erforming downstrea m s tatistical analy- sis, visualization, etc. We compared the p erformance of TaSer wit h varianttools , a DBMS-based tool for annotating and extracting variants . We showed t hat TaSer are more time and memory efficient and can b e applied to handle large NGS datasets with m oderate computing resources. W e also co mpared Ta Ser with Varian tAnnotation , an R - package with similar functions. We d emonstrate t hat Ta Ser is much faster, s impler to use and offers unique f unctionalities, such as calc ulatin g su mmary me trics (e.g. transition/transversion ratio etc.) f or th e entire NGS d ataset. Additionally, TaSer also supports more input file formats than both com petin g t ools, which saves user’s effort for preparing intermediate files. 2 DESCRIPTION 2.1 Annotate Sequence Datas ets Using TabAnno TabAnno is a t ool for an notating sequence variants and integrating multiple sources of data. Given that standard annotation only re quires chrom osomal positions, reference and a l terna tive alleles, TabAnno is d esigned to su pport generic tab-delimited files as i nput [ e.g. VCF files of genotype calls or METAL (Willer, et al., 2010) files of summary statistics], which saves user efforts f or preparing intermediate files. TabAnno sup ports standard gene-based annotatio n via comm only u sed gene/transcript definition, including refFlat or UCSC KnownGenes. Specif- ically, to an notate each mut ation, a re ading fra m e is first deter mined by th e transcript s where the mutation lies. We then obtain th e codons before and after the mutatio n from the reference genome. Sy non ym ous/non- synonymo us variants will be annotated b y wheth er or n ot the mutation induces changes o n the amino acid, according to the univ ersal genetic co de. It ca n also annotate genomic re gions of interest, e.g. regions overlapping transcript ion factor binding sites. Spe cifically , external information de fin- X.Zhan et al . 2 ing the region of inter est can be stored as BED files (Kent, et a l ., 2002), and used in TabAnno for ann otati ons. A variety of bioinformati cs databases are supported and can be incorporated in the a nnotation, e.g. pre-computed PolyPhen or GERP scores (Cooper, et a l., 2 005). More Detailed features and usage of TabAnn o can be fo und on the authors’ websit e. Several optimizations were implemented in TabAnno to e nable efficien t access to large datase ts: 1.) It can re ad/write bgzip co mp ressed f iles, which minim izes disk I/O, a major b ottleneck for p rocessing high t hroughput data. 2.) We pre-processed external databases via compression and index- ing. For exa mpl e, we s tore genomic regions (e.g. tran scriptio n factor bi nd- ing sites) as ordered array. Consequently, each record can be retrieved with a tim e complexity of O(log(n)), where n is the number of regions. 2.2 Perform C omplex Queri es with SeqMiner Using the project file created b y TabAnno , c omplex queries to NGS da- tasets c an be performed via SeqMiner , an R-package that integrates tabix , naturally inherits all its benefits and allows rand om access to feature- enriched tab-delimited files. Retrieved information is stored in standard R data objects, (e.g. mat rix or list). Su bsequent data qual ity control ( QC), visualizatio n, analysis can be co nveniently performed in R. One m ajor fun ction for SeqMiner is to quer y com pressed and indexed VCF files. With built-in functions, user can convenientl y extract genoty pe in - formation for variants th at reside in a given gene, or be long t o a certain mutation type (e.g. non-synonymous). Use rs can also spe cify an d extract additional fi elds, such as genotyp e likelihoods, or read depths, etc. In addition to V CF files, Se qMiner can s upport quer ies to generic tab- delimited files. For example, our tools can query M ETAL files and extract association test s tatistics fo r variants in a g iven gene, e tc. Tutorials of the software can be found on the authors’ website. 2.3 Benchmark and E xempla r Application We evaluate d TaSer and comp ared it with v arianttools and VariantAnnota- tion on a desktop computer with 1 core of Xeon CPU X5 660 2.80GHz. A dataset of variants on c hromosome 1 fro m 1000 Genomes Project (1 000 Genomes Project, et al., 2012) was used . The dataset consists of a tota l of ~3 milli on variants from 1,092 individual s (11 Gb aft er compression). TaSer and varianttools follow a similar workflow, in that the who l e d ataset will first be processed in order to facilitate su bsequent complex queries. Applying TabAnno , we annotated the VC F f ile us ing refFl at, a commonly used gene/transcript definition for UCSC genome browser, incorporating pre-calcul ated PolyPhen scores, and outputted a fe ature-enriched project file. The whole dataset was processed in 1 .66 hours, and th e peak m emory usage is 43Mb. In order to use varianttools , a database project will ne ed to be created. Th e process of building the database took 28.7 hours, and the peak memory us age is 648 Mb. I t is c l ear th at DBMS based method is computatio nally intensive in da ta preparation, which make it ch all enging to be appli ed on NGS datasets with m any thousand sampl es. We also evaluated both tools for making queries. Specifically, we extracted inform ation of non-sy non ym ous variants in 100 randomly selected genes in the feature-en riched VCF files, w hich took ~10 seconds fo r both tools. We compared our tool s with VariantAnnotation , an R -package that sup - ports complex queries to VCF files, leveraging dynamically loaded bioin- formatics databases. It does not r equire annotating th e whole dataset to perform co mpl ex que ries. Th erefore it may be more f lexible than our tools when it needs to switch to alternative annotati ons. However, when the sam e set of annotation is repeatedly used (e.g. in genetic associatio n studies), SeqMiner can be more efficient for performing queries from feature- enriched files. Fo r exampl e , after loading necessary databases, V ari- antAnnotation took ~3 .5 minutes to ex tract 100 genes on chromosom e 1 , which i s >20 times slower than SeqMiner (10 seconds). Mor eover, Se- qMiner is more se lf-contained, requires fewer steps to perform the query, and stores results in standard R objects (e.g. matrix or list) for easier m a- nipulation. A n additional advantage of TaSer over VariantAnnotation is that it can calculate a variety of summary s tatistics for the entire NGS dataset besides maki ng quer ies, such as tr ansition/t ransversion ratio, which is an im portant metric for NGS data QC. TaSer supports generic tab-delimited files, which represents a broader c lass of input formats than what varianttools and VariantAnnotation suppo rt. We also evaluated the performance of TaSer for annotating an d qu erying ge- neric tab-delimit ed fi les. Specifically, we simul ated phenot yp e infor matio n, and generated METAL files o f single site ass ociation test statistics for ea ch variant in the 1000G d ataset. The inp ut files of summary statistics ( 251 Mb after bgzip compression) were annotat ed i n 76 seconds and it took S e- qMiner 7 secon ds to extract s umm ary statistics from 100 randomly chos en genes. It shows that our tools are capable of handling l arge datasets on a single desktop computer. Varian tA nnotation and varianttools do not d irect- ly support gen eric tab-delimited f iles, a nd therefore were not compared on this dataset. 3 DISCUSSION AND CONCLU SION We dev elope d TaSer , a toolkit for annotatin g and querying NGS datasets. It complements existing software tools, and provides a valuable platfor m for statisticians to apply existing t ools an d de- velop n ovel methods for analyzing NGS data. Alt hough it is main- ly d eveloped for analyz ing human genet ic da ta, it can als o b e used for other organisms, e.g. p rokaryotes. If gene d efinitions are p ro- vided and properly formatted, our tools can be app lied to annotate and qu ery sequ ence datasets from these organisms as well. TaS er is cur rently bein g deplo yed to process NGS data from many thou- sands of individu als in our pr ojects. We envision that it will con- tinue to make valuable contributions to the analysis of NGS data. A CK NOWLEGE MENT We would like to thank Drs. Gonçalo Abecasis, Hyun Min Kang Yanming Li for helpful discussions . REFERENCES 1000 Geno mes Pr oject, C., et a l. (2012) A n integrated map of ge netic varia tion fro m 1,092 human geno mes, Nature, 491 , 56-65. Adzhubei, I.A ., et al . (2010) A method an d serve r for predictin g da maging missens e mutations, Nature methods, 7 , 248-249 . Cooper, G.M., et a l. (2005) Distribution and i ntensity of c onstraint in mammalian genomic sequence , Geno me research, 15 , 901- 913. Danecek, P., et al . (201 1) The vari ant c all f ormat and VCFtools, Bioinformatics , 27 , 2156-2158. Kent, W.J., et al . (2002) The hu man gen ome bro wser at UCSC, Genome rese arch, 12 , 996-1006. Li, H. (2011) Tab ix: fas t ret rieval of seq uence features fro m gener ic TAB- delimited files, Bioinformati cs, 27 , 718-719 . San Lu cas, F.A., et al. (2012) Integrated annotatio n a nd analysis of genetic variants from next-gene ration sequencin g studies with variant tools , Bioinfor matics, 28 , 421 - 422. Willer, C.J., Li, Y. and Abecasis, G.R. (2010) META L: fast and ef ficient meta-analysis of genome wide association sca ns, Bioinformatics, 26 , 2190- 2191.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment