An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants of Clinical Significance

Fast and cheaper next generation sequencing technologies will generate unprecedentedly massive and highly-dimensional genomic and epigenomic variation data. In the near future, a routine part of medical record will include the sequenced genomes. A fu…

Authors: Momiao Xiong, Long Ma

An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants o f Clinical Significa nce Momiao Xiong Long Ma Division of Biostatistic s Division of Biostati stics The Universit y of Texas School o f Public Health The University of Texas School o f Public Health Houston, TX 77 030 Houston, TX 77 030 momiao.xiong@uth.t mc.edu marlone.zj@gmail.co m Abstract Fast and cheaper next generation seque ncing (NGS) technolo gies will generate unprec edentedly massive ( thousands or even ten thousands o f individuals) and highl y-dime nsional (up to hundreds of millions) geno mic and epigenomic variation data. In the near fut ure, a ro utine part of medical record will include the sequenced geno mes. A fundamental question is how to efficiently ex tract geno mic and epigenomic variants of cli nical utility which will provide information for opti mal wellness and interfere nce strategies. Tr aditional paradig m for identifying variants of clinical validity is to test a ssociation of the varia nts. Howeve r, significantly associated genetic variants may or may not be usefulness for diagnosis and prognosis of disease s. Alternati ve to association st udies for finding genetic variants of predictive utility is to systematically search variants that co ntain sufficient info rmation for phenotype prediction. To achieve this, we introduce co ncepts of sufficient di mension reduction (SDR ) and coord inate hypothesis which proj ect the original high di mensional data to ve r y low dimensional space while preservi ng all information on response phe notypes. We then formulate clinically sig nificant genetic variant disco very problem into sparse SDR proble m and develop algorithms that c an select significant genetic variants from up to or even ten millions of predictors with the aid of dividing SD R for whole genome into a nu mber of sub -SDR pro blems defined for genomic regions. T he sparse SDR is in tur n formulated as spar se optimal scoring prob lem, but with penalty which can remove r ow vectors fro m the basis matrix. T o spee d up computation, we develo p the modified alter nating direction method for multipliers to solve the spar se optimal scoring prob lem which can easil y be implemented in parallel. To illustrate its application, the pro posed meth od is ap plied to simulation data and the NHLBI’s Exo me Sequencing Proj ect (ESP) dataset as well as the T CGA dataset. Introduction Purpose of this paper is to formulate clinically signif icant genetic variant d iscovery prob lem into sparse SDR problem and develop algorit hms that can select s ignificant genet ic variants fro m up to millions of predictors. T o achieve this, we first s how that SDR for whole genome can b e partitioned into a number of sub -S DR proble ms defined for divided geno mic regions. Then, si milar to Wang and Zhu’s appro ach, we formulate th e sparse SD R into sparse opti mal scoring problem, but with penalty which can remove ro w vectors from the b asis matrix. Since large - scale discovery of genetic variants ma y involve millions of genetic varia nts, solving large s parse optimal scoring problem requires heav y computation. T o speed up co mputation, we apply the alternating d irection method for multipliers which can eas ily be implemented i n parallel. T o illustrate its application, the prop osed method is applied to simulation data and the NH LBI’s Exo me Seque ncing Project (ESP ) dataset and TCGA dataset. Methods Sufficient Dimension Re duction Throughout the paper we consider continuous phe not ype (response variable) and r egression. In other words, we will focus on quantitative trait analysis. Ho wever, all discu ssed concepts can be extended to b inary response variable and classification. Let Y be a univariate response variable (phenotype) and X be a p dimensional vector of predictors (genot ypes for genetic variants). Si nce dimension of geno mic variation is extremel y high , to r educe the impact of noise and irreleva nt p redictors, dimension reduction is a po werful tool for quantitative trait ana lysis and regression. Dimension red uction is to identify the best li near subspace that that best preser ves infor mation relevant to a regression (Nil sson et al. 200 7). Dimension red uction consists of unsupervised di mension reduction and supervised di mension reduction. P ri ncipal component analysis (P CA) is a typical method f or unsupervised dimension reduction which pro jects p redictor data onto a linear space without response in formation. Supervised dimension reduction is to d iscover the best subspac e that maximall y reduces the di mension of the input while preserving the infor mation necessary to predict the resp onse variable. The current po pular supervised di mension reduction method is SDR which aims to find a linear subspace S such that the response Y is co nditionally independent of the covariate ve ctor X, given the proj ection of X on S: X P X Y S |  , (1) where  indicates independe nce and S P represents a proj ection on S . In other words, all infor mation of X about Y is contained in the space S . The subspace S is referred to as a di mension reduction subspace. The subspace S may not be unique. T o uniquely describe di mension reduction subspace, we introduce central subspace (CS) that is defined the intersection s of all red uction subspaces S satisfying conditional independence a ssertion . T he CS is denoted by X Y S | . Many methods have been d eveloped for identifying CS. A popular sliced inverse regress ion for identifying the basis vector in the CS is to sol ve the follo wing eigenequation:       x Y X E X E )) | ) ( ( c o v ( , (2 ) where x  are eigenvalue s, and  is an eigenvector, r espectively. So lutions to eige nequation (2) yields the basis matrices ] ,..., [ 1 k B    for X Y S | . Sparse SDR by Alternative Direction Method of Multipliers The eigenvalue prob lem can also been formulated as a co nstrained o ptimization proble m (Chen and Li 19 98; Wang and Zhu 2013 ): , ,..., 1 , 1 ,..., 1 , 0 , 1 || || min 2 , d i i j Y Y Y Y X Y j T T i i T T i i i R b R P i k i                t. s. (3) where T n y y Y )] ( ) ,. . . , ( [ 1    . To develop sp arse SDR that can si multaneously reduce t he di mension and the number of p redictors, we first introduce a coord inated-independent p enalty function. Let d i B T p T i ,..., 1 ], ,..., [ ] ,..., [ * * 1 1        be a i p  matrix which for ms the basis matrix of the CS. We introd uce the following pe nalty functi on to penalize the variab le in all reduction directio ns toward zer o (Chen et al. 20 10):             p l l l p l l T l l B 1 2 * 1 * * || | ) ( . For simplicity of computation, we define r l l      2 * || || , 0  r is a p re-specified para meter. We can chose 5 . 0  r as Chen (201 0) suggested. After introducing the penalt y function, t he sparse versio n of optimal scoring problem (3 ) for penalizing the variable can be defined as , ,..., 1 , , 0 , 1 || || || || m in 1 1 2 * 2 2 , d i i j Y Y Y Y X Y j T T i i T T i p l r l i i R R P i k i                      t. s. This is a bi-convex pro blem. It is convex in  for each  and convex in  for each  . It can be solved by a simple iterative algorithm. T he iterative proce ss consists of two step s: (1) for fixed i  we optimize with respect to i  and for fixed i  we optimize with r espect to i  . The algorithms are gi ven bellow. Step 1: Initialization. Let n Y Y D T /  and T Q ] 0 ,..., 0 , 1 [ 1  . We first initialize for d i i ,..., 1 , ) 0 (   : * ) 0 ( ) ( ~     D Q Q I T i i i , , ~ ~ ~ ) 0 ( ) 0 ( ) 0 ( ) 0 ( i T i i i D      ], : [ 1 i i i Q Q    ,where *  is a random k - vector. Step 2: Iterate between ) ( ) ( s s   and until convergence or until a spec ified maximum n umber of iterations (s=1,2,…)is r eached: Step A: For fixed , ,..., 1 , ) 1 ( d i s i    we solve the followin g minimization prob lem:               p l r l s i s i R X P s i 1 1 2 * 2 2 ) ( ) 1 ( || || || min ) ( Y || d 1 i ,where ) *( ) *( 1 ) ( ) ( 1 ) ( ,..., [ ] ,..., [ s p s s d s s B       . (4a) Step B: For fixed , ,..., 1 , ) ( d i s i   we seek , ,..., 1 , ) ( d i s i   which solve the following unconstrained optimization problem: . 1 ,..., 1 , 0 , 1 || || min ) ( ) ( ) ( ) ( 2 2 ) ( ) ( ) (              i j Y Y Y Y X Y s j T T s i s i T T s i s i s i R k s i t. s. Solution to the abo ve optimization leads to a nonlinear eq uation: . ) ( ) ( ) ( ) ( 1 ) ( ) ( ) ( s i T s T i s i T T s i s i s i X Y n X Y D D Q Q I        (4 b) By Newton method, we obtain a so lution ) ( ~ s i  to equation (4 b) . Set ) ( ) ( ) ( ) ( ~ ~ ~ s i s T i s i s i D      . Results To illustrate its applicatio n to selection of clinica lly useful genetic variant s, the proposed method was first applied to several simulation datasets. Fourier serie s were used as the transformation functio n. The number of Fourier function in the s imulations was 30. We used the true positive rate (T PR), defined as the prop ortion of the correctly identified pr edictors, to measure ho w well the met hod selects the predictor s. We considered two scenarios: 50 SNPs and 100 SNPs SNP s from chromoso me 1 in the NHLBI’ s Exome Sequenci ng Project (ESP) dataset with 5,406 individuals and 1, 779 ,016 SNP s. The simulation models were given by      p l i j ij i b x y 1 1 . 0 , where i  is distributed as a stand ard normal distribution ). 1 , 0 ( N The results were s ummarized in T able 1 . Table 1. T rue positive rates for two simulated data. Dataset Sample Size Number of SNPs Number of Causal SNPs TPR 1 5406 50 10 100% 2 5406 100 10 100% To further evaluate its perfor mance, the pro posed method was also ap plied to the real NHLBI’s Exo me Sequencing Project (E SP) dataset with HDL phenot ype . We discovered 863 SNP s that contribute the H DL variation. T he top 10 selected SNPs were listed in Table 2 . It is repor ted that the gene choles teryl ester transfer p rotein, plasma (CETP) is as sociated with HD L (Braun et al. 2012), CD36 is associated with atherosclerotic card iovascular diseases ( Yuasa - Kawase et al. 201 2), Table 2. Top 10 selected SNPs CHR SNP Gene P-value CHR SNP Gene P-value chr16 rs34065661 CETP 9.624E- 17 chr19 rs1052983 LILRA6 1.57313E - 07 chr7 rs3211938 CD36 4.629E- 13 chr7 rs10085732 1.66681E - 07 chr6 rs17622 DDO 4.333E- 08 chr19 rs1868953 3.45063E - 07 chr8 rs111855567 5.861E- 08 chr19 rs117156027 3.45063E - 07 chr1 rs12088246 PTGFR 1.512E- 07 chr1 rs79907831 SPOCD1 1.15888E - 06 References Nilsson J, Sha F, and J ordan MI. (2 007). Regression on Manifold s Using Kernel Dime nsion Reduction. Pr oceedings of the 24th International Con ference on Machine Learning, Corvallis, OR, 20 07. Chen, C.H., Li, K.C., 19 98. Can SIR be as popular as multiple line ar regres sion? Statistica Sinica 8, 289 – 316. Wang T and Zhu L. (2 013). Sparse sufficient di mension reduction using optimal scoring. Computational Statistics and Data Analysis. 57 : 223 -232.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment