Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Statisti cs Surve ys V ol. 2 (2008) 43–60 ISSN: 1935-7516 DOI: 10.1214/ 07-SS026 Statistic al adv ances and c h allenges for analyzing corr elated hi gh dimensional SNP data in genomic study for compl ex diseases ∗ Y ulan Liang Dep artment of Biostatistics University at Buﬀalo, The State University of Ne w Y ork, Buﬀalo, NY 14214, USA e-mail: yliang@b uffalo.e du Arpad Kelemen Dep artment of Neur olo gy, Buﬀalo Neur oimaging A nalysis Center, The Jac obs Neur olo g i c al Institute, Universit y at Buﬀalo, The State University of New Y ork, 100 High Str e et, Buﬀalo, NY 14203, USA e-mail: akelemen @buffalo .edu Abstract: Recen t adv ances of information tec hnology in biomedical sci- ences and other applied areas hav e created numerous l arge dive rse data sets with a high dimensional featu re space, whic h pro vide us a t remendous amoun t of inf ormation and new opportunities f or improving the quality of h uman l i fe. M ean while, great challenges are also created drive n by the con tin uous arr iv al of new data that requir es researche rs to conv ert these raw data in to scient iﬁc knowledge in or der to beneﬁt f rom it. Association studies of complex di seases using SNP data ha v e become more and more popular in biomedical r esearc h in r ecent years. In this pap er , we present a review of recen t st atistical adv ances and challenge s for analyzing correlated high dimensional SNP data in genomic asso ciation studies for complex dis- eases. The r eview includes b oth general feature r eduction approac hes for high dimensional correlated data and more sp eciﬁc approache s for SNPs data, which i nclude unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using stat istical testing/ scoring, statistical modeli ng and machine learning m ethods with an emphasis on ho w to iden- tify i n teracting lo ci . Keywords and phra ses: Complex disease, High dimensional dat a, Single Nucleotide Polymorphism, Statistical metho ds. Receiv ed June 2007. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2 F ea ture selection metho ds for high dimensional problems . . . . . . . . 45 3 SNP selections in genome-wide asso ciation studies . . . . . . . . . . . 46 ∗ This pap er was accepted b y Mi c hael Kosorok, Ass o ciate Editor for the IMS. 43 Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 44 3.1 Statistical measures and testing for SNP-disease a sso ciation . . . 47 3.2 Super vised statistical mo dels and statistical learning algorithms . 48 3.3 Unsuper vised haplotype mapping approaches . . . . . . . . . . . 50 3.4 Computational int elligence approaches . . . . . . . . . . . . . . . 50 4 Other c hallenges in genetic asso ciation studies o f complex diseases . . 51 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 1. In tro duction Correla ting genetic v ariations in DNA sequences with phenotypic diﬀerences has bee n one of the grand challenges in biomedical resear ch. Substan tial eﬀorts hav e bee n made to o btain all common g e netic v ariations in h umans, including single nu cleotide po lymorphisms (SNPs), dele tio ns and ins e r tions [ 13 ]. SNPs a re sin- gle base pair p ositio ns in genomic DNA at which diﬀerent s equence alternatives (alleles) exist in normal individuals in some populatio n(s), wherein the least frequent allele ha s an abundance o f 1 % or gre ater [ 13 ]. In pra ctice, the ter m “SNP” is used more lo osely . Restricting the attention to co mmon SNPs with minor a llele frequency bigger than a certain cutoﬀ, e.g. 1% will help to ﬁlter out so me “ r ecent” mutations. SNPs are b elieved to a lter the risk for develop- ing particular diseases. It is, how ev er, very unlikely that individual SNPs play an impo r tant role in the development of co mplex diseases. Instead, high-or der int eractions o f SNPs a re suppo sed to e x plain the diﬀerence s betw een low and high risk population gro ups. The HapMa p Pro ject has collected g enotypes of millions of SNPs from p opu- lations with ances tr y fr o m Africa, Asia a nd Europ e a nd makes this infor mation freely av ailable in the public domain [ 93 – 95 ]. T o ﬁnd evidence of asso ciation in this huge data set is a gr and challenge no w. Therefore, there is a great need, con- ceptually as w ell as computationally , to develop adv a nced ro bust algo rithms and analytical metho ds for characterizing ge ne tic v ariations that ar e non-redunda nt and identif y the target SNPs that a r e most lik ely to aﬀect the phenotypes and ultimately contribute to disease developmen t. Exploiting informa tion redundanc y due to a sso ciations between SNP markers po tent ially reduces the eﬀorts in terms o f time and cost for gene tic a sso ciation studies [ 75 ]. How ev er, the eﬃcacy of sear ching for an optimal set of SNPs ha s not b een as succes s ful as exp ected in theor y . One primary ca use is the high dimensionality with highly correla ted feature s /SNPs that ca n hinder the p ow er of the identiﬁcation of small to mo derate genetic eﬀects in co mplex dis eases. The need to incor p o rate cov ar iates of other en vironmental risk factors as eﬀect mo diﬁers or confounders further worsens the “ curse o f dimensionality” problem in mapping genes for complex diseases [ 16 ]. Therefore , featur e se le c tion for mas- sive genomic data in high dimensions has be come one of the ma in tas ks to b e tackled with statistical and computational eﬀorts in the past deca de . Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 45 2. F eature sel ection metho ds for hig h dimens ional problems The computational and s tatistical metho ds that address the “curs e of dimen- sionality” problem in ge nomic research can be group ed into three categor ies: ﬁltering, wrapp er , a nd embedded metho ds. Filtering metho ds select feature s ub- sets indep endently from the lear ning clas s iﬁers and do not incorp or ate learning. Therefore, ﬁltering metho ds a re fast [ 10 ; 60 ; 6 9 ; 10 9 ]. A weakness of ﬁltering metho ds is that they o nly consider the individual features in iso lation and ig - nore the p ossible interaction among them. Y et, the co m bination of these featur es may hav e a co m bined eﬀect that do es not nece s sarily follow from the individual per formances of features in the group [ 73 ]. One of the consequences of ﬁltering metho ds is that we may end up with many highly cor related features/ SNPs with highly redundant information that worsens the class iﬁcation and predic- tion per formance. If there is a limit o n the num b er of features to b e chosen, then w e ma y not b e able to include all the informative ones. T o a ddress this problem in ﬁlter ing metho ds, wrapp er metho ds wrap around a pa rticular learning algorithm that can asses s the selected feature subsets in terms of the estimated clas siﬁcation err o rs and then build the ﬁnal classiﬁer [ 44 ]. W r app er metho ds use a le arning machine to measure the quality of subsets of features. One of the well-kno wn wrapp er metho ds for feature selection is Sup- po rt V ector Machine Recursive F ea ture E limination, which r eﬁnes the optimum feature set by using a Supp or t V ector Machine, [ 33 ]. The idea of SVMRFE is that the or ientation of the sepa rating hyper - plane found by the SVM can b e used to se le ct informa tive features: if the plane is o rthogona l to a particular feature dimension, then that feature is infor mative, and vice versa. SVMRFE uses the weight s o f a SVM class iﬁer to pro duce a feature ranking, and then eliminates the feature with smallest w eigh t mag nitude recursively . W r app er metho ds ca n be used with arbitr ary classiﬁers and can nota bly re- duce the num ber of fea tur es and signiﬁcantly improv e the classiﬁcation accur acy [ 63 ; 79 ]. How ev er, w r app er methods hav e the dra wback that they do not incor- po rate kno wledge a b o ut the sp eciﬁc structure of the cla ssiﬁcation or re g ression function [ 52 ]. Moreover, they are mo re computatio nally exp ensive since they need to ev aluate a cross-v alidation scheme a t each itera tion. With muc h b etter computational eﬃciency and s imila r p e r formance to wrap- per metho ds, a r elatively new cla ss of approaches for featur e selection called “embedded metho ds” has be c o me av ailable in the literatur e. Lal et al. [ 52 ] pro - vide the detailed mathematical form ulations of embedded metho ds. Embedded metho ds pro ce s s feature s election simultaneously with the learning classiﬁer and the fea ture selection can no t be separa ted from the learning. F or exa mple, W e- ston e t al. [ 107 ] mea sure the imp or tance of a feature using a b ound that is v a lid for Supp or t V e c tor Machines only , thus it is not po ssible to use this metho d with, for example, decision tr e es. Therefore the structure o f the class of functions under co nsideration pla ys a crucial r ole. F o r an embedded metho d, every subset of features is mo deled by a vector σ ∈ { 0 , 1 } n of indicator v ariables, σ i := 0 indicating that a feature is present in a subset and σ i := 1 indicating that a feature is absen t (i = 1, . . . ,n). Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 46 A parameterized family of classiﬁcatio n or regress ion functions are given as follows: f : Λ × ℜ n → ℜ , ( α, x ) ∝ f ( α, x ). The goal of an em bedded metho d is to ﬁnd a vector of indicator v ariables σ ∗ ∈ { 0 , 1 } n and α ∗ ∈ Λ that minimize the exp ected risk R ( α, σ ) = R L [ f ( α, σ ∗ x ) , y ] dF ( x, y ), where ∗ denotes the po int wise pro duct, L is a loss function and P is a measure on the do main o f the training data ( X ; Y ). One may imp ose some additional cons tr aints for pena lt y or r egulariza tions to achiev e sparseness: s ( σ ) ≤ σ 0 , where s : [0 , 1] n → ℜ + measures the spar s ity of a given indicator vector σ . F or exa mple, s could b e deﬁned as: s ( σ ) := l 0 ( σ ) ≤ σ 0 , that is to b ound the zero “ norm” l 0 ( σ ), which counts the num ber o f nonzero ent ries in σ . The L1-no rm, L2-norm, and L ∞ - norm or the elastic- net p enalty , a mixture of the L2-no rm and the L1-no rm p enalties [ 105 ] a r e also prop osed to achiev e automatic featur e s elections by s hr inking the ﬁtted co eﬃcients tow ard zero. These automatic fea ture selection metho ds a lso b eneﬁt from the reduction in the ﬁtted coeﬃcients’ v ariance. One of the merits of an embedded metho d is that it int ends to ﬁnd the feature subset of a certain size that leads to the b est po ssible genera lization or equiv alently to minimal r isk, which can b e seen fro m the ab ov e for mulation. Therefore, the function that measur es the quality of a scaling factor can be ev alua ted faster than a cro ss-v a lidation error e s timation pro cedur e. Moreover, they turn the m ultiple testing problems for feature selection in to an optimization problem in the no nparametric setting. Some recent studies [ 90 ; 10 5 ] hav e s hown that they are mor e co mputatio nal eﬃcient and asymptotically optimal for high dimensional data. Embedded metho ds tend to hav e higher capacit y than ﬁltering metho ds and are therefore mor e likely to ov erﬁt. W e th us exp ect ﬁlter ing metho ds to p erfor m better if o nly a small amount of training da ta is av aila ble. E mbedded meth- o ds even tually outp erform ﬁltering metho ds as the num b er of training sam- ples increase. L ASSO prop os ed by Tibshir ani [ 97 ; 98 ], lo gic regression with the regular iz ed Laplacia n prior [ 51 ] a nd Bayesian regular ized ne ur al netw ork with automatic relev a nc e determination [ 56 ] are examples of em bedded metho ds. Note that the three featur e r eduction methods , ﬁlter, wra ppe r and embedded metho ds discus sed in this section may p erfo r m diﬀerently when a pplied to cate- gorica l SNP data instead of contin uous g ene ex pression data, in which there are only three genotypes, tw o ho mo zygous genotypes and o ne heterozygo us geno- t ype . Next we will fo c us on the rev iew of the r ecently develop ed categoric a l SNP data reduction methods in geno me wide ass o ciation studies. 3. SNP selectio ns in genome -wide asso ciatio n s tudies A ma jor aim of as so ciation studies is the identiﬁcation of po lymorphisms, usu- ally sing le n ucleotide po lymorphisms (SNPs) asso ciated with a trait or disease status. There a re se veral ma jor computational and statistical tracks for SNP selections, which we will review next [ 3 ; 18 ; 25 ; 34 ; 35 ; 40 ; 46 ; 48 ; 59 ; 115 ]. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 47 3.1. Statistic al me asur es and testing for SNP-dise ase asso ciation Spec iﬁc a lly , in genome-wide disea s e asso cia tion studies, v arious s tatistical mea- sures and testing based approaches have b een prop osed for sele c ting a sub- set of SNPs [ 17 ; 30 ; 3 6 ; 5 7 ; 85 ; 89 ]. These include Link age Disequilibrium (LD) based SNP selection a nd supe rvised SNP selectio n. Link age Disequilibrium based methods for selecting a maxima lly informative set of SNPs for asso cia- tion a na lyses were develop ed ﬁrst [ 24 ; 92 ; 10 1 ; 1 02 ; 1 08 ]. F or instance, Zhang and J in [ 11 4 ] ident iﬁed tagSNPs fro m haplotype data in tw o steps; ﬁr st, they ident iﬁed ha plotype blo cks and then identiﬁed tagSNPs that b est distinguish the haplotypes w ithin a haplotype blo ck. This metho d is applicable for a ll types of asso ciatio n studies. Anderso n and No v em bre [ 1 ] and Mannila et al. [ 61 ] pro- po sed ﬁnding haplotype blo ck b ounda ries using minimum de s cription length. Entrop y-based meas ure for SNP selections were pr op osed by Hamp e, Schreiber , and Kr aw czak [ 36 ] and Zhao, Boe rwinkle, and Xiong [ 1 17 ]. Beckmann et al. [ 8 ] presented Mantel s tatistics for SNP selections a nd disease mapping purp oses by using haplotype sharing to corr elate temp oral and spatia l dis tr ibutions of cancer in a g e neralized regression mode l. A sliding window appro ach developed by Neale and Sham [ 68 ] combines p- v alues from m ultiple indep endent tests using χ 2 = − 2 P m i =1 l og ( p i ) ∼ χ 2 2 m . Here, p i is the p-v a lue o f as s o ciation b etw een S N P i and presence of disease, and m is the n um ber of SNPs in the sliding window. The test sta tistic χ 2 has a chi-square dis tr ibution with 2 m deg rees of freedom. The sliding window incorp ora tes the or dering of SNPs on the chromosome and merges res ults a cross adjacent windows to detect chromosome regions with signiﬁcant ass o ciations [ 27 ; 84 ; 1 13 ]. How ev er, it do es not consider the distance b etw een them and the implicit assumption is that the SNPs ar e equally spa ced. The scan statistic [ 26 ; 3 9 ; 54 ; 91 ; 1 0 4 ] do es account for the spacing and order - ing of SNP s on the chromosome, but it do es not consider gene-g ene interactions. F o r instance, Sun, et al. [ 91 ] developed a ch romoso ma l scan statistic approa ch, which includes tw o par ts : (i) Identif ying SNP clusters; (ii) Identif ying SNP clus- ters with signiﬁca nt dis e a se ass o ciation. This sca n metho d assumes the po sition of ea ch SNP is ra ndomly determined by a Poisson pro cess . The lengths b etw een t wo a djacent SNPs ha v e an exp onential distribution and the sum-o f-lengths betw een SNPs has a Ga mma distribution. Under the ab ov e assumptions, the clusters of SNP s are ﬁr st identiﬁed b y testing the hypothesis tha t whether the observed leng th b etw een a set of SNPs (co mbin ed interv al b etw een these SNP s ) is equal (null hyp o thesis) or less than (alternative hypothesis) the expected length. Rejection of the null hypothesis identiﬁes this gr oup of SNPs a s clus- ter. T o further identify SNP clus ter s with sig niﬁca nt disease a sso ciation, diseas e outcomes ar e incorp o rated and Pearso n C hi- square p-v alues are co mputed fo r asso ciatio ns of signiﬁcance. Other test statistic appro aches, such as the sc ore statistic [ 81 ; 82 ], a nd weigh ted- av erage s ta tistic [ 87 ] for disease mapping in case-control studies were als o pro- po sed fo r SNP selection in ge netic asso cia tio n studies. Cheng et al. [ 20 ] prop ose using the exp ectation maximization (EM) alg orithm to estimate haplotype fre- Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 48 quencies of multip le linked SNPs, and follow this by constructing a contingency table statistic S for LD analy sis based on the estimated haplotype frequencies. An empirica l p-v a lue is obtained base d on the null distribution o f the max im um of S (S*) from a large num b er (e.g., 1,000 or more) of r a ndomized p ermuta- tions. This metho d is develop ed fo r mapping functional sites or reg io ns from case-control data using haplotypes of m ultiple link ed SNP s. All these conv en tional test based ﬁlter approaches estimate the asso ciation betw een each SNP (or multiple SNPs) and a phenotype, and then use the cor - resp onding p- v alues to prioritize the res ults. O ne drawback is that one may end up with many highly correla ted SNPs or genes with high re dunda ncy informa- tion, which can b e h urdles for further classiﬁca tions and predictions. Also the test based approa ches ca n no t incorp orate many environmental factors to ac- count for g ene-environment interactions. F urthermore, the no n-indep e ndence of SNPs in physical proximit y (Link age Disequilibrium) may cause pro blems for m ultiple testing sce na rios with corr elated tests [ 6 ; 7 ; 23 ; 70 ; 80 ; 112 ]. Simple co r- rections may lead to either co nserv ativ e p-v alues if Bonferro ni correction is used or b ecome computationally ex pens ive, if p ermutation is use d [ 84 ]. Nyholt [ 70 ] prop osed a method for eﬃcient ly accounting for multiple tes ting of many SNPs in an asso ciatio n study that in v olves estimating an “eﬀective num ber ” of inde- pendent tests, and then adjusting the smallest observed p-v a lue using Sida k’s formula based on this n um ber of tests. Salyakina et al. [ 80 ] further ev alua ted this method. Note that the “multiple testing problem” discussed here diﬀers from the “curse of dimensio nality pro blem”, so it p oses diﬀerent challenges. “Multiple testing problem” is ca used by the high dimensionality of the predictors (includ- ing features plus p o ssible interactions of features) and the complex c o rrelatio n structures of the predictors, while the “cur se o f dimensionality problem” ar is es when c o nsidering the interaction of many features, i.e., there are not enough observ ations in each combination o f those features. Last, but not least, thes e exis ting testing based appro a ches ignore s ome in- formation ab o ut the SNPs, suc h a s sub-s tructures of the underlying p opula tion (admixture proble m). This may lead to spurio us results as well as s uﬀer from low pow er. This may explain why repro ducibilit y has b ecome a ma jor issue in genomic a sso ciation studies for co mplex disea ses. The sa me data set can show a highly sig niﬁcant a sso ciation with o ne method, wher e as a diﬀerent metho d shows no o r only a marginal asso cia tio n. Also , g iven the low prior probabil- it y of causa lity for each SNP in the genome, r igoro us standar ds of statistica l signiﬁcance ar e needed for genome-wide a sso ciation studies in order to avoid a ﬂo o d of false-p ositive r esults. Multiple replications in lar g e s amples may pro- vide the most straightforw ard pa th in ident ifying r o bust and broadly r elev a nt asso ciatio ns. 3.2. Sup ervise d stati stic al mo dels and statistic al le arning algor ithms In order to incor p orate e nvironmental factors and o ther cov ariates /confounders int o the geno mic asso cia tion studies, v a rious mo del based approa ches hav e b een Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 49 developed. He a nd Zeliko vsky [ 37 ] pr op osed tagSNP s for unphased g enotypes based on mu ltiple linear regr essions. Durrant et al. [ 28 ] ado pted a log istic- regres s ion mo del applicable to whole-genome screens using sliding w indows; it controls for o ther (contin uous) confounders a nd gene-e n vironment interactions. Y et, they ha ve to make assumptions on the disease model, which is usually un- known in pra c tice. Moreover, the eﬀects of violations of these assumptions are unpredictable in g eneral. Baker [ 5 ] applied a simple lo glinear mo del for haplo- t ype e ﬀects in a case-control study inv olving tw o unphased genotypes. The haplotype tr end regr ession, developed by Zaykin et al. [ 111 ], ﬁts a mo del of additive eﬀects of haplotypes that takes a s e r ies of marker genotypes, com- putes haplo t ype probabilities for e a ch o bserv ation us ing the co mpo site haplo type metho d (CHM), and for ms a linear regre s sion on the r esp onse using the ha p- lotype pro babilities a s the r egressio n matr ix. A nonpara metric metho d called Haplotype Pattern Mining (HPM) was pr op osed to ident ify disease asso ciated haplotype patterns from case-c o ntrol da ta. HPM has tw o steps: In step I, given the data-mar kers, haplotypes, and phenotypes , the g o al is to output all haplo- t ype patterns that are stro ngly asso ciated with the disease sta tus for a g iven v alue of the asso ciation threshold; In step I I, it is to ﬁnd the “gene lo ca tion”, by counting fr equency that one marker app ears in the haplotype pa tterns identiﬁed in the ﬁrst step. Since the HPM metho d utilizes the diseas e status (case/ control), it is a super vised mining appro a ch. T oivonen et al. [ 99 ] sho w ed that HPM do es not require any assumptions on the inheritance patterns a nd ha s go o d lo ca l- ization p ow er, even when the num ber of pheno co pies is la rge. Knorr- Held and Rue [ 49 ] develop ed Mar ko v ra ndom ﬁeld mo de ls on blo ck updating for disease mapping. O ther mo del-based approaches that ca n take into a ccount the spatial correla tion betw een markers were also pro p o sed [ 14 ; 31 ; 32 ; 42 ; 9 6 ; 100 ; 103 ; 106 ]. Recently , Sch wender and Ickstadt [ 83 ] demonstra ted logic re gressio n based ident iﬁcation of SNP interactions for the dis ease status in cas e-control study and prop osed tw o measures for q ua ntif ying the imp ortance of featur e int eractions for classiﬁcatio n. In compa rison with some well-known classiﬁcation metho ds such as CAR T [ 12 ], Rando m F orests [ 1 1 ], and other regress ion pro cedures [ 10 8 ], logic r egress ion ha s shown a go o d classiﬁca tio n p erforma nce when applied to SNP data. When ﬁtting with categor ical features/ v ar iables in the mo del based approaches, i.e. the genotype mea s urements with tw o homozygous g e notypes and one hetero zygous genotype, we often deﬁne a set of dummy v aria ble s that represent a single categorica l feature/v ar iable. In order to select the se t/ group of dummy v ariables that r epresent a single c ategorica l fea tur e/v a riable/SNP simult aneously , Y uan and Lin [ 110 ] pr op osed the gro up-Lars and the group- Lasso metho ds. P ark and Hastie [ 72 ] prop osed se veral regula rization path a lgo- rithms with g roup ed featur e /v ar iable selection for mo deling gene-g ene in terac- tions. Multifactor dimensio nality r eduction has b een prop o sed and implemen ted for SNPs data reduction by Coﬀey et a l. [ 22 ], Ritchie et al. [ 77 ] a nd Mo o re et al. [ 64 ]. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 50 3.3. Unsup ervise d haplotyp e mapping appr o aches Haplotype density ba sed clustering algor ithms and clustering techniques ba sed on the deg ree of haplo type shar ing in aﬀected individuals for haplo type map- ping were developed r e cently . These approaches have adv antages of robustness since they are nonpa rametric and require fewer as sumptions in mo deling. F u et al. [ 29 ] and Zhang et a l. [ 1 1 6 ] prop osed Bayesian mo dels for the ana ly sis of genetic structure when p opulations are corr elated. Liu et a l. [ 58 ] employ ed a Bay esian a pproach to mo del p o sitions of the histor ical recombinations and m utation even ts that pro duced the o bs erved haplotypes fro m an initial set of founders by acco un ting for all sources of uncertainties. They employ ed Monte Carlo Markov Chain (MCMC) metho d for parameter estimation a nd assig ned haplotypes to clus ters repr e senting allele heter ogeneity . Molitor et al. [ 62 ] mo d- eled haplotype ris ks using clusters obtained fr o m a proba bilit y mo del, but their metho d do e s not ta ke pheno co pies into consider a tion. Both metho ds were devel- op ed mainly for haplot ype ﬁne mapping and do not scale up for whole-genome screens v ery w ell. Other alg orithms for SNPs a re hierarchical clustering and graph metho ds [ 2 ; 55 ]. Pr incipal Comp onent Analysis with multip le genotype frequencies was also applied to select a subset of cor related SNPs that capture multiple g enotype v aria bility in the r egion [ 9 ; 5 7 ]. Howev er, whether Principal Comp onent Analysis is a suitable to ol for ca tegorica l SNPs information is arg uable, since it is more appropria te fo r co nt in uous scale data . The related corr esp ondence ana lysis may be mor e suitable, but the interpretation of the results from corresp ondence analysis reveals many challenges. 3.4. Computational intel ligenc e appr o aches Computational in telligence systems [ 47 ; 7 4 ] hold a gr eat pr o mise for tackling the tasks and c hallenges posed by la rge, diverse, genomic data for complex diseases. Some of these challenges are the iden tiﬁcation of gene-gene and gene- environmen t interactions [ 4 ; 4 3 ; 50 ; 66 ; 78 ], dealing with the notor ious “curse of dimensionality”, the uncer taint y , and unclear , fuzzy b oundarie s of phenotypes for complex diseases [ 76 ; 88 ]. T echniques include neura l netw orks [ 7 1 ], genetic algorithms [ 2 1 ], g enetic programming [ 65 ], evolutionary trees [ 53 ], evolutionary algorithms [ 41 ] and v arious h ybrid a pproaches. F or instance, Mo ore [ 64 ] de- veloped a hybrid genetic progra mming (GP) with a multif actor dimensiona lit y reduction method to pick SNPs for epistasis . Motsinger, et al. [ 67 ] applied a genetic progr amming neur al netw ork (GPNN) approach for detecting epistasis in case-co ntrol studies for SNPs data. They ev al- uated the p ow er of GPNN for iden tifying hig h- order gene-gene interactions a nd applied GPNN to a real data on Parkinso n’s disea se. They develop ed a Gram- matical Evolution Neural Netw ork (GENN), a machine-learning approa ch to detect gene - gene and gene- environmen t in teractions in high dimensiona l genetic epidemiological da ta. F urthermore, they pro p o sed an E nsemble Le a rning Ap- proach for Set asso ciation (ELAS) to detect a se t of interacting lo ci that predicts Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 51 the complex trait. An imp ortant adv antage of the hybrid approach is that a ny form of expe rt knowledge could b e used to guide the sto chastic sea rch a lgorithm to iden tify epistatic SNPs in the absenc e of marginal eﬀects. 4. Other c hallenges in genetic asso ciation studie s of compl ex diseases An imp or tant challenge that faces molecular ass o ciation studies in the p ost ge- nomic era is to under stand the interconnections from a netw ork of genes and their pro ducts that a r e mo diﬁed by a v ariety of en vironmental factors [ 15 ; 45 ]. The v ar iety of phenotype deﬁnitions leads to a m ultiplicit y of tests that inv olv e a larg e n um ber of compariso ns that often r esult in less p ow er. The need for a de- quate a lgorithms a nd mode ls for r educing biolog ical and statistical re dundancy from thousands of SNPs and ﬁnding an o ptimal s et of SNPs asso ciated with dis- eases ar e pressing for common complex disea ses. Dealing with man y depe ndent asso ciatio n tests is one of the emer ging issues on the statistical/ computational side. F o r SNP-disease data, in addition to b eing larg e, redundant, diverse and dis- tributed, three imp orta nt characteristics p ose c hallenges for data a nalysis and mo deling: (1) heterog eneity , (2) a c onstantly evolving biolog ical nature and (3 ) complexity . Firstly , there is the heter o geneity of SNP data, in the sense that i) the p opulatio n data inv o lves the po pulation substructur e or admixture problem and there is lo cus he ter ogeneity wher e a large fraction o f the prev a lence is due to pheno copies; and ii) there is a wide array of data types, including categor- ical, contin uo us, sequence data, as well as temp or a l, inco mplete and missing data. Such data sets are large with a lot of r edundancy in SNP and haplo- t ype databases. Sec o ndly , they a re very dynamic and contin uously evolving, which means that s pe c ia l knowledge is req uired when designing the mo deling techn iques. Lastly , but most imp or tantly , these SNP and haplotype da ta a re complex with intrinsic features a nd subtle patterns, in the sense that they ar e very rich in asso ciated co mplex phenot ype tra its. The diﬃculty in a SNPs asso ciation study is inc r eased by the nature of com- plex disea ses [ 38 ]. Typically , the contribution o f single genes as well as of single environmen tal risk facto r s is small to mo dera te. F urthermore, most complex diseases result from gene-gene and gene-environment in teractions [ 19 ]. By dis- regar ding interactions, rela tive r isks of individual gene tic v a riants are expected to be s mall. Disregarding gene-environmen t interaction also w eakens ex po sure- disease a nd gene-disea se asso ciations. In complex diseas es, it is likely that a combination of genes predisp os es for the disease and environmen tal factor s ag- grav ate the impact of these genes and therefor e a re jo intly resp onsible for disease developmen t in po pulations (known as epistatic eﬀect). In addition, e nvironmen- tal factors, which s eem to have o nly a mo derate impact at the p opula tion lev el might have lar ger relative r isks in subp o pulations with certain genetic pr edis- po sitions. There a re ma jor metho dologica l challenges in the study o f gene-g ene and gene-environmen t interactions. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 52 Other op en questions and ch allenges for new computatio nal appr oaches in analyzing the asso ciations b etw e en genetic ma rkers, such as SNPs in complex diseases involv e several hierar chical levels. First lev el of complexity: How to analyze multiple SNPs in a s ingle gene? How to ana lyze in teractions a mong m ultiple SNPs in a single ge ne ? Second level of complexity: How to analyze m ultiple SNPs in multiple genes? How to analyze in teractions among multiple SNPs in multiple g e nes? Third le vel of c omplexity: How to analyze interactions among multiple SNPs in multip le g e nes and environmen tal factors? F ourth level of c omplexity: How to analyze a s so ciations b etw een SNPs in single or multi- ple genes and quantitativ e tr aits? How to identify and quantify the perce n tage of the ass o ciation b etw een genes and disea ses expla ined by the asso cia tion be- t ween the same gene and q uantit ative traits, taking into co nsideration single genes, multiple g enes and en vironmental factor s. Lastly , the ultimate g oal in ge- netic/genomic analysis is to build direct or indirec t ca usal asso ciation b etw e en genetic v ariants and phenotypes/dis e a se status, but the diﬃcult y he r e is that we do not kno w if ther e is asso cia tion b etw een the SNPs and the disease. Ho w- ever, with the development o f computational/s tatistical approaches, we may b e able to iden tify these causal asso cia tions and co ns truct the path w ays related to complex diseases. 5. Discuss ion New a dv ances in human geno me resear ch hav e drawn tremendous a tten tion of resear chers fr om multiple ﬁelds, including b oth theor etical scientists a nd ap- plied researchers, esp ecia lly in the sta tis tica l ﬁeld. Huge amounts of con tin- uously gr owing la rge-sc a le g enomic, pro teo mic a nd clinical data for complex diseases and phenotype tra its have p osed ever gr eater challenges for the compu- tational ﬁeld. Multiple who le geno me wide ass o ciation studies hav e alre a dy b een completed and have resulted in nov el and promising genetic v ar iants for v ari- ous disea ses. In this pa pe r we presented a s urvey of recent adv a nces and some promises of designing, developing and implemen ting statistical/ computational metho ds for identifying SNP markers r esp onsible for common, co mplex, chronic diseases, suc h as diab etes, cancer, multiple s clerosis, and car diov ascular dis- ease and for tackling the challenges, such as gene- gene and gene-e nvironment int eractions a long with the notorio us “curse of dimensiona lit y” pr oblem. Suc- cess in ident ifying SNPs and haplotypes conferring susce ptibility or r esistance to common disea s es will provide a deepe r understanding of the a rchitecture of the disea se, the risks, and oﬀer a more powerful diagnostic to ol a nd pr edictive treatment. References [1] Anderson, E.C. and Novembre, J. (20 03). Finding haplotype blo ck bo undaries by using the minimum -description-leng th principle. Americ an Journal of Human Genetics 73 336 –354 . Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 53 [2] Ao, S. , Yip, K. , Ng, M. , Cheung, D . , F ong, P.Y. , Melhado , I. and Sham, P.C. (2005 ). CLUST AG: hier a rchical clustering a nd graph metho ds for selecting tag SNPs. Bioinformatics 21 (8) 1735–17 36. [3] A vi-Itzhak, H.I. , Su, X. and D e La Vega, F.M. (2003). Selection of minim um s ubs e ts of sing le nu cleotide p o ly morphisms to ca pture haplotype blo ck diversit y . Pac S ymp Bio c omput. 466– 477. [4] Azevedo, L. , Suriano, G. , v an A sch, B. , Harding, R. M. and Amorim, A. (2006). Epistatic interactions: how stro ng in disease and evolution? T r ends Genet. 11 585–59 8. [5] Baker, S. G. (2 005). A simple log linear mo del for haplotype eﬀects in a case-control study in v olving tw o unphased genotypes. Statistic al Applic a- tions in Genetics and Mole cular Biolo gy 4(1) 14. MR21382 19 [6] Becker, T. , Cichon, S. , Jonson, E. and Knapp, M. (2005). Multiple testing in the con text of haplot ype analysis r e visited: applica tion to case- control data. Annals of Hu man Genetics 69 747–75 6. [7] Becker, T. a nd Knapp, M. (2004). A p ow erful strategy to account for m ultiple testing in the context of haplotype analy s is. Am J H um Genet. 75(4) 561–57 0. [8] Beckmann, L. , Thomas, D .C. , Fischer, C. and Chang-Cla ude, J. (2005). Haplotype sharing analy sis using Mantel statistics. Human Her e d- ity 59 67– 78. [9] Benjamin, D. H. and N icola, J. C. (2004). Principa l comp onent anal- ysis for selectio n of optimal SNP- sets that capture intragenic g enetic v ari- ation. Genetic Epidemiolo gy 26(1) 11–21 . [10] Bo, T. and Jonassen, I. (2 0 02). New feature subse t selection pro cedures for class iﬁcation of expr ession proﬁles. Genome Biolo gy 3(4) re s earch0017. [11] Breiman, L. (200 1). Rando m F o rests. Mac hine L e arning 4 5 5–32. [12] Breiman, L. , Friedman, J. H. , Olshen, R. A . and Stone, C. J. (1984). Classiﬁc ation and R e gr ession T r ess W adsworth, Belmont. [13] Brookes, A.J. (19 99). Revie w : The essence o f SNPs. Gene 234 177– 186. [14] Burkett, K. , McNeney, B. and Graham, J. (20 04). A no te o n infer - ence o f tra it as s o ciations w ith SNP haplotypes and other attributes in generalized linear models . Human Her e dity 57 200–2 06. [15] Bur ton, P. R. , Tobin, M. D. and Hopper, J.L . (2005). Key conce pts in genetic epidemiology . L ac ent 366 941 –951. [16] Cardon, L. R. and Bell , J. I. (2001 ). Asso ciatio n s tudy designs for complex diseases. Nat R ev Genet 2 91–99 . [17] Carlson, C.S. , Eberle, M.A. , Rieder, M.J. , Yi, Q. , Kr ugl y ak, L. and Nickerson D.A. (2004). Selecting a maximally informa tive se t of single- n ucleotide p olymorphisms for asso ciatio n a nalyses using link age disequilibrium. Am J Hum Genet. 74 106–12 0. [18] Chapman, J. M. , Cooper, J. D . , Todd, J. A. and Cla yton, D. G. (2003). Detecting disease asso ciations due to link age disequilibr ium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Her e d. 56 18– 3 1. [19] Cha tterjee, N. , Kal a ylioglu, Z . , Mo slehi, R. , Peters, U . and Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 54 W a cholder, S. (200 6 ). Pow er ful multilocus tests of g enetic asso ciation in the presence o f g ene-gene and gene-environment int eractions. Americ an Journal of Human Genetics 79 (6) 1002–10 16. [20] Cheng, R. , Ma, J. , El ston, R.C. and L i, M.D. (2005 ). Fine map- ping functiona l sites o r reg ions from case-Control data using ha plotypes of m ultiple link ed SNP s. A nnals of Human Genet ics 69(1) 102 –112. [21] Clark, T. G. , De Iorio, M. , G riffiths, R. C. and F arrall, M. (2005). Finding ass o ciations in dense genetic maps: a genetic algorithm approach. Hum an Her e dity 60 97– 108. [22] Coffey, C.S. , H eber t, P.R. , Ritchie, M.D. , Krumholz, H.M. , Morgan, T.M. , G aziano, J. M. Ridker, P.M. and Moore, J.H. (2004). An application o f conditional logistic regres sion and m ultifactor dimensionality r eduction for detecting gene-gene interactions on ris k of m yocardia l infarction: The imp ortance of mo del v alida tion. BMC Bioin- formatics 5 49. [23] Conneel y, K. N. a nd Boehnke, M. (200 5). Combining correla ted p- v alues in trait-SNP ass o ciation studies. The Americ an So ciety of Human Genetics 55th Annual Me eting, Salt L ake City, Utah 18 4 –189 . [24] Cores, C. a nd V apnik, V. N. (199 5). Suppor t V ector Net w orks. Machine L e arning 20 273 – 297. [25] Dal y, M. J. , Rioux, J. D. , Schaffner, S. F. , Hud son, T. J. and Lander, E. S. (2001). High-r esolution haplotype structure in the h uman genome. Nat. Genet. 29 229– 232. [26] Dembo, A. and Karlin, S. (1992 ). Poisson approximations for r - scan pro cesses. The Annals of Applie d Pr ob ability 2 329–35 7. MR11610 58 [27] Dudbridge, F. and Koeleman, B. P . C. (200 4). Eﬃcient computation of signiﬁcance lev els fo r multiple a sso ciations in la rge studies of cor re- lated da ta, including genomewide as so ciation studies. Americ an J ournal of H uman Genetics 75(3) 424– 4 35. [28] Durrant, C. , Zonder v an, K. T. , Cardon, L. R. , Hunt, S. , De- loukas, P. and Morris, A . P. (20 04). Link ag e Disequilibr ium Map- ping v ia Cladistic Analysis of Sing le-Nucleotide Polymorphism Haplo- t ype s. Am. J. Hum. Genet. 75 35–43. [29] Fu, R. , Dey, D. K. and Holsinger, K. E. (20 05). Bay esian mo dels for the a na lysis o f genetic structure when po pula tions are c orrela ted. Bioin- formatics 2 1(8) 1516–15 29. [30] Gop alakrishnan, S. and Qin, Z. S. (2006 ). T a gSNP Selection Based on Pairwise LD Criterio n and Pow er Analy sis in Associa tion Studies Paci ﬁc Sym. Bio c omputing 11 511–5 22. [31] Greensp an, G . and Geiger, D. (200 4). Mo del-based inference of hap- lotype block v ariation. J. Comp. Biol. 11 493 –504. [32] Greensp an, G. and Geiger, D. (20 0 6). Mo deling Haplotype Blo ck V ari- ation Using Marko v Chains. Genetics 172(4 ) 2 583– 2599. [33] Guyon, I. , Westo n, J. , Barnhill, S. a nd V apnik, V. N. (2002). Gene Selection for C a ncer Classiﬁcation using Suppo r t V ector Machines. Ma- chine L e arning 46(1–3 ) 389 –422 . Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 55 [34] Halld orsson, B. V. , Bafna, V. , Lipper t, R. , Schw ar tz, R. , De La Vega, F. M. , Clark, A. G. a nd Istrail, S. (2004 ). Optimal haplotype blo ck-free selectio n of ta gging SNPs for geno mewide asso ciation studies. Genome R es 14 1633– 1640 . [35] Halperin, E. , Kimmel, G. a nd Shamir, R. (2005 ). T ag SNP Sele c tio n in Genotype Da ta for Ma x imizing SNP Pr ediction Accuracy . Bioinformatics 21(suppl 1 ) 1 95–2 03. [36] Hampe, J. , Sch reiber, S. a nd Kra wczak, M. (200 3). Entrop y-based SNP selection for genetic asso ciation s tudies. Hum Genet. 114 36–43 . [37] He, J. and Zeliko vsky, A. (200 6). MLR-ta gging infor mative SNP se- lection for unphased g enotypes bas ed on multiple linea r regres sion. Bioin- formatics 2 2(20) 2558–2 561. [38] Hirschhorn, J. N. a nd Dal y, M. J. (20 0 5). Genome- w ide as so ciation studies for co mmon dise a ses and complex traits. Nature R eview s Genetics 6 95–108. [39] Hoh, J. and Ott, J. (20 00). Scan s tatistics to scan markers for suscep- tibilit y genes. Pr o c Nat A c ad Sci 97 961 5–961 7. [40] Howie, B. N. , Carlson, C. S. , Rieder, M. J. a nd Nickerson, D. A. (2006 ). Eﬃcient selection of tag ging s ingle-nucleotide p o lymorphisms in m ultiple populatio ns. Human Genetics 120(1) 58–68 . [41] Hubley, R. M. , Zitzler, E. and Ro a ch, J. C. (2003). Evolutionary al- gorithms for the selectio n of single nucleotide p olymorphisms. BMC Bioin- formatics 4 30– 39. [42] Hung, R. J. , Brennan, P. , Mala veille, C. , Porru, S. , Dona to, F. , Boffett a, P . and Witte, J. S. (2004 ). Us ing hierarchical mo deling in genetic a sso ciation studies with multiple mar kers: applicatio n to a case- control study of bla dder c ancer. Canc er Epidemiol o gy Biomarkers and Pr evention 13(6) 1013 –1021 . [43] Hunter, D. J. (2005 ). Gene-environment interactions in human dis e ases. Natur e R eviews Genetics 6 287 –298 . [44] Inza, I. , Sierra, B. , Blanco, R. and Larranaga, P. (200 2). Gene selection by s equential s earch wrapp er approaches in micro array c a ncer class prediction Journal of Intel ligent and F uzzy Systems 12(1) 25–34. [45] Ioannidis, J. P. , Gwinn, M. , Little, J. , H iggins, J. P. , Bernstein, J. L. , Boffett a, P. , Bondy, M. , Bra y, M. S. , Brenchley, P.E. , Buffler, P. A. et al. (2006 ). Human Genome Epidemiolo gy Net w ork and the Netw ork o f In v estigator Netw orks, A ro ad map for eﬃcient and reliable hu man genome epidemiology . Natu r e Genetics 38(1) 3– 5. [46] Judson, R , Salisbur y, B. , Schneider, J. , Windemuth, A . a nd Stephens, J. C. (2002 ). How many SNPs do es a g e nome-wide haplo - t ype ma p require? Pha rmac o genomics 3 379–391 . [47] Kasabov, N. (2002). Evolving Conne ctionist Systems: Metho ds and Applic ations in Bioinformatics, Br ain S tudy and Intel ligent Machines. London-New Y o rk, Springer-V erlag. MR23758 96 [48] Ke, X . and Cardon, L . R. (2003). Eﬃcient selective scree ning of hap- lotype tag SNPs. Bioi nformatics 19 287–2 88. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 56 [49] Knorr-Held, L. a nd Rue, H. (2002). On blo ck up dating in Ma rko v ra n- dom ﬁeld models for disease mapping. S c andinavian Journal of S tatistics 29(4) 597–61 4. MR19884 14 [50] Krina, T. , Zonder v an, L . a nd Cardon, T. (20 04). The complex in- terplay among factors that inﬂuence allelic asso ciatio n. Natur e R eviews Genetics 5(2) 89– 100. [51] Krishnapuram, B. and Carin, L. (200 5 ). Sparse Multinomial Logistic Regressio n: F a st Alg o rithms and Genera lization Bo unds. IEEE T r ansac- tions on Pattern Analy sis and Machine Intel ligenc e 27(6) . [52] Lal, T. N. , Chapelle, O. , Weston, J. and El isseeff, A. (200 6). Embedded metho ds. F eature Extraction: F oundations and Applications. In Guyon, I., Gunn, S., Nikrav esh, M. Zadeh, L. A. (E ds .) Springer, Ber lin, Germany . [53] Lam, J. C. , Roeder, K. and Devlin, B. (2000). Haplotype ﬁne mapping by evolutionary tr ees. A m. J. Hum. Genet. 66 (2) 659– 673. [54] Levin, A. M. , Ghosh, D. , Cho, K. R. and Kard iaS. L. R. (2005 ). A mo del-based sc a n statistics for identifying extreme chromosomal regio ns of gene expression in h uman tumors . Bio informatics 21 2867 –287 4. [55] Li, J. and Jiang, T. (2005). Haplo t yp e - based link age disequilibrium map- ping via direct data mining Bioinformatics 21 4384 –439 3. [56] Liang, Y. and Kelemen, A. (2005 ). T empor al Gene Expr ession C la ssiﬁ- cation with Regularise d Neural Netw o rk. International Journal of Bioin- formatics Rese ar ch and Applic ations 1 (4) 399–4 13. [57] Lin, Z. a nd Al tman, R. B. (2004). Finding haplotype tagging SNPs by use of principal components a nalysis. A m. J. Hum. Genet. 75 850–861 . [58] Liu, J. S. , Saba tti, C. , Teng, J. , Kea ts, B. J. a nd Risch, N. (2001). Bay esian analys is of ha plotypes for link a ge disequilibr ium map- ping. Genome Re se ar ch 11 (10) 171 6–172 4. [59] Liu, Z. and L in, S. (200 5). Multilo cus LD measur e a nd tagging SNP selection with genera lized mutual information. Genet Epidemi ol. 29 353– 364. [60] Long, A. , Mangalam, H. , Chan, B. , Tolleri, L. , Ha tfield, G. a nd Baldi, P. (2001). Improv ed sta tistical inference from DNA microar ray data using analysis of v a riance and a Bay esian statistica l framework. J. Biol. Chem. 27 6 19937–1 9944 . [61] Mannila, H. , K oivisto, M. , Perola, M. , V arilo, T. , H ennah, W. , Ekelund, J. , Lukk , M. , P el tonen, L. and Ukkonen, E. (20 03). Min- im um description le ng th blo ck ﬁnder, a metho d to identify haplotype blo cks and to compa re the strength of blo ck bo undaries. Am. J. Hum. Genet. 73 86–94. [62] Molitor, J. , Marjoram, P. and Thomas, D. (20 03). Fine-Sca le Map- ping of Disease Genes with Multiple Mutations via Spatial Clustering T ec hniques. Am. J . Hum. Genet. 73 1368–1 384. [63] Monari, G. and Dreyfus, G. (2000). Withdrawing an example from the training set: an ana lytic es timation of its eﬀect on a nonlinear para m- eterized model. N eu r o c omputing L ett ers 35 195– 2 01. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 57 [64] Moore, J. H. (2007 ). Genome-wide analy sis o f epistasis using multifactor dimensionality reductio n: feature sele c tion and cons truction in the domain of human genetics. In: Zh u, Davidson (eds.) Kno wledge Discovery and Data Mining: Challenges a nd Realities with Real W orld Data, IGI, (in press). [65] Moore, J. H. and White, B. C. (2006). Exploiting exp ert knowledge for genome-wide genetic analy s is using genetic prog ramming. In: Runarsson et a l. (eds .) Parallel Problem Solving from Nature - P P SN IX, Lecture Notes in Computer Scie nce 4 193, 969–97 7. [66] Moore, J. H. and Williams , S. M. (20 02). New strategies for iden tify- ing gene-gene in teractions in h ypertensio n. A nn Me d. [67] Motsinger, A. A . , L ee, S. L. , Mellick, G. and Ritchie , M. D. (2006). PNN: Pow er studies and applications o f a neur a l netw ork metho d for detecting gene-gene int eractions in studies of human diseas e. BMC Bioinfo rmatics 7(1) 39 –50. [68] Neale, B. a nd S ham, P. (20 04). The future of asso cia tio n studies: Gene- based analysis and replicatio n. A meric an Journal of Human Genetics 7 5 353–3 62. [69] Newton, M. A . , Kendziorski, C. M. , Richmond, C. S. , Bla ttner, F. R. and Tsui, K. W. (2 001). On diﬀerential v ariability of expressio n ratios: improving statistical inference ab out gene expressio n changes from microarr ay data. Journal of Computational Biolo gy 8(1) 37–5 2 . [70] Nyhol t, D. R. (2004). A simple c o rrection for multiple testing for single-nucleotide po lymorphisms in link age disequilibr ium with each other. Americ an Journ al of Human Genetics 74(4) 765– 7 69. [71] Ott, J. (2001 ). Neural netw o rks a nd diseas e a s so ciation s tudies. meric an Journal of Me dic al Genetics 105 (1) 60–61. [72] P ark, M. and Has tie, T. (2006). Regulariza tion Path Algorithms fo r Detecting Gene In teractions, preprint. [73] P a v lidis, P. a nd No ble, W. S. (200 1). Analysis of strain and re - gional v ariation in gene expr ession in mo use bra in. Genome Biolo gy 2(10) resear ch0042.1 -004 2.15. [74] Pedr ycz, W. (1997 ). Computational Intel ligenc e: An Intr o duction. Bo ca Raton, FL, CR C. [75] Risch, N. J. (2 0 00). Searching for genetic determinants in the new mil- lennium. Natur e 405 847– 856. [76] Risch, N. a nd Merikangas, K. (199 6 ). The future of ge netics studies of complex h uman dis eases. Scienc e 273 1516– 1 517. [77] Ritchie, M. D. , Hahn, L. W. and Moo re, J. H. (20 03a). Pow er of m ultifactor dimensio nality reduction for detecting gene-gene interactions in the presence of g e no typing error , missing data, pheno copy , and g enetic heterogeneity . Genet Epidemiol. 24 150 –157. [78] Ritchie, M. D. , White, B. C. , P arker, J. S. , Hahn, L. W. and Moore, J. H. (2003 b). Optimization of neural netw ork architecture us- ing genetic prog ramming improv es detection and mo deling of gene-g ene int eractions in studies of human diseases. BMC Bioinformatics 4 28–38 . Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 58 [79] Riv als, I. and Personn az, L. (200 3). MLPs (Mono-Lay er Polynomials and Multi-Layer Perceptrons) for Nonlinear Mo deling. Journal of Machine L e arning R ese ar ch 3 138 3–13 9 8. [80] Sal y akina, D. , Seaman, S. R. , Brow ning, B. L. , Du dbridge, F. and Muller-Myhsok, B. (2005). Ev aluation of Nyho lt’s pro cedure for m ultiple testing correction. Human Her e dity 60(1) 19–25 . [81] Schaid, D. J. (19 96). General scor e tests for asso ciatio ns of genetic ma r k- ers with disea se using cas e s a nd their parents. Genetic Epidemio lo gy 13 423–4 49. [82] Schaid, D. J. , R owland, C. M. , Tines, D. E. , Jacobson, R. M. and Po land, G. A . (200 2). Score test for asso cia tion b etw een traits and haplotypes when link age phase is a m biguous. Am J H um Genet 7 0 425– 439. [83] Schwender, H. and Ickst adt, K. (2006). Identiﬁcation of SNP In- teractions Using Logic Reg r ession, ht tp://www.sfb475.uni- dortmund.de/ ber ich te/tr31-06.p df , accessed on Oct.-31-20 06. [84] Seaman, S.R. and Muller-Myhsok , B. (200 5 ). Rapid sim ulation of P v alues for pro duct metho ds and multiple-testing adjustment in ass o ciation studies. Americ an Journ al of Human Genetics 76 399– 408. [85] Sebastiani, P. , Lazarus, R. , Weiss, S. T. , Lunkel, L. M. , Kohane, I. S. a nd Romani, M. F. (2003). Minima l haplotype tagging. Pr o c. N atl. A c ad. Sci. USA 100 9900– 9905. [86] Shriver, M. , Mei, R. , P arra, E . J. , et al . , (2005). La rge-sca le SNP analysis reveals clus tered and co ntin uo us patterns of h uman genetic v a ri- ation. Human Genomics 2(2) 81–8 9 . [87] Song, K. and Elston, R. C. (2006). A pow erful metho d of co m bin- ing measure s of asso cia tion and Hardy-W einberg disequilibrium fo r ﬁne- mapping in cas e-control studies. Statistics in Me dici ne 25(1) 1 05–12 6. MR22220 77 [88] Stephens, M. a nd Donnell y, P . (20 00). Inferenc e in molecular p opu- lation genetics. J R Stat S o c B 62 605– 655. MR17 9628 2 [89] Stram, D. O. , Haiman, C. A. , H irschhorn, J. N . , Al tshu ler, D. , Kolonel, L. N. , Henderson, B. E. a nd Pik e, M. C. (2003 ). Cho o sing haplotype-tagg ing SNPs bas ed on unphased geno type data using prelimi- nary sample of unrela ted sub jects with an example fro m the multiethnic cohort study . Hum . Her e d. 55 27–36. [90] Sun, W. a nd Cai, T. (20 0 7). Oracle and ada ptive comp ound decision rules for false discovery rate control. J. Ameri c an St atist ic al Asso cia tion 102 901–91 2. [91] Sun, Y. , Levin, A. , Boer winkle, E. , R ober tson, H. and Kardia, S. (2006). A scan statistic for identifying c hromosoma l pa tterns o f SNP asso ciatio n. Genetic Epidemio lo gy 3 0 627–63 5 . [92] T an, P. , Steinba ch, M. and Kuma r, V . (2005 ). In tro duction to Data Mining, Addison-W esley , pp. 76– 79. [93] The I nterna tional H apMap Consor tium (2005). A haplotype map of the human genome. Natur e 437 129 9 –132 0. Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 59 [94] The I nterna tional HapMap Consor tium (2004). Integrating ethics and science in the In ternational Ha pMap Pro ject. Nat Rev Genet 5 4 67– 475. [95] The Interna tional H apMap Consor tium (2 003). The International HapMap Pro ject. Natur e 426 789 –796. [96] Thomas, D. C . , Stram, D. O. , Conti, D. , Molitor, J. and Mar- joram, P. (200 3). B ay esian spatial mo deling of haplotype asso cia tions. Human Her e dity 56 32–40 . [97] Tibshirani, R. (1996 ). Regression shrink age and selection via the la sso. J. Ro yal. St atist. So c B. 58(1) 267–288 . MR13792 42 [98] Tibshirani, R. (1 997). The lasso metho d for v a riable selection in the Cox mo del. Statistics in Me dicine 16 385– 395. [99] Toivonen, H. T. , Onkamo, P. , V asko, K. , Ollik ainen, V. , Sevon, P. , Mannila, H. , Herr, M. and Kere, J. (2000). Data mining applied to link age disequilibrium mapping. Am. J. Hu m. Genet. 67(1) 133–1 45. [100] Tzeng, J. N . , W ang, C. H. , Kao, J. T. a nd Hsiao, C . K. (2006). Regressio n-based asso cia tion analysis w ith cluster ed ha plo types through use of genotypes. A meric an J ournal of Hu m an Genetics 78(2) 231–24 2. [101] V apnik, V. N . (1995). The Nature of Statistical Learning Theory . Springer-V erlag, New Y ork MR13679 65 [102] V apnik, V. N. (1998 ). Statistical Lear ning Theory . Wiley , New Y ork. MR16412 50 [103] Verzilli, C. J. , S t allard, N. and Whitt aker, J. C. (2006 ). Bay esian graphical mo dels for g enomewide asso c ia tion s tudies. Americ an Journal of H uman Genetics 79(1) 100– 1 12. [104] W allenstein, S. and Neff, N. (19 8 7). An approximation for the dis- tribution of the scan s tatistic. Stat Me d 6 197–20 7. [105] W ang, L. , Zhu, J. and Zou, H . (2006). Doubly regula rized supp or t vector machine. Statist ic a Sinic a 16 589–61 5. MR22672 51 [106] Wessel, J. a nd Schork, N. J. (2006). Generalized Genomic Dis- tance Based Regress ion Metho dology for Multilo cus Asso ciation Analysis. Americ an Journ al of Human Genetics 79(5) 792– 8 06. [107] Weston, J. , Mukherjee, S. , Chapell e, O. , Pontil, M. , Poggio, T. and V apnik, V . (2000). F eature Selection for SVMs. In S. A. So lla, T . K. Leen, and K . R. Muller , (eds), Adv ances in Neural Informatio n Pro cessing Systems, v olume 12, 526–532 , Cambridge, MA, USA. MIT Pr ess. [108] Witte, J. S. and Fijal, B. A. (2 001). Introduction: Analysis o f Sequence Data and Population Structure. Genetic Epidemi olo gy 21 600–6 01. [109] Yu, J. and Chen, X. W. (2005 ). Bay esian Neural Net w ork Appr o aches to Ov ar ian Cancer Iden tiﬁcation from High-r e s olution Ma s s Spectr ometry Data. Bioi nformatics 21 (suppl-1) i487–i4 94. [110] Yuan, M. and Lin, Y. (2006). Mo del selection and estimation in re- gressio n with group ed v ariables. J ournal of the R oya l Statistic al So ciety: Series B (Statistic al Metho dolo gy) 68 (1) 49–67. MR22125 74 [111] Za ykin, D. V. , Westf all, P. H. , You ng, S. S. , Karnoub, M. A. , W a gner, M. J. and Ehm, M. G. (2002b). T esting Asso cia tion o f Statisti- Y. Liang and A. K e lemen/Statistic al advanc es for SNP data 60 cally Inferr ed Ha plotypes with Discrete and Contin uo us T r aits in Sa mples of Unrelated Individuals. Hum Her e d 53 79 –91. [112] Za ykin, D. V. and Z hivoto vsky, L. A. (2005 ). Ranks of g enuin e asso- ciations in whole-genome scans. Genet 171 813–8 23. [113] Za ykin, D. V. , Zhivotovsky, L. A., et al. (20 02a). T runcated pr o duct metho d for com bining P-v alues. Genet Epidemiol 22 170–1 8 5. [114] Zhang, K. and Jin, L. (2003). HaploBlo ckFinder: Haplo type blo ck anal- ysis. Bioinfo rmatics 19 13 00–13 01. [115] Zhang, K. , Qin, Z. , Liu, J. , Chen, T. , W a terman, M. S. and Sun, F. (2 004). Haplo type Blo ck Partitioning and T ag SNP Selection Using Genotype Data and Their Applications to Asso ciation Studies. Genome R es. 14 908 – 916. [116] Zhang, Y. , Niu , T. and Liu, J. (200 6). A coa lescence-guided hierarchical Bay esian metho d for haplotype inference. Americ an Journ al of Human Genetics 79 (2) 313–322 . [117] Zhao, J. , Boer winkle, E. and Xiong, M. (2005 ). An entrop y-based statistic fo r g enomewide asso cia tio n studies. Americ an J ournal of Human Genetics 77 27– 40.

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment