Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach

The Annals of Applie d Statistics 2007, V ol. 1, No. 2, 459–479 DOI: 10.1214 /07-A OAS121 c  Institute of Mathematical Statistics , 2 007 IMPR O VING POPUL A TION-SPECIFIC ALLEL E FREQUENCY ESTIMA TES BY AD APTING SUPPLEMENT AL D A T A: AN EMPIRICAL BA YES APPR OA CH By Marc Coram and Hua T ang 1 Stanfor d Uni v ersity Estimation of the allele frequency at genetic markers is a k ey ingredient in biologi cal and biomedical research, such as studies of human genetic v ariation or of th e genetic etiology of heritable traits. As g enetic data b ecomes increasingly av ailable, inv estigators face a dilemma: w hen should data from other studies and p opulation sub- groups b e p o oled with the primary data? Pooling additional samples will generally reduce the vari ance of t he frequency estimates; how - ever, used inapp ropriately , p o oled estimates can b e severel y biased due to p opu lation stratiﬁcation. Because of this p otential bias, most inv estigators av oid p o oling, even for samples with the same ethnic backgro und and residing on the same continent. H ere, we prop ose an empirical Ba yes approach for estimating allele frequencies of single nucleoti de p olymorphisms. This pro cedure adaptivel y in corporates genot yp es from related samples, so that more similar samples hav e a greater inﬂuence on the estimates. In every example we h ave con- sidered, our estimator ac h ieves a mean squared error ( MSE) that is smaller than either p o oling or not, and sometimes substantially im- prov es o ver b oth extremes. The bias introduced is small, as is shown by a sim u lation study that is carefully matched to a real data exam- ple. Our method is p articularly useful when small groups of individ- uals are genotyped at a large number of markers, a situation w e are lik ely to encounter in a genome-wide association study . 1. In tro d uction. Allele frequ en cy at a genetic mark er is one of the most imp ortant elemen ts in studies of genetic dive rsit y , as we ll as in p opulation- based disease association stu dies. It pla ys a piv otal role in link age stud- ies, whic h mo d el the allelic id entical b y descent prob ab ility , and in asso- ciation studies, which directly compare the allele frequency b et wee n th e aﬀected cases and unaﬀected con trols. Moreo v er, once a disease v arian t has Received F ebruary 2007; revised May 2007. 1 Supp orted by NIH GM073059. Key wor ds and phr ases. Empirical Bay es, allele frequency. This is an electronic r e print of the orig inal article published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2007, V o l. 1, No. 2 , 459–47 9 . This reprint diﬀers fro m the origina l in pagina tion and t ypo graphic detail. 1 2 M. COR AM AND H . T ANG b een ident iﬁed, accurate assessments of th e allele fr equency of the v ari- an t en ab le us to ev aluate the prop ortion of the disease burden in a sp e- ciﬁc p opu lation that is attributable to the v ariant. F u eled by the recent dev elopments in high-throu gh p ut genot yping tec hn ologies, v arious eﬀorts are u n derwa y to c haracterize allele f requencies at a genome-wide scale in div ers e p opulations. Ho w ever, b ecause of the still signiﬁcant costs asso- ciated with these high-throughp ut platforms, current large-sca le genomic pro jects often assa y a large n u m b er of mark ers in a s m all num b er of individ- uals. F or example, the Inte rnational HapMap Pro ject has genot yp ed more than f our million single nucleo tide p olymorph isms (SNP) in 90 Afr icans from Nigeria ( 60 of wh ic h are u nrelated individu als), 90 U.S. resid en ts with northern and w estern Eur op ean ancestry (60 of which are unrelated individuals), 45 Han Chinese from Beijing and 45 Japanese from T okyo [ In ternational HapMap C on s ortium ( 2005 )]. In another eﬀort, P erlegen S ci- ences genot yp ed 71 Americans of Europ ean, African or Han C hinese ances- try [ Hinds et al. ( 2005 )]. Th e maximum lik eliho o d estimate (MLE) of allele frequency , in this case just the observ ed prop ortion of one allele, has a bi- nomial sampling err or, whic h can b e substantial for small samples. Small sample sizes remain a concern, eve n as more individu als are b eing geno- t yp ed, b ecause there is a sim u ltaneously gro w ing concern ab out p opu lation stratiﬁcation [ Lander and S c hork ( 1994 )]. When genot yp es are av ailable f r om ind ividuals represen ting th e same p op- ulations, the allele frequency estimates can b e impr o ve d b y com bining geno- t yp e d ata. O n the other h and, inju dicious com bining of s amp les r ep resen ting distinct p opulations can lead to biased estima tes, as p opulation stratiﬁca- tion and genetic dr ift lead to diverge nce in allele f r equencies among p opu- lations [ Fisher ( 1922 ) and W right ( 1931 )]. Unf ortu nately , d eciding w hether t wo samp les repr esen t a homogenous p opu lation, and hence are combin- able, is a d elicate and sub jectiv e decision. Do the Han Chinese fr om Beijing (HapMap samp le) and those f r om Los Angeles (P erlegen sample) r epresen t the same p opulation? Can w e use the HapMap African genot yp es to imp r o ve frequency estimates of Pe rlegen Afr ican Americans? One p ossible appr oac h to addr ess such ambiguit y is a tw o-stage app roac h: one ﬁr s t tests whether the tw o samples are com binab le, using a rand om set of m ark ers and a pro ce- dure su c h as Devlin and Ro ed er ( 1999 ) or Pritc h ard and Rosen b erg ( 1999 ), and, in a second stage, com b ine or not com bine d ep ending on th e outcome of th e ﬁr st-stage test. Th is t wo-stag e procedu re, how ev er, su ﬀers fr om tw o p oten tial pr oblems. First, when only a small num b er of individ u als ha ve b een genot yp ed, the ﬁrst-stage test may not hav e su ﬃcien t p o wer to detect the diﬀerence; on the other hand , with a suﬃcien tly large sample size, an y trivial noncongru ency leads to rejection of the test, and therefore v oids the p ossibilit y to com bine samples. Second, the ﬁr st-stage test can in tr o duce a bias since only similar allele fr equencies are allo wed to b e com bined . EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 3 Ba y esian and empirical Ba ye s ap p roac hes oﬀer ﬂexible v en ues for com- bining m u ltiple s ources of inf orm ation. Lange pioneered an empirical Ba yes approac h for estimating allele frequencies of a single mark er using d ata at the same marker from m u ltiple p opulations [ Lange ( 1995 )]. Lo c kwoo d , Ro eder and Devlin ( 2001 ) extended this approac h to in corp orate multi-loci genot yp e inf orma- tion via a Ba y esian h ierarc hical mo d el. Both metho ds emp lo y a Diric hlet ( α ) distribution to describ e the disp er s ion of f r equencies b etw een the diﬀer- en t popu lations. The t wo app roac hes diﬀer in ho w α is estimated: Lange’s metho d estimates α b y maxim u m like liho o d at eac h lo cus separately; wh ile Lo c kwoo d , Ro eder and Devlin ( 2001 ) b orrow strength across lo ci. These t wo metho ds are describ ed in greater detail in Section 2.6.5 . Add itionally , ther e is a ric h literature in mod eling p opulation structure and d iv ergence u s ing genetic p olymorphism data, although the pr imary in terests are in ferences ab out p opu lation h istory and estimating parameters suc h as genetic distance and p opulation size [ Kitada, Ha ya shi and Kish in o ( 2000 ), Nic holson et al. ( 2002 ) and Wilson, W eale and Balding ( 2003 )]. In this pap er we pr op ose a new empirical Ba y es approac h, whic h oﬀers an adaptiv e pro cedur e to com bine multiple samples. This metho d a v oids the p r oblems asso ciated with the t wo- stage pr o cedure by int ro ducing an aﬃnit y measure, ν, whic h is based on the global similarit y b et ween sam- ples. There is no need to mak e a “hard” decision to com b ine or n ot com- bine, as ν parametrizes a conti n u ous sp ectrum of compromise b et ween the t wo extremes. As a resu lt, our approac h allo ws us to b orrow strength from additional samples, even if th ey are indub itably distinct from the target p opulation. An imp ortant adv an tage of our app roac h is that it requires no kno wledge nor a ssumptions of the genealogy th at relates v arious samples. As we explain in Section 2.6 , our appr oac h diﬀers from related existing approac hes in implemen tation as w ell as in terpretation. W e illustrate the statistica l v alidit y of our metho d b y a series of analyses using r eal genot yp e data f rom HapMap and Pe rlegen pro jects; these analyses also p ro vid e some in teresting biologica l ins ights regarding the p opulations stud ied by the t wo pro jects. 2. Method . 2.1. Data , mo del and the b asic ide a. Our goa l is to estimate allele fre- quencies at a large num b er of bi-alleleic markers in a target p opulation, Y . The a v ailable data consists of n Y alleles f rom Y (based on genot yp es of 1 2 n Y individuals), as wel l as a b o oster sample of n X alleles from a (p ossibly re- lated) p opulation, X . A t eac h mark er, we assign one allele as the “ A ” allele, and denote the obs erv ed n u m b ers of A allele at marker i in the tw o samples as X i and Y i , resp ectiv ely . Let q i b e the frequen cy of A allele at marker i in p opulation Y , and let p i b e the corresp ondin g frequency in X . Within eac h 4 M. COR AM AND H . T ANG p opulation, w e assu me th e Hardy–W einberg equilibrium at eac h m ark er; furthermore, w e assume that all mark ers are indep endent (i. e., in link age equilibrium). Add itionally , we assume ev olutionary neutralit y at th e ma jor- it y of marke rs. Some of th ese assu m ptions are not strictly necessary , as we explain in S ection 2.6.3 . Consequ ently , we mo del X i ∼ Binom( n X , p i ) , Y i ∼ Binom( n Y , q i ) (all indep endent) . Using genot yp e data f rom Y alone, the maximum lik eliho o d estimate (MLE) of the frequency of A allele, q i , coincides with the observed frequ ency , ˆ q i = y i n Y . Lik ewise, denote th e MLE of the corresp onding allele frequ en cy in X as ˆ p i . The empirical Ba yes a pproac h we prop ose is motiv ated b y the ob s erv a- tion that, if p opulations X and Y are ev olutionarily related, then the allele frequencies at the corresp ondin g mark ers, p i and q i , are often p ositiv ely as- so ciated [ Jiang and C o c kerham ( 1987 )]. F or example, Figure 1 (a) d isp la ys a t wo- dimensional h istogram of the MLE allele frequencies for ∼ 60,000 SNPs on Chromosome 22 using the HapMap Ch inese ( x -axis) and Japanese ( y - axis) samples. W e use 15 , instead of 45 , Japanese individuals for this plot in order to f acilitat e comparison with the sim u lation stud y in Section 3.3 . The higher intensit y band al ong the diagonal indicates the high degree of asso ciation b etw een the frequ encies in JPT and CHB p opulations. The main statistica l contribution of this p ap er is to deve lop a wa y to b orr o w s tr ength from such asso ciation. T o ﬁx ideas, consid er a marke r whose J P T frequency w e wish to estimate and whose CHB sample fr equency happ ens to b e 0 . 33 . So far, what we kno w ab out this mark er is that it lies somewh ere in the third v ertical strip in Figure 2 (a). I t is natural to consider the p opulation of suc h mark ers, that is, the s ubset of markers whose C HB samp le fr equency is essentia lly 0 . 33 . The thir d histogram sho w n on the righ t displays the corresp ondin g all ele counts in JPT. This histogram has a mean ab out 30 (out of 90 Japanese al leles), and the distribution is wel l appro ximated by the sup erimp osed b eta-binomial density , whose parameters were ﬁtted b y maxim um lik eliho o d. F or this particular histogram, the ﬁtted parameters are (11 . 30 , 22 . 41) . Therefore, this data is well ap p ro ximated by a mo del in whic h we consider the tru e fr equency to ha ve a Beta d istribution and the ob- serv ed coun ts to b e Binomially distributed. Accordingly , w e prop ose to tak e this Beta as an empirical pr ior for all of the m ark ers in the histogram; for eac h marker in tu rn, we condition this p rior on the observed JPT coun ts to deriv e a Beta p osterior, and use the p osterior m ean as an up dated f requency estimate. EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 5 2.2. Windowe d estimates. W e will formalize th is basic idea by d escribing a wind o wed version of our empirical Ba yes approac h. T he arbitrary wind o w width δ will b e remo v ed later in a parametric v ersion. F or a single mark er, a commonly u sed Bay esian approac h assu mes a Beta p rior, wh ic h is a conju- gate prior for the binomial distrib ution [ Bernard o and Smith ( 1994 )]. Ho w - Fig. 1. Two-dimensional Histo gr am of MLE of al lele fr e quencies in (a) 45 HapMap Chinese individuals ( x -axis) v.s. 15 HapMap Jap anese individuals ( y -axis), and ( b ) 45 simulate d b o oster samples (x-axis) v.s. 15 simulate d tar get samples ( y -axis) use d in Se c- tion 3.3 . 6 M. COR AM AND H . T ANG Fig. 2. I l lustr ation of the b asic ide a (a) sc atte r plot of the empiric al al l ele fr e quencies i n 45 HapMap C hinese indi viduals ( x -axis) v.s. 45 HapMap Jap anese individuals ( y -axis); (b) histo gr am and ﬁtte d Beta-Binomial density of the al l el e c ounts i n Jap anese in narr ow windows. ev er, a problem arises here that p lagues m an y Ba yesia n app roac hes: ho w to c ho ose the parameters, a and b , for the prior distribu tion? F ortu nately , with an emp ir ical Ba ye s appr oac h, w e can take adv an tage of the large n u m b er of mark ers a v ailable to estimate these parameters ob j ectiv ely . Consider a certain marker i f or w hic h we wish to estimate q i . The obser ved frequency at mark er i in X is ˆ p i . No w consider the (presumably large) set J of all markers j suc h that ˆ p j falls within a n arro w windo w of ˆ p i , sa y , ( ˆ p i − δ , ˆ p i + δ ). W e can u nderstand J as the “p opulation” of markers, ab out wh ic h w e h a ve the same information coming f rom p opulation X as w e ha v e for the mark er of interest, m arker i. By lo oking at these markers as an aggregate, w e can emp irically determine the degree to whic h this X information do es or do es not inform us ab out marke r frequency in the Y p opulation. F or j ∈ J , we appro x im ate the d istr ibution q j ’s by a Beta distribu tion, Beta( a, b ). Under this mod el, the Y j ’s are indep endent and ident ically distributed as BetaBinom( n Y , a, b ) . Let ˆ a i and ˆ b i maximize the like liho o d of this d ata: Y j ∈J d BetaB inom ( n Y , a, b )( y j ) , where d BetaBinom( n, a, b )( x ) denotes the Beta Binomial d ensit y at x : B( a + x, b + n − x ) B( a, b )  n x  , (2.1) EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 7 where B is the Beta function. W e therefore form the empirical Ba yes pr ior, q i ∼ Bet a( ˆ a i , ˆ b i ). Condition- ing on Y i = y i , we obtain the p osterior: q i | y i ∼ Beta( y i + ˆ a i , n − y i + ˆ b i ). T aking the mean and v ariance of this p osterior, we obtain th e estimator ˆ q i EBW = E( q i | Y i ) = y i + ˆ a i n + ˆ a i + ˆ b i with exp ected s quared error (ignoring the randomness in ˆ a i and ˆ b i ) of d V ar ( ˆ q i EBW | Y i ) = ˆ q i EBW (1 − ˆ q i EBW ) n + ˆ a i + ˆ b i + 1 . Deﬁne an aﬃn it y m easure, ν i = ˆ a i + ˆ b i . In our examples, it generally happ ens that the mean of the emp irical Ba y es prior f or m arker i , µ i = ˆ a i ν i is nearly the same as ˆ p i , and has little sampling v ariabilit y b ecause of the large num b er of m ark ers falling in to J . The v ariance of the pr ior can b e written as σ 2 i = µ i (1 − µ i )( ν i + 1) − 1 , wh ic h decreases as ν i increases. In other w ord s, when ν i is large, the b o oster sample exerts greater inﬂuence on the p osterior estimate of ˆ q i EBW . As w e illustrate in the Examples, ν tends to b e larger wh en the b o oster sample is biological ly m ore closely related to the target p op u lation; hence, the p ro cedure adaptiv ely incorp orates the b o oster samples. 2.3. Par ameterize d estimates. In the windo wed ve rsion w e selected a and b to maximize the likelihoo d of the data in eac h win do w. A more elega n t approac h seeks a parametric form that relates a and b to the conditioning information. By treating the data globally , th is approac h av oids the arbi- trary c hoice of windo w-wid th. Moreov er, it can readily handle the case in whic h we w ish to sim u ltaneously in corp orate multiple b o oster samp les (see Section 2.5 ). F or n o w, con tin uin g with the case of a s in gle b o oster sample, a simple ye t r easonably eﬀectiv e choi ce is EB1: a ( p ) = β 0 + β 1 p, (2.2) b ( p ) = β 0 + β 1 (1 − p ) . The emp ir ical Ba yes p rior of q with parameters, a and b, derived from th is mo del has a simp le in terpretation: “a p r iori” q follo ws a Beta distribution whic h repr esen ts a total of 2 β 0 + β 1 pseudo-count s; a baseline of β 0 coun ts are assigned to eac h allele; the ad d itional β 1 coun ts are allocated prop ortionally to the observed fr equencies in X . W e estimate the co eﬃcients in ( 2.2 ), β ’s, by maximizing the likelihoo d: Y j d BetaB inom ( n Y , a ( ˆ p j ) , b ( ˆ p j ))( y j ) , 8 M. COR AM AND H . T ANG where the d ep endency of a and b on th e β ’s is su ppressed . The mo del in ( 2.2 ) sp eciﬁes a linear m o del f or a and b, and in duces a symmetry condition: a ( p ) = b (1 − p ) . W e can add higher-order terms or terms that b reak th e symmetry , b ut in our analyses of r eal data, these terms do not cont ribute to signiﬁcan t impr o veme n ts in the lik eliho o d . Ho we v er, the lik eliho o d does improv e if the endp oints ˆ p = 0 and ˆ p = 1, whic h occur fairly often in our data, are tr eated as sp ecial cases. E B2 is a w ay to introdu ce add itional terms that treat these cases sym metrically: a ( p ) = β 0 + β 1 p + β 2 1 p =0 + β 3 1 p =1 , (2.3) b ( p ) = β 0 + β 1 (1 − p ) + β 2 1 p =1 + β 3 1 p =0 , where 1 is the in d icator fun ction. 2.4. Spline estimates. One migh t question whether the parametric forms used in the p revious section imp ose u nnecessary constrain ts. T o allo w a more ﬂexible mo d el of ˆ a and ˆ b, we use a B-spline expansion: a ( p ) = N X j =1 N j ( p ) θ j , (2.4) b ( p ) = N X j =1 N j (1 − p ) γ j , where the N j ( x ) are an N-dimensional set of basis fun ctions for cubic B- splines on [0, 1]. W e can imp ose the symmetry condition, a ( p ) = b (1 − p ) , by taking θ j = γ j for all j . 2.5. Multiple b o osting samples. Several genomics pr o jects ha ve sur v eyed div ers e p opulations. O u r empirical Ba ye s appr oac h generalizes naturally to suc h situations, incorp orating empirical fr equencies f rom sev eral b o osting samples (multiple p ’s) in estimation of the q ’s. F or example, the p arametric mo del EB1 for m ultiple b o osting samples is a ( p (1) , . . . , p ( K ) ) = β 0 + K X k =1 β k p ( k ) , (2.5) b ( p (1) , . . . , p ( K ) ) = β 0 + K X k =1 β k (1 − p ( k ) ) , where p ( k ) denotes the allele frequen cy in b o osting sample k . Similarly , a spline-based mo del is a ( p (1) , . . . , p ( K ) ) = K X k =1 N X j =1 N j ( p ( k ) ) γ k j , EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 9 (2.6) b ( p (1) , . . . , p ( K ) ) = K X k =1 N X j =1 N j (1 − p ( k ) ) γ k j , 2.6. R emarks. 2.6.1. An aﬃnity me asur e. F or the window ed estima te, w e deﬁne ν to b e th e median of ˆ a + ˆ b ov er all w in do ws ; for EB1, we deﬁne ν as 2 ˆ β 0 + ˆ β 1 . F or the spline v ersion, EB3, we d eﬁ ne ν by R p ˆ a ( p ) + ˆ b ( p ) dp . In all situations, ν h as a simple in terpr etation as the eﬀectiv e sample size of the b o osting sample. As illustrated in the next section, ν reﬂects the genetic asso ciation b et w een the tw o p opulations. 2.6.2. It can b e reasoned th at, if p op u lations X and Y are essen tially iden tical, our approac h is asymptotically equ iv alen t to p o oling the genot yp e data d irectly . T o v erify this is h o w our approac h b ehav es, we reconsider th e windo w ed estimate. Let p j ∼ Bet a( a 0 , b 0 ) for all j . F or mark er i, with ˆ p i = x i /n X , consider all mark ers j, s.t. ˆ p j ∈ ˆ p i ± δ . As δ → 0 , this is the set of mark ers with X j = x i . Conditioning on X j = x i , the p osterior d istribution of p j is Beta ( a 0 + x i , b 0 + n X − x i ) . S in ce X and Y are identic al p opulations, p j = q j for all markers. It f ollo ws that Y j | X j = x i ∼ BetaBinom( n Y , a 0 + x i , b 0 + n X − x i ) . Because the MLEs, ˆ a and ˆ b, are consistent as the n u m b er of alleles b ecomes inﬁnite, w e ha ve ˆ a → a 0 + x i and ˆ b → b 0 + n X − x i , so that ˆ q EB i = Y i + ˆ a n Y + ˆ a + ˆ b . = Y i + a 0 + x i n Y + a 0 + b 0 + n X . In other w ords, the empirical Bay es estimator is equiv alen t to directly p o oling data, w ith an additional shrin k age tow ard the prior pseud o prop or- tion, a 0 / ( a 0 + b 0 ). 2.6.3. In describing our approac h, w e ha ve made thr ee commonly adopted assumptions: that eac h marker satisﬁes the Hardy–W einberg equilibr ium, that pairs of m ark ers are in link age disequ ilibr ium, and that th e genome is under neutral ev olution. The Hardy–W ein b erg equilibriu m (HWE) in a p op- ulation requir es only one generation of rand om mating; empir ically , there is v ery w eak evidence for systemat ic d eviation from HWE, ev en in stratiﬁed (historically nonrandom mating) popu lations such as the Mexicans or the Puerto Ricans [ Choudh r y et al. ( 2006 )]. In fact, HWE is often used as a 10 M. COR AM AND H . T ANG diagnosis for genot yping error [ Y onan, P almer and Gilliam ( 2006 )]. The as- sumption of link age disequ ilibrium is required here s o that the Beta-Binomial lik eliho o d fun ction can b e m u ltiplied across mark er s . Includin g tigh tly linked mark ers (i.e., correlated ge not yp es) in the estimation of the Beta parame- ters mak es the eﬀectiv e sample size somewhat smaller than the n ominal one, but should n ot int ro duce systematic b ias, as long as mark ers are relativ ely ev enly distrib u ted across th e genome. Finally , ev olutionary neu tralit y w ar- ran ts that the diﬀerence in allele frequencies b et ween p opu lations are du e to genetic drift and not directional selection. Stron g d ir ectional selection m a y create a situation in whic h allele f r equencies are similar in t wo p opu lations at most lo ci, except at a few lo ci wh ere selection results in large allele fre- quency discrepancy . While there is eviden ce that v arious parts of the h uman genome ha v e b een su b jected to recent p ositiv e selectio n [ Sab eti et al. ( 2006 ) and V oight et al. ( 200 6 )], the selection co eﬃcien ts are lik ely lo w [ Kim u ra ( 1968 )]. F ur ther, ev olutionary n eutralit y h olds in a large pr op ortion of the genome, in particular, nonco d ing SNPs and syn on ym ous SNPs . Besides, in the unlikely scenario that strong selection leads to divergence b et ween t wo p opulations at a genome-wide scale, our approac h will not pro duce seve rely biased frequ ency estimates, b ecause the aﬃnity measure w ill b e lo w and, therefore, the b o oster data are not allo w ed to inﬂu ence the estimates su b- stan tially . 2.6.4. One ma y question the appropriateness of the Beta p rior w e as- sume, which is conv enien tly th e conju gate prior for a Binomial d istribution [ Sk ellam ( 19 48 )]. Under selectiv e neu tralit y and f or p opulations that hav e reac hed an m utation-drift equ ilibrium, W right ( 1951 ) sho wed that the al- lele frequencies at bi-allelic loci follo w a b eta distribu tion. Visu al insp ec- tion ind icates that the empirical allele frequencies of the HapMap Eur o- p ean samp les follo w a Beta distribu tion reasonably well, although there are sligh t excess rare alleles, that is, frequ encies near 0 or 1. While it ma y b e p ossible for a diﬀeren t prior to somewhat improv e the ﬁt, it is diﬃcult to ﬁnd a p erfect prior. This is b ecause the SNPs genot yp ed in a pro j ect sel- dom r epresen t a rand om sample of all p olymorphic sites, and the ascertain- men t b ias distorts the u nderlying frequency sp ectrum. When the ascertain- men t p ro cedure is kno w n, it is p ossible to correct th e bias [ Nic holson et al. ( 2002 ) and Nielsen, Hubisz and Clark ( 2004 )]. Unfortun ately , the ascertain- men t schemes are often so complex that it is diﬃcult to correct for the bias [ Clark et al. ( 2005 )]. 2.6.5. W e are now in a go o d p osition to exp lain the diﬀerence b et ween our m etho ds and the empirical Ba y es approac h of Lange ( 1995 ), whic h aims to impr o ve the fr equ ency estimate at a single marker using genot yp es from m u ltiple p opulations. This metho d m o dels th e allel e frequ encies at the EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 11 mark er in v arious p opulation as indep endent draws from a single prior d is- tribution, whic h is c h osen to maximize the like liho o d of the observed allele coun ts in all p op u lations. The p osterior mean represents a shrin k age to- w ard the p o oled p opulation a ve rage. I n contrast , our metho d is closer to a regression mo del; the frequencies in the target p opulation are mo deled con- ditionally on the b o osting p opulation frequencies. T h us, it b orrows str ength from the information in the frequencies in the b o osting p opu lation to b et- ter estimate their corresp ondin g frequencies in the target. Y et only those p opulations wh ose frequencies are th ough t to b e informativ e ab out th e fre- quencies in the target are used . It co nsiders the set of all markers with a giv en f r equency in the b o osting p opulation to b e the collection of mark ers that captures the p ertinent conditioning information. The hierarchica l Ba yesian approac h tak en b y Lo c kwoo d , Ro eder and Devlin ( 2001 ) can b e exp ected to b ehav e like our estima tor in m an y resp ects, al- though coming to this esti mate from a diﬀerent dir ection. Sp eciﬁcally , the hierarc h ical structure mo dels the p opulation-sp eciﬁc allele frequency by a lo cus-sp eciﬁc Diric hlet distribu tion, so that the s u m of the Diric h let pa- rameters control s the dive rgence betw een p opulations, and is th eoretically motiv ate d by W right ’s F st [ W right ( 1951 )]. The Diric hlet parameters are al- lo we d to v ary across lo ci, but the hierarchica l stru cture of the mo d el b orr o ws strength fr om all lo ci to d etermine, roughly sp eaking, the ov erall d egree of div ergence. A t ea c h lo cus, the allele frequency in eac h p opu lation is mo d - eled as a “symmetric” departu r e fr om that of an imp licit ancestor. T he sp eciﬁc form is reasonable so long as the p opu lations are relate d thr ough a star-shap ed ph ylogen y . Ho w ever, it would appear inappropriate to treat the p opulations in this symmetric manner if some of th e p opulations are more closely related than others. In con trast, our approac h do es not imp ose an underlying p op u lation history mo del. Instead, it endeav ors to b e ﬂexi- ble and d ata-adaptiv e for the pu rp ose of estimating allele frequencies in th e target p op u lation and estimating the eﬀectiv e genetic association b et ween the target and eac h of the b o oster samples. 2.6.6. In the examples b elo w , w e “orien t” our data so that the “ma j or” allele wh ose frequency w e are estimating corresp ond s to the alphab etically lesser nucleotide (A < C < G < T) as it w ould o ccur on the p ositiv e strand of the chromosome. As this orien tation treats all markers equitably , it is not to o surprising that the frequencies, th us deﬁned, are qu ite symmetrically distributed ab out 1 2 . If the data is orien ted based on other in formation, for example, on the basis of th e allele present in the referen ce sequence of the h uman genome, the symmetry is brok en . It is also more “neutral.” F or example, if an oracle were to orient the data so that the allele that is m ore lik ely in, sa y , Europ eans w as alw ays the ma jor allele, this actually pro v id es extra information (e.g., y ou would know not to estimate a frequ en cy 12 M. COR AM AND H . T ANG b elo w 0.5 for E urop eans on an y allele), b ut this d o es not treat diﬀeren t p opulation group s equitably (e.g., African p opulation frequ en cies w ouldn’t ob ey this ru le). Th e fair wa y to introd uce this extra information, we argue, is to act ually pro vide the genot yp e data that your orientati on is based on as a (p oten tial) b o oster sample. 3. Results. In this section we examine the p erformance of the empir- ical Ba yes estimates usin g genot y p e d ata collect ed from the In ternational HapMap Pro j ect [ In ternational HapMap C on s ortium ( 2005 )] and by the Per- legen Sciences [ Hinds et al. ( 2005 )]. F or the HapMap data, we used th e geno- t yp e data (Pu blic Release #21) fr om un related ind ividuals rep r esen ting f our ethnic p opulations; these include the follo win g: 60 Y orub a in Ibadan, Nigeria (YRI), 60 U.S. r esid en ts with ancestry from north er n and western Europ e (CEU), 45 Han Chinese in Beijing, C hina (CHB), and 45 Japanese in T okyo, Japan (JPT). Data f rom the P erlegen Pro ject includes 24 Europ ean Ameri- cans, 23 African Americans and 24 Han Chin ese from the Los Angeles area. Because our goal is to develo p this statistical approac h, w e on ly u sed SNPs on c hromosome 22, lea ving ∼ 55 , 000 S NPs in the HapMap data and ∼ 20 , 000 SNPs in the P erlegen data, of wh ic h ∼ 14 , 800 o v erlap. 3.1. Bo oster sample and tar get sample r epr esent identic al p opulations. Our ﬁrst exp eriment examines the p erf ormance of th e empirical Ba yes es- timates, wh en the p opulations represente d by the target sample and the b o oster sample coincide. T o do so, w e r an d omly split eac h p opulation sam- ple in HapMap into t wo sets of approximate ly equal size, and treat one set as the target sample and the other as a b o oster. Int uitiv ely , if we knew that the t wo samples came from the same p opulation, the most eﬃcien t frequen tist estimator is to simply compu te the MLE usin g the p o oled genot yp e data. The equiv alen t emp ir ical Ba y es estimator w ould use ˆ a i = x i and ˆ b i = n X − x i . In other wo rds, the prior w ould use the observ ed allele counts in the b o oster sample as the pseud o-counts. T able 1 summarizes the results for eac h p op- ulation, demonstrating that th e emp irical Bay es approac h es, b oth EB1 and EB2, appr o ximately reco ver the p o oling estimator on this d ata. F or exam- ple, in terms of EB1, th e “p o oling-equiv alen t” empir ical Ba y es pr ior uses β 0 = 0 and β 1 = n X so that ˆ a ( ˆ p i ) = 0 + n X ˆ p i and ˆ b ( ˆ p i ) = 0 + n X (1 − ˆ p i ) . As sho w n in T able 1 , the estimat ed coeﬃcients ( ˆ β 1 ) are v ery close to n X so that EB1 and EB2 do indeed b eha v e m uch lik e p o oling on this example, as one wo uld hop e. Instead of using 0 baseline p seudo-coun ts, though, the lik eliho o d criterion selects small β 0 v alues near 0 . 10. This is sensible sin ce a Beta (0 . 1 , 0 . 1) distribu tion appro xim ates the un conditional distribution of allele fr equ encies in these p opulations. EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 13 T able 1 Par ametric estimators for HapMap p opulations, when the tar get and b o oster p opulations c oincide. EB1 r efers to the mo del in ( 2.2 ); EB2 r ef ers to the mo del in ( 2.3 ) EB1 EB2 p op n Y n X β 0 β 1 β 0 β 1 β 2 β 3 YRI 60 60 0.13 58.19 0.6 3 58.62 − 0.55 − 26.2 6 CEU 60 60 0.11 58.90 0 .59 58.87 − 0.51 − 19 .64 CHB 44 46 0.10 46.36 0.5 3 46.27 − 0.46 − 13.6 2 JPT 44 46 0.08 44.60 0.65 44.69 − 0.60 − 17 .97 Fig. 3. Estimate d p ar ameters for the empiric al Bayes priors as a f unction of the ob- serve d fr e quency in the b o oster sample. T ar get sample c onsists of 15 HapMap Jap anese individuals; b o oster sample c onsists of 45 HapMap Chinese individuals. Blue lines ar e p a- r ameters a in the Beta prior; r e d lines ar e b ; gr ay l ines ar e a + b, which i s a lo c al m e asur e of aﬃnity b etwe en the tar get and b o oster p opulations. He avy solid lines ar e p ar ameters for spline mo del, EB3; thin lines ar e p ar ameters for windowe d implementation; dotte d li nes ar e p ar ameters for line ar mo del EB1. 3.2. HapMap Chinese b o osts HapMap Jap anese. In the second exp eri- men t w e use the HapMap Chinese s ample (CHB) to b o ost frequency est i- 14 M. COR AM AND H . T ANG mates for the Japanese (JPT). As a “v alidation” sample, w e set aside 30 JPT ind ividuals, and app ly empirical Bay es esti mators on 15 JPT and 45 CHB. Figure 3 plots the estimated parameters ˆ a and ˆ b using w indo wed, parametric and splin e v ariatio ns of the emp ir ical Ba yes estimators. Visual insp ection suggests that these d iﬀeren t mo dels pr o duce similar emp irical Ba y es p riors. F or the linear estimator, EB1, the estimated aﬃnit y m easure is ν = 36 . 9 6 ( β 0 = 0 . 0 38 , β 1 = 36 . 88) substan tially lo w er than the nominal 90 alleles. S ince the previous example suggests that the empirical Ba y es esti- mators will ap p ro ximate dir ect p o oling wh en the b o osting samp le repr esen ts the same p opulation as the target, the discrepancy b et ween empirical Ba y es estimators and direct p o oling indicates a non trivial genetic heterogeneit y b et w een th e Ch inese and Japanese. P o oling th ese samples wo uld not b e jus- tiﬁed, but the high aﬃnity means th at our metho d can still b orr o w a large amoun t of stren gth. Since our goal is to improv e the Japanese allele frequency estimates, we compute the m ean squ ared err or (MSE) for the v arious estimators, treating the MLE of allele frequen cies in the “v alidation” sample as an imp erfe c t gold standard. As we explain in the App endix, calculating the MSE b y treating the v alidation sample as a p erfect gold standard pro d u ces an u p w ard bias, whic h can b e corrected. In what follo w s w e rep ort the corrected MSE. Using the MLE of allele fr equencies on the 15 JP T alone, the MSE is 2 . 83 × 10 − 3 . In con trast, EB3 pro d uces an MSE of 1 . 273 × 10 − 3 , ac hieving a 55% error reduction without an y a priori assumption of p opulation h omogeneit y . The EB1 mo del ﬁts essentially j ust as well (MSE of 1 . 279 × 10 − 3 ), itself only slightly b etter than the wind o wed estimate (MSE of 1 . 282 × 10 − 3 ). Accordingly , w e fa vo r th e simple and relativ ely in terp retable EB1 mo d el f or general application. In this case, if one simply treats the Chinese and Japanese in dividuals as sampled from a homogeneous p opulation and computes the MLE on the p o oled sample of 60 individuals, the corrected MSE is redu ced to 1 . 51 × 10 − 3 , not so m u c h higher than EB3. Ho wev er, it d o es not alwa ys w ork out this well for the p o oled estimate: if the p opulations are less closely asso ciated, the bias introdu ced from p o oling can o verwhelm the v ariance reduction (e.g., if Eur op eans are p o oled in w ith African Americans to “improv e” their fre- quency estimates; see S ection 3.3.1 ). Moreo v er, in p ractice, few inv estigat ors w ould feel comfortable p o oling heterogeneous p opulations su c h as C h inese and Japanese. Ou r metho d has the adv an tage that it automatica lly obtains the r igh t degree of p o oling, and thus allo ws us to b orr o w in f ormation from a bo oster sample even when we know the samples r epr esent heter o gene ous p opula tions. 3.3. Simulate d data. In order to accurately assess the MSE, b ias and v ariance in a situation where we ha v e a true gold standard, w e p erform a EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 15 sim u lation exp eriment . The data consists of 55 , 000 mark ers, whose frequen- cies in X , p, are dra wn indep endent ly from Beta(0 . 198 , 0 . 198); conditional on p, the corresp ondin g frequencies in Y , q , are dr awn from Beta( a ( p i ) , b ( p i )) , where a and b follo ws the mo del of ( 2.2 ). T he co eﬃcients, β ’s, used in ( 2.2 ) are c hosen so that the join t distrib ution of ˆ p and ˆ q approximate s that of the HapMap Chinese and Japanese. Giv en p i and q i , w e next generate genot yp e data b y sampling Y i ∼ Binom (30 , q i ) , and X i ∼ Binom(90 , p i ) . The scatt er plot of the sim u lated data is s ho wn in Figure 1 (b), and resem b les the Chinese v.s. Japanese plot [Figure 1 (a)] in b oth marginal and joint distr ib utions. Figure 4 disp la ys the estimated a and b usin g v arious method s. W e see that they are qualitativ ely s im ilar. T o compute MS E , bias and v ariance, we use the und erlying f requencies in p opu lation Y as the gold standard . Th e MSE usin g the 30 observed alleles from the target s ample is 2 . 3 × 10 − 3 ; b y directly p o oling all 120 observ ed allele s, the MSE is 1 . 2 × 10 − 3 , w h ile the MS E usin g EB3 is 1 . 0 × 10 − 3 . Figure 5 sh o ws that th e MSE using the b o osting samp le (red curve) can b e subs tan tially low er than the MLE usin g samples from Y alo ne (blac k cur ve), esp ecially f or frequencies n ear 0 . 5. The bias-v ariance decomp osition, also sho w n, in dicates that the bias introd uced b y the empirical Bay es pro cedu r e is qu ite small. 3.3.1. A dmixe d p opulatio ns. Estimating allele frequencies in an admixed p opulation oﬀers an int eresting ap p lication of our metho ds. An admixed p opulation arises when repro ductiv ely isolated ancestral p opulations mate, pro du cing oﬀspr ings whose genome repr esen ts a mixtu r e of alleles f rom mul- tiple ancestral p opu lations. Tw o of the largest m in orit y p opulations in the U.S. are b oth recen tly admixed: African Americans are largely an African group with recen t Eur op ean admixture, and the Hispanics represen t v ari- ous degrees of mixing among Nativ e Americans, Europ eans and Africans. Increasing num b ers of genetics studies are fo cusin g on one of these adm ixed p opulations. Intuitiv ely , th e frequ encies in the admixed p opulation (e.g., African Americans) resem ble a w eighte d a verag e of th e corresp onding fr e- quencies in the ancestral p opulations (e.g., Europ eans and Africans). T h ere- fore, existing genot yp e data on the ancestral p opulations sh ould pr ovide information on the allele frequencies in an admixed p opulations. Here w e illustrate this idea using the HapMap YRI and CEU samples to improv e African American allele frequ en cies estimated from the Perle gen data. W e u s e 12 of the Perlege n African American individuals to estimate ˆ q , w hile reserving the other 12 individuals as an imp erfe ct gold standard. The standard and common p ractice, whic h uses the 12 African Americans alone, p ro du ces an MSE of 13 . 2 × 10 − 3 . T h ere are several wa ys to incor- p orate HapMap samples: using CEU alone, EB1 estimates ( β 0 , β 1 ) to b e (0 . 66 , 4 . 45); u sing YRI alone, EB1 estimate s ( β 0 , β 1 ) to b e (1 . 48 , 34 . 29) . These resu lts indicate that the YRI p opulation has higher aﬃnit y to the 16 M. COR AM AND H . T ANG Fig. 4. Estimate d p ar ameters for the empiric al Bayes priors as a function of the observe d fr e quency i n the b o oster sampl e. The tar get and b o oster samples c onsist of 30 and 90 simulate d al leles, r esp e ctively. Blue l ines ar e p ar am eters a in the Beta prior; r e d lines ar e b ; gr ay l ines ar e a + b, which is a lo c al m e asur e of aﬃni ty b etwe en the tar get and b o oster p opulations. He avy soli d l ines ar e p ar ameters f or spli ne mo del, EB3; thin li nes ar e p ar ameters for windowe d im plementation; dotte d lines ar e p ar ameters f or l i ne ar mo del EB1. African Americans. It is imp ortan t to note that n aiv ely p o oling the CEU and the African American samples w ill incr e ase the MSE. Wh en we hav e b o oster samples from b oth YRI and CEU, our metho d of c hoice is mo del ( 2.5 ), w hic h allo ws us to in corp orate YRI and CEU sim u ltaneously , the p a- rameter estimates are ( β 0 = 1 . 2 2 , β YRI = 64 . 37 , β CEU = 18 . 90). I n terestingly , the fr action, β CEU β YRI + β CEU = . 23 , resem bles previously estimated Europ ean an- cestry in the Afr ican Americans in the literature [ P arr a et al. ( 1998 )]. The parameter estimates and th e MSE ev aluated using the remaining 12 Afr ican Americans are sh o wn in T able 2 . This example highligh ts the ad v antag e of the empirical Bay es metho d o ve r MLE based on p o oling: ﬁr st, if we had only a Eur op ean b o oster s am- ple, we wo uld b e b etter oﬀ not using it at all than naiv ely po oling it. The empirical Ba y es metho d estimates a v ery lo w aﬃnit y of 5 . 65; b etw een the t wo extremes, this is closest to ignoring the b o oster sample. Y et it do es not EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 17 Fig. 5. MSE (thick soli d), bias 2 (dashe d) and varianc e (dash-dot ) of the empiric al Bayes estimates on the simulate d data pr esente d in Se ction 3.3 . The thin line r epr esents the exp e cte d MSE of the MLE using the tar get sample alone. ignore the C EU samp le completely , th er eby still achieving an 18% reduction in MSE. Second, n o matter what th e b o oster sample is (CEU alone, YRI alone, or CEU and YRI), EB achiev es a smaller MSE than either naiv ely p o oling or ignoring the same b o oster sample(s). Finally , we note that the inﬂuence of b oth CE U and YRI are greater in the presence of t wo b o oster samples th an when ther e is only one b o oster s amp le, and the minim um MSE is ac hiev ed usin g tw o b o oster samples s im u ltaneously . L o osely sp eaking, w e T able 2 Co eﬃcient estimates and MSE c omp arison for estimating fr e quencies in Afric an Amer ic ans. T ar get sample (c olumn Y) c onsists of 12 Perle gen Afric an Americ an individuals; b o oster samples (c olumn X) c onsist of 60 CEU and 60 YRI indi viduals fr om HapMap. MSEs ar e c ompute d using 12 i ndep endent Afric an Americ an individuals as a gold standar d, with the bias c orr e ction describ e d in the App endix. T he c olumn lab ele d MSE MLE is b ase d on naively p o oling genotyp e data fr om samples X and Y, r e gar d l ess of p opulation origin. T he c olumn lab ele d MSE EB is b ase d on e quation ( 2.5 ) X Y β 0 β CEU β YRI MSE MLE MSE EB Af. Am — — — 6 . 24 × 10 − 3 CEU Af. Am 0.65 4.34 — 22 . 48 × 10 − 3 5 . 13 × 10 − 3 YRI Af. Am 1.47 — 34.51 3 . 35 × 10 − 3 2 . 47 × 10 − 3 YRI and CEU Af. Am 1.22 18.90 64.37 4 . 15 × 10 − 3 1 . 15 × 10 − 3 18 M. COR AM AND H . T ANG strik e the r igh t balance in w eighing the b o oster samples, and at the same time extract more in formation from eac h relev an t b o oster sample. Most im - p ortant ly , we ac hiev e such b alance and optimalit y without assuming any kno wn genetic relationship among the target and the b o oster samples. 4. Discussion. Estimating allele f requencies is a basic y et imp ortan t step in many genetic studies. F or example, p opu lation-sp eciﬁc allel e f requency is the essent ial sou r ce of information used in a series of analyses, whic h inferr ed genetic structure, as well as co rrelation b et w een spatial patte rn of genetic v ariation and geograph y , in the Human Genome Dive r sit y Pro ject—Cent re d’Etude du Po lym orphisme Humain (HGDP-CEPH) Human Genome Div er- sit y Panel [ Ramac handr an et al. ( 2005 ) and Rosen b erg et al. ( 2002 , 2005 )]. The HGDP-CEPH consists of 1000 ind ividuals r ep resen ting 52 world-wide p opulations, man y of whic h are represen ted by 10–25 individu als. In suc h small samp les, p opu lation-sp eciﬁc allele frequen cy estimates are sub ject to substanti al sampling errors; as a resu lt, the inf erred genetic clustering pat- tern wa s u n stable esp ecially for p opu lations with small samples [ Rosenb erg et al. ( 2005 )]. In link age analysis of extended p edigrees, wher e not all relev an t mem b ers of a p edigree ha ve b een genot yp ed, p opulation allele fr equency is r equ ired to infer identit y b y descent (IBD) information [ Risc h ( 1900 ), W eeks and Lange ( 1998 ) and Lo c kw o o d, Ro eder and Devlin ( 2001 )]. Lik ewise, accurate allele frequen cy estimates p la y an imp ortant role in a whole-genome case-co n trol asso ciation study (W GA), which compares the allele f r equency b et wee n a group of aﬀected cases and a grou p of unre- lated health y control s. Numerous W GA are underwa y , with a samp le size on the order of thousands of su b jects, ea c h genot yp ed at 10 0,000 –500, 000 SNPs. Because of th e need to correct for multiple comparisons, large samp les are r equ ired for detecting risk alleles that are rare or hav e mo derate eﬀect [ Hirsc hh orn and Daly ( 2005 ) and W ang et al. ( 2005 )]. While in m ost ongo- ing case-con trol studies, the o veral l sample size is c hosen to ac hiev e goo d statistica l p ow er, stratiﬁed analyses are often p erformed on muc h smaller subsets of participan ts repr esen ting minorit y groups. Th u s , estimating allele frequency from a small sample size r emains a concern in practice. In this pap er we pr op osed an empirical Ba y es approac h, whic h enabled us to impro ve p opulation-sp eciﬁc allele frequency estimates by adaptivel y incorp orating genot yp e data from r elated p opulations. The ﬂexibilit y and computational eﬃciency of our approac h allo ws it to b e incorp orated in ex- isting genetic data analyses. In the con text of case-con trol asso ciation stud- ies, although the approac h we pr op ose not d irectly addr ess the hyp othesis testing pr oblem, it can b e furth er develo p ed to do so. Po w ered by the n ew generation of high-throughput p latforms, w e exp ect a blo om in genot yp e data fr om diverse p opu lations. When genotypes fr om ad d itional unaﬀected EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 19 individuals are a v ailable fr om an external sour ce, we can apply our metho d and reduce the uncertain t y in the fr equencies of the con trol samp les. Our approac h diﬀers from existing approac hes in sev eral imp ortan t as- p ects. First, since our goal is to improv e the allele frequ en cy estimates in a sp eciﬁc target p opulation, we do n ot treat all p opulations symmetrically . Of course, if w e w ant to improv e the estimates in all p opulations, w e can simp ly apply our metho d to target eac h p opulation in turn. Second, b y concentrat - ing on allele f requency estimation, our metho d do es n ot r equire assuming an underlying genealogy or a common ancestral p opulation. Th is is attractiv e b ecause genealogy (or coalesce n t)-based approac hes either are r estricted to analysis of data from a genetic region with n egligible recom bination (suc h as Y-c hromosome or m tDNA), or r equire hea v y computation on el ab orate ancestral recom bination graphs [ Nordb org ( 2001 )]. Th u s, coal escen t-based approac hes are n ot easily app licable to data of genomic scale. Lik ewise, mo deling a common ancestral p opu lation either requires a fu ll genealogica l approac h as describ ed ab ov e, or m aking simplifying assumptions such as a star-shap ed genea logy . In con trast, by a voiding explicit mod eling of full p opulation history , our app roac h is not only computationally eﬃcient, b ut also is more robust to unkn o wn and complex demographic h istory . Throughout this pap er w e ha v e considered estimation with resp ect to a squared- error loss. In pr actice, one migh t b e more concerned, sa y , ab ou t prop ortional errors in frequency; tak en to the extreme, this requires that sp ecial attent ion b e paid to rare alleles. In the examples w e consider, we ﬁnd that a sin gle b eta-binomial mo del ﬁ ts the sampling distrib ution of the target frequ encies reasonably well. So long as this remains the case, one can con tinue to compute the p osterior in the mann er w e describ e b ut ma y wish to consider, for example, the p osterior median as an alternativ e to the p osterior mean. By this choic e, one m inimizes the p osterior L 1 loss on b oth the frequency as well as the log-odd s scales. On the other hand, if it tur ns out that the sampling distribu tion of rare allele’s frequencies are not adequ ately approxima ted by a b eta-binomial, it wo uld b e sensib le to consider a generalization of this w ork in which, instead of ﬁtting family of parameterized b eta d istributions, one attempted to ﬁt a mixtur e of b eta’s, so that the extra comp onent can make a more r eﬁned m o del for the app earance of rare alleles. Our examples using real genetic data suggest that incorp orating addi- tional b o osting samples can often s u bstant ially im p ro ve the fr equency esti- mates b y int ro ducing only a small degree of bias in exc h an ge for the v ariance reduction. Ho wev er, the impro v ement in the estimation describ es th e b ehav- ior of the estimate a v eraged o v er all SNPs. There ma y b e a sm all n u m b er of SNPs whose allele frequen cies diﬀer substant ially b et ween p opu lations. Therefore, as a w ord of caution, we p oin t out that the app roac h we prop ose here may not b e appropr iate for some app lications. F or example, if on e’s 20 M. COR AM AND H . T ANG goal is to detect SNPs whose frequencies diﬀer signiﬁcan tly b etw een t wo p opulations, then using eac h p opulation to b o ost the other tends to shrink the o v er all allele frequency diﬀerence. I n future r esearc h, we plan to dev elop extensions to detect suc h SNPs. Empirical Ba yes app roac hes, pioneered by Robbin s ( 1964 ), oﬀer a natur al and uniﬁed framew ork f or incorp orating auxiliary information. W e hop e that the promising results w e rep ort here will inspire fur th er dev elopmen t for analyzing the imp ending large genot y p ing studies. APPENDIX In our real-data examples the MSE is computed using a gold s tand ard that is imp erfect; w e now sho w that this results in an upw ard bias compared to the true MSE. W e can estimate the bias and then su btract it to yield unbiased MSE estimates. Let ˆ q b e the estimator w hose MSE w e are estimating, let ˆ q v al denote the allele f r equency in an indep endent v alidation sample (imp erfect gold s tandard), an d let ] MSE = 1 N P i ( ˆ q i − ˆ q v al i ) 2 denote the asso ciated MSE estimate usin g the imp erfect gold stand ard . Under the assumed statistical mo del, the ˆ q v al i ’s are indep endent of the data u s ed for estimation, and are unbiase d estimates of the true frequencies q i . T o compute the bias of ] MSE , add and su btract q i , expand the square, and tak e exp ectations: E ] MSE = E 1 N X i ( ˆ q i − q i + q i − ˆ q v al i ) 2 , = E 1 N X i [( ˆ q i − q i ) 2 + ( ˆ q v al i − q i ) 2 + 2( ˆ q i − q i )( q i − ˆ q v al i )] , (A.1) = E MSE + 1 N X i V ar ( ˆ q v al i ) , where the cross term in ( A.1 ) h as exp ectation zero, b eing a pr o duct of in- dep end en t factors, of wh ic h the second has exp ectatio n 0 b ecause ˆ q v al i is unbiase d. The v ariance of ˆ q v al i is q i (1 − q i ) /n v al , d ue to binomial v ariation. A ttempting a plug-in estimate for this v ariance, we observe that E ˆ q v al i (1 − ˆ q v al i ) = q i − E ( ˆ q v al i ) 2 = q i − q 2 i − [ E ( ˆ q v al i ) 2 − q 2 i ] = q i (1 − q i ) − V ar ( ˆ q v al i ) = q i (1 − q i )(1 − 1 n v al ). Th erefore, an unbiased estimate of V ar ( ˆ q v al i ) is ˆ q v al i (1 − ˆ q v al i ) / ( n v al − 1). Consequentl y , an unbiased estimate of the MSE is ] MSE − 1 N X i ˆ q v al i (1 − ˆ q v al i ) n v al − 1 . EMPIRICAL BA YES ESTIMA TE OF ALLELE FREQUENCY 21 REFERENCES Bernardo, J.-M. and Smith, A. F. M. (1994). Bayesian The ory . Wiley , Chichester. MR1274699 Choudhr y, S. , Coyle, N. E. , T an g , H . , Salari, K. , Lind, D. , C lark, S. L. , Tsai , H.-J. , Na qvi, M. , Phong, A. , Ung, N . , Ma t allana, H. , A v ila, P. C. , Ca sal, J. , Torres, A. , Na zario, S. , Castro, R . , Ba ttle, N. C. , Perez -S t able, E. J. , Kwok, P.-Y. , Shepp ard, D. , S hriver, M. D. , Ro driguez-Ci ntro n, W . , Ri sch, N. , Ziv, E. and B urchard, E. G. ( 2006). Population stratiﬁcation confounds genetic association studies among Latinos. Hum. Genet. 118 652–664. Clark, A. G . , Hubisz , M. J. , Bust amante, C. D. , Williamson, S. H. and Nielsen, R. (2005). Ascertainment bias in studies of human genome-wide p olymorphism. Genome R es. 15 1496–15 02. Devlin, B. and Roeder, K. (1999). Genomic control for association stu d ies. Biometrics 55 997–1004. Fisher, R. (1922). On t he d ominance ratio. Pr o c. R oy. So c. Edinbur gh 42 321–341. Hinds, D. , Stuve, L. , Nilsen, G. , Halperin, E. , Eskin, E. , Ballinger, D. , Frazer, K. and Cox, D. (2005). Whole-genome p atterns of common DN A v ariation in three human p op u lations. Scienc e 307 1072–1 079. Hirschhorn, J. N. and Dal y, M. J. (2005). Genome-wide association studies for common diseases and complex t raits. Nat. R ev. Genet. 6 95–108. Interna tional Hap Map Consor tium . (2005). A haplotype map of the human genome. Natur e 437 1299–1320. Jiang, C. J. and Cockerham, C. C. (1987). Use of the multinomial Dirichlet mo del for analysis of sub divided genetic p opulations. Genetics 115 363–366. Kimura, M. (1968). Evolutionary rate at the molecular level . Natur e 217 624–62 6. Kit ada, S. , H a y ashi, T. and Kishino, H . (2000). Empirical Bay es proced ure for esti- mating genetic distance b etw een p opu lations an d eﬀectiv e p opulation size. Genetics 156 2063–2079. Lander, E. S. and Schork, N. J. (1994). Genetic dissection of complex traits. Scienc e 265 2037–2048. Lange, K. (1995). Applications of the diric h let distribution to forensic matc h probabili- ties. Genetic a 96 107–11 7. Lockw ood, J. , Roeder, K. and Devlin, B. (2001). A Bay esian h ierarc hical mod el for allele frequ encies. Genet. Epidemi ol. 20 17–33. Nicholson, G. , Smith, A. V. , J ´ onsson, F. , G ´ ust afsson, O. , Stef ´ ansson, K. and Donnell y, P. (2002). Assessing p opulation diﬀerentiation and isolation from single- nucleoti de p olymorphism data. J. R. Stat. So c. Ser. B Stat. Metho dol. 64 695–715. MR1979384 Nielsen, R. , Hubisz, M. J. and Clark, A. G. (2004). Reconstituting the frequency sp ectru m of ascertained single-n u cleotide p olymorphism data. Genetics 168 2373–2382 . Nordbor g, M. (2001). Coalescen t theory . In Handb o ok of Statistic al Genetics (D . Bald- ing, M. Bishop and C. Cannings, eds.) Chap ter 7 179–212. Wiley , Chichester. P arra, E. J. , Marcini, A. , A key , J. , Ma r tinson, J. , Ba tzer, M. A. , Cooper, R. , F orrester, T. , Allison, D. B. , De ka, R . , Ferre ll, R. E. and Shrive r, M. D. (1998). Estimating African American admix t ure proportions by use of p opulation- sp eciﬁc alleles. Am. J. Hum. Genet. 63 1839–185 1. Pritchard, J. and Rosenberg, N. (1999). Use of unlinked genetic markers to detect p opulation stratiﬁcation in association studies. Am. J. Hum. Genet. 65 220–22 8. Ramachandran, S . , Deshp ande, O. , Ro seman, C. , Rosenber g, N. , Feldman, M . and Ca v a lli-Sforza, L. (2005). Supp ort from the relationship of genetic and geographic 22 M. COR AM AND H . T ANG distance in human p opu lations for a serial found er eﬀect originating in Africa. Pr o c. Natl. A c ad. Sci. USA 102 15942–15947. Risch, N. (1900). Link age strategies for genetically complex traits. I I I. The eﬀect of marker p olymorphism on analysis of aﬀected relative pairs. Am. J. Hum. Genet. 46 242–253 . Ro bbins, H. (1964). The empirical Ba yes approach to statistical decision problems. Ann. Math. Statist. 35 1–20. MR0163407 Ro senberg, N. , Mahajan, S. , Ramachandran, S. , Zhao, C. , Pritchard, J. and Feldman, M. W. (2005 ). Clines, clusters, and the eﬀect of study design on the inference of human p opulation structure. PL oS Genet. 1 e70. Ro senberg, N. , Pritchard, J. , Weber, J. , Cann, H. , Kidd, K. , Zh i vot o v sky, L. and Feldman, M. W. ( 2002). Genetic structure of human p opulations. Scienc e 298 2381–23 85. Sabeti, P. C. , Schaffner, S. F. , Fr y, B. , Lohmueller, J. , V ari ll y , P. , Shamovsky, O. , P alma, A. , Mikkelsen, T. S. , Al tshuler, D. and Lander, E. S . (2006). Positi ve natural selection in t h e human lineage. Scienc e 312 1614– 1620. Skellam, J. G. (1948). A probability distribution derived from the binomial distribution by regarding t h e probability of success as va riable b etw een the sets of trials. J. R oy. Statist. So c. Ser. B 10 257–261. MR0028539 Vo ight, B. F. , Kudara v alli, S. , W en, X. and Pritchard, J. K. (2006). A map of recent p ositiv e selection in the human genome. PL oS Biol. 4 e72. W ang, W. Y. S. , B arra tt, B. J. , C la yton, D. G. and T odd , J. A. (2005). Genome- wide association studies: Theoretical and practical concerns. N at. R ev. Genet. 6 109– 118. Weeks, D. and Lange, K. (1998). The aﬀected - p edigree-member metho d of link age anal- ysis. Am. J. Hum. Genet. 42 315–3 26. Wilson, I. J. , Weale, M. E. and Baldi ng, D. J. (2003). Inferences from DNA data: P opulation histories, evolutionary pro cesses and forensic match probabilities. J. R oy. Statist. So c. Ser. A 166 155–201. MR19845 68 Wright, S. (1931). Evolution in medelian p opulations. Genetics 16 97–159. Wright, S. (1951). The genetical structure of p op u lations. Ann. Eugenics 15 323–354. MR0041413 Yonan, A. L. , P almer, A. A. and Gilliam, T. C. (2006). Hardy –W einberg disequi- librium identiﬁed genotyping error of th e serotonin transp orter (SLC6A4) promoter p olymorphism. Psychiatr. Genet . 16 31–34. Dep ar tment of Hea l th Research and Policy St anford Un iversity St anford, California 943 0 5 USA E-mail: mcoram@stanford.edu Dep ar tment of G enetics St anford University School of Medicine St anford, California 943 0 5 USA E-mail: h uatang@stan ford.edu

Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment