On probabilities for separating sets of order statistics

Consider a set of order statistics that arise from sorting samples from two different populations, each with their own, possibly different distribution function. The probability that these order statistics fall in disjoint, ordered intervals, and tha…

Authors: ** 원문에 저자 정보가 명시되어 있지 않음. (가능하면 원 논문 PDF에서 확인 필요) **

On probabilities for separating sets of order statistics ∗ D. H. Gluec k † A. Kari mp our -F ard † J. Mandel † K. E. Muller ‡ June 24, 2007 Abstract Consider a set of ord er statistics that arise fr om sorting samples f r om t w o different p opulations, eac h with their o wn, p ossibly differen t distribu tion function. The probabilit y that these order statistics fall in disjoint , ordered in terv als, and that of the smallest statistics, a certain n u m b er come fr om the fir st p opulations, are give n in term s of the t w o distribu tion functions. The result is applied to computing the join t probabilit y of th e num b er of rejections and the n umber of false r ejections for the Benjamini-Ho c hb erg f alse disco v ery rate p ro cedure. Keyw ords: Benjamini and Ho c h b erg pro cedure, blo ck matrix, p ermanen t , m ultiple comparison. 1 In tro d uction Gluec k et al. (2006b) gav e explicit expressions for t he probability that arbitrary subsets of order statistics fall in disjoin t, o rdered in terv als on the set o f real n umbers. In this pap er, w e extend this w ork and consider t w o sets of real v alued, independen t but not necessarily iden tically dis tributed ra ndo m v ariables. W e giv e expressions in terms o f cumulativ e distribution functions for the proba bilit y that arbitrary subsets of order statistics fall in ∗ Debo rah H. Glueck is As s istant Professo r, Depar tmen t of Preven tive Medicine and Biometrics, University of Color ado a t Denv er and Health Sciences Center, Ca mpus Box B119, 4200 East Nin th Aven ue, Denv er, Colorado 8026 2 (e-mail: Deb ora h.Glueck@uc hs c .e du). Anis Karimp our -F ard is a gra duate student in Bioinformatics, Department o f Pr even tive Medicine and B io metrics, University o f Color ado a t Denv er and Health Sciences Cent er, Campus Box B119, 420 0 East Nint h Aven ue, Denv er, Colorado 8026 2 (e-mail: Anis Karimp our- F ard@uchsc.edu). J an Mandel is Profess or, Department of Mathematics, Adjunct P rofessor , Department of Computer Science, and Direc to r of the Center for C o mputational Mathematics, University of Colorado a t Den ver and Health Sciences Ce nter, Campus Box 1 70, Denv er, Colorado 8 0217 -3364 (e- mail:Jan.Mandel@c uden ver.edu). Keith E. Muller is Pr ofessor and Director of the Divisio n of B iostatistics, Department of Epidemiology and Health Policy Resear ch, University of Flor ida , 1329 SW 16th Street Ro om 5125, PO B ox 1 0017 7 Gainesville, FL 326 1 0-017 7 (e-mail:K eith.Muller@biostat.ufl.edu) Glueck was supp o rted by NCI K07CA888 11. Mandel was supp orted by NSF-CMS 0325 314. Muller was suppo rted by NCI P01 CA47 9 82-0 4 , NCI R01 CA095 749-0 1A1 and NIAID 9 P 30 AI 50410 . The author s thank Pro fessor Ga r y Gr un wald for his helpful co mments. † Univ ersity o f Color ado at Denv er and Health Sciences Center ‡ Univ ersity o f Florida 1 disjoin t, or dered interv als, and that of the smallest statistics, a certain n umber come from one set. W e ha v e b een unable to find an y previous pap ers on t his topic. This problem is of in terest in calculating pro ba bilities for the Benjamini and Ho c hberg (1995) multiple comparisons pro cedure. 2 A s imple example Consider the following simple example. Let X 1 , X 2 ∈ [0 , 1] b e indep enden t random v ariables. Denote b y F X 1 ( x 1 ) and F X 2 ( x 2 ) the marginal cum ulativ e distribution functions and b y F X 1 ,X 2 ( x 1 , x 2 ) the joint cum ula t ive distribution f unction of X 1 and X 2 . Assume that the cum ulativ e distribution functions are con tin uous. Let Y 1 = min { X 1 , X 2 } and let Y 2 = max { X 1 , X 2 } b e the order statistics. F or i = 1 , 2, write the marginal cumulativ e distribution function of Y i as F Y i ( y i ), a nd the joint cum ulat ive distribution function a s F Y 1 ,Y 2 ( y 1 , y 2 ), for y 1 ≤ y 2 . This joint cum ulativ e distribution function is also con tin uous (Dav id, 19 81, p. 10 ). Cho ose n umbers b 1 < b 2 , b 1 , b 2 ∈ (0 , 1). W e wish to find the probabilities A = Pr { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) } , (1) β = Pr { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) ∧ ( x 1 < b 1 ) } (2) and γ = Pr { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) ∧ ¬ ( x 1 < b 1 ) } . (3) and express them in terms of the distribution functions F X 1 and F X 2 . First, w e will find the probabilities directly . So, β = Pr { ( x 1 < b 1 ) ∧ ( x 2 > b 2 ) } = F X 1 ( b 1 ) [1 − F X 2 ( b 2 )] (4) and γ = Pr { ( x 1 > b 2 ) ∧ ( x 2 < b 1 ) } = [1 − F X 1 ( b 2 )] F X 2 ( b 1 ) . (5) Equations (4) and (5) follow directly from the indep endence of the random v ariables, a nd the definition of the cum ulativ e distribution functions. Since { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) } = (6) { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) ∧ ( x 1 < b 1 ) } ∪ { ( y 1 < b 1 ) ∧ ( y 2 > b 2 ) ∧ ¬ ( x 1 < b 1 ) } and the union is disjoin t, it follows t hat A = β + γ . (7) F or a problem with more than t wo order statistics, the num b er of cases one needs to consider and the n um b er of p ossible combinations of statistics, subsets, and b ounds mak es 2 a direct approac h impractical. An algorithmic approac h to obtaining γ and β will allo w the generalization to an arbitrary num b er of order statistics. Using the assumption that the distribution func tions are con tinuous, simple set op erations, and the definition of distribution f unction, w e obtain that the probabilit y o f the union (6) is A = Pr { ( y 1 < b 1 ) ∧ ¬ ( y 2 < b 2 ) } (8) = Pr { y 1 < b 1 } − Pr { ( y 1 < b 1 ) ∧ ( y 2 < b 2 ) } (9) = F Y 1 ( b 1 ) − F Y 1 ,Y 2 ( b 1 , b 2 ) . (10) The cum ulativ e distributions of the order statistics can b e written (Bapat a nd Beg, 1989), F Y 1 ( b 1 ) = F X 1 ( b 1 ) [1 − F X 2 ( b 1 )] + [1 − F X 1 ( b 1 )] F X 2 ( b 1 ) (11) F Y 1 ,Y 2 ( b 1 , b 2 ) = F X 1 ( b 1 ) [ F X 2 ( b 2 ) + F X 2 ( b 1 )] − [ F X 1 ( b 2 ) − F X 1 ( b 1 )] F X 2 ( b 1 ) . (12) Then, substituting Equations (11) and (12) in to Equation (10), w e can write A in terms of the distribution functions of X 1 and X 2 , A = F X 1 ( b 1 ) [1 − F X 2 ( b 1 )] + [1 − F X 1 ( b 1 )] F X 2 ( b 1 ) (13) − F X 1 ( b 1 ) [ F X 2 ( b 2 ) − F X 2 ( b 1 )] − [ F X 1 ( b 2 ) − F X 1 ( b 1 )] F X 2 ( b 1 ) = F X 1 ( b 1 ) [1 − F X 2 ( b 2 )] + [1 − F X 1 ( b 2 )] F X 2 ( b 1 ) . (14) W e no w interpret t he terms in the sum in Equation (14). The term that includes F X 1 ( b 1 ) as a factor is the proba bility of an ev en t in which x 1 < b 1 o ccurs, a nd the term that includes 1 − F X 1 ( b 2 ) as a factor is the pro babilit y of an ev en t in whic h x 1 > b 2 . Since b 1 < b 2 , the t w o ev ents a r e disjoin t, a nd, consequen tly , (7) follows a gain. T o summarize, we ha v e expres sed the pro babilit y in terms of the join t distribution of the order statistics, whic h w as in turn written in terms of the distribution f unctions o f the random v ariables. Finally , b y recognizing terms that corresp o nded t o a pa r t it ion, w e decomp osed A in to a sum of β and γ , the tw o probabilities of inte rest. 3 General c ase The logic used in this simple, t w o ra ndom v aria bles example can b e generalized to an arbitrary n umber of random v ariables. Consider a set of order statistics that arise from sorting samples from tw o differen t p opulatio ns, each with their ow n, p ossibly differen t distribution function. W e wish to find t he probability that these order statistics fall in a giv en union o f interv als, and that of the smallest statistics, a certain num b er come f r om one p o pulation. F or this general case, w e need to intro duce some notation a nd definitions. Let X i , i = 1 , . . . m , b e indep enden t but not necessarily identic ally distributed real v alued random v ar ia bles with v alues in the in terv al [0 , 1] and con tin uous cum ulativ e distribution functions F X i ( x i ). P artition the set { X 1 , X 2 , . . . , X m } in to t wo subsets, S 1 = { X 1 , X 2 , . . . , X n } , S 2 = { X n +1 , X n +2 , . . . , X m } . (15) 3 F or example, one can consider measuremen ts f o r males or females, or for tw o different p opulations of breast cancer, slo w or fast gro wing. The order statistics Y 1 , Y 2 , . . . , Y m are random v ariables defined by sorting the v alues of X i . Th us Y 1 ≤ Y 2 ≤ . . . ≤ Y m . Denote the realizations of the order statistics b y y 1 ≤ y 2 ≤ . . . ≤ y m . The arg umen ts of the joint cumulativ e distribution function of o rder statistics are customarily written omitt ing redundan t arguments ; t hus for 1 ≤ e ≤ m let 1 ≤ n 1 < n 2 < · · · < n e ≤ m , denote t he indices of the order statistics of inte rest. The join t cum ula tiv e distribution f unction o f the set { Y n 1 , Y n 2 , . . . , Y n e } , whic h is a subset of the complete set of order statistics, is defined as F Y n 1 ,...Y n e ( y 1 , . . . , y e ) = Pr ( { Y n 1 ≤ y 1 } ∩ { Y n 2 ≤ y 2 } ∩ · · · ∩ { Y n e ≤ y e } ) . (16) Supp ose w e are g iv en s ≤ m disjoint in terv a ls ( c q , d q ) , 0 = c 1 < d 1 < c 2 < · · · < c s < d s = 1 , (17) and integers k q ≥ 0 , s X q =1 k q = m, (18) where k 0 = 0 a nd k q is the n um b er of order statistics that fa ll in the q th in terv al. D efine w q , 1 = 1 + P q − 1 i =1 k i , and w q ,k q = P q i =1 k i to be the subscripts of the largest and smallest order statistics, resp ectiv ely , that fall in the q th in terv al. In the case when k q = 1, w e ha ve w q , 1 = w q ,k q . Using this notation, the ev en t that exactly k q of the order statistics fa ll in the q th in terv al is n c 1 < Y w 1 , 1 < · · · < Y w 1 ,k 1 < d 1 ∧ · · · ∧ c s < Y w s, 1 < · · · < Y w s,k s < d s o , (19) or, in a more compact notat io n (21) b elow . No w let B b e another random ev ent. The follo wing theorem giv es the probability of this ev ent in tersected with the eve n t (19), in terms of the cum ulative distribution functions of the order statistics relativ e t o the ev en t B . This distrubution function is defined b y F Y n 1 ,...Y n e ; B ( y 1 , . . . , y e ) (20) = Pr ( { Y n 1 ≤ y 1 } ∩ { Y n 2 ≤ y 2 } ∩ · · · ∩ { Y n e ≤ y e } ∩ B ) Con tra ry t o the usual con ven tion, w e do not require tha t the indices of the order statistics in the cumulativ e distribution function (20) are sorted, b ecause that w ould result in a complication of the notat io n in the next theorem (additional ren um b ering of the argumen ts). Theorem 1 Denote the event E = s \ q =1   c q < Y w q, 1  ∩  Y w q,k q < d q   . (21) 4 Then Pr ( E ∩ B ) = F Y w 1 ,k 1 ,Y w 2 ,k 2 ,...,Y w s,k s ; B ( d 1 , d 2 , . . . , d q ) (22) − s X i =1 F Y w 1 ,k 1 ,Y w 2 ,k 2 ,...,Y w s,k s ,Y w i,k i ; B ( d 1 , d 2 , . . . , d q , c q ) + s X r,t =1 r 0 , (38) where Φ is the cum ula t iv e distribution function of the standard normal (mean = 0 and v ar ia nce = 1). Let φ b e the probability densit y function of the standard normal. Supp ose that in truth, w e ha v e ǫ 1 ∼ N ( µ 0 , σ 2 ), so that the n ull holds for H 1 , and ǫ 2 ∼ N ( µ A , σ 2 ), so that a lt ernat iv e holds for H 2 . Define S 1 = { X 1 } , a nd S 2 = { X 2 } . Then the n um b er of p-v alue for whic h the n ull holds, n = 1. F or H 1 , the h yp otheses fo r 8 k 1 j Theory Sim ulatio n Difference 1 0 .472982 .47388 .000898 1 1 .00978051 .009 5 .00028051 T able 2: Comparison of Sim ulation and Theory . R ecall that k 1 is the n umber of hy p otheses that w ere rejected, and j is the n umber of null hypotheses that we re rejected. W e had t w o h yp otheses, and one n ull hypothesis. whic h the n ull holds, the p- v a lue has a uniform distribution o n the in terv al [0 , 1], so for x 1 ∈ [0 , 1], F X 1 ( x 1 ) = x 1 . (39) F or H 2 , the alternative holds. When w e conduct the hy p othesis test, we are unaw a re of the truth. W e alwa ys calculate the p- v alue under the null. Ho w ev er, since the alternativ e actually holds, Pr [ Z 2 ≤ z 2 ] = Pr " ¯ ǫ i − µ 0 σ √ N ≤ z 2 # = Pr " ǫ i − µ A σ √ N ≤ z 2 + µ 0 − µ A σ √ N # =Φ " z 2 + µ 0 − µ A σ √ N # . (40) Finally , F X 2 ( x 2 ) = Pr ( X 2 < x 2 ) = Pr ( { X 2 < x 2 } ∩ { Z 2 ≤ 0 } ) + Pr ( { X 2 < x 2 } ∩ { Z 2 > 0 } ) = Pr ( { 2Φ ( Z 2 ) < x 2 } ) + Pr ( { 2 [1 − Φ ( Z 2 )] < x 2 } ) = Pr  Z 2 ≤ Φ − 1 ( x 2 / 2)  + 1 − Pr  Z 2 ≤ Φ − 1 (1 − x 2 / 2)  =Φ " Φ − 1 ( x 2 / 2) + µ 0 − µ A σ √ N # + 1 − Φ " Φ − 1 (1 − x 2 / 2) + µ 0 − µ A σ √ N # , (41) where the last step follow s by substitution from Equation 40. No w, as a sp ecific example, we fix µ 0 = 0, µ A = 1, σ 2 = 1 α = . 05. W e wish to calculate the probabilit y that k 1 = 1, and that j = 0 or j = 1. With c 1 = 0, d 1 = α/ 2, c 2 = α , d 2 = 1. This is the probabilit y that of the tw o h yp ot heses, w e reject exactly one, and it is H 1 , the hy p othesis for which the n ull holds. When j = 0 , the rejection w e mak e is of the h yp othesis for whic h the alternativ e holds, and when j = 1 , the rejection w e mak e is of t he n ull h yp othesis, a fa lse rejection. W e calculated the probability using our metho dology , and b y a sim ulation using a sample of 100,0 00 v aria bles. Recall that k 1 is the num b er of order statistics that are less than b 1 , and j are the n um b er in Set 1, and less t han b 1 . The results are sho wn in T a ble 2. 9 Notice that the sim ulation differs from the t heory only in the fourt h decimal place. The theory is exact. Soft w a r e that implemen ts this metho d in Mathematica is av ailable fro m the authors up on request. References Bapat, R. B. and Beg, M. I. ( 1 989). “Order Statistics for non-iden tically distributed v ar ia bles and p ermanen t s,” S ankhya , Ser. A., 51, 79-93. Benjamini, Y., and Ho c h b erg, Y. (1 9 95). Con tro lling the F alse Disco v ery Rate: A Practical and P ow erful Approach to Multiple T esting. J. R. Stat. So c . S er. B Stat. Metho d ol. 57 289-300. Da vid, H. A. ( 1 981). Or der Statistics , (2nd ed.). New Y ork: Wiley . Gluec k, Deb orah H., Muller, Keith E., K a rimp our-F ard, Anis, Hun ter, Law rence. (2 006a) (in review), Exp ected P o w er for the F alse D isco ver Rate with Indep endence. Gluec k, D . H., Karimp our-F ard, A., Mandel, J. a nd Muller, K.E. (2006b) (in review), On the probability that order statistics fall in interv als. Gluec k, D . H., Karimp our-F ard, A., Mandel, J. , Hun t er, L. and Muller, K.E. (20 07) (in review), F ast computation b y blo c k p ermanen ts of cum ulativ e distribution functions of order statistics from sev eral p opulations. Rosner B. (20 06). F undamentals of Bio statistics (6th edition). New Y ork: Bro o ks-Cole. Ross, S. ( 1 984). A First Co urse in Pr ob ability: Se c o nd Ed ition . New Y ork: Macmillan Publishing Company . 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment