On probabilities for separating sets of order statistics

On probabilities for separating sets of order statistics ∗ D. H. Gluec k † A. Kari mp our -F ard † J. Mandel † K. E. Muller ‡ June 24, 2007 Abstract Consider a set of ord er statistics that arise fr om sorting samples f r om t w o diﬀerent p opulations, eac h with their o wn, p ossibly diﬀeren t distribu tion function. The probabilit y that these order statistics fall in disjoint , ordered in terv als, and that of the smallest statistics, a certain n u m b er come fr om the ﬁr st p opulations, are give n in term s of the t w o distribu tion functions. The result is applied to computing the join t probabilit y of th e num b er of rejections and the n umber of false r ejections for the Benjamini-Ho c hb erg f alse disco v ery rate p ro cedure. Keyw ords: Benjamini and Ho c h b erg pro cedure, blo ck matrix, p ermanen t , m ultiple comparison. 1 In tro d uction Gluec k et al. (2006b) gav e explicit expressions for t he probability that arbitrary subsets of order statistics fall in disjoin t, o rdered in terv als on the set o f real n umbers. In this pap er, w e extend this w ork and consider t w o sets of real v alued, independen t but not necessarily iden tically dis tributed ra ndo m v ariables. W e giv e expressions in terms o f cumulativ e distribution functions for the proba bilit y that arbitrary subsets of order statistics fall in ∗ Debo rah H. Glueck is As s istant Professo r, Depar tmen t of Preven tive Medicine and Biometrics, University of Color ado a t Denv er and Health Sciences Center, Ca mpus Box B119, 4200 East Nin th Aven ue, Denv er, Colorado 8026 2 (e-mail: Deb ora h.Glueck@uc hs c .e du). Anis Karimp our -F ard is a gra duate student in Bioinformatics, Department o f Pr even tive Medicine and B io metrics, University o f Color ado a t Denv er and Health Sciences Cent er, Campus Box B119, 420 0 East Nint h Aven ue, Denv er, Colorado 8026 2 (e-mail: Anis Karimp our- F ard@uchsc.edu). J an Mandel is Profess or, Department of Mathematics, Adjunct P rofessor , Department of Computer Science, and Direc to r of the Center for C o mputational Mathematics, University of Colorado a t Den ver and Health Sciences Ce nter, Campus Box 1 70, Denv er, Colorado 8 0217 -3364 (e- mail:Jan.Mandel@c uden ver.edu). Keith E. Muller is Pr ofessor and Director of the Divisio n of B iostatistics, Department of Epidemiology and Health Policy Resear ch, University of Flor ida , 1329 SW 16th Street Ro om 5125, PO B ox 1 0017 7 Gainesville, FL 326 1 0-017 7 (e-mail:K eith.Muller@biostat.uﬂ.edu) Glueck was supp o rted by NCI K07CA888 11. Mandel was supp orted by NSF-CMS 0325 314. Muller was suppo rted by NCI P01 CA47 9 82-0 4 , NCI R01 CA095 749-0 1A1 and NIAID 9 P 30 AI 50410 . The author s thank Pro fessor Ga r y Gr un wald for his helpful co mments. † Univ ersity o f Color ado at Denv er and Health Sciences Center ‡ Univ ersity o f Florida 1 disjoin t, or dered interv als, and that of the smallest statistics, a certain n umber come from one set. W e ha v e b een unable to ﬁnd an y previous pap ers on t his topic. This problem is of in terest in calculating pro ba bilities for the Benjamini and Ho c hberg (1995) multiple comparisons pro cedure. 2 A s imple example Consider the following simple example. Let X 1 , X 2 ∈ [0 , 1] b e indep enden t random v ariables. Denote b y F X 1 ( x 1 ) and F X 2 ( x 2 ) the marginal cum ulativ e distribution functions and b y F X 1 ,X 2 ( x 1 , x 2 ) the joint cum ula t ive distribution f unction of X 1 and X 2 . Assume that the cum ulativ e distribution functions are con tin uous. Let Y 1 = min { X 1 , X 2 } and let Y 2 = max { X 1 , X 2 } b e the order statistics. F or i = 1 , 2, write the marginal cumulativ e distribution function of Y i as F Y i ( y i ), a nd the joint cum ulat ive distribution function a s F Y 1 ,Y 2 ( y 1 , y 2 ), for y 1 ≤ y 2 . This joint cum ulativ e distribution function is also con tin uous (Dav id, 19 81, p. 10 ). Cho ose n umbers b 1 b 2 ) } , (1) β = Pr { ( y 1 b 2 ) ∧ ( x 1 b 2 ) ∧ ¬ ( x 1 b 2 ) } = F X 1 ( b 1 ) [1 − F X 2 ( b 2 )] (4) and γ = Pr { ( x 1 > b 2 ) ∧ ( x 2 < b 1 ) } = [1 − F X 1 ( b 2 )] F X 2 ( b 1 ) . (5) Equations (4) and (5) follow directly from the indep endence of the random v ariables, a nd the deﬁnition of the cum ulativ e distribution functions. Since { ( y 1 b 2 ) } = (6) { ( y 1 b 2 ) ∧ ( x 1 b 2 ) ∧ ¬ ( x 1 < b 1 ) } and the union is disjoin t, it follows t hat A = β + γ . (7) F or a problem with more than t wo order statistics, the num b er of cases one needs to consider and the n um b er of p ossible combinations of statistics, subsets, and b ounds mak es 2 a direct approac h impractical. An algorithmic approac h to obtaining γ and β will allo w the generalization to an arbitrary num b er of order statistics. Using the assumption that the distribution func tions are con tinuous, simple set op erations, and the deﬁnition of distribution f unction, w e obtain that the probabilit y o f the union (6) is A = Pr { ( y 1 < b 1 ) ∧ ¬ ( y 2 < b 2 ) } (8) = Pr { y 1 < b 1 } − Pr { ( y 1 < b 1 ) ∧ ( y 2 < b 2 ) } (9) = F Y 1 ( b 1 ) − F Y 1 ,Y 2 ( b 1 , b 2 ) . (10) The cum ulativ e distributions of the order statistics can b e written (Bapat a nd Beg, 1989), F Y 1 ( b 1 ) = F X 1 ( b 1 ) [1 − F X 2 ( b 1 )] + [1 − F X 1 ( b 1 )] F X 2 ( b 1 ) (11) F Y 1 ,Y 2 ( b 1 , b 2 ) = F X 1 ( b 1 ) [ F X 2 ( b 2 ) + F X 2 ( b 1 )] − [ F X 1 ( b 2 ) − F X 1 ( b 1 )] F X 2 ( b 1 ) . (12) Then, substituting Equations (11) and (12) in to Equation (10), w e can write A in terms of the distribution functions of X 1 and X 2 , A = F X 1 ( b 1 ) [1 − F X 2 ( b 1 )] + [1 − F X 1 ( b 1 )] F X 2 ( b 1 ) (13) − F X 1 ( b 1 ) [ F X 2 ( b 2 ) − F X 2 ( b 1 )] − [ F X 1 ( b 2 ) − F X 1 ( b 1 )] F X 2 ( b 1 ) = F X 1 ( b 1 ) [1 − F X 2 ( b 2 )] + [1 − F X 1 ( b 2 )] F X 2 ( b 1 ) . (14) W e no w interpret t he terms in the sum in Equation (14). The term that includes F X 1 ( b 1 ) as a factor is the proba bility of an ev en t in which x 1 b 2 . Since b 1 < b 2 , the t w o ev ents a r e disjoin t, a nd, consequen tly , (7) follows a gain. T o summarize, we ha v e expres sed the pro babilit y in terms of the join t distribution of the order statistics, whic h w as in turn written in terms of the distribution f unctions o f the random v ariables. Finally , b y recognizing terms that corresp o nded t o a pa r t it ion, w e decomp osed A in to a sum of β and γ , the tw o probabilities of inte rest. 3 General c ase The logic used in this simple, t w o ra ndom v aria bles example can b e generalized to an arbitrary n umber of random v ariables. Consider a set of order statistics that arise from sorting samples from tw o diﬀeren t p opulatio ns, each with their ow n, p ossibly diﬀeren t distribution function. W e wish to ﬁnd t he probability that these order statistics fall in a giv en union o f interv als, and that of the smallest statistics, a certain num b er come f r om one p o pulation. F or this general case, w e need to intro duce some notation a nd deﬁnitions. Let X i , i = 1 , . . . m , b e indep enden t but not necessarily identic ally distributed real v alued random v ar ia bles with v alues in the in terv al [0 , 1] and con tin uous cum ulativ e distribution functions F X i ( x i ). P artition the set { X 1 , X 2 , . . . , X m } in to t wo subsets, S 1 = { X 1 , X 2 , . . . , X n } , S 2 = { X n +1 , X n +2 , . . . , X m } . (15) 3 F or example, one can consider measuremen ts f o r males or females, or for tw o diﬀerent p opulations of breast cancer, slo w or fast gro wing. The order statistics Y 1 , Y 2 , . . . , Y m are random v ariables deﬁned by sorting the v alues of X i . Th us Y 1 ≤ Y 2 ≤ . . . ≤ Y m . Denote the realizations of the order statistics b y y 1 ≤ y 2 ≤ . . . ≤ y m . The arg umen ts of the joint cumulativ e distribution function of o rder statistics are customarily written omitt ing redundan t arguments ; t hus for 1 ≤ e ≤ m let 1 ≤ n 1 < n 2 < · · · < n e ≤ m , denote t he indices of the order statistics of inte rest. The join t cum ula tiv e distribution f unction o f the set { Y n 1 , Y n 2 , . . . , Y n e } , whic h is a subset of the complete set of order statistics, is deﬁned as F Y n 1 ,...Y n e ( y 1 , . . . , y e ) = Pr ( { Y n 1 ≤ y 1 } ∩ { Y n 2 ≤ y 2 } ∩ · · · ∩ { Y n e ≤ y e } ) . (16) Supp ose w e are g iv en s ≤ m disjoint in terv a ls ( c q , d q ) , 0 = c 1 < d 1 < c 2 < · · · < c s < d s = 1 , (17) and integers k q ≥ 0 , s X q =1 k q = m, (18) where k 0 = 0 a nd k q is the n um b er of order statistics that fa ll in the q th in terv al. D eﬁne w q , 1 = 1 + P q − 1 i =1 k i , and w q ,k q = P q i =1 k i to be the subscripts of the largest and smallest order statistics, resp ectiv ely , that fall in the q th in terv al. In the case when k q = 1, w e ha ve w q , 1 = w q ,k q . Using this notation, the ev en t that exactly k q of the order statistics fa ll in the q th in terv al is n c 1 < Y w 1 , 1 < · · · < Y w 1 ,k 1 < d 1 ∧ · · · ∧ c s < Y w s, 1 < · · · < Y w s,k s < d s o , (19) or, in a more compact notat io n (21) b elow . No w let B b e another random ev ent. The follo wing theorem giv es the probability of this ev ent in tersected with the eve n t (19), in terms of the cum ulative distribution functions of the order statistics relativ e t o the ev en t B . This distrubution function is deﬁned b y F Y n 1 ,...Y n e ; B ( y 1 , . . . , y e ) (20) = Pr ( { Y n 1 ≤ y 1 } ∩ { Y n 2 ≤ y 2 } ∩ · · · ∩ { Y n e ≤ y e } ∩ B ) Con tra ry t o the usual con ven tion, w e do not require tha t the indices of the order statistics in the cumulativ e distribution function (20) are sorted, b ecause that w ould result in a complication of the notat io n in the next theorem (additional ren um b ering of the argumen ts). Theorem 1 Denote the event E = s \ q =1   c q < Y w q, 1  ∩  Y w q,k q < d q   . (21) 4 Then Pr ( E ∩ B ) = F Y w 1 ,k 1 ,Y w 2 ,k 2 ,...,Y w s,k s ; B ( d 1 , d 2 , . . . , d q ) (22) − s X i =1 F Y w 1 ,k 1 ,Y w 2 ,k 2 ,...,Y w s,k s ,Y w i,k i ; B ( d 1 , d 2 , . . . , d q , c q ) + s X r,t =1 r 0 , (38) where Φ is the cum ula t iv e distribution function of the standard normal (mean = 0 and v ar ia nce = 1). Let φ b e the probability densit y function of the standard normal. Supp ose that in truth, w e ha v e ǫ 1 ∼ N ( µ 0 , σ 2 ), so that the n ull holds for H 1 , and ǫ 2 ∼ N ( µ A , σ 2 ), so that a lt ernat iv e holds for H 2 . Deﬁne S 1 = { X 1 } , a nd S 2 = { X 2 } . Then the n um b er of p-v alue for whic h the n ull holds, n = 1. F or H 1 , the h yp otheses fo r 8 k 1 j Theory Sim ulatio n Diﬀerence 1 0 .472982 .47388 .000898 1 1 .00978051 .009 5 .00028051 T able 2: Comparison of Sim ulation and Theory . R ecall that k 1 is the n umber of hy p otheses that w ere rejected, and j is the n umber of null hypotheses that we re rejected. W e had t w o h yp otheses, and one n ull hypothesis. whic h the n ull holds, the p- v a lue has a uniform distribution o n the in terv al [0 , 1], so for x 1 ∈ [0 , 1], F X 1 ( x 1 ) = x 1 . (39) F or H 2 , the alternative holds. When w e conduct the hy p othesis test, we are unaw a re of the truth. W e alwa ys calculate the p- v alue under the null. Ho w ev er, since the alternativ e actually holds, Pr [ Z 2 ≤ z 2 ] = Pr " ¯ ǫ i − µ 0 σ √ N ≤ z 2 # = Pr " ǫ i − µ A σ √ N ≤ z 2 + µ 0 − µ A σ √ N # =Φ " z 2 + µ 0 − µ A σ √ N # . (40) Finally , F X 2 ( x 2 ) = Pr ( X 2 < x 2 ) = Pr ( { X 2 < x 2 } ∩ { Z 2 ≤ 0 } ) + Pr ( { X 2 < x 2 } ∩ { Z 2 > 0 } ) = Pr ( { 2Φ ( Z 2 ) < x 2 } ) + Pr ( { 2 [1 − Φ ( Z 2 )] < x 2 } ) = Pr  Z 2 ≤ Φ − 1 ( x 2 / 2)  + 1 − Pr  Z 2 ≤ Φ − 1 (1 − x 2 / 2)  =Φ " Φ − 1 ( x 2 / 2) + µ 0 − µ A σ √ N # + 1 − Φ " Φ − 1 (1 − x 2 / 2) + µ 0 − µ A σ √ N # , (41) where the last step follow s by substitution from Equation 40. No w, as a sp eciﬁc example, we ﬁx µ 0 = 0, µ A = 1, σ 2 = 1 α = . 05. W e wish to calculate the probabilit y that k 1 = 1, and that j = 0 or j = 1. With c 1 = 0, d 1 = α/ 2, c 2 = α , d 2 = 1. This is the probabilit y that of the tw o h yp ot heses, w e reject exactly one, and it is H 1 , the hy p othesis for which the n ull holds. When j = 0 , the rejection w e mak e is of the h yp othesis for whic h the alternativ e holds, and when j = 1 , the rejection w e mak e is of t he n ull h yp othesis, a fa lse rejection. W e calculated the probability using our metho dology , and b y a sim ulation using a sample of 100,0 00 v aria bles. Recall that k 1 is the num b er of order statistics that are less than b 1 , and j are the n um b er in Set 1, and less t han b 1 . The results are sho wn in T a ble 2. 9 Notice that the sim ulation diﬀers from the t heory only in the fourt h decimal place. The theory is exact. Soft w a r e that implemen ts this metho d in Mathematica is av ailable fro m the authors up on request. References Bapat, R. B. and Beg, M. I. ( 1 989). “Order Statistics for non-iden tically distributed v ar ia bles and p ermanen t s,” S ankhya , Ser. A., 51, 79-93. Benjamini, Y., and Ho c h b erg, Y. (1 9 95). Con tro lling the F alse Disco v ery Rate: A Practical and P ow erful Approach to Multiple T esting. J. R. Stat. So c . S er. B Stat. Metho d ol. 57 289-300. Da vid, H. A. ( 1 981). Or der Statistics , (2nd ed.). New Y ork: Wiley . Gluec k, Deb orah H., Muller, Keith E., K a rimp our-F ard, Anis, Hun ter, Law rence. (2 006a) (in review), Exp ected P o w er for the F alse D isco ver Rate with Indep endence. Gluec k, D . H., Karimp our-F ard, A., Mandel, J. a nd Muller, K.E. (2006b) (in review), On the probability that order statistics fall in interv als. Gluec k, D . H., Karimp our-F ard, A., Mandel, J. , Hun t er, L. and Muller, K.E. (20 07) (in review), F ast computation b y blo c k p ermanen ts of cum ulativ e distribution functions of order statistics from sev eral p opulations. Rosner B. (20 06). F undamentals of Bio statistics (6th edition). New Y ork: Bro o ks-Cole. Ross, S. ( 1 984). A First Co urse in Pr ob ability: Se c o nd Ed ition . New Y ork: Macmillan Publishing Company . 10

On probabilities for separating sets of order statistics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment