Statistical analysis of the Hirsch Index

ST A TISTICAL ANAL YSIS OF THE HIRSCH INDEX By Luca Pra telli ∗ , Alber to Baccini † , Lucio Barabe si † and Marzia Marcheselli † A c c ademia Navale di Livorno ∗ and Univ ersit` a di Siena † The Hirsc h index (commonly referred to as h -index ) is a biblio- metric indicator which is w idely recognized as eﬀective for measuring the scientiﬁc prod uction of a sc holar since it summarizes size and impact of the research output. In a formal setting, th e h -index is actually an empirical functional of th e distribution of the citation counts received by the scholar. Under this approac h, t he asymptotic theory for the empirical h - index has b een recently exploited when the citation counts follo w a conti nuous distribution and, in particular, v ariance estimation h as b een considered for the Pareto-t yp e and the W eibull-typ e distribu- tion famili es. H o w eve r, in bibliometric applications, citation counts displa y a distribution supp orted by the integers. Thus, we provide general properties for the empirical h - index under the small- and large-sample settings. In ad d ition, we also introdu ce con sistent nonparametric vari ance estimation, whic h all o ws for the implemen tion of larg e-sample set estimation for the theoretical h -index. 1. In tro duction. The h -index has b een in tro duced by Hirsc h (2005) as a researc h p erformance indicator for individual sc holars. The h -index is designed as a single score, balancing the t wo most imp ortant dimensions of researc h activit y , i.e. the p ro du ctivit y of a scholar and the corresp ondin g impact on the scientiﬁc comm unit y . Ind eed, according to the original d eﬁ- nition of the emp irical h -index provided b y Hirs c h (2005 ), “a scien tist has index h , if h of his or her N p pap ers ha ve at least h citations eac h, whereas the other ( N p − h ) pap ers h a v e no more than h citatio ns eac h”. Not withs tand ing that the h -index has b een only recen tly prop osed, it is increasingly b eing adopted for ev aluation and comparison purp oses to p ro- vide inf ormation for fu nding and tenure decisions, since it is consid ered an appropriate to ol for identifying “go o d” scienti sts (Ball, 2007). As a m at- ter of fact, s everal reasons explain its p opularity and diﬀusion (Costas and Bordons, 2007 ). As it is apparen t from its deﬁnition, the h -index displa ys a simple structure allo wing easy computation, using data from W eb of S ci- ence, Scopus or Go ogle Schola r, while it is robust to pu b lications with a AMS 2000 subje ct classiﬁc ations: Primary 62G05; secondary 62G20, 62G32 Keywor ds and phr ases: Hirsch index, hea vy-tailed distribution, v ariance estimation 1 2 L. PRA TELLI ET AL. large or small n um b er of citations. In addition, the h -index ma y b e adopted for assessing the researc h p erformance of more complex str u ctures, suc h as journals (setting up as a comp etitor of the Imp act F actor, see e.g. Braun et al. , 2006), group s of scholars, departments and institutions (Molinari and Molinari, 2008 ) and ev en count ries (Nejati and Hosseini Jenab, 2010 ). Quite ob viously , the h -ind ex h as receiv ed considerable atten tion by r e- searc h ers in the ﬁelds of scien tometric s and inf ormation science (V an Noor- den, 201 0). Ev en if the Hirsch index w as origi nally in tro duced in a descrip- tiv e framewo rk, scien tometricia ns often aim to assume a statistical mo del for citation distrib u tion and the inte rest is fo cused on the empirical h -ind ex (see e.g. Gl¨ an zel, 2006) . In a prop er statistical p ersp ectiv e, Beirlant and Einmahl (2010) hav e managed the empirical h -index as the estimator for a suitable statistical functional of the citation-coun t distrib u tion. Acco rd- ingly , these authors ha v e pro ven the consistency of the empirical h -index and they ha v e giv en th e cond itions for its large-sample normalit y . In ad- dition, Beirlan t and Ein mahl (2010) ha ve pr o vided an expression for the large-sample v ariance of the empirical h -index and a simpliﬁed f ormula for the same quan tit y when the underlying ci tation-co unt distribution displa ys P areto-t yp e or W eibull-t yp e tails. Th ese t w o sp ecial families h a v e central imp ortance in bibliometrics, since h ea vy -tailed citation-coun t d istr ibutions are commonly assumed (see e.g . Gl¨ anzel, 2006, and Barcz a and T elcs, 2009). Beirlan t and Einmahl (2010) ha v e dev eloped the asymptotic theory for the empirical h -index b y assum in g a con tinuous citation-c oun t distribution, ev en if the citation num b er is ob viously an in teger. Hence, scient ometricians ma y demand results on the empirical Hirsc h ind ex under a more general approac h. Thus, on the basis of a suitable reformulat ion o f the emp irical h - index, we p ro vide distributional prop erties, as w ell as exact expr essions for the mean and v ariance, of th e empirical h -index w hen citation count s follo w a distribution supp orted by the in tegers. Moreov er, the general large- sample prop erties of the empirical h -index are obtained and the link b et ween the “in tege r” and the “con tin uous” cases is fully analyzed. In addition, a sim- ple and consisten t nonparametric estimator for the v ariance of the empiri- cal h -index is also in tro duced u nder v ery mild conditions. Accordingly , the ac h ieved theoretical results are assessed in a small-sample study b y assum- ing some sp eciﬁc h ea vy-tailed distributions for the citation coun ts. Finally , an application to the “top-te n” researchers for the W eb of Science arc hive in the ﬁeld of Statistics and Probabilit y durin g the p erio d 1985-2 010 is carried out. ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 3 2. Deﬁnitions and preliminary results. Let X b e a p ositiv e random v ariable (r.v.) and let S b e the corresp onding surviv al function (s.f.), i.e. S ( x ) = P ( X > x ) . Ev en if X is a discrete r.v. in the common bibliometric app lications (since it represents the citat ion num b er for a pap er of a giv en schola r), we actually pro vide a m ore general approac h. S imilarly to Beirlant and Einmahl (2010), it is assumed that S ( x ) > 0 f or eac h x since an un b ound ed supp ort f or the r.v. X is u s ually required in sciento metrics (Egghe, 2005). If th e left-hand limit of S is denoted by S − ( x ) = P ( X ≥ x ) , on the basis of the Hirsc h deﬁ nition of the empirical h -index rep orted in the Int ro duction, for eac h in teger n ≥ 1, a “natural” expression for the theoretical Hirsc h in d ex h n , corresp onding to the la w of X , is giv en b y (2.1) h n = sup { x ≥ 0 : nS − ( x ) ≥ x } . Ob viously , it tu rns out that h n > 0 since S is a strictly p ositiv e fun ction. It is at once apparent that (2.1) encompasses the deﬁ nition of the theoretical h -index giv en b y Beirlan t and Einmahl (2010) w h en the r.v. X is assumed to b e con tin uous. Moreo v er, wh en the r.v. X is in teger-v alued - the m ost in teresting situation in bibliometrics - the theoretical Hirsc h index (2.1) reduces to the inte ger n um b er deﬁned by (2.2) h n = max { j ∈ I N : nS ( j − 1) ≥ j } = n X j =1 I [ j /n, 1] ( S ( j − 1)) , where I E represent s the u sual indicator fu nction of a set E . It sh ould b e remark ed that h n ր ∞ and h n /n → 0 as n → ∞ , as immediately f ollo ws from the d eﬁnition (2.1 ). Since it holds that S − ( j ) = P ( ⌊ X ⌋ ≥ j ) , j ∈ I N , where ⌊·⌋ denotes the fu nction giving the greatest in teger less than or equal to the fu nction argumen t, it is wo rth n oticing that ⌊ h n ⌋ turn s out to b e the 4 L. PRA TELLI ET AL. h -index corresp on d ing to the la w of ⌊ X ⌋ . Indeed, fr om the d eﬁnition (2.1) w e ha ve S − ( ⌊ h n ⌋ ) ≥ ⌊ h n ⌋ n and S − (1 + ⌊ h n ⌋ ) < 1 + ⌊ h n ⌋ n . Rephrasing the previous statemen t in its d ual setting, if X is an inte ger- v alued r .v., the h -index corresp onding to the la w of X tu rns out to b e the in teger p art of the h -index corresp onding to the absolutely con tinuous law X + U , w here U is a uniform r.v. on [0 , 1] indep endent from X . If X 1 , . . . , X n are n ind ep endent copies of X , the estimator of h n , i.e. the empirical h -index, ma y b e immediately in tro duced as an empirical fun ctional on the basis of the deﬁnition (2.1) . More precisely , the empirical h -index is deﬁned to b e (2.3) b H n = sup { x ≥ 0 : n b S n − ( x ) ≥ x } , where b S n − ( x ) = 1 n n X i =1 I [ x, ∞ [ ( X i ) . It should b e remark ed that (2.3) reduces to the empirical h -index deﬁned b y Beirlan t and Einmahl (2010) w hen the r .v. X is con tin uous. Moreo ver, on the con trary to (2.3) , the expression of the empirical h -index commonly given in bibliometric literature (see e.g. Gl¨ anzel, 2006, p .316) is not consistent when the r ealizati ons of the n copies are n ull. In addition, by considering the previous discussion and from expression (2.2), the estimat or of ⌊ h n ⌋ corresp ondin g to the la w of ⌊ X ⌋ is giv en by (2.4) e H n = n X j =1 I [ j /n, 1] ( b S n ( j − 1)) , where b S n represent s the emp irical s.f., i.e. b S n ( x ) = 1 n n X i =1 I ] x, ∞ [ ( X i ) . ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 5 It is at once apparen t th at (2.5) e H n = ⌊ b H n ⌋ , while it turns out t hat e H n = b H n when the r.v. X is in tege r-v alued. Actually , estimator (2.4) constitutes th e formal expression of th e emp irical h -index giv en b y Hirsch and rep orted in the In tro duction. It h olds that b H n ր ∞ a.s. (and h en ce e H n ր ∞ a.s.) as n → ∞ on the basis of the Gliv enko-Can telli Th eorem. In particular, it follo ws that E [ b H n ] ր ∞ and E [ e H n ] ր ∞ as n → ∞ . In order to ac hieve s ome useful small-sample p rop erties for the estimator (2.4), it should b e r emark ed that Y j,n = I [ j /n, 1] ( b S n ( j − 1)) , j = 1 , . . . , n , are dep endent Bernoulli random v ariables. More pr ecisely , eac h Y j,n turns out to b e a Bernoulli r.v. with paramete r p j,n = E [ Y j,n ] = P ( n b S n ( j − 1) ≥ j ) = n X y = j n y ! S ( j − 1) y (1 − S ( j − 1)) n − y . In the sequel, we p ose p j,n = 0 if j > n . Moreo v er, since it trivially h olds that V ar[ Y j,n ] = p j,n (1 − p j,n ) , and Co v[ Y j,n , Y l,n ] = p j,n (1 − p l,n ) for j > l , it also follo ws that E [ e H n ] = n X j =1 p j,n and (2.6) V ar[ e H n ] = n X j =1 p j,n (1 − p j,n ) + 2 n X l =2 p l,n l − 1 X j =1 (1 − p j,n ) = n X j =1 r j,n (1 − p j,n ) , where r j,n = p j,n + 2 n X l = j +1 p l,n . 6 L. PRA TELLI ET AL. Ob viously , it h olds that E [ b H n ] /E [ e H n ] → 1 as n → ∞ . The b ehavior of V ar[ b H n ] and V ar[ e H n ] as n → ∞ w ill b e considered at length in Sections 3 and 4. 3. Large-sample prop erties of the empirical h -index. By means of expression (2.5) and considering the discuss ion follo win g expression (2.2), in order to explore the large-sample b ehavio r of the empirical h -index as n → ∞ , la ws deﬁn ed on a con tinuous sup p ort ma y b e managed by consid- ering la ws concen trated on in tege rs and vice v ersa . Moreo v er, if ( a n ) n is an inﬁnitesimal sequence and a n ( b H n − h n ) conv erges in distribu tion to µ , it follo ws that a n ( b H n − h n ) d − → µ ⇐ ⇒ a n ( e H n − ⌊ h n ⌋ ) d − → µ. In addition, b y noting that a n ∼ b n means asymptotic equiv alence of the sequences ( a n ) n and ( b n ) n , i.e. lim n a n /b n = 1 as n → ∞ , if a − 2 n ∼ V ar[ e H n ] and lim n V ar[ e H n ] = ∞ , we h a v e lim n V ar[ b H n ] V ar[ e H n ] = 1 and a n ( b H n − h n ) d − → µ ⇐ ⇒ e H n − h n e σ n d − → µ ⇐ ⇒ b H n − h n e σ n d − → µ, where e σ 2 n is a co nsisten t estimat or of V ar[ e H n ], i.e. e σ 2 n V ar[ e H n ] P − → 1 . Hence, in order to implement conﬁdence sets for h n , the ev aluation and the estimation of V ar [ e H n ] is of cen tr al imp ortance in the most int eresting case for the scien tometricia ns, i.e. when it holds that V ar[ e H n ] → ∞ as n → ∞ . F or example, th is setting happ ens for the P areto-t yp e f amily of la ws satisfying the condition S ( x ) = x − α l ( x ) with α ∈ ]0 , ∞ [ and for the W eibull-t yp e family of la ws satisfying th e condi- tion S ( x ) = exp( − x τ l ( x )) with τ ∈ ]0 , 1 / 2[, where l is a slowly-v arying function, i.e. l ( tx ) l ( t ) → 1 ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 7 for eac h x as t → ∞ . Sin ce the v ariance (2.6) is a fun ction of the pr obabilities p j,n s, the preliminary step consists in d etermining tig h t inequalitie s for th ese quan tities, as th e f ollo wing r esult pro vides. Theorem 3.1 . If G r epr esents the s.f. of the standar d Normal distribu- tion, ther e exists a c onstant A > 0 such that, for e ach n ≥ 1 and j = 1 , . . . , n , it holds that (3.1) | p j,n − G ( x j,n ) | ≤ A v 3 j,n + 1 v 4 j,n (1 + | x j,n | ) 6 , wher e x j,n = j − nS ( j − 1) v j,n and v 2 j,n = nS ( j − 1)(1 − S ( j − 1)) . Corollar y 3.1. Ther e exists a c onstant C > 0 solely dep ending on A , such that (3.2) n X j = ⌊ 2 h n ⌋ +1 r j,n ≤ C h 3 / 2 n for e ach n and h n ≥ 1 . The further Corollary to Theorem 3.1 giv es the consistency of the esti- mator b H n . Corollar y 3.2. The r atio b H n /h n c onver ges in quadr atic me an to 1 , i.e. it holds that E   b H n h n − 1 ! 2   → 0 as n → ∞ . Similarly to the framework considered by Beirlan t and Einmahl (2010), the previous consistency result is state d in a ratio-setting since h n ր ∞ as n → ∞ . Final ly , on the basis of Corollary 3.2 it also follo ws that E [ b H n ] h n → 1 as n → ∞ . 8 L. PRA TELLI ET AL. 4. Consisten t estimation of the empirical h -index v ariance. As emphasized in Section 2, in order to ac hiev e the con v ergence in d istribution of b H n , the ev aluation of the v ariance (2.6 ) is cen tral. T o this aim, the fol- lo w in g result provides some inequ alities and a useful asymptotic equiv alence for (2. 6) by assuming a mild condition. Theorem 4.1 . F or e ach n i t holds that (4.1) V ar[ e H n ] ≥ ⌊ 2 h n ⌋ X j =1 r j,n (1 − p j,n ) ≥ ⌊ 2 h n ⌋ X j =1 e r j,n (1 − p j,n ) , wher e e r j,n = p j,n + 2 ⌊ 2 h n ⌋ X l = j +1 p l,n . Mor e over, if (4.2) lim in f n V ar[ e H n ] > 0 , it holds that (4.3) V ar[ e H n ] ∼ ⌊ 2 h n ⌋ X j =1 e r j,n (1 − p j,n ) as n → ∞ . In p articular, if V n = ⌊ 3 b H n ⌋ X j =1 p j,n (1 − p j,n ) + 2 ⌊ 3 b H n ⌋ X l =2 p l,n l − 1 X j =1 (1 − p j,n ) it holds that R n = V n V ar[ e H n ] P − → 1 as n → ∞ . F rom T eorem 4.1, it is at once apparent that V n w ould b e a consistent estimator of (2.6) w hen it is p ossible to ev aluate the p j,n s for j ≤ ⌊ 3 b H n ⌋ and in the case that condition (4.2) h olds, i.e. when V ar [ e H n ] do es not appr oac h 0 as n → ∞ . Hence, it is con v enien t to in tro duce a further condition which solely in v olv es the b eha vior of S on I N and wh ic h imp lies condition (4.2 ). More precisely , w e consid er the condition ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 9 (4.4) lim n √ nψ ( n ) S ( n ) = 0 , where ψ ( n ) = S ( n − 1) − S ( n ) = P ( n − 1 < X ≤ n ) . Ob viously , when th e r.v. X is in teger-v alued, ψ rep resen ts the probabilit y function corresp onding to X . It sh ou ld b e n oticed that condition (4.4) ma y also b e r eform ulated as S ( n − 1) S ( n ) = 1 + γ n √ n , where ( γ n ) n is a p ositiv e in ﬁnitesimal sequence, and hence for ea c h M > 0, it holds (4.5) lim n S ( n − M √ n ) S ( n + M √ n ) = 1 , since 1 ≤ S ( n − M √ n ) S ( n + M √ n ) ∼ ⌊ M √ n ⌋ +1 Y j = ⌊− M √ n ⌋ +1  1 + γ n + j √ n + j  ≤ exp   2( M + 1) √ nδ n + ⌊− M √ n ⌋ q n − M √ n − 1   ∼ 1 where δ n = sup h ≥ n γ h . As pro ven in the follo wing result, cond ition (4.4) ensures the u n b ound edness of (2.6) as n → ∞ . Theorem 4.2 . If the law of X satisﬁes c ondition (4.4), it holds that lim n V ar[ e H n ] = ∞ , In order to ac hiev e consisten t estimation of (2.6), it is necessary to in- tro duce a further condition, whic h i s sligh tly more restrictiv e than condition (4.4). More precisely , this condition assumes that for eac h M ≥ 0 it holds that (4.6) lim n sup j ∈ D M ,n     ψ ( j ) ψ ( n ) − 1     ! = 0 , 10 L. PRA TELLI ET AL. where D M ,n = [ n − M √ n, n + M √ n ] ∩ I N. I t ma y b e easily v eriﬁed that condition (4.6 ) implies condition (4.4 ) and hence co ndition (4.5). Since a natur al estimator for p j,n is gi v en b y b p j,n = n X y = j n y ! b S n ( j − 1) y (1 − b S n ( j − 1)) n − y , on the basis of the large-sample b eha vior of the ratio R n giv en in Theorem 4.1, an estimator for the v ariance (2.6) turns out to b e (4.7) b V n = min( ⌊ 3 b H n ⌋ ,n ) X j =1 b p j,n (1 − b p j,n ) + 2 min( ⌊ 3 b H n ⌋ ,n ) X l =2 b p l,n l − 1 X j =1 (1 − b p j,n ) . It should b e remarked that estima tor (4.7) is fully nonparametric. I ndeed, it do es not require the sp eciﬁcation of a semi-parametric mo d el for the un der- lying citation distribu tion as in the case of the v ariance estimator prop osed b y Beirlan t and Einmahl (2010). F or example, their estimator requires the estimation of the P aretian index wh en a P areto- t y p e family is assumed for the la w of X - a non-trivial task, see e.g. Beirlan t et al. (2004). The follo wing r esult pro vides a c ompact asymptotic equiv alent e xpression for (2.6) and the consistency of estimator (4.7) if condition (4.6) is assumed. Theorem 4.3 . If the law of X satisﬁes c ondition (4.6), it holds that (4.8) V ar[ b H n ] ∼ h n (1 + n ψ ( ⌊ h n ⌋ )) 2 as n → ∞ . M or e over, it fol lows that b V n V ar[ b H n ] P − → 1 as n → ∞ . It sh ould b e remarked that for the Pareto -t yp e and the W eibu ll-type families (describ ed in Section 3) condition (4.6) is satisﬁed. Accordingly , b H n approac hes normalit y and f rom Theorem 4.3 for the Paret o-t yp e f amily , it holds that V ar[ b H n ] ∼ h n (1 + α ) 2 and h n ∼ C n 1 / (1+ α ) ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 11 when l ( x ) ∼ C 1+ α , w hile for th e W eibull-t yp e family , it holds that V ar[ b H n ] ∼ h n (1 + τ log ( n/h n )) 2 and h n ∼ C (log n ) 1 /τ , when l ( x ) ∼ C − τ and where C > 0 is a suitable constan t. These r esults are in complete agreement w ith the ﬁ ndings b y Beirlan t and Einmahl (2010). On the b asis of Theorem 4.3 and on the remarks con tained in Section 2, when condition (4.6) is satisﬁed by th e un derlying distribution, a large- sample conﬁdence set for h n at the (1 − γ ) conﬁdence lev el turns out to b e (4.9) C n = { [ [ b H n − z 1 − γ / 2 q b V n ] ] , . . . , [ [ b H n + z 1 − γ / 2 q b V n ] ] } , where z γ represent s the γ -th qu an tile of the standard Normal distrib ution, while [ [ · ] ] represen ts th e function giving the inte ger close st to the argument. In addition, in order to assess the homogeneit y of the theoretica l h -ind exes for t wo s cholars, a suitable test statistic is giv en b y T n = b H 1 ,n − b H 2 ,n q b V 1 ,n + b V 2 ,n , where b H 1 ,n and b H 2 ,n represent the empirical h -ind exes corresp onding to the sc holars, while b V 1 ,n and b V 2 ,n are th e resp ectiv e v ariance estimators as giv en b y (4.7) . It is at once apparen t that T n d − → N (0 , 1) as n → ∞ , when b H 1 ,n and b H 2 ,n approac hes normalit y . The test stat istic T n is d eﬁned in a nonparametric setting, in co n trast to the test statistic pr o- p osed in a semiparametric approac h by Beirlan t and Einmahl (2010), wh ic h requires consisten t estimation of the t w o P aretian ind exes of the sc holar citatio n d istributions. Finally , when the analysis of the theoretical h -ind exes corresp on d ing to k sc holars ( k ≥ 2) is considered, simultaneo us set estimation and homogene- it y hyp othesis testing could b e managed b y means of tec hniqu es similar to those suggested in Marc h eselli (2003). Th ese issues will b e pursu ed in future researc h . 12 L. PRA TELLI ET AL. 5. Some studies and examples. I n order to assess in p ractice the prop erties of the empirical h -ind ex ac h ieved in the previous s ections, a study w as carried out for t w o heavy- tailed distr ibutions. First, a d iscrete stable distribution for the r.v. X wa s considered . This distribution m a y b e sp eciﬁed via the p robabilit y generati ng function g ( s ) = E[ s X ] = exp( − λ (1 − s ) α ) , s ∈ [0 , 1] , where α ∈ ]0 , 1] and λ ∈ ]0 , ∞ [ (Steutel and v an Harn, 2004, p.265). The discrete stable d istribution is P aretian of ord er α for α ∈ ]0 , 1[ (Christoph and Schreib er, 1998 ) and it constitutes a ﬂexible an d natural mo del for hea vy-tail ed d iscrete data (see Marc heselli et al. , 2008, for a description of the d istribution prop erties and of the corresp onding parameter estimation issues). A “discretized” W eibu ll distr ib ution wa s su bsequentl y assumed f or the r.v. X . The distribution displa ys the prob ab ility fu nction f ( x ) = [exp( − x τ ) − exp( − ( x + 1) τ )] I I N ( x ) , where τ ∈ ]0 , ∞ [. O b viously , it turns out that S ( j ) = exp( − ( j + 1) τ ) , j ∈ I N . By assu ming that n = 30 , 50 , 100 , 150 , 200, the v alues of h n , E [ b H n ], V ar [ b H n ] and of the large-sample v ariance appro ximation (4 . 8) we re computed for the discrete stable distribu tion with parameter vecto rs ( α, λ ) = (0 . 25 , 1 . 0), (0 . 50 , 1 . 5) , (0 . 75 , 2 . 0), as w ell as for t he discretized W eibull distribution with parameters τ = 0 . 01 , 0 . 10 , 0 . 40. These c hoices were made in ord er to ﬁt, as close as p ossible, the real p r o ductivit y and the real citation distribu tions of sc h olars with diﬀeren t scien tiﬁc ag es and b elonging to diﬀeren t researc h areas and with more or less pr onounced imp act on researc h. In the s tu dy , B = 10 , 000 random v ariates w ere generated for eac h n c h oice and for eac h considered distribution in order to ac h iev e the Monte Carlo estimatio n of E [ b V n ] and the Mon te Carlo estimatio n of the actual co verag e f or the conﬁdence set (4.9) at the 95% nominal conﬁdence lev el. The corresp onding results were rep orted in T ables I and I I. The analysis of these tables sho ws that h n and E [ b H n ] are s im ilar even for small n v alues and b V n turns out to b e nearly unbiased. In addition, it can b e v eriﬁ ed that the actual co verage of the large-sample conﬁdence s et (4.9 ) is almost equiv alen t to the nominal co verage ev en for s m all n v alues. Unfor- tunately , it can b e assessed that the large-sample v ariance app ro ximation (4.8) ma y b e quite d issimilar from V ar[ b H n ] ev en for quite large n v alues. It should b e remarked that an estimation p ro cedure based on (4.8) requires, in an y case, the additional estimation of α or τ . ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 13 Accordingly , b V n seems to b e an app ealing v ariance estimator, b oth fr om a th eoretical and practical p ersp ectiv e. In general, we hav e v eriﬁed similar conclusions for a plethora of distrib utions satisfying condition (4.6), ev en if w e ha ve n ot r ep orted the corresp onding r esults. T able I. Discrete stable d istr ibution. α λ n h n E [ b H n ] V ar[ b H n ] h n (1+ α ) 2 E [ b V n ] Co v erage 0 . 25 1 . 0 30 11 11 . 31 4 . 73 7 . 04 4 . 88 0 . 96 50 17 17 . 17 7 . 58 10 . 88 7 . 79 0 . 96 100 30 30 . 28 14 . 18 19 . 20 14 . 51 0 . 96 150 42 42 . 19 20 . 34 26 . 88 20 . 73 0 . 95 200 53 53 . 38 26 . 21 33 . 92 26 . 74 0 . 95 0 . 50 1 . 5 30 9 8 . 96 2 . 6 6 4 . 00 2 . 96 0 . 95 50 12 12 . 46 4 . 01 5 . 33 4 . 35 0 . 96 100 19 19 . 59 6 . 83 8 . 44 7 . 25 0 . 96 150 25 25 . 57 9 . 25 11 . 11 9 . 72 0 . 96 200 31 30 . 91 11 . 43 13 . 78 11 . 98 0 . 95 0 . 75 2 . 0 30 6 6 . 65 1 . 1 1 1 . 96 1 . 38 0 . 97 50 8 8 . 38 1 . 5 8 2 . 61 1 . 90 0 . 96 100 11 11 . 67 2 . 57 3 . 59 2 . 93 0 . 96 150 14 14 . 29 3 . 38 4 . 57 3 . 76 0 . 95 200 16 16 . 54 4 . 09 5 . 22 4 . 54 0 . 95 14 L. PRA TELLI ET AL. T able I I. Discret ized W eibull distribution. τ n h n E [ b H n ] V ar[ b H n ] h n (1+ τ log( n/h n )) 2 E [ b V n ] Co v erage 0 . 01 30 10 10 . 77 6 . 77 9 . 78 6 . 56 0 . 94 50 17 17 . 86 11 . 25 16 . 64 11 . 05 0 . 96 100 35 35 . 47 22 . 43 34 . 28 22 . 25 0 . 96 150 52 52 . 98 33 . 57 50 . 92 33 . 39 0 . 96 200 70 70 . 44 44 . 70 68 . 55 44 . 52 0 . 95 0 . 10 30 8 8 . 63 4 . 93 6 . 2 4 4 . 90 0 . 95 50 13 13 . 60 7 . 83 10 . 10 7 . 82 0 . 95 100 25 25 . 09 14 . 59 19 . 28 14 . 67 0 . 95 150 35 35 . 83 20 . 95 26 . 67 21 . 12 0 . 95 200 46 46 . 07 27 . 05 34 . 97 27 . 20 0 . 95 0 . 40 30 4 4 . 47 1 . 40 1 . 2 3 1 . 49 0 . 97 50 6 6 . 01 1 . 72 1 . 76 1 . 85 0 . 96 100 8 8 . 74 2 . 22 1 . 98 2 . 38 0 . 96 150 11 10 . 75 2 . 54 2 . 63 2 . 71 0 . 96 200 12 12 . 38 2 . 77 2 . 66 2 . 96 0 . 96 As a p r actical application of th e ac hiev ed results, we also considered the scienti ﬁc p erformance of the b est ten schola rs in the ﬁeld of Statis- tics and Pr obabilit y according to the W eb of S cience arc hiv e. Data we re dra wn from the Thomson-Reuters databases by selecting the sc holars listed in the catego ry Mathematics of the ISIHighlyCited.com database (giv en at the WEB site http:// hcr3.isikno wledge.co m/home.cgi). F or eac h sc h olar in the data base, an author searc h w as p erformed during the mon th of Decem- b er 2010 on the IS I W eb of S cience for the p erio d 1985-2 010. Th e sea rc h w as carried out in s u c h a wa y that only articles and letters p ublished in journals con tained in the Statistics and Probabilit y database were considered. Ac- cordingly , the citation coun ts w ere collect ed for eac h sc holar. The citation coun ts cov ered do cumen ts cont ained in the Science Citation In dex Expand ed and So cial Sciences Citation Ind ex and Arts & Humanities Citation Index. Finally , the ten sc h olars with the highest h -indexes were consid ered. More precisely , the names of the ten sc holars, the corresp ond ing pap er num b er and h -index, as wel l as the large-sample conﬁd ence sets at the 95% nomi- nal conﬁdence lev el were rep orted in T able I I I. Obviously , pratictioners ma y largely b eniﬁt from this example in order to understand the imp ortance of quan tifying v ariabilit y for an appropr iate comparison an alysis of the researc h p erforman ce. ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 15 T able I I I. P erform an ce of the “top-ten” most-cited sc holars in the ﬁeld of Statistic s and Probabilit y during the p erio d 1985-2010 . n h n C n Hall, P .G. 418 46 { 42 , . . . , 50 } Rubin, D.B . 104 39 { 32 , . . . , 46 } Carroll, R.J. 198 38 { 33 , . . . , 43 } Tibshirani, R. 104 37 { 31 , . . . , 43 } F an, J. 114 36 { 30 , . . . , 42 } Marron, J.S. 10 7 36 { 31 , . . . , 41 } Hastie, T.J. 77 34 { 27 , . . . , 41 } Lin, D.Y. 93 32 { 26 , . . . , 38 } Raftery , A.E. 88 31 { 25 , . . . , 37 } W ei, L.J. 88 31 { 26 , . . . , 36 } 16 L. PRA TELLI ET AL. APPENDIX A A.1. Pro of of Theorem 3.1. F or ﬁxed j and n , let us assume that Z i = I ] j − 1 , ∞ [ ( X i ) − S ( j − 1) p S ( j − 1)(1 − S ( j − 1)) , i = 1 , . . . , n. Accordingly , w e ha ve p j,n = P 1 √ n n X i =1 Z i ≥ x j,n ! . Th us, b y applying the Osip ov inequ alit y (see e.g. DasGupta, 2008, p.659) to the Z i ’s for α = 6, inequalit y (3.1) is pr ov en, since for eac h m ≥ 2 it holds that E [ | Z i | m ] ≤ 1 [ S ( j − 1)(1 − S ( j − 1))] ( m − 1) / 2 . A.2. Pro of of Corollary 3.1. Since for eac h n and j ≥ 2 h n + 1, it holds that (A.1) v j,n x j,n = j  1 − nS ( j − 1) j  ≥ j  1 − nS (2 h n ) 2 h n  ≥ j 2 and since from the deﬁnition of h n it follo ws that (A.2) v 2 j,n ≤ nS ( j − 1) ≤ j for j ≥ 2 h n + 1, b y means of inequalit y (3.1) we ha ve (A.3) | p j,n − G ( x j,n ) | ≤ A v 5 j,n + v 2 j,n v 6 j,n x 6 j,n ≤ 64 A v 5 j,n + v 2 j,n j 6 ≤ 128 A j 7 / 2 . Since on th e basis of (A.1) and (A.2) it also h olds that x j,n ≥ √ j 2 , and hence G ( x j,n ) ≤ G ( √ j / 2), for eac h l > 2 h n it follo ws from (A.3) n X l = j +1 p l,n ≤ 256 A 5 j 5 / 2 + n X l = j +1 G ( √ l/ 2) ≤ B j 5 / 2 , where B is a constan t, solely dep endin g on A . Th us, it also turns out that n X j = ⌊ 2 h n ⌋ +1 r j,n ≤ B h 5 / 2 n + 4 B 3 h 3 / 2 n and hence inequalit y (3.2) follo ws. ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 17 A.3. Pro of of Corollary 3.2. If S is a con tin uous fun ction, fr om Corollary 1 give n b y Beirlan t and Einmahl (2 010) it holds that b H n h n P → 1 as n → ∞ . When S is not a con tinuous f unction, let h ′ n b e the theoretica l h -index corresp onding to the la w of ⌊ X ⌋ + U , where U is a un iform r .v. on [0 , 1] ind ep endent f rom X , while let b H ′ n b e the empirical h -ind ex based on n indep endent copies of ⌊ X ⌋ + U . Since | h n − h ′ n | ≤ 1 and | b H n − b H ′ n | ≤ 1, the co n v ergence in p robabilit y to 1 of b H n /h n is in turn obtained fr om the conti n uous-setting r esult. Moreo v er, the uniform inte grabilit y of b H 2 n /h 2 n follo ws by consid ering inequalit y (3.2) since E      n X j = ⌊ 2 h n ⌋ +1 Y j,n   2    = V ar   n X j = ⌊ 2 h n ⌋ +1 Y j,n   + E   n X j = ⌊ 2 h n ⌋ +1 Y j,n   2 ≤ n X j = ⌊ 2 h n ⌋ +1 r j,n +   n X j = ⌊ 2 h n ⌋ +1 r j,n   2 and ⌊ 2 h n ⌋ X j =1 Y j,n ≤ 2 h n . A.4. Pro of of Theorem 4.1. The inequalities in (4.1) easily follo ws from expression (2.6), wh ile (4.3) follo w s from (3.2 ) and ⌊ 2 h n ⌋ X j =1 | r j,n − e r j,n | ≤ 2 h n n X l = ⌊ 2 h n ⌋ +1 r l,n . Moreo ver, on the basis of C orollary 3.2 it tur ns out that b H n /n con v erges in mean to 0. Hence, it is conv enien t to consider R ′ n = I [0 ,n ] (3 b H n ) R n , for wh ic h it h olds that R ′ n ≤ 1 by means of (2. 6). Hence, the s econd part of the Theorem f ollo ws from (4.1), (4.3) and lim n P ( ⌊ 3 b H n ⌋ ≤ ⌊ 2 h n ⌋ ) = 0 . 18 L. PRA TELLI ET AL. A.5. Pro of of Theorem 4.2. By using the notations introd uced in Theorem 4.1 , w e ha ve x j +1 ,n − x j,n = 1 + nψ ( j ) v j +1 ,n + x j,n v j,n − v j +1 ,n v j +1 ,n . Th us, for a giv en c > 0 and for eac h n suc h that h n ≥ 1, on the b asis of (4.5) it holds that 1 + n ψ ( j ) v j +1 ,n < 1 2 c for j ∈ [ h n − √ h n , h n + √ h n ] ∩ I N. Equiv alen tly , there exist at least c v alues x j,n in the inte rv al [ − 1 , 1]. Th us, if D n represent s the set of indexes j ∈ [ h n − √ h n , h n + √ h n ] ∩ I N for whic h | x j,n | ≤ 1, on the basis of (3.1) and (4. 1) it follo ws that V ar[ b H n ] ≥ ⌊ 2 h n ⌋ X j =1 p j,n (1 − p j,n ) ≥ X j ∈ D n p j,n (1 − p j,n ) ≥ cG (1) (1 − G (1)) − A X j ∈ D n v 3 j,n + 1 v 4 j,n (1 + | x j,n | ) 6 , F rom (4.5) w e ha v e inf j ∈ D n v j,n ∼ q nS ( h n − p h n ) ∼ q nS ( h n ) ∼ p h n , and, since c is arbitrary , it holds th at lim in f n V ar[ e H n ] ≥ cG (1)( 1 − G (1)) − 4 A lim sup n √ h n inf j ∈ D n v j,n = cG (1) (1 − G (1)) − 4 A, whic h co mpletes the pro of. A.6. Pro of of Theorem 4.3. F or a ﬁ xed M > 0, from condition (4.6) and from (4.5) it follo ws th at (A.4) lim n sup j ∈ D M ,h n      (1 + n ψ ( j )) v h n ,n (1 + n ψ ( ⌊ h n ⌋ )) v j +1 ,n − 1      ! = 0 . ST A TISTICAL ANAL YSIS OF THE H IRSCH INDEX 19 Th us, b y means of expression (A.4), from Theorem 4.2 it follo ws that V ar[ b H n ] ∼ V ar[ e H n ] ∼ ⌊ 2 h n ⌋ X j =1 p j,n (1 − p j,n ) + 2 ⌊ 2 h n ⌋ X l =2 p l,n l − 1 X j =1 (1 − p j,n ) ∼ 2 ⌊ 2 h n ⌋ X l =2 p l,n l − 1 X j =1 (1 − p j,n ) ∼ 2 h n (1 + n ψ ( ⌊ h n ⌋ )) 2 Z 2 h n −∞ G ( x ) d x Z x −∞ (1 − G ( u )) d u ∼ h n (1 + n ψ ( ⌊ h n ⌋ )) 2 , since Z ∞ −∞ G ( x ) d x Z x −∞ (1 − G ( u )) d u = 1 2 . Hence, expression (4.8) is pro ven. As to the consistency of b V n , b y assuming that b ψ n ( j ) = b S n ( j − 1) − b S n ( j ) , it holds th at sup j ∈ D M ,h n      (1 + n b ψ n ( j )) v h n ,n (1 + n ψ ( ⌊ h n ⌋ )) v j +1 ,n − 1      P − → 0 , as n → ∞ , since inf M ≥ 0 lim s up n P ( | b H n − h n | > M p h n ) = 0 . Th us, con verge nce in probabilit y of b V n / V ar[ e H n ] to 1 follo ws. REFERENCES [1] Ball, P. (2007). Achiev emen t index clim bs the ranks. Natur e 448 727–737. [2] Barcza, K. and Telcs, A . (2009). P aretian publication patterns imply Paretian Hirsc h index. Scientometrics 81 513–519. [3] Beirlant, J. and Einmahl, J.H.J. (2010). Asymp totics fo r the Hirsch index. Sc an- dinavian Journal of Statistics 37 355–36 4. [4] Beirlant, J., Goegebeur, Y., Segers, J. and Te ugels, J. (2004). Statistics of extr emes: the ory and applic ations . Wiley , New Y ork. [5] Braun, T., Gl ¨ anzel, W. and Schuber t, A. (2006). A H irsc h-type ind ex for jour- nals. Scientometrics 69 169–173. 20 L. PRA TELLI ET AL. [6] Cost as, R. and Bordons, M. (2007). The h -index: adv an tages, limitations and its relation with other bibliometric indicators at the micro level. Journal of Inf ormetrics 1 193–203 . [7] Christoph, G. and Schrei ber, K. (1998 ). Discrete s table random va riables. Statis- tics and Pr ob ability L ett ers 37 243–247. [8] DasGupt a, A. (2008). Asymptotic the ory of statistics and pr ob ability . S pringer, New Y ork. [9] Egghe, L. (2005). Power laws i n the inf ormation pr o duction pr o c ess . Wiley , N ew Y ork. [10] Gl ¨ anzel, W. (2006). On the h -index - A mathematical approach to a new measure of publication activit y and citation impact. Scientometrics 67 315–321 . [11] Hirsch, J.E. (2005). A n index to q uantify an individual’s scientiﬁc research output. Pr o c e e dings of the National A c ademy of Scienc es of the Uni te d Stat es of Americ a 102 16569–1 6572. [12] Marcheselli, M . (2003). Asymptotic results in jackkniﬁng non-smo oth functions of the sample mean vecto r. An nals of Statistics 31 1885–1904 . [13] Marcheselli, M., Baccini, A . and Barabesi, L. (2008). Parameter estimation for the discrete stable fami ly . Communic ations in Statistics - The ory and Metho ds 37 815–830 . [14] Molinari, J.F. and Molinari, A. (2008). A new meth od ology for ranking scientiﬁc institutions. Scientometrics 75 163–17 4. [15] Neja ti, A. and Hosseini Jenab, S. (2010). A tw o-dimensional approach to ev aluate the scientiﬁc pro duction of coun tries (case stu dy: the basic sciences). Scientometrics 84 357–364 . [16] Steutel, F.W . and v an H arn, K. (2004). Inﬁnite divisi bility of pr ob ability distr i- butions on the r e al line . Dekker, New Y ork. [17] V an Noorden, R. (2010). Metrics: a profusion of measures. Natur e 465 864–866. Address of the Firs t author Ac cademia Na v ale, Viale It alia 72, 57100 Livorno, It al y E-mail: luca pratelli@marina.difesa.i t Address of the Others a uthors Dip ar timento di Economia Politica, P.zza S.Francesco 17, 53100 Siena, I t al y E-mail: baccini@unisi.it E-mail: barabesi @unisi.it E-mail: marchese lli@unisi. it

Statistical analysis of the Hirsch Index

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment