P-values: misunderstood and misused

P-v alues: misunder stood and misused Bertie Vidg en 1 and T aha Y asseri 1 , ∗ 1 Oxf ord Inter net Institute , University of Oxford, Oxf ord, UK Correspondence*: T aha Y asseri Oxf ord Inter net Institute, Univ ersity of Oxf ord, 1 St Giles, Oxf ord, O X13JS, UK, taha.yasseri@oii.o x.ac.uk ABSTRA CT P-v alues are widely used in both the social and natural sciences to quantify the statistical signiﬁcance of obser ved results . The recent surge of big data research has made the p-value an e v en more popular tool to test the signiﬁcance of a study . Ho we v er , substantial literature has been produced critiquing how p-v alues are used and understood. In this paper w e re view this recent critical literature , much of which is routed in the life sciences, and consider its implications for social scientiﬁc research. W e pro vide a coherent picture of what the main criticisms are , and dra w together and disambiguate common themes . In par ticular , we e xplain ho w the F alse Discov er y Rate is calculated, and ho w this diff ers from a p-v alue. W e also make e xplicit the Bay esian nature of many recent criticisms, a dimension that is often under play ed or ignored. W e conclude by identifying practical steps to help remediate some of the concer ns identiﬁed. W e recommend that (i) f ar low er signiﬁcance le vels are used, such as 0 . 01 or 0 . 001 , and (ii) p-v alues are inter preted conte xtually , and situated within both the ﬁndings of the individual study and the broader ﬁeld of inquir y (through, f or e xample, meta-analyses). Keyw ords: p-v alue, statistics, signiﬁcance, p-hacking, prevalence, Ba yes, big data 1 INTR ODUCTION P-v alues are widely used in both the social and natural sciences to quantify the statistical signiﬁcance of observed results. Obtaining a p-value that indicates “statistical signiﬁcance” is often a requirement for publishing in a top journal. The emergence of computational social science, which relies mostly on analyzing large scale datasets, has increased the popularity of p-values e ven further . Howe ver , critics contend that p-values are routinely misunderstood and misused by many practitioners, and that e ven when understood correctly they are an inef fectiv e metric: the standard signiﬁcance lev el of 0 . 05 produces an ov erall False Discov ery Rate that is far higher, more like 30%. Others argue that p-values can be easily “hacked” to indicate statistical signiﬁcance when none exists, and that the y encourage the selecti ve reporting of only positi ve results. Considerable research exists into ho w p-values are (mis)used, [e.g. 1 , 2 ]. In this paper we revie w the recent critical literature on p-values, much of which is routed in the life sciences, and consider its implications for social scientiﬁc research. W e provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular , we explain ho w the False Discov ery Rate is calculated, and ho w this differs from a p-v alue. W e also make explicit the Bayesian nature of many recent criticisms. In the ﬁnal section we identify practical steps to help remediate some of the concerns identiﬁed. P-v alues are used in Null Hypothesis Signiﬁcance T esting (NHST) to decide whether to accept or reject a null hypothesis (which typically states that there is no underlying relationship between two v ariables). If the 1 Vidgen et al. P-values: misunder stood and misused null hypothesis is rejected, this gi ves grounds for accepting the alternati ve hypothesis (that a relationship does exist between two variables). The p-value quantiﬁes the probability of observing results at least as extreme as the ones observed giv en that the null hypothesis is true. It is then compared against a pre-determined signiﬁcance lev el ( α ). If the reported p-value is smaller than α the result is considered statistically signiﬁcant. T ypically , in the social sciences α is set at 0 . 05 . Other commonly used signiﬁcance le vels are 0 . 01 and 0 . 001 . In his seminal paper, “The Earth is Round ( p < . 05) ” Cohen argues that NHST is highly ﬂawed: it is relati vely easy to achie ve results that can be labelled signiﬁcant when a “nil” hypothesis (where the ef fect size of H 0 is set at zero) is used rather than a true “null” hypothesis (where the direction of the ef fect, or ev en the eff ect size, is speciﬁed) [ 3 ]. This problem is particularly acute in the context of “big data” exploratory studies, where researchers only seek statistical associations rather than causal relationships. If a large enough number of v ariables are examined, ef fectiv ely meaning that a large number of null/alternati ve hypotheses are speciﬁed, then it is highly likely that at least some ’ statistically signiﬁcant’ results will be identiﬁed, irrespecti ve of whether the underlying relationships are truly meaningful. As big data approaches become more common this issue will become both far more pertinent and problematic, with the robustness of many “statistically signiﬁcant” ﬁndings being highly limited. Le w argues that the central problem with NHST is reﬂected in its hybrid name, which is a combination of (i) hypothesis testing and (ii) signiﬁcance testing [ 4 ]. In signiﬁcance testing, ﬁrst de veloped by Ronald Fisher in the 1920s, the p-v alue provides an inde x of the evidence ag ainst the null hypothesis. Originally , Fisher only intended for the p-value to establish whether further research into a phenomenon could be justiﬁed. He saw it as one bit of e vidence to either support or challenge accepting the null hypothesis, rather than as conclusiv e evidence of signiﬁcance [ 5 ]; see also [ 6 , 7 ]. In contrast, hypothesis tests, de veloped separately by Ne yman and Pearson, replace Fisher’ s subjecti vist interpretation of the p-v alue with a hard and fast “decision rule”: when the p-value is less than α , the null can be rejected and the alternativ e hypothesis accepted . Though this approach is simpler to apply and understand, a crucial stipulation of it is that a precise alternati ve hypothesis must be speciﬁed [ 6 ]. This means indicating what the expected ef fect size is (thereby setting a nil rather than a null hypothesis) —something that most researchers rarely do [ 3 ]. Though hypothesis tests and signiﬁcance tests are distinct statistical procedures, and there is much disagreement about whether the y can be reconciled into one coherent framework, NHST is widely used as a pragmatic amalgam for conducting research [ 8 , 9 ]. Hulbert and Lombardi argue that one of the biggest issues with NHST is that it encourages the use of terminology such as signiﬁcant/nonsigniﬁcant . This dichotomizes the p-value on an arbitrary basis, and con verts a probability into a certainty . This is unhelpful when the purpose of using statistics, as is typically the case in academic studies, is to weigh up e vidence incrementally rather than mak e an immediate decision [ 9 , p. 315]. Hulbert and Lombardi’ s analysis suggests that the real problem lies not with p-v alues, but with α and ho w this has led to p-values being interpreted dichotomously: too much importance is attached to the arbitrary cutof f α ≤ 0 . 05 . 2 THE F ALSE DISCO VER Y RA TE A p-v alue of 0 . 05 is normally interpreted to mean that there is a 1 in 20 chance that the observ ed results are nonsigniﬁcant, having occurred e ven though no underlying relationship exists. Most people then think that the ov erall proportion of results that are false positi ves is also 0 . 05 . Ho wev er , this interpretation confuses the p-v alue (which, in the long run, will approximately correspond to the type I err or rate ) with the F alse Discov ery Rate (FDR). The FDR is what people usually mean when the y refer to the error rate: it is the 2 Vidgen et al. P-values: misunder stood and misused T able 1. Greater pre valence, greater po wer , and a lower T ype I error rate reduce the FDR Prev alence Power T ype I error rate FDR 0.01 0.8 0.05 0.86 0.1 0.8 0.05 0.36 0.5 0.8 0.05 0.06 0.1 0.2 0.05 0.69 0.1 0.5 0.05 0.47 0.1 0.8 0.01 0.10 0.1 0.8 0.001 0.01 proportion of reported disco veries that are false positi ves. Though 0 . 05 might seem a reasonable le vel of inaccuracy , a type I error rate of 0 . 05 will likely produce an FDR that is far higher , easily 30% or more. The formula for FDR is: F alse Positiv es T rue Positiv es + F alse P ositiv es . (1) Calculating the number of true positiv es and false positiv es requires kno wing more than just the type I error rate, but also (i) the statistical power , or “sensitivity”, of tests and (ii) the pre valence of ef fects [ 10 ]. Statistical power is the probability that each test will correctly reject the null hypothesis when the alternati ve hypothesis is true. As such, tests with higher power are more likely to correctly record real ef fects. Prev alence is the number of ef fects, out of all the ef fects that are tested for , that actually exist in the real world. In the FDR calculation it determines the weighting gi ven to the po wer and the type I error rate. Lo w prev alence contributes to a higher FDR as it increases the likelihood that false positi ves will be recorded. The calculation for FDR therefore is: (1 − Prev alence) × Type I error rate Prev alence × Po wer + (1 − Prev alence) × Type I error rate . (2) The percentage of reported positi ves that are actually true is called the Positi ve Predictiv e V alue (PPV). The PPV and FDR are in versely related, such that a higher PPV necessarily means a lo wer FDR. T o calculate the FDR we subtract the PPV from 1. If there are no false positiv es then PPV = 1 and FDR = 0 . T able 1 sho ws how lo w prev alence of effects, low po wer , and a high type I error rate all contribute to a high FDR. Most estimates of the FDR are surprisingly large; e.g., 50% [ 1 , 11 , 12 ] or 36% [ 10 ]. Jager and Leek more optimistically suggest that it is just 14% [ 13 ]. This lower estimate can be explained some what by the fact that the y only use p-v alues reported in abstracts, and ha ve a dif ferent algorithm to the other studies. Importantly , they highlight that whilst α is normally set to 0 . 05 , many studies —particularly in the life sciences —achie ve p-v alues far lo wer than this, meaning that the av erage type I error rate is less than α of 0 . 05 [ 13 , p. 7]. Counterbalancing this, howe ver , is Colquhoun’ s argument that because most studies are not “properly designed” (in the sense that treatments are not randomly allocated to groups and in RCTs assessments are not blinded) statistical po wer will often be far lo wer than reported —thereby driving the FDR back up again [10]. 3 Vidgen et al. P-values: misunder stood and misused Thus, though difﬁcult to calculate precisely , the evidence suggests that the FDR of ﬁndings overall is far higher than α of 0 . 05 . This suggests that too much trust is placed in current research, much of which is wrong far more often than we think. It is also worth noting that this analysis assumes that researchers do not intentionally misreport or manipulate results to erroneously achie ve statistical signiﬁcance. These phenomena, kno wn as “selectiv e reporting” and “p-hacking”, are considered separately in Section 4 . 3 PREV ALENCE AND B A YES As noted abov e, the pre valence of effects signiﬁcantly impacts the FDR, whereby lo wer pre valence increases the likelihood that reported effects are false positives. Y et prev alence is not controlled by the researcher and, furthermore, cannot be calculated with any reliable accurac y . There is no way of kno wing objecti vely what the underlying pre v alence of real effects is. Indeed, the tools by which we might hope to ﬁnd out this information (such as NHST) are precisely what hav e been criticised in the literature surveyed here. Instead, to calculate the FDR, pre valence has to be estimated 1 . In this reg ard, FDR calculations are inherently Bayesian as they require the researcher to quantify their subjecti ve belief about a phenomenon (in this instance, the underlying pre valence of real ef fects). Bayesian theory is an alternativ e paradigm of statistical inference to frequentism, of which NHST is part of. Whereas frequentists quantify the probability of the data giv en the null hypothesis ( P(D | H 0 ) ), Bayesians calculate the probability of the hypothesis giv en the data ( P(H 1 | D) ). Though frequentism is far more widely practiced than Bayesianism, Bayesian inference is more intuitiv e: it assigns a probability to a hypothesis based on ho w likely we think it to be true. The FDR calculations outlined abo ve in Section 2 follo w a Bayesian logic. First, a probability is assigned to the prior lik elihood of a result being false ( 1 − prev alence ). Then, ne w information (the statistical po wer and type I error rate) is incorporated to calculate a posterior probability (the FDR). A common criticism against Bayesian methods such as this is that they are insufﬁciently objecti ve as the prior probability is only a guess. Whilst this is correct, the large number of “ﬁndings” produced each year , as well as the low rates of replicability [ 14 ], suggest that the pre valence of ef fects is, overall, f airly lo w . Another criticism against Bayesian inference is that it is o verly conserv ativ e: assigning a lo w value to the prior probability makes it more likely that the posterior probability will also be lo w [ 15 ]. These criticisms notwithstanding, Bayesian theory of fers a useful way of quantifying ho w likely it is that research ﬁndings are true. Not all of the authors in the literature re viewed here explicitly state that their ar guments are Bayesian. The reason for this is best articulated by Colquhoun, who writes that “the description ‘Bayesian’ is not wrong b ut it is not necessary” [ 10 , p. 5]. The lack of attention paid to Bayes in Ioannidis’ well-regarded early article on p-v alues is particularly surprising giv en his use of Bayesian terminology: “the probability that a research ﬁnding is true depends on the prior probability of it being true (before doing the study)” [ 1 , p. 696]. This perhaps reﬂects the uncertain position that Bayesianism holds in most uni versities, and the acrimonious nature of its relationship with frequentism [ 16 ]. W ithout commenting on the broader applicability of Bayesian statistical inference, we ar gue that a Bayesian methodology has great utility in assessing the overall credibility of academic research, and that it has receiv ed insufﬁcient attention in pre vious studies. Here, we hav e sought to make visible, and to rectify , this oversight. 1 In much of the recent literature it is assumed that prev alence is very lo w , around 0.1 or 0.2 [1, 11, 10, 12] 4 Vidgen et al. P-values: misunder stood and misused 4 PUBLICA TION BIAS: SELECTIVE REPOR TING AND P-HA CKING Selecti ve reporting and p-hacking are two types of researcher-dri ven publication bias. Selecti ve reporting is where non-signiﬁcant (b ut methodologically rob ust) results are not reported, often because top journals consider them to be less interesting or important [ 17 ]. This skews the distribution of reported results to wards positi ve ﬁndings, and ar guably further increases the pressure on researchers to achiev e statistical signiﬁcance. Another form of publication bias, which also ske ws results towards positi ve ﬁndings, is called p-hacking. Head et al. deﬁne p-hacking as “when researchers collect or select data or statistical analyses until nonsigniﬁcant results become signiﬁcant.” [ 18 ]. This is direct manipulation of results so that, whilst they may not be technically false, the y are unrepresentati ve of the underlying phenomena. See Figure 1 for a satirical illustration. Head et al. outline speciﬁc mechanisms by which p-values are intentionally “hack ed”. These include: (i) conducting analyse midway through experiments, (ii) recording many response v ariables and only deciding which to report postanalysis, (iii) excluding, combining, or splitting treatment groups postanalysis, (iv) including or excluding co v ariates postanalysis, (vi) stopping data exploration if analysis yields a signiﬁcant p-v alue. An excellent demonstration of ho w p-values can be hacked by manipulating the parameters of an experiment is Christie Aschwanden’ s interactiv e “Hack Y our W ay to Scientiﬁc Glory” [ 19 ]. This simulator , which analyses whether Republicans or Democrats being in ofﬁce af fects the US economy , shows ho w tests can be manipulated to produce statistically signiﬁcant results supporting either parties. In separate papers, Head et al. [ 18 ], and de W inter and Dodou [ 20 ] each examine the distributions of p-values that are reported in scientiﬁc publications in different disciplines. It is reported that there are considerably more studies reporting alpha just below the 0 . 05 signiﬁcance lev el than above it (and considerably more than would be expected gi ven the number of p-values that occur in other ranges), which suggests that p-hacking is taking place. This core ﬁnding is supported by Jager and Leek’ s study on “signiﬁcant” publications as well [13]. 5 WHA T T O DO W e argued above that a Bayesian approach is useful to estimate the FDR and assess the ov erall trustworthiness of academic ﬁndings. Howe ver , this does not mean that we also hold that Bayesian statistics should replace frequentist statistics more generally in empirical research [see: 21 ]. In this concluding section we recommend some pragmatic changes to current (frequentist) research practices that could lower the FDR and thus improv e the credibility of ﬁndings. Unfortunately , researchers cannot control how pre v alent effects are. They only ha ve direct inﬂuence ov er their study’ s α and its statistical po wer . Thus, one step to reduce the FDR is to mak e the norms for these more rigorous, such as by increasing the statistical power of studies. W e strongly recommend that α of 0.05 is dropped as a con vention, and replaced with a far lower α as standard, such as 0 . 01 or 0 . 001 ; see T able 1 . Other suggestions for improving the quality of stat istical signiﬁcance reporting include using conﬁdence interv als [ 7 , p. 152]. Some have also called for researchers to focus more on effect sizes than statistical signiﬁcance [ 22 , 23 ], arguing that statistically signiﬁcant studies that have ne gligible effect sizes should be treated with greater scepticism. This is of particular importance in the context of big data studies, where many “statistically signiﬁcant” studies report small ef fect sizes as the association between the dependent and independent v ariables is very weak. 5 Vidgen et al. P-values: misunder stood and misused Figure 1. “Signiﬁcant”: an illustration of selectiv e reporting and statistical signiﬁcance from XKCD. A vailable at http://xkcd.com/882/. Last accessed on 16 February 2016. 6 Vidgen et al. P-values: misunder stood and misused Perhaps more important than an y speciﬁc technical change in how data is analysed is the gro wing consensus that research processes need to be implemented (and recorded) more transparently . Nuzzo, for example, argues that “one of the strongest protections for scientists is to admit e verything” [ 7 , p. 152]. Head et al. also suggest that labelling research as either exploratory or conﬁrmatory will help readers to interpret the results more faithfully [ 18 , p. 12]. W eissgerber et al. encourage researchers to provide “a more complete presentation of data”, beyond summary statistics [ 24 ]. Improving transparenc y is particularly important in “big” data-mining studies, gi ven that the boundary between data exploration (a legitim ate exercise) and p- hacking is often hard to identify , creating signiﬁcant potential for intentional or unintentional manipulation of results. Se veral commentators ha ve recommended that researchers pre-re gister all studies with initiati ves such as the Open Science Framew ork [ 18 , 1 , 7 , 14 , 25 ]. Pre-registering ensures that a record is kept of the proposed method, effect size measurement, and what sort of results will be considered notew orthy . Any deviation from what is initially registered would then need to be justiﬁed, which would gi ve the results greater credibility . Journals could also proactiv ely assist researchers to improve transparenc y by providing platforms on which data and code can be shared, thus allowing external researchers to reproduce a study’ s ﬁndings and trace the method used [ 18 ]. This would provide academics with the practical means to corroborate or challenge pre vious ﬁndings. Scientiﬁc kno wledge advances through corroboration and incremental progress. In keeping with Fisher’ s initial vie w that p-values should be one part of the evidence used when deciding whether to reject the null hypothesis, our ﬁnal suggestion is that the ﬁndings of any single study should al ways be conte xtualised within the broader ﬁeld of research. Thus, we endorse the vie w of fered in a recent editorial of Psycholo gical Science that we should be extra sceptical about studies where (a) the statistical power is low , (b) the p-value is only slightly belo w 0 . 05 , and (c) the result is surprising [ 14 ]. Normally , ﬁndings are only accepted once they hav e been corroborated through multiple studies, and ev en in indi vidual studies it is common to “triangulate” a result with multiple methods and/or data sets. This offers one way of remediating the problem that e ven “statistically signiﬁcant” results can be false; if multiple studies ﬁnd an ef fect then it is more likely that it truly exists. W e therefore also support the collation and organisation of research ﬁndings in meta-analyses as these enable researchers to quickly e valuate a lar ge range of rele vant e vidence. DISCLOSURE/CONFLICT -OF-INTEREST ST A TEMENT The authors declare that the research was conducted in the absence of any commercial or ﬁnancial relationships that could be construed as a potential conﬂict of interest. A UTHOR CONTRIBUTIONS All authors listed, ha ve made substantial, direct and intellectual contrib ution to the work, and approv ed it for publication. A CKNO WLEDGMENTS For providing useful feedback on the original manuscript we thank Jonathan Bright, Sandra W achter , Patricia L. Mabry , and Richard V idgen. REFERENCES [ 1] Ioannidis J. Why most published research ﬁndings are f alse. PLoS Med 2 (2005) e124. 7 Vidgen et al. P-values: misunder stood and misused [ 2] Ziliak ST , McCloske y DN. The cult of statistical signiﬁcance: How the standar d err or costs us jobs, justice, and lives (Uni versity of Michigan Press, Ann Arbor) (2008). [ 3] Cohen J. The earth is round ( p < . 05) . American Psycholo gist 49 (1994) 997–1003. [ 4] Le w MJ. T o p or not to p: on the evidential nature of p-values and their place in scientiﬁc inference. arXiv pr eprint arXiv:1311.0081 (2013). [ 5] Fisher RA. Statistical methods for r esear ch workers (Genesis Publishing Pvt Ltd, Edinb urgh) (1925). [ 6] Sterne J A, Smith GD. Sifting the evidencewhat’ s wrong with signiﬁcance tests? Physical Therapy 81 (2001) 1464–1469. [ 7] Nuzzo R. Statistical errors. Nature 506 (2014) 150–152. [ 8] Berger JO. Could ﬁsher , jef freys and ne yman hav e agreed on testing? Statistical Science 18 (2003) 1–32. [ 9] Hurlbert SH, Lombardi CM. Final collapse of the neyman-pearson decision theoretic frame work and rise of the neoﬁsherian. Annales Zoologici F ennici (BioOne) (2009), v ol. 46, 311–349. [ 10] Colquhoun D. An in vestigation of the false disco very rate and the misinterpretation of p-v alues. Royal Society Open Science 1 (2014) 140216. [ 11] Biau DJ, Jolles BM, Porcher R. P value and the theory of hypothesis testing: An e xplanation for new researchers. Clinical Orthopaedics and Related Resear ch 468 (2010) 885–892. [ 12] Freedman LP , Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol 13 (2015) e1002165. [ 13] Jager LR, Leek JT . An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15 (2014) 1–12. [ 14] Lindsay DS. Replication in psychological science. Psycholo gical science 26 (????) 0956797615616374. [ 15] Gelman A. Objections to bayesian statistics. Bayesian Analysis 3 (2008) 445–449. [ 16] McGrayne SB. The theory that would not die: how Bayes’ rule cr ack ed the enigma code , hunted down Russian submarines, & emer ged triumphant fr om two centuries of contr oversy (Y ale Univ ersity Press, Ne w Hav en) (2011). [ 17] Franco A, Malhotra N, Simonovits G. Publication bias in the social sciences: Unlocking the ﬁle drawer . Science 345 (2014) 1502–1505. [ 18] Head ML, Holman L, Lanfear R, Kahn A T , Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol 13 (2015) e1002106. [ 19] Aschwanden C. Science isn’t broken. F ive Thirty Eight, http://fivethirtyeight.com/ features/science- isnt- broken/ , last accessed on 22 January 2016 (2015). [ 20] de W inter JC, Dodou D. A surge of p-values between 0.041 and 0.049 in recent decades (but negati ve results are increasing rapidly too). P eerJ 3 (2015) e733. [ 21] Simonsohn U. Posterior-hacking: Selecti ve reporting in v alidates bayesian results also. Draft P aper (2014). [ 22] Coe R. It’ s the effect size, stupid: What ef fect size is and why it is important. P aper pr esented at the British Educational Resear ch Association annual confer ence: Exeter (2002). [ 23] Sulli van GM, Feinn R. Using ef fect size-or why the p value is not enough. Journal of graduate medical education 4 (2012) 279–282. [ 24] W eissgerber TL, Milic NM, W inham SJ, Garovic VD. Beyond bar and line graphs: Time for a new data presentation paradigm. PLoS Biol 13 (2015) e1002128. doi:10.1371/journal.pbio.1002128. [ 25] Peplo w M. Social sciences suf fer from sev ere publication bias. Natur e, August (2014). 8

P-values: misunderstood and misused

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment