P values, confidence intervals, or confidence levels for hypotheses?

Null hypothesis significance tests and p values are widely used despite very strong arguments against their use in many contexts. Confidence intervals are often recommended as an alternative, but these do not achieve the objective of assessing the cr…

Authors: Michael Wood

P v al u e s , co n fi d e nc e i nt e r va l s , or c on f i de n ce l e v e l s f or h y p ot h e se s ? 11 February 2014 Michael Wood University of Por tsmouth B usiness School Richmond Buildin g Portland Street, P ortsmouth PO1 3DE , UK michael.wood@por t.ac.uk mickofemswor th@gmail.com Abstract Null hypothesis sign ificance tests and p valu es are wi dely used despite v ery strong argu ments against their use in many contex ts . Confidence interv als are often reco mmended as an alternative, but thes e do not achieve th e objective of assessing the credibility of a hypothesis, and the distinction be tween confid ence and probabili ty is an unnecessary confusion. This paper proposes a more straightforward (probabilistic ) defini tion of c onfidence, and sugg ests how the idea can be appli ed to wh at ever hypotheses are of int erest to researchers . The relative merit s of the different approa ches are discussed usin g a series of illustrative e xamples: usuall y confidence based approaches s eem more transpa rent and useful, but there are some contexts in which p values may be app ropriate. I also suggest some meth ods for converting r esults from one f ormat to another. (The attra ctiveness of the idea of confid ence is demonstrated by the widespread persistence of the completely inc orrect idea that p =5% is equivalent to 95% confidence in th e alternative hypoth esis. In this pap er I show how p values can be used to derive meaningful confidence state ments, and the assumptions und erlying the deri vation.) Key words: Confidenc e interval, Confid ence level, Hypo thesis testing, Null hyp othe sis significance tests, P value, Use r friendliness . P values, confidenc e intervals, or confide nce levels for hypothes es? 2 P v al u e s , co n f id e n c e i n te r v a l s, o r c on f i de n ce l e v e ls f or h y p ot h e se s ? Introduction Null hypothesis sign ificance tests (NHSTs ) are widel y used to answer the question of whether empirical results based on a sampl e are due t o chance, or wheth er they are likely to indicat e a real effect which app lies to the whole population fr om which the sampl e is taken . There are, however, serious difficu lties with such tests and th eir resulting p values or significance le vels: the literature on th ese difficulties g oes back at least h alf a century (e.g. Coh en, 1994; Gardn er and Altman, 1986 ; Gill, 1999; Kirk, 19 96; Lindsay, 1995; Mingers, 2006; Morrison and Henkel, 1970; Nickerson, 2000). NHSTs are very widely misin terpreted, they do not provide the information that is likely to be want ed , and as man y null hypotheses ar e obviousl y false the tests are often unn ecessary as well a s uninformati ve. The commonly sug gested a lternative to the use of NHSTs is th e use of confidence intervals (e.g. Cash en and Geiger, 2004; Cortina and F olger, 1998; Gardn er and Altman, 1986; Gill,1999; Ming ers, 2006; Wo od, 2005). In medicine, for example, guid ance to au thors of research papers in s ome journ als (BMJ, 2011), and regu latory authorities (ICH , 1998), strong ly recommends these in preference to NHSTs. However , in most of the social scienc es, NHSTs, and not confidence inter vals, ar e still the standard. There are, how ever, also proble ms with confidence intervals: 1 They refer to an inter val wh ereas in many cas es researchers do want to evaluat e a hypothesis. 2 “ C onfidence” is usuall y defined in a rather awkward w ay which appears t o distinguish the c oncept from the probab ilities that pe ople intuitivel y want. 3 They are inapplicable if the characteristic of interest c annot be expre ssed on a suitable numerical s cale. My aim in this paper is to e xtend the idea of c onfidence to inclu de confidenc e levels for hypotheses in general (n ot just inter vals), to propose th at confidence level s can reasonabl y be interpreted as pr obabilities, to sugg est some simple methods for deriving confiden ce levels from p values, and to assess th e relative merits of NHSTs, c onfidence inter vals and con fidence levels for hypotheses. This should be of inter est to any resea rcher concerned about the b est way to analyze and co mmunicate statistic al results. For example, accordin g to a study which s ought to investigate the impac t of social status on mortality by anal yzing how winnin g an Academy a ward (Oscar) ma y prolong an actor’s life , P values, confidenc e intervals, or confide nce levels for hypothes es? 3 “life expectancy was 3.9 years long er for Acade my Award winn ers that for other, less recognized performers (79.7 vs. 75.8 y ears, p = 0.003)” (Redel mei er and Singh, 2 001: 955). The p value here does not directl y address the qu estion of how likely i t is that Oscar win ners really do live longer – the equivalent c onfidence le vel for this hyp othesis is 99.85 % (making a fe w reasonab le assumptions to be explain ed below). Alterna tively, we might cite a confid ence lev el for the slightly stronger h ypothesis that th e life expectanc y of Oscar winners is at least a year longer (98.6%). Confidenc e levels of this type avoid most of the diff iculties of p values – they do, f or example, see m far easier to und erstand. I start with a brief discussi on of the c oncepts used to frame the problem tha t all the methods are tacklin g – that of using a sample to make inferences about a wider population. I then review briefly th e difficulties of NHSTs, how c onfidence intervals overcome many of these difficulties and how c onfidence can be defined. Then I explore the idea of confidence le vels for more general hyp otheses and how th ey can be estima ted. I finish with a discussi on of a series of examples, chos en to illustrate the advan tages and disad vantages of p values, confidence intervals, and c onfidence levels f or hypotheses in a range of diff erent contex ts. My concern in this article is with the concepts used to express s tatistical conclusions, n ot with the detail of methods of anal ysis; I have chos en examples usin g relatively si mple methods because this makes it easier t o analyze th ese concepts . Samples, popula tions , processes and t he wider context Hand (2009: 29 1) points out that “ much statistical the ory is based on th e notion t hat the data have been randoml y drawn from a p opulation ” but this is often n ot the case. To make sense of statistical inferenc e procedures w e then need to i magine a populati on from whic h the sample can reasonably be assumed to ha ve been randoml y selected. The sample of Oscar winners included only past Os car win ners, but the aim of the r esearch was to see if anyth ing could be inferred about the life expectancy of Oscar winners in general, includ ing future wi nners . We then need to make the assu mption that the sa mple can be regard ed as a rand om sample from this population of curren t and potential future winner s – which is obvi ously a difficu lt notion to nail down precisel y. Experiments, or randomized tria ls, also make the idea of the populati on problematic. Suppose, for exa mple, we are c omparing two trainin g progra ms, A and B, with th e help of 42 trainees 1 : as I will use this a s an example belo w it is help ful to give a few details. A rand omly chosen group of 21 tr ainees does Pr ogram A, the r emainder to do B, and then th e effectiveness of the training f or each trainee is ra ted on a 1-7 scale. To compare the two progra ms we then work out the mean eff ectiveness ra tings for each gr oup of 21 trainee s: these c ome to 4.48 for Program A and 5.55 for Pro gram B. This sugg ests that Program B is be tter, but d oes not answer the question of wh ether the effe ct may be due to chance and might be rev ersed in another similar experi ment. In one sense the population her e is the wider gr oup of trainees fr om whom the sample is drawn , but even if the 42 trainees c omprised the entire population, the proble m P values, confidenc e intervals, or confide nce levels for hypothes es? 4 of sampling err or still arises because a different divi sion of trainees int o two groups might produce a different result. In both cases it is obviously important to analyze how reliable the result is taking account of sampling error, alth ough the idea of a population is difficult to visualize. An alternative me taphor is the idea of a “process”: we could refer t o the training pr ocess o r the process of winnin g an Oscar. This is still rather awkward, and does not acknowl edge the future dimension. A term such as “wider contex t” is vaguer , an d so more suitable f or informal descriptions, alth ough for formal work the notion of p opulation is conveni ent and deeply embedded in the langu age of sta tistics. For the Oscars , the wider cont ext includes the futur e, and for the training programs the wid er context inclu des both a wid er population of trainees and the fact that there are many possible allo cations into two groups. Null hypothesis significance tes ts (NHSTs) and p value s The idea of an NH ST is to set up a null hyp othesis and then u se probability theory or simulati on to estimate the probability of obtaining the obser ved r esults, or more extreme re sults, if the null hypothesis is true 2 . If this probability, known as the p valu e, is low, then we conclude the null hypothesis is unlik ely and an alternati ve hypothesis must be true. Man y researchers use cut-off levels of 5%, 1%, etc and d escribe their results as significant at 5% (p< 0.05) or whatev er. The result above about Osc ars and life exp ectancy, for exa mple, is signifi cant at 1%, indicatin g reasonably strong e vidence against the null hypo theses, and so for the h ypotheses that winn ing an Oscar really doe s tend to prol ong life. On the other hand, a subsequent analysis usin g a difference method of analysis and including more r ecent data gave r esults that are equ ivalent to a p value between about 13% and 17 % (S ylvestre et al , 2006; I have estimated the p value fr om the confidence in terval given in the ar ticle, using the method described bel ow). This sugg ests that the differenc e in life expectanci es is well within the range that would be exp ected if only chance factors w ere at work. In contrast to the earlier result, this suggests there is little evidence for the hypothesis that winning an Oscar prolongs an actor’ s life. Similarly, the ad vantage of program B over Program A yields a p value of 2.11%. T his is shown graphicall y in Figure 1 3 . This graph repres ents the likely distribu tion of the mean difference between the program s in similar samples drawn fro m the same s ource, assuming the truth of the null h ypothesis that th e difference be tween the populati on means is actually zero, and that any differenc e observed in the sampl e is just due to rando m sampling er ror (this is often called a sa mpling dist ribution). The graph sho ws that the probabilit y of this diff erence being as big as, or big ger th an, the observed diff erence of 1. 07 is only 2.11 %. As this p value is low we can assu me the null hypoth esis is unlikely and there is a real differenc e in the effectiveness of th e two programs. P values, confidenc e intervals, or confide nce levels for hypothes es? 5 Figure 1: Probabil ity distri bution for sample estimate o f difference between Program B and Program A assumi ng the null hypothe sis of no pop ulation differen ce As noted above, NHSTs have attracted some very extensive criticism over the years. I will review some of the main points here, but for a more extensive review the reader is referr ed to the citations above. There are counter-argu ments; taken in the right spirit, sometimes NHSTs may have a useful role to play – I will discus s this in rel ation to some specific examples b elow. 1. NHSTs fail to provide the information people are likely to want. The p value from the NHST about Oscar winners does not tell us (a) how many extra years Oscar winners are likely to have (the streng th of the effect), and (b) how probable it is that winning an Oscar winner does inc rease life expectanc y (the p value may gi ve us some indication but it do es not give this probability ). The p value concerning the two training programs is similarly uninformative. Readers o f the article about the two training programs are giv en the mean scores for each group, but the difference between the means is not made explicit in the ar ticle. This is par t of a general tendency for “ quantitative ” research in some social sciences to be strangely non- quantitative in the sense that readers are often n ot told the size of effects and differences. This is not, however, true of the article about the Oscars (in a medical journal with different conventions) wher e the 3.9 y ears is made explicit. 2. The conclusions that can be drawn from NHSTs are often trivial. Strictly, the null hypotheses on which p values are based are exact: the Oscar winners’ life expectancy and that of the controls are exactly the same, and the population mean scores for both training programs are identical. In practice, slight differences are likely between any two groups, so null hypotheses of this kind are v ery likely to be false, which means that there is li ttle point in a for mal test to pro ve it. The result will depend on the sam ple size: with a suitably large sample almost any null hypothesis is likely to be d isproved. Ev en apart from this logical point, null hypo theses are sometimes so unlikely as to make disproving them o f marginal interest. Fo r -3.0 - 2.0 -1.0 0.0 1.0 2.0 3.0 S am ple e s t im ate of dif f e r e nce b e t w e e n Pr og r am m e B and A p is t he pr o ba b ilit y in t he t w o t ails w hich is 2.11% P values, confidenc e intervals, or confide nce levels for hypothes es? 6 example, Grinyer et al (1994) tested the – very implausible – hypothesis that respondents to a questionnaire are equall y likely to agree with a sta tement, or disagree, or neither agree no r disagree; and Glebbeek and Bax (2004: Table 2) cite a p v alue less than 1% for the relationship between employee turnover and the performance of organizations – common sense and much of the literatur e suggests that a null hypothesis of no relationship between these two variables is false, so the p value adds little. 3. NHSTs are very widely misinterpreted. Statistically signifi cant results are widely assumed to be large and important in a practical sense. The fact that a p value is 5% is widely view ed as impl ying that the probability of the truth o f the alterna tive hypothesis is 95%. A non-sig nificant p value is often seen as some sort of proof for the truth of the null hypot hesis (this fallacy is built into the “test of normality” often used as a pre-check for some statistical procedures) . None o f these are valid . Nickerson (2000) lists ten distin ct ways in which NHSTs can be, and often are, misunderstood. Part o f the reason for this is d oubtless the natural te ndency to assume that a carefully crafted statistic like a p value will deliver the information that is obviously wanted: unfortunately this is not the case. Coulson et al (2010) tested how well 330 authors of published articles understood p values and confidence intervals. They concluded that “interpretation was g enerally poor” . However, there was very clear evidence that many authors interpreted confidence intervals in te rms of p values; those who interpreted confidence intervals without reference to null hypothesis tests gave a far better interpretation of the results than those who thought in terms of null hypothesis tests, which suggests that NHSTs are a powerful confusing influence in the in terpretation of statis tics, even among professional research ers. The terminology commonl y used is not helpful. “Significant” in ordinary English does mean important. Widely used phrases like A “is sig nificantly more than ” B suggests that the statistical significance is a property of the difference between A and B, as o pposed to bein g just a measure of the strength of the evidenc e for this difference. Furthermore, p values mean focusing on a hypothetical null hypothesis instead of the hypothesis of interest. And to cap it all, as a measure of the strength of evidence, p v alues are a rev erse measure – low values indicating strong evidence. All these fa ctors make the widesp read misundersta nding of N HSTs seem al most inevitable. Confidence int ervals The idea of confidenc e intervals is to use the data t o derive an inter val within whi ch we have a specified level of confidenc e that the population par amet er will lie. For the Osca rs, the first analysis suggests that the extra life expectancy f or winners is 3.9 years and the 9 5% confidence P values, confidenc e intervals, or confide nce levels for hypothes es? 7 interval for this add itional l ife expectancy is likel y to be about 1. 3 to 6.5 years (an estimate fr om the results given by Redelmeier and Sin gh (2001) usi ng the normal dis tribution as described below). We cannot be sure of the e xact advantage fr om winnin g an Oscar on th e basis of the sample data, but we can be 95% confident that the true figu re will lie in this interval. I will look at the s econd example in more detail and c arry this through t o the discussion of confidence le vels . Figure 2 shows a confid ence distri bution for the populati on difference between the means of the effecti veness ratings of Programs A and B. This is derived from Figu re 1 simply by shifting th e curve along so that it is center ed on 1.07 (the observed difference of the means) rather than 0. An inform al rationale for thi s goes as follows. The most like ly value for the population parame ter , given the sample data, is the sa mple esti mate (1.07), so it makes sense that this should be the centre of the confidence dis tribution. Further more, Figur e 1 suggests that the probabilit y that the diff erence betwe en the sample estimate, and the un known population value (assumed to b e 0), being more than 1.07 is 2.11%, so the tails in Fig ure 2 are correct from this p erspecti ve. Figure 1 can be r egarde d as describing the probabilities of different discrepan cies between the sample esti mate and the populati on parameter being estimated, so it is reasonab le to regard a disp laced version of Figure 1, in Figure 2, as representing our view, based on the sample inf ormation, of differ ent possible val ues of the population parame ter. The horizontal axis in Fig ure 2 refers to the possible values of the unknown population para meter (the differ ence of th e means), whereas in Figure 1 i t refers to sample estimat es. Figure 2 (and the spreadsheet behind it) enables us to read of f the 2.5 and 97.5 percentiles of this distribu tion – this is the 95% c onfidence inter val which extends fr om 0.17 to 1.97. (There is a more detail ed discussion of the rationale behin d this in t he Appendix.) Figure 2: Confidence distribu tion and interval fo r difference between P rogram B and Program A This has clear advan tages over the p value pr esentation . It does answer direc tly questions about the s trength of the effect (how big th e difference is), and the width of the interval describes the unce rtainty due to sa mpling error in an obvious way. The informa tion displayed is not trivial or obvious li ke the NHST c onclusions may be, and misinterpretations -3.0 - 2.0 - 1.0 0. 0 1.0 2.0 3.0 P os s ible va lue s of po p ulat ion dif f e r e nce Pr o gr am m e B - Pr o gr am m e A T he do tt ed lines re pr e sent a 95 % c on f id e nce inter v al. T he ar ea i n t he t w o t ails be y on d t he outer s olid lines is 2.11 % as in Figu r e 1. P values, confidenc e intervals, or confide nce levels for hypothes es? 8 seem far less likely than for NHSTs. The focus is v ery much on the dif ference betw een the programs and not on a hypothetical nu ll hypothesis, th ere is no inverse scal e, and the phrase “confidence” suggest s that what is b eing assessed is the strength of the evidence. It is also worth noting that, because z ero is not in this interval, we can be mor e than 95% confident tha t Program B is bet ter than Progra m A, which is equivalent to the statement that p<0.05. In this way all the infor mation in the signi ficance level can be deduce d from confidence inter vals, but the c onfidence interv als provide extra in formation about the size of the difference and the extent of the uncer tainty. Confidence inter vals corres ponding to man y other null hypoth esis models have been derived and built into software pac kages. They are widely used as a means of an alyzing and presenting resul ts in some fields such as medicine, but not in most social s ciences. Confidence leve ls for hypothe ses The notion of confidenc e can easily be e xtended from intervals to more gen eral hypotheses. This idea can easil y be applied to the difference bet ween Progra ms A and B (Figures 1 and 2). The confidence cur ve in Figu re 2 is symmetrical s o the confidence of the value being in the lower tail will be half of 2.1 1% or 1.1% (roun ded to one deci m al place), and the confidence that the difference in the means will be greater than zer o, or that Progra m B is actually, in population terms, better than Program A is 100% – 1.1% = 98.9 %. This principle can easily be extended. Su ppose we wer e interested in the hypothe sis that the advantag e of Program B is substantial, sa y greater than on e unit. Then t he p value is no longer helpful, but in the c ase of Figure 2, we can use the t distribution ( more detail s below) directly to show tha t: Confidence (Progra m B more than 1 unit better than A) = 56%. It should be clear fro m Figure 2 that this is rough ly right. In a very similar way, using Redelmeier and Singh ’s (2001) data and methods, the confidence level f or the hypothesis that Oscar winner s live longer than the contr ols is 99.85%, and the confidence for their life e xpectancy b eing at least a year l onger is 98.6 %. The idea of a c onfidence level f or a hypothesis is more general than these two ex amples might suggest. For exampl e, Glebbeek and Bax (2004) wanted to confirm the hypothesis that there is an “inver ted U - shape relati onship” between two variables – staff turnover and organizational perfor mance – by setting up regression models with both staff turnover, and st aff turnover squared, as independent v ariables. Because this hypoth esis does not de pend on a single parameter, it is awk ward to use p values or c onfidence inter vals to suppor t this hypothesis. Glebbe ek and Bax (2004) actually used p values, but it is easy to us e a bootstrap argument to esti mate the confid ence level f or this hypothesis – this comes to 67% (W ood, 2012). P values, confidenc e intervals, or confide nce levels for hypothes es? 9 Confidence as pr obability The word “confid ence” is conven tionally used t o ind icate that the concept is not probability but is to be interpret ed in frequentist t erms. This me ans that we need to imagine re peating the procedure that l ed to a 95% confid ence interval, f or example: th en if the 95% is accurate, 95% of these repetitions should produce an in te rval which in cludes the true v alue of wha t we are trying to estimate (s ee, for exam ple, Bayarri and Berg er, 2004). On the other hand, int erpreting a 95 % confidence inter val as a probabil ity would simpl y involve asserting that there is a probability of 95% that the truth ab o ut the whole p opulation lies s omewhere in thi s interval. T his distinction is described as “subtl e” by Nickers on (2000, p. 2 79), and is on e of the issues at stake in the literature on the f oundations of statisti cal inference. This literature is co mplex both c onceptually and mathematically, and has spawne d debates without eas y answers . One influ ential and im portant perspective is the Bayesian one: the Bayesian equ ivalent of a confid ence interval i s a credible inter val. These are someti mes identical to frequ entist confidence inter v als (Bayarri a nd Berger, 2004: 63), so in these cases it would be reasonable to view confidence l evels as prob abilities, and to identif y the confidence distribution with the posterior pr obability distribution . The confidence level for a hypothesis is simply the posterior probabil ity of the hypothesis. Part of the reason for the reluctance to do this ste ms from the frequ entist view t hat either the populati on mean is in the confidenc e inter val, or it is not, and tha t the idea of probability cannot meaningfull y be used to express th is type of uncertaint y. However , in everyday discours e the idea of using probability for th is ty pe of “ epistemic ” uncertainty is widespread and unprobl ematic, so it wou ld seem se nsible to ignore the frequen tists’ philosophical objec tions and treat confidence l evels as probabil ities. Another reason for the reluctance to use Bayesian methods is tha t these bring in prior probabilities to reflec t prio r beliefs about the situa tion. In practice, this may be diff icult, and is widely seen as intr oducing an unhelpful element of subjectivity int o the calculati ons. However, the assumption nece ssary to produc e Bayesian in tervals which ar e identical t o standard confidence inter vals for the example in Figure 2 is tha t the prior pr obabilities should be uniform indicating that all values on the h orizontal axis in Fig ure 2 are equally likely (see A ppendix). In Bayesian terms, thi s is the assumpti on on which the d erivation of Figure 2 and the above confidence inter val depends. My suggesti on is that we ad opt the following definition of a confidence level for an interval or hypo thesis: A confidence le vel is defined as an es timate of the pro bability of the true value of the parameter being within the interval, or of the probability of the truth of the hypo thesis. There may be diff erent ways of estim ating this proba bility: using Bayesian credible inter vals based on a unifor m prior distrib ution, or on some othe r prior distributi on, or by the methods P values, confidenc e intervals, or confide nce levels for hypothes es? 10 used to derive confid ence in tervals. Obviously, different methods of computatio n may give slightly different an swers, but this is hard ly unusual in statisti cs where man y concepts ar e slippery and can b e made precise in different wa ys. (Different statistical t ests may give different p values, for example.) The Bayesian meth ods have the advantag e that they are arguably more transparent: Bayar ri and Berger ( 2004) suggest that the Bay esian approach shou ld be “taught to the masses” (p. 5 9) and that this is often po ssible “without chang ing the proc edures that are taught”. Methods of estim ating confidence lev els If we think of a confid ence distrib ution as a Bayesian p o sterior distribu tion, then we ha ve the whole gamut of Bay esian methods a t our disposal. Similarly there are a wid e variety of methods for deriving c onfidence intervals, which could easily b e adapted to give confidence le vels for more general hyp otheses. This inclu des bootstrappin g, an approach of ver y general applicab ility, which was used to g enerate the 67% confid ence level mentioned ab ove. However, in practi ce, we may be using software which just generates p values, and possibly confidenc e intervals. Alternati vely we may just have the results in a publi shed paper. The methods outlin ed below are for estimating confid ence levels fr om the infor mation we are likely to have in the se circumstanc es. Estimating confi dence levels from p values or other NHST statistics If we only have p values (fr om a software pa ckage or a published paper) it is someti mes possible to estimate confid ence levels as we ha ve seen above. This app roach assumes that it is reasonable to shift the null hypothesi s distribution and treat it as a confidence dis tribution as described above in the section on confid ence intervals . There is a more de tailed analysis of the conditions under which t his is reas onable in the Appe ndix: in rough ter ms the curve shift method is likely t o be reaso nable if the null h ypothesis is model ed by a sym metrical distrib ution such as the normal or t distrib utions. The method above can easily be gen eralized. If th e difference of the means were negative, the argu ment above will be reversed, s o in general we can write Confidence (Pop. para m eter > 0) = 1 – p /2 if Sample estimate ≥0 = p /2 if Sample estimate < 0 where p is the two tailed p value for the null hyp o thesis that the populati o n parameter is zero. In Figure 1 the one-tailed p value is 2.1% /2 = 1.1% whi ch is the same as the confidence that Program A is better than Pr ogram B. Confidence l evels thus gi ve us another way of interpreting one-tailed significance levels. On the othe r hand, the argu ment here is not consistent with the common misc onception that p is 5% means that the probab ility of th e null hypothesis being tru e is 5%, so the pr obability that the probability that the altern ative hypothesis is true – that there is a difference be tween progra ms A and B – is 95%. Fi gure 2 P values, confidenc e intervals, or confide nce levels for hypothes es? 11 il lustrates the proble m with this. The probability of the differenc e between the two progra ms being exactly zer o is very small indeed, certainly not 5%. In some cases p values are not given e xactly but as an inequality. The example ab ove mirrors Eggins et al (2008), where the p is given a singl e star, indicating that p < 5%. This means that the above redu ces to the asser tion that the c onfidence le vel is greater than 97.5%. If we are starting from a va lue of t or z , or if we want a confidenc e level for another hypothesis, it is po ssible to deduce the standard error from the given inf ormation and then use this to calculate c onfidence levels fr om the t or normal distributions. The arith m etic here i s easy, and is incorporated in the spread sheet at http://wood m .myweb.port.ac.uk /CLIP.xls . Estimating confi dence levels fro m confidence i ntervals Sylvestre et al (2 006), in their pap er on Oscar winne rs, gave a 9 5% confidence in terval f or additional life expe ctancy enjoyed by Oscar winners as – 0.3 years to +1.6 years. The confiden ce level for the h ypothesis that this additi o nal life expect ancy is positi ve could be estimat ed by rerunning the analysis with different confidence le vels until one is f ound that has a lower li mit of zero, which can then be used to es timate the confide nce level for the hypothesi s that the additional life expe ctancy is positi ve and that the experience of winning an Osca r is linked to longer life exp ectancy. In this way any software package generating diff erent confid ence intervals could b e used to build up a confidence distribution. In practice, it may be reasonable to assume that the c onfidence distribution is approximated by a t or normal distribu tion, in which case the mean and one of the given limi ts can be used to esti mate the standard e rror and hence use the t or normal distrib ution to estimate any confid ence level we want – the confidence le vel for the h ypothesis that the additional life expe ctancy is positi ve is somewh ere between 91% and 9 4% (using the spreadsheet at http: //woodm.myw eb.port.ac.uk/CLIP .xls , the answer dependin g on which li mit we take as they ar e not qui te symmetrical). This is obv iously just an appr oximatio n, but there seems little p oint in being too peda ntic when the in terpretation of confidence is li ke ly to refer to arbitrary levels such as 95%, and estimates of stati stics such as p values are themselves sub ject to considerable v ariation between sa mples (Boos and Stefanski, 2011). Some example s I will start by drawin g together the dis cussion of the ex amples above, and then consider a fe w more examples – ch osen to illustra te a number of i ssues. All statistics n ot explained belo w, or given by the au thors of the original res earch, are esti mated roughly usin g the methods de scribed in the pre vious secti on and the spreadshe et at http://woodm.my web.port.ac. uk/CLIP.xls . P values, confidenc e intervals, or confide nce levels for hypothes es? 12 Oscars and life expec tancy Redelmeier and Sin gh (2000) foun d that Oscar winner s’ life expectancy was 3.9 years longer than the controls. Th ey cited a p value for this r esult: p = 0.003 Alternatively the y could ha ve stated that: 95% confidence in terval for add itional life expectanc y i s 1.3 to 6.5 years Confidence level f or positiv e additional life e xpectancy = 99.85% Confidence level f or additional life expe ctancy of one year or more = 98.6% The equivalent r esults for Sylvestre et al’s (2006) upda ted analysis are 4 : p = 0.15 95% confidence inter val is -0.3 to 1.7 years Confidence level f or positiv e additional life e xpectancy = 9 3% Confidence level f or additional life expectancy of on e year or more = 27% The confidence in tervals and levels seem more useful and easier to interpre t than the p values. Training programs A a nd B This example was introduced because it is an experim ent, or randomized controlled trial. However, the issues regard ing the analysis are similar to the Oscars exa mple and to the ne xt example, so I will no t analyz e this further here. Men and wome n in work-home culture This example is in cluded to show ho w the results in a typical article in a social sci ence journal, the British Journal of Management, could be analyzed in terms of confid ence. A s part of a study of “work -home cultur e and empl oyee well- being” Bea uregard (2011) showed the difference s between men and women on 9 variables in her sample of 224 local government employ ees. I will use two of the se variables as illu strations: Table 1. Some of the results in Tab le 1 in Beau regard (2011) Measure Mean for men (n=84) Mean for women (n=140) t(222) Work-home culture: managerial supp ort 4.34 4.56 – 1.33 Hours worked weekly 41.27 36.69 3.68*** She also gives the SD for each variable, and a note under the table explains that *** means p<0.001: the difference between m en and women is not significant (p>0.05) for the first variable, and highly significant for the second. As is usual in management research, confidence intervals are n ot given. P values, confidenc e intervals, or confide nce levels for hypothes es? 13 Tables 2 and 3 show th e same results in terms of confi dence intervals and levels. Table 2. Some of the results in Tab le 1 in Beau regard (2011) expre s sed as confid ence intervals Measure Mean for men (n=84) Mean for women (n=140) Difference of mean s (Men – Women) 95% Confidence Interval for differenc e of means Work-home culture: managerial supp ort 4.34 4.56 -0.22 -0.55 to +0.11 Hours worked weekl y 41.27 36.69 4.58 2.1 to 7.0 Table 3. Some of the results in Table 1 in Beauregard (2011) expressed as confidence levels for hypotheses Measure Mean for men (n=84) Mean for women (n=140) Difference of means (Men – Women) Confidence level fo r the hypothesis: Mean for Men > for Women Work-home culture: managerial supp ort 4.34 4.56 -0.22 9% Hours worked weekl y 41.27 36.69 4.58 99.99% Telepathy In a series of exp eriments in the 1 920s and 30s, th e psychologist, J B Rhine, found a number of people who appeared to be telepathic (Rhi ne, 1997). In one series of experi ments, Hubert Pearce Jr. did a card guessing experi m ent 8,075 ti mes, and got the card right on 3,049 occasions. There were five card s in the pack, so guesswork would have pr oduced about 1615 hits. Rhine argues that Pearce's p erfo rmance is s o much better than guessw ork that telepathy must be involved; others ha ve taken the hypothesis that Pearc e was cheating more seriously (Hansel , 1966). We can model the n umber of correct c ards under the null h ypothesis that Pearce was guessing using the binomial distrib ution, and then use the normal app roximation ( z = 39.9) t o deduce that the tw o tail p value is, for all practical pu rposes , zero, which means that the results could not have aris en from chance alone. This is an example where the p value do es seem entir ely appropriate an d the idea of confidence would be ra ther awkward: the reasons for this are discus sed below. Heart transpla nts In October 2007, heart tran splants were stopped a t Papworth Hospital in the UK because 7 out of 20 patients had di ed within 30 days of their operati on (Garfield, 2008): this was significantly more than the nati onal average rate of about 10% ( p = 0.00 24 , one tail, us ing the binomial distribution with a mean of 2 deaths in a 20 patient gr oup to model the null hypothesis as shown in Figure 3 5 ). P values, confidenc e intervals, or confide nce levels for hypothes es? 14 Figure 3. Distribution of number of deaths i n 20 patient gr oups assuming the null hypothesis that the mean death rate = 10% The curve shift m ethod obviously cannot work here. If we shift the curve five units to the right so that it is cen te re d on the sample value of 7, this clearly cannot repres ent a confidence distrib ution b ecause it would in dicate that there is z ero confidence tha t the population mean is 0, 1, 2, 3 o r 4 deaths, whereas 4 deaths in particular does seem reasonably consistent with the sample value of 7. The binomial distribution is a different shape for different population means , so we cannot si mply slide it along . Confidence inter vals and levels h ere clearly need to be estimat ed using other methods: the standard nor mal approximation method gives a 95% confid ence interval (based on 7 out of 20, or 35%, of pa tients dying ) extending from 14% to 56%, a numerical Bayesian a pproach (using the spreadsheet at ht tp://woodm.m yweb.port.ac.uk/ ConfIntsPoissonBino m.xls ) gives an interval extendin g from 18% to 57%, and there are a number of other methods giving similar an swers . These same methods can be used to estimate a confidence level for the hypothesis that the lo ng term death rate at Papworth Hospital is more than 10%: this comes to 99.8% by the first method above, and 99.94% by the second. Staff turnover and o rganizational per formance We have mentioned above the estimated confiden ce level of 67% for the hyp othesis that this relationship has an in verted U-shape wi th very high an d very low levels of staff turn over both leading to subopti mal performance (Glebb eek and Bax, 2 004; Wood, 201 2). There is no satisfactory and eas y way of using p values or confidence inter vals to expr ess this result. Glebbeek and Bax (2004) also tried a lin ear (straig ht line) model of the relati onship between staff turno ver and performance (with thre e control variables): the regr essio n coefficient in one m odel was -177 8: this means that p redicted perfor mance fell by 1778 P values, confidenc e intervals, or confide nce levels for hypothes es? 15 currency units per s taff member f or each additional 1 % in staff turnover. The p value found was 0.007, which can b e conver ted to a 95% confidenc e interval ext ending from -3060 to -495 currency units per additi o nal 1% of staff turnover, or t o a confidence le vel that the regre ssion coefficient is posi tive of 99. 65%. Discussion: a comp arison of p values, confidence in tervals and confidence leve ls for hypothe ses Most o f the above examples can be analyzed by all three approa ches. There are two examples where this is not so. For the inverted U-shape hypothesis relating staff turnover and organizational performance there is no easy and obvious way to use p values and confidence intervals, so the confidenc e level approach is the obvious one to use. With the telepathy example, to use the data to derive a confidence interval o r level we need to define a suitable measure to assess the extent to which telepathy is o ccurring. The obvious statistic is the proportion of correct guesses: under the guessing hypothesis we would expect population v alue of this measure to be exactly 20%, and under the te lep athy hypothesis it would be more than 20%. If we were going to use confidence intervals, 20% would be just one value on the continuum of possibilities, which would mean that the confidence level for the hypothesis that the proportion of correct guesses is exactly 20% is zero – which is not a helpful answer. This effectively rules out the idea of confidence as discussed in this paper. Furthermore, this is one instance in which p values are not trivial and do make good logical sense, because the null hypo thesis is an exact one, and any departures from 20% are surprising from this point of view. This sugges ts the p value as th e appropriate s tatistic here. This example illustrates neatly the main advantage of NHSTs over their rivals: the underlying rationale is straightforward involving the estimation of a probab ility under the assumptions of the null hypothesis. There are none of the extraneous, and possibly questionable, additional assumptions which are necessary to use the idea of confidence – foremost among these is the assumption, explicit in the Bayesian formulation and implicit in frequentist formulations, that all possibilities are assumed to be equally likely before analyzing the ev idence. NHSTs and p values may not be user-friendly and may just tell a small part of the story, but for thoug htful users, the rationale is si mpler and invol ves fewer a ssumptions. Let’s now consider Men and wom en in work-home culture (Table 1). The co nfidenc e interval presentation (Table 2 ) has the advantage of telling readers how strong the effect is, and the likely level of uncertainty due to sampling erro r. The p values given in the original article do not directly tell readers how big the difference is, nor the likely impact of sampli ng errors on the result, and they are difficult to interpret for the reasons discussed above. The comparison with confidence levels (Table 3 ) is less clear cut because simply telling readers that there is 99.99% chance that men (in this population) wo rk longer hours than women says nothing about the size P values, confidenc e intervals, or confide nce levels for hypothes es? 16 of the effect – how much longer they work. The confidence interval presentation her e is arguably the most informati ve, with the confidence level presentation providing a simple summary in ter ms of the hypoth eses of interest t o the researchers. Very similar arguments apply to the difference between the two training programs, the heart transplants, the linear model relatin g staff turnover and o rganizational performance, and the Oscars and life ex pectan cy. In the la st of these ex amples, I have shown how confidence levels for hypotheses can be made more useful by considering different hypotheses. The confidence levels for Oscar winners living at least a y ear longer tell a slightly different story from the confidence levels in them simply living longer. Another possibility, of course, would be to show a graph of the confidence distributi on (like Figur e 2 ). One important way in which the examples vary is in terms of the status of the null hypothesis. For m any people telepathy is so unlikely that the alternative, null or chance hypothesis, is very much the front runner. The situation with the heart transplants is rather different in that there are fairly obvious reasons for differences between hospitals, but it still makes excellent sense to take the national average as a baseline for comparison. In both cases there are good reasons for taking the null hypothesis seriously, and so, from this perspective, NHSTs are a reasonable approach (although in the latter case the arguments against them may be stronger). This is not true o f the other examples. There is very little reason to think that staff turnover would have no impact o n performance, or that there would be no difference between men and women in work-home cultur e, or that t wo training programs would be (exactly) equ ally effective. These null hypotheses ha ve little interest or credibility, which, me ans, firstly that testing them is of m arginal interest, and secondly that the focus o n the null hypothesis is likely to seem odd to readers of the research (t his is possibly acknowledged by the com mon practice of not mentioning n ull hypo theses in research publications). In these thre e exa mples, i t definitely makes sense to focus on confidence, becau se then the focus is on the hypothesi s or interval o f interest; there is n o strange, hypothetical and distinctly uninteresting null hypothesis involved. Conclusions Using confidenc e intervals, or giving confidence le vels for hyp otheses of interes t, has the potential to avoid many of the widely acknowledg ed proble ms of NHSTs and p values. Confidence seems a more in tuitive and direct conc ept which a voids the need to f ormulate a null hypothesis in order t o demonstrate h ow implausible i t is. For example, research studyin g the hypothesis tha t Oscar winners live longer is b et ter served by giving a confidence le vel for this hyp othesis (98. 5% for one study, 93% for the se cond) than a p value because th e former gi ves a direct estimate of the probabilit y of the hypothesis being true whereas the latter do es not. To convey inf ormation about the size of the effect, we P values, confidenc e intervals, or confide nce levels for hypothes es? 17 could either use a confidence inter val, or give the confidence level for Oscar winn ers having at least one year add itional life exp ectancy (98.6% f or the first study and 27 % for the second). In a more typical social scie nce context, instead of the conventional t and p values in tables such as Table 1, we could use c onfidence inter vals as in Tabl e 2, or confidence le vels as in Table 3. Using c onfidence intervals or levels invol ves converting the ch aracteristic of in terest to a single quantity – typically a diff erence betwe en two means, or a regressi on coefficient (sl ope). The idea of confidenc e is conven tionally view ed as distinct fr om the idea of prob ability, but I have argued abo ve that this is unnecessary. Conf idence distributi ons could be defin ed as probability distrib utions: the probabil ities in questi on could then b e estimated in a varie ty of different ways, incl uding as Bayesian posteriors based on a flat prior distrib ution. This may offend purist statis ticians, but it is worth rememberin g that the typi cal underlyin g assumption of a random sample fr om a large populati on may bear only a very rough relationshi p with reality, and that empirical estimates of bo th confidence l evels and p values are themselves uncertain and unreliable esti mates. In practice, because of the current do minance of nu ll hypothesis t esting, the information we often have comprises p values and statistics su ch as t and z . Unde r many circumstances (se e the Appendix) it is reas onable to es timate confidence distrib utions, and s o confidence inter vals and levels f or other hyp otheses, by shifting the nu ll hypothesis di stribution along so that it repr esents a confidenc e distributi on – e.g. for the differ ence of t wo means or proportions, for regressi o n coefficients, or any other statistic f o r which the t or z distribution is the basis of the null hypothesis t est. In these cases the re is a very si mple formula for deriving confidence levels f or the hypoth esis that the populati on value of the stati stic is above or below zero (or other null hypothesis value ) – either p /2 or 1 – p /2 . For other hypothese s and confidence inter val s, it is be possibl e to use a giv en p value to “reverse engineer” the c onfidence distributions and so deri ve the required statistics – a s preadsheet is a vailable ( http://woodm.m yweb.port.ac.u k/CLIP.xls ) for perfor ming the simpl e calculation s involved. Despite these argu ments, N HSTs do have certain adv antages. P values are proba bilities of certain events hap pening on the assumption that the null hypothesis is true ; in terms of detailed rationale this is conceptually si mpler than c onfidence based methods because these depend on an argu ment involving various assumpti ons (like the fla t priors assu mption) to derive confidence from pr obability. If the nu ll hypothesis is a credible or interes ting hyp othesis, then p values do make s ome sense. Ho wever, for a non-tech nical audience (i.e. almost everybod y) it would be sensible to avoid jargon like “ p ” or “significa nt” and use phrases li ke “the probability of getting a sample value this far fro m 0 is 0.3% if only ch ance factors are at work”, or “the data is consistent with the hypothesis that there is no differ ence and onl y chance factor s are at work.” And, of course, reader s also need t o know how big the difference, or other measure of effe ct, is. Th is may necessi tate a leng thier description of conclusions, but hopefully one that is more informative an d less likel y to lead to misunderstandin g. If we want a probabilit y for the tru th of our hypothesis, then we need a c onfidence level, n ot a p value. In practice, res earch results P values, confidenc e intervals, or confide nce levels for hypothes es? 18 could easily be given in several for mats, which may be the bes t way of comparing the practical value of the differ ent approaches. References Bayarri, M. J., & Berger, J. O. (2 004). The interpla y of Bayesian and frequentist a nalysis. Statistical Science, 19 (1), 58-80. Beauregard, T. A. (2 011). Direct and ind irect links between organizati onal work-home culture and empl oyee well-bein g. British Journal of Ma nagement , 22 , 218-237. BMJ. (2011). Resea rch. http://resources.b m j.com/bm j/authors/types- of -articl e/ research accessed on 6 Oct ober 2011. Bolstad, W. M. ( 2007). Introduction to Bayesian statis tics (second edition ). Hoboken, New Jersey: Wil ey. Boos, D. B., & Stefan ski, L. A. (20 11). P-value precisi on and reproducibil ity. The American Statistician , 65 (4), 213-221. Cashen, L. H., & Geig er, S. W. (2004). St atistical powe r and the testing of null hypotheses: a revi ew of contemp orary manage ment research and reco mmendations f or future studies. Organization al Research Method s, 7 (2), 151-167. Cohen, J. (1994). The earth is round (p< .05). American P sychologist, Dece mber, 1994, 997-1003. Coulson, M., Heale y, M., Fid ler, F., & Cumming, G. (2 010). Confidenc e intervals permit, but do not guarante e, better inferenc e than statis tical significan ce testing. Frontiers in Psychology , Article 26. Cortina, J. M., & Folg er, R. G. (1998). Wh en is it acceptable t o accept a null hypo thesis: no way Jose? Organiza tional Resea rch Methods, 1 (3), 334-350. Eggins, R. A., O'Brien, A. T., Reyn olds, K. J., Haslam , S. A., & Crock er, A. S. (2008). Refocusing the F ocus Group: AIRing as a Basis for Effe ctive Workplac e Planning. British Jou rnal of Management, 1 9 (3), 277-293. Gardner, M., & Al tman, D. G. (198 6, March 15). Confi dence intervals rather than P values: estimation r ather t han hypothesis testing . British Medical Jou rnal, 292 , 7 46-750. Garfield, S. (2008, April 6). Heart of the matter. Retrie v ed January 5, 2 012, from The Guardian: http:/ /www.guard ian.co.uk/lifeandstyl e/2008/apr/06 /healthandwellb eing.nhs Gill, J. (1999). Th e insignific ance of null hypothesis significance testin g. Political Research Quarterly, 52 , 64 7-674. P values, confidenc e intervals, or confide nce levels for hypothes es? 19 Glebbeek, A. C., & Bax, E. H. (2004 ). Is high empl oyee turnover reall y harmful? An empirical test usin g company rec ords . Academy of Mana gement Journal, 47 (2), 277-286. Grinyer, J. R., Colli son, D. J., & Russ ell, A. (199 4). The impact of finan cial rep orting on revenue invest ment: theory and e vidence. British Acc ounting Review, 26 , 123-136. Hand, D. J. (2009). Modern statistic s: the myth and the magic. Jou rnal of the Royal Statistical Society, Series A , 172 , 28 7-306. Hansel, C. E. M. (19 66). ESP: a scientific evaluation . Lo ndon: MacGibbon & Kee Ltd. ICH. (1998). ICH Har monized Tripartite Guideline: Stati stical principles fo r clinical t rials (E9). http://www.ich.org accessed on 6 October 20 11. Kirk, R. E. (1996). Practical sig nificance: a concept whose time has come. Education al an d Psychological Measu r ement , 56 (5), 746- 759. Lindsay, R. M. (1 995). Reco nsidering the status of tests of sign ificance: an alterna tiv e criterion of adequacy. Accou nting, Organization s and Society, 20 (1), 35-5 3. McGoldrick, P. M., & Green land, S. J. (1992). Competit ion between ban ks and build ing societies. British Jo urnal of Manag ement, 3 , 169-17 2. Mingers, J. (2006 ). A critique of stati stical modellin g in manage ment science fr om a critical realist per spective: its rol e within mul timethodology. Journal of th e Operational Resea r ch Society, 57 (2), 2 02-219. Morrison, D. E., & H enkel, R. E. (1970). The significanc e test controversy . London: Butterworths. Nickerson, R. S. (20 00). Null hypothesis sig nificance t esting: a revi ew of an old an d continuing controvers y. Psychol ogical Methods, 5 (2), 241-301. Redelmeier, D. A., & Singh, S. M. (2001). Survival in Academy Award-winn ing actors and actresses. Annals of Int ernal Medici ne , 134 , 955-96 2. Rhine, J. B. (19 97). Extra-sensory percepti on. Boston: Branden Publishin g Company. Sylvestre, M.-P., Hu szti, E., & Han ley, J. A. (20 06). Do Oscar winn ers live longer than less successful peers ? A reanalysis of the evidence. Annals of Internal Medici ne , 145 , 361-363. Wood, M. (2005). Bootstrapp ed confidence inter vals as an approach to sta tistical inference. Organiz ational Research Method s, 8 (4), 454- 470. Wood, M. (20 12 ). Bootstrap ping confidence l evels for hypothes es about regressi on models. http://arxi v.org/ab s/0912.3880 . P values, confidenc e intervals, or confide nce levels for hypothes es? 20 Appendix : A Baye sian analysis of the validity of the curve shift method Let’s suppose that we are interested in a numerical p opulation param eter,  , and we have some sample informati on which gives an esti m ate of its val ue – say   . The typical null h ypothesis would be that the p opulation valu e of  is zero, but   is typically slightl y different fr om zero. Regardless of whethe r the ran ge of possible value s for  is discrete or continuous, we can regard it as discret e if we r emember that ther e is a limit to the a ccuracy of our measurements. Furthermore, in practic e, there is likely to be a minimum possible value and a maximu m. This means that we ha ve a finite number, n , of possible h ypotheses ab out the value of  which we can call H min … H max . For example, in Figure 1, if we decide t o measure to th e nearest tenth, and assume that - 10 . 0 is the mi nimum possible valu e of  , and +1 0.0 is the maxi mum, then, for example, the 10 1 st hypothesis, H 0.0 , is the null hypothesis th at  is zero, and the 102 nd hypothesis is H 0.1 , the hypothesis that  is actually 0. 1. Similarly, if we i magine taking furth er samples, there are n possible sample estimates of  ; the estimate from th e actual sample,   , being 1.1. We now make th e following assumpti ons: 1. There is a distributi on curve for the sample estim ates, un der the null hypothe sis, which extends far enoug h for the curve to be shifted. If, f or example, the para meter bei ng estimated wer e a correlati on coefficient, th e shift might m ove parts of the curve above +1 or below -1, which , of course, cannot be corr ect because corr elation coeffici ents cannot take value s outside these li mits. Also, ther e has to be the po ssibility of departures from the null hypothesis in b o th directions (otherwis e the probabilit y of departures in the p ossible direction will si m ply be 10 0%) – this rules out the χ 2 test for goodness of fit, f or exampl e. 2. This distribution is s ymmetrical . 3. The probability distribu tions for the sa mple estima tes under each of the hyp o theses, H i , are the same shap e an d width as the distribu tion under the null h y pothesis. 4. The prior probabilitie s for all hypoth eses, H i , are equal. Bayes theorem n ow tells us that th e posterior pr obability of hypothesis H i is                     󰇛  󰇜          󰇛  󰇜 (Equation 1) where X is a rand om variab le varying over all n possibl e sample esti mates of  , the sum is taken over all the hyp otheses, H j , and P(H j ) is the prior probability of the corresp onding hypothesis. Assumption 4 means tha t we can cancel out the (e qu al) prior probabilities, so this reduces to                              (Equation 2) Using Assumption 3 , the distribution under H i is simply the distribution under H 0 moved i units to the right so we can writ e the numerator of the right han d side as P values, confidenc e intervals, or confide nce levels for hypothes es? 21                        (Equation 3) (For example, if i = 0.5 in Figure 1, the distrib ution will be moved 0.5 unit s to the rig ht so the left hand side of Equation 3 will be the prob ability corresp onding to 1.1 – 0.5 = 0.6 in the null hypothesis distributi on.) Also, because the distributio n is symmetrical (Assu m ption 2):                        (Equation 4) This means that Equati on 2 becomes                               (Equation 5) If we sum this equati on for all values of i , the left hand side obviously su ms to one , as does the numerator on the right hand side, which means that the denominat or on the rig ht hand side also sums to one, and the equ ation reduces to                        (Equation 6) This equation tells us that the posterio r probability of each hypothesis is given by a discre te version of Figure 1 shifted   units to the right, which is a discret e version of Fig ure 2. As the above argument d oes not depend on whether w e measure to th e nearest tenth, or hundredth or thousandth, we can get as close as we like to a con tinuous distributi on, so we can vie w the continuous version, Figure 2, as the p osterior distribu tion, or, using a Bayesian int erpretation of confidence, as a c onfidence distrib ution. In practice, Assumpti ons 1-3 are satisfi ed w hen the n ormal or t distributions are used to model the null hypothesis. There may, of course, be other sets of assu mptions which l ead to the sa me conclusion (e.g. Bolstad, 2007: 2 27), but Assumptions 1-3 have the ad vantage tha t they have a simple graphical interpretation. Notes 1 This example is bas ed on Table 3 of Eggins et al (2008), althou gh I have changed th e details of the experiment to give an illu stration which can be appreciated b y readers who have not read this article. 2 This conception o f NHSTs is usually t raced back to Fisher, an d seems to be the stand ard in most social sciences, although it is sometimes combined with the “Neyman - Pearson” position a nd errors of Types I and II in a man ner which is not entirely con sistent – see Cortin a and Folger (1998: 340-341). 3 The value of t in t he paper from which the example is d rawn (Eggins et al, 2008 ) is 2.40, and the difference of the mean s is 1.07, which mean s that the stan dard error of the d ifference is 0.446: thi s allows us to use the t distribu tion to calculate t hese statistics, an d others cited belo w. P values, confidenc e intervals, or confide nce levels for hypothes es? 22 4 The first, third and fourth of these are estimated usin g the mean of results from th e two CI limits given in the article. 5 The Excel formula for th e p value is =1-BINO MDIST(6,20,0.1, TRUE). Figure 3 i s based on the binomial distribution.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment