Significance Tests in Climate Science

A large fraction of papers in the climate literature includes erroneous uses of significance tests. A Bayesian analysis is presented to highlight the meaning of significance tests and why typical misuse occurs. It is concluded that a significance tes…

Authors: Maarten H. P. Ambaum

Significance T ests in Climate Science Maarten H. P . Ambaum Department of Meteorolo gy , Uni ve rsity of Reading, UK (15 March 2010) A lar ge fracti on of papers in the climate lit eratur e includ es erroneous uses of significance tests. A B ayesia n analy sis is presented to highligh t the m eaning of significance tests and why typica l misuse occurs. It is concluded that a significance test very rarely provid es useful quanti tati v e informatio n. The sign ificance statistic is not a quantita ti ve m easure of ho w confiden t w e can be of the ‘real ity’ of a giv en result . 1. Introductio n In the clima te literature one can re gular ly read statements such as ‘this corr elation is 95% sig- nificant’ or ‘areas of significant anomali es at the 90% signi ficance lev el are shaded ’ or ‘the significa nt v alue s are prin ted in bo ld. ’ Unfortun ately this is an incorrect an d misleading way of using significan ce te sts. In th is note we high light why this is wrong. W e will also indic ate wha t a significa nce test does mean . Although this note does not add ne w theory to significance tests, it does employ a Bayesia n frame work to e xemp lify the is sues. Practitioner s in climate scie nce are ge nerall y familiar with the tech nical aspects of Bayesian stat istics, bu t will perhap s be less familia r with its use in the analys is of s ignifican ce te sts. W e tested a recent, randomly selected issu e of The Journal of Climate for at least one such misuse of significan ce tests in each article. The Journal of Climate was not selec ted because it is pron e to include such errors b ut because it can safely be consid ered to be one of the top journa ls in climate science. In that particul ar issue we observe d a m isuse of significance tests in 14 out of 19 artic les. A rando mly selected issue of ten years before sho wed such misuse of significa nce tests in 7 out of 1 3 art icles. These two samples p erhap s would not pass a traditi onal significa nce test, b ut they do indicate that such errors occur in the best journals with the most carefu l w riting and editing. Indeed, in one of this author’ s pape rs such erroneous use occurred. Comparing the p apers in t he two examined iss ues, it app ears that pa pers with a more dynamical focus genera lly do not stra y as mu ch into significanc e tes ting as paper with a more geogra phical , diagno stic focus. The d istinct ion between these two ca teg ories is necessa rily va gue. The a uthor also wonders whe ther an incre ased ea se with which such test s can be performed with data pro- cessin g and plo tting softw are has lead to a near de fau lt inclusio n of such tests in p apers. From exp erienc e, the author is also awa re that re viewers often insist on the incl usion of significance tests. This reported misus e of significanc e tests do es n ot necessarily in valid ate the re sults from tho se 1 parts of the pape rs. The significa nce test is someti mes only a small part of the evi dence pre- sented , often it is only a subs idiary , if misleading pie ce of infor mation. Further more, many papers con tain somewha t neutra l statements such as ‘this correlation is highl y significant ( p < 0 . 01). ’ Such a statemen t could b e read at f ace value , namely t hat th e co rrelati on was su bjecte d to a significanc e test and a p -va lue of less than 0.01 was found. In such a neutral reading, the statemen t is als o some what meaningle ss, as will b e sho wn belo w . Such a statement is more l ike ly intende d to mean th at the cor relatio n is in some s ense ‘real’ a nd the p -v alue is a quanti fication of that. W e will sh o w here that this quant ification of confidenc e is w rong. Data highl ighted as signi ficant may easily be less significa nt than data that were suppre ssed as not pass ing the significa nce threshold . Simply put, the si gnificanc e statistic is not a quant itati ve measure of ho w co nfident we can be of the ‘reality ’ of a g i ven res ult. A typical scenario in which people use significanc e test s in climate science is the follo wing: some ex perimen t produces two time-s eries and the y are correlated (for exa mple, global mean temperat ure and the E NSO index.) Is the observ ed correla tion real or is it a fl uke ? W e w ill use t his co rrelat ion scenar io throu ghout to be able to e xe mplify sp ecific as pects o f significance testing ; ho we ver , the di scussi on is v alid for any significance test which is based o n assessing the probab ility of an alternati ve hyp othesi s, the n ull-hy pothe sis, which assumes that the data exhibi t no relation . So let us concentrate on the typical question of whether an observ ed correlatio n is real or a statist ical fluke. The correct answer to this questio n is in fact very difficu lt to obtain . Indeed , it is usually impossible to quanti fy our deg ree of belie ve either way by statisti cs alone. Un- fortun ately , it is widely held that a standard significanc e tes t (for example , a t -test) pro vides an answer . Standard significa nce tests hardly e ve r giv e a usefu l answer to the questi on we are trying to answer . It can be argue d that the significa nce test more accurate ly should be named the insigni ficance test, as it may be a reasonabl e test for insignifican ce. Clearl y if Fisher had called his test the insign ificance test, then i t w ould probably not be used v ery much. Mark eting pl ays a n important role in science . There is qui te a bit of li teratu re on the misuse of st atistic al si gnificanc e tests. It has be en ar gued that the power of R. A. Fisher , the great prop onent of significan ce testing, is the real reason why significan ce tests a re so ub iquito us; see Zili ak and McClosk ey (200 8). In the psych ologic al literat ure the false use of significance tests has been regularly point ed out, although, perhaps , not with much succ ess. See for example Cohen (1994), Hunter (1997), or Armstrong (2007). In the geoph ysical literatu re there has been much less attention to the misuse of significan ce tests. A nice re view of significan ce testing in atmospheri c science, including stern critiq ue of the misuse of significanc e testing, can be found in Nicholls (2001). A thorough and detai led discus sion in th e conte xt of scientific hypothe sis testi ng can be found in Jaynes, 2003. 2 In the nex t section we will highligh t the gene ral structure of a significance test and exempl ify , using frequen cy tables, the relat ionsh ip between what the significan ce test pro vides and what we really would like to kno w . Section 3 provid es a Bayesia n analy sis of sig nificance tests. This quantifies t he relationsh ip between si gnificanc e tests and h ypoth esis tes ts. It als o qu antifies what we do get out of a significan ce test. S ome conclud ing remark s re gard ing the prac tical use of significanc e tes ts are in section 4. 2. General structure of significance tests First, let u s examine the structure of a typ ical significance test in the scenario d escrib ed be fore. A brief in trodu ction can also be found in Jollif fe, 2004 . The hypothe sis we are trying to test is: ‘the two time-serie s are related; th e correlation r 0 we observ e is a measure of t his relation. ’ Note the distinctio n be tween relation and correlatio n he re. A correlation is a statistical property of two time-seri es, w hile a rela tion indica tes that the two time-serie s are depe ndent in some phy sical way . W e the n define the so-c alled null-h ypoth esis which in some s ense states the op posit e. In our case: ‘The two time-series are not related; the ob serv ed correlat ion r 0 is a fluk e. ’ W e the n contin ue to t est the v alidity of the null hypoth esis. Here is where the first con fusio n comes in. W e want to devi se a w ay to assign a pro babilit y to the vali dity of the null-hypoth esis, giv en the observe d correlatio n. But what we end up doing is calculating the probabi lity of the observed correl ation when the null hypothes is is assumed to be true. T hese two pro babili ties are diffe rent, althoug h the y are related by Bayes’ theo rem. This co mmon erro r is called the er ror o f the transp osed conditio nal. The discussion of Baye sian statist ics, be lo w , formalizes th is. Let us co ntinue with the usual signi ficance test. T here are st andar d procedures for assignin g a probab ility to the observ ed correlati on, assu ming th e null-hy pothe sis is true: t -tes ts for Gaussian data, parametri c or or non- parametr ic tests (for exampl e, the K olmogorov –Smirno v test) for non-Gau ssian data. In general, w e stud y synthetic time-series with similar properties, perhap s similar temporal autocorre lation , etc., to the origina l time-series b ut which are unrelated by constr uction . W e can then see what the probabilit y is to find a correlatio n between such unrelated series at least as lar ge as the obse rve d cor relatio n r 0 . T his prob abilit y is calle d the p -valu e. There is a distinct ion between the us e of th e absol ute v alue of the correl ation or the a ctual v alue which t hen c orresp ond to a tw o-side d or a one-sided test, respecti vely . The prese nted ar guments work the same for either test and also wider classes of tests: significan ce tests al ways find the probab ility , the p -val ue, of an observ ation, assumin g the t ruth of the null-hy pothe sis. If the p -v alue is lar ge then two unrelated time-serie s can easily produc e a correl ation as larg e as r 0 . W e must then con clude that the o bserv ed co rrelat ion p rov ides little e vidence for an actu al relatio n between the two origi nal time-s eries. If the p -v alue is lo w (typi cally , val ues of 5% or e ven 1% are cho sen to d efine what is ‘lo w’) then the o bserv ed co rrelati on is unlik ely to oc cur in 3 unrela ted ti me-serie s. What can we conclu de from those two possib le outcomes? It is reasonab le to conclude that, if we only hav e these statistics a v ailab le, a hig h p -valu e is a good indicator that the obse rve d correla tion r 0 is not particula rly spec ial. Any pair of un relate d time-s eries could easily (high p ) ha ve a corre lation a s large as r 0 . Note th at this does not mean that the null-hypo thesis is highly probab le; it means tha t the c orrela tion va lue is highly prob able, giv en that the null-hyp othesi s is true. Beware of th e error of the transpo sed con ditio nal. Further confusio n occur s when the p -va lue is lo w . All it means is that it not lik ely that the observ ed correlation woul d occur in two unrelated time-seri es. Howe ver , we cannot concl ude from such an outcome that the two original time-seri es are likely related , that is, significan tly correla ted. It can be ar gued that ‘significantly correlated ’ is defined to corre spond to a low p -valu e. Al- thoug h this wo uld be techni cally correct, it w ould render the statement of sig nificant corr elatio n quite in significa nt in a ny practical sen se. T he low p -v alue i s a property of unrelated time-se ries; it says nothin g about related time-series. In philosophy such a situatio n is called a cate gory error . Statements suc h as ‘ the two time-series are si gnificant ly correlated at the 9 5% le vel’ (tha t is p is lo wer than 5%) commit a cate gory error . It is instruct i ve to wo rk this out using a 2 × 2 frequ enc y table. Suppose we can repeat our ex- periment t hat p roduc ed the tw o original time- series as often as we like and w e k no w beforeh and that the serie s are relat ed. For example , we run an ense mble of climate models and extr act the global mean temperat ure and the ENSO -signal for each ensemble member . W e then find some correlation r bet ween the two time-series. W e can then co mpare that correla tion with the thresh old corr elatio n, say r p , which co rrespo nds to a gi ven p -v alue. For examp le we can chose a p -va lue for significan ce of 5%. This will corres pond to a particular thresh old corr elatio n r p . The correlation b etween the related time-series of any exper iment wil l be ei ther lar ger or smaller than r p . W e ha ve not dwelled on what is meant when we kno w something to be true beforeha nd. In scienc e, we need to use an op eratio nal definition stating that the re is a wide body of histori cal e viden ce which suppo rts the hyp othesi s. F or example, Netwon’ s laws are kno wn to be ‘true. ’ This ex ample is so well- kno wn that we immedia tely can unders tand the subtletie s of scientific truths . W e kno w for exampl e that Newton ’ s la w s ha ve a limited vali dity . Scientific truth alwa ys has to be q ualified; it cannot be compar ed with log ical truth. A wide-rangi ng discu ssion can be found in Jayne s, 2003. In our example we run a hundre d experi ments and di vide them in two categorie s with either high (hi gher tha n r p ) or lo w (lo wer than r p ) correlation. Because the time-series are re lated b y constr uction we expe ct a f airly lar ge fraction to pr oduce a high co rrelati on. L et us, for the sake 4 of ar gument, say that 60% of our exp eriments sho w a high correlatio n. W e now do the same thing for a hundred syn thetic time-serie s w hich are unrelate d by con- structi on. If our sign ificance test is defined properly , then o ut of a hundred unrelate d synthetic time-seri es, on a verage, 5 will ha ve a high corre lation and 95 will ha ve a lo w corr elatio n. T he results are summarized in the table, belo w . lo w r high r related 40 60 unrela ted 95 5 From the tab le we see that the p -v alue of 5% is a sta tement about the unre lated time-series . It says nothi ng abou t the related time-series. T o get a statemen t about the relate d time-series we need to b e able to repeat our exp eriment a su f ficient number o f times to produce a trustwo rthy probab ility density of the correlat ion v alues for related time-seri es. This is often impossible. Regul arly we only ha ve a single series, say from a climate record. W e can then not infer the probab ility dens ity with out extra informatio n or some physicall y bas ed e stimates about t he sizes and proper ties of th e signal and the noise. Based on this example table, we can no w partly answer the questio n that most people are in- tereste d in: is the ob serv ed correlatio n r 0 an indic ation of a real relatio n or is it a fluke? If we assume th at the obs erv ed correla tion is lar ger than the th reshol d correlation r p then we see from the abov e table th at th e chance of it being repre sentat i ve of a re al relation is 60 / ( 6 0 + 5 ) ≈ 92%, where we hav e employ ed equal prio r odds on the time-series being related or unrela ted; this probab ility is d if feren t fro m the 95% that the significan ce tes t would ha ve us believ e. Note that the 92% val ue abo v e depen ds on the prior odds. If w e do not know whether the time-seri es are rela ted or unrelated, it does not mean these two op tions ha ve equ al odds ; it just means that the odds are undefined, see Cox (196 1). The assumption of equal odd s is a strong additi onal ass umption , altho ugh it can be thought of as the maximu m entropy prior , tha t is, it is the assumption t hat is maximally no n-committ al gi ven lack o f any further information regardin g the relation between the time-ser ies, see Jay nes (1963, 2003). Of course, in reality such equal prior odds are unlik ely , and it is us ually impossib le to quan tify the actual prior odds. The actual prob abilit y also dep ends on the di vision between the high and lo w probabiliti es for the related time-s eries. If the sign al–to– noise ratio is lo w in our experiments w e expect a weak distin ction between related and unrela ted time-se ries. In the limit of very lo w signal– to–no ise ratio, the related series would also sho w 95% lo w correlat ions and 5% high correlatio ns, see table belo w . T he probabi lity that our observ ed r 0 with r 0 > r p is indicati ve o f an actual rel ation is then 5 / ( 5 + 5 ) = 50%, again assuming equal prior odds for the time-serie s to be related or unrela ted: the obs erv ed correlation does not provi de evid ence either way , ev en though it is 5 lo w signal /noise lo w r high r related 95 5 unrela ted 95 5 thoug ht to be ‘signi ficant’ according to a sign ificance test. Of course, this should not come as a surpris e: if the signal –to–no ise ratio is very lo w , then any observ ed correla tion essen tially pro vides infor mation about the noise and it is therefo re impossible to use this observ ation to infer an ything about the sign al. Although this la st ca se repres ents an extr eme exa mple, it does demonst rate that the p -v alue can be ve ry far from the actual probabili ty of the truth of a null- hypot hesis. 3. Bayesian analysis W e can formalize the situatio n by usin g B ayesian statist ics. L et us define the hypothes is H as ‘the time-series are related. ’ W e observ e that the time-s eries ha ve a correlat ion of r 0 . W e are no w int ereste d in the conditio nal pro babili ty , p ( H | r 0 ) , that the hypoth esis is true, giv en that the time-seri es hav e correlation of at least r 0 . The signi ficance test gi ve s us the p -val ue, that is, the condit ional proba bility p ( r 0 | H ) that we observ e a correlation r 0 gi ve n that the hypothesis is fals e ( H ). S o: p -v alue ≡ p ( r 0 | H ) . (1) It is important to ke ep this Bayesian express ion for the p -val ue in mind. A common mistake is to assume that p ( H | r 0 ) = 1 − p ( r 0 | H ) . This is the m istak e of the trans - posed con dition al: it is wrongly assumed tha t p ( r 0 | H ) = p ( H | r 0 ) . It is straightfor ward to do th e correc t algebra: p ( H | r 0 ) = 1 − p ( H | r 0 ) (complemen tarity ) = 1 − p ( r 0 | H ) p ( H ) p ( r 0 ) (Bayes’ theor em) = 1 − p ( r 0 | H ) p ( H ) p ( r 0 | H ) p ( H ) + p ( r 0 | H ) p ( H ) (ex clusi ve propositio ns) = 1 − p ( r 0 | H ) 1 p ( r 0 | H ) O ( H ) + p ( r 0 | H ) , (2) where we ha ve intro duced the (prior) odds ratio for the hypothesis H , O ( H ) = p ( H ) / p ( H ) . (3) This equation is essentially Bayes’ theorem written out to indicate the relat ionsh ip between the poster ior probab ility p ( H | r 0 ) and the p -v alue . W ith this equation it is obviou s that we cannot 6 use the p -va lue p ( r 0 | H ) to estimate the probability of the truth of the hypothesis . W e also need the prior odds ratio as well as the conditio nal probab ility p ( r 0 | H ) . N ote that, if we assume an odds ratio of O ( H ) = 1, then we reco ver the results we present ed in the prev ious section. Perhaps in hin dsigh t, it sho uld co me a s no surpri se that the p robabi lity of the truth of H needs to depen d on the prior o dds for H . If H is ov erwhelmingly li kel y ( O ( H ) → ∞ ), then the observ ation of corre lation r 0 does ve ry little t o ch ange this: p ( H | r 0 ) → 1 . If H is very unlikely ( O ( H ) → 0), then the observ ation of correlation r 0 does again ver y little to change this: p ( H | r 0 ) → 0 . It is also interesti ng to consider again the case of low signal –to–n oise ratio. In this limit the condit ional probab ilities p ( r 0 | H ) and p ( r 0 | H ) beco me indisti nguis hable. From Eq. 2 we then find p ( H | r 0 ) ≈ O ( H ) 1 + O ( H ) = p ( H ) . (4) As expe cted, in this case the obs erv ation of r 0 chang es nothi ng to the probabi lity of H ; the observ ed correlatio n is mainly a meas ure o f the noise and says little about t he si gnal. For a prior odds ratio of 1, the probab ility for the hypothes is to be true remains 50% after the obse rv ation . Written out lik e this, it seems surpris ing that so many of us regula rly get confused by sign ifi- cance tests at all. L et us anal yse the follo wing apparently innoc uous statements w hich in some form or anot her seem to be the mainstay of many in vest igatio ns, for example a physical mea- surement : 1. My measure ment stands out from the noise. 2. So my measuremen t is not likely to be caused by noise. 3. It is therefo re unlikel y that w hat I am seein g is noise. 4. The measure ment is therefore positi ve evid ence that there is really something happeni ng. 5. This prov ides ev idenc e for my th eory The fi rst two statements are essen tially expr ession s of the fact that we ha ve a situa tion with a lo w p -v alue: the ch ance that the ob serv ation is produced by noise i s lo w . The mai n error oc curs in the third state ment. It is the erro r of the transp osed cond itiona l. T he proba bility of the data to be noise, giv en our measur ements, is not the same as the prob ability of our measurements , gi ve n that the data is noise . The fourth statemen t would follow from the third statement if it were true. The truth of the fi fth statement depend s on what alternati ves there are to the noise hypot hesis; this is w here physic s comes in as well as Occam’ s razor : is my the ory the next most likel y explanati on of the observ ation? T he prese nce of altern ati ve theorie s also influenc es prior odds for hypotheses . For example, if there are many plausib le alternat i ve hypo these s, the presen t hypothes is w ill hav e lo w prior odds. A beautiful quantifica tion of such ideas can be found in Jayne s, 2003. 7 A m ore compact form of Eq. 2 can be found by writing Bayes’ theorem in terms of prior odds ratio O ( H ) and posterior odds ratio O ( H | r 0 ) with O ( H | r 0 ) = p ( H | r 0 ) / p ( H | r 0 ) . (5) W e find O ( H | r 0 ) = O ( H ) p ( r 0 | H ) p ( r 0 | H ) . (6) The factor which updates the prior odds to the post erior odds is called the likeliho od ratio. For exa mple, in th e case o f a low signal–to –noise ratio, the like lihoo d ratio equals 1; i n this case the poster ior and prio r odds are the same. Note again, that to find the poste rior odds, the p -v alue, p ( r 0 | H ) , is insuf ficient; w e need the prior odds as well as the lik elihoo d ratio. So what do we do? E quatio n 6 gi ves some quantitati ve clues. It is true that from Eq. 6 it follo ws that a lo w p -v alue seems to indic ate that the odds for H typically hav e inc reased by our ‘statistica lly significant’ obser v ation . By how much dep ends on the v alue of the lik elihoo d ratio p ( r 0 | H ) / p ( r 0 | H ) . If the p -v alue is lo w compared to p ( r 0 | H ) then the posteri or odds for H are lar ger than the prior odds. Althoug h its valu e is usua lly hard to determin e, w e normal ly assume that p ( r 0 | H ) is not small (it depends, for example, on the signal–to–n oise ratio.) In this sense a low p -v alue can pro vide positi ve e vidence for the hypothesis . What it does not pro vide is an y quantitati ve measure of what the poster ior odds are or by what amount the odds might ha ve improve d. The 5% (or 1%) significanc e bound is utterly irrele vant: the improv ement or deterio ration of the odd s for H depend on how large the p -v alue is compared to p ( r 0 | H ) , a quanti ty that in practice is hard to determine. 4. Conclusions So are significance tests at all useful? As indicated before, a high p -v alue is a useful indication that our observ ed correlat ion is not particu larly note worthy . Note that a high p -va lue (that is, high p ( r 0 | H ) ) does not mean that p ( H | r 0 ) is low , see E q. 2. It just m eans that the obse rve d correla tion is easily consi stent with null-hy pothe sis H so that H cannot be rejected. O ccam’ s razor then tells us that we shou ld not hypothesize a relations hip for which there is no evid ence. Opposite ly , a low p -v alue is not indicati ve of much at all exce pt that the obser ved correlation is not very probable if the null-hypoth esis were true. There is a tenta ti ve , but unquantified and possib ly incorrec t, indicatio n that the posterior odds for our hypothesi s m ay hav e increased, as quanti fied by Eq. 6. But especially in this case, w hich is regu larly used as positi ve e videnc e for the hypoth esis, any informat ion assuming the null -hypo thesis is quite irrel e v ant. A so-called ‘significant corr elatio n’ is meaningless in any prac tical sense; such a state ment is a cate gory error . Significanc e tests of a single ex perimen t alone cannot be used to prov ide quanti tati ve e vidence for a physical relation. 8 Ackno wledgments: The aut hor thanks R. G. Harriso n for insigh tful discussio ns during the prepar ation of this manuscript. Refer ences J. S. A rmstrong, 20 07: S ignificanc e tests ha rm pro gress in forecasti ng. Int. J. of Fo recast ing , 23 , 321-3 27 J. Cohen, 1994: T he Earth is Round ( p < 0 . 05). American Psychol ogist , 49 , 997 –1003 . R. T . Cox, 1961: The Algebra of Probable Inference. The John’ s Hopkins Press, Baltimore, 114pp . J. E. Hunter , 1997 : Needed: A Ban on the Significa nce T est, Psycholo gical S cience , 8 , 3–7. E. T . Jayn es, 1968: Prior Probab ilities , IEEE Tra ns. on Systems S cience and Cybern etics , 4 , 227–2 41. E. T . Jayn es, 2003: Probabil ity Theory . The Logic of Science. Cambridge Univ ersity Press, Cambridge, 727pp I. N. Jollif fe, 200 4: P stan ds for . . . , W eather , 59 , 77–79. N. Nicholls, 2001: The Insignifica nce of S ignifican ce T esting. Bull. Amer . Met. Soc. , 82 , 981–9 86. S. T . Ziliak & D. N. M cClosk ey , 2008: The Cult of Statistical Significance. Uni ver sity of Michigan Press, 352p p. 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment