Hot Hands, Streaks and Coin-flips: Numerical Nonsense in the New York Times

Hot Hands, Streaks and Coi n - ﬂ ip s : Numerical Nonsense in the New Y ork Times Dan Gusﬁeld Comput er Science Department, Universit y of Ca l ifor nia, Da vis August 31, 2018 The existence of “Hot Hands” and “Streaks” in spo rts and gamblin g is hotly debated, but there is no unc ertain t y about the rece n t batting-a v erage of the New Y ork Times: it is no w two-for-two in mangling and misunderstanding elemen tary concepts in probabilit y and statistics; and mixing up the key points in a recen t pap er that re-examines earlier w ork on the statistics of streaks. In so doing, it’s high-visibilit y articles hav e added to the general-public’s confusion ab out probability , making it seem m ysterious and pa rado xical when it needn’t b e. Ho w ev er, those ar t icles make excellen t case studies on how to get it w rong, and for discussions in high-sc ho ol and college classes fo cusing on quan titativ e reasoning, data a nalysis, probabilit y and statistics. What I hav e written here is inte nded for that audience. 1 The Background The starting point fo r this discussion is an article b y George Johnson in the New Y ork Times Sunday Review on Octo b er 18, 20 15, entitled “Gambler, Scien tists and Mysterious Hot Hand”. That a r ticle discusse s the claims in a recen t w orking pap er (not yet p eer review ed) by t w o economists, Joshua Miller and Adam Sanjurjo, en titled “Surprised b y the Gam bler’s and Hot Hand F allacies? A T ruth in the Law of Small Numbers” [2]. According to the Johnson article, the Miller and Sanj ur j o pap er claims that the authors of a classic 1985 pap er (Thomas Gilovic h, Ro b ert V allone and Amos Tv ersky) [1] debunking the concept o f hot hands in bask etball, made an erro r in how they thought ab out probabilit y . Quo t ing from the Johnson article: A w orking pap er published this summer has caused a stir by prop osing that a 1 classic b o dy of researc h disprov ing the existence o f the hot hand in bask etball is ﬂawe d by a subtle misp er c ep tion ab out r andom n ess . (italics added) Then, on Octob er 27 , 2015 , in a follo w-on NYT article (in TheUpshot) entitled “Streaks Lik e Daniel Murphey’s Aren’t Necess arily Random”, Biny amin App elbaum wrote: Last y ear tw o economists launche d a more fundamen tal assault: They argued that dispro of s of the “hot hand” theory had made a b asic statistic al err or . (italics added) Its a c hallenge to kee p the play ers straig ht in this story , so to recap, the issue of hot hands and the probabilit y of streaks w as ﬁrst discussed in tw o academic pap ers and then in t w o subsequen t NYT articles: the ﬁrst pap er, b y Gilovic h et al. in 1985 (whic h w e will refer to as “GVT”), claims that the b elief in hot-hands (in bask etball) is not statistically supp orted; the second b y Miller and Sanjurj o ( which w e will refer to as “MS”) in the summer of 2015 re-examine s that w ork, suggests that statistical errors w ere made, and comes to a diﬀeren t conclusion; the third by George Johnson in Octob er 2015 in the NYT discussing the claims in MS; and a fourth article, b y App elbaum ten da ys later in the NYT t ha t rep eats, and eve n strengthens, some of the stat ement from the Johnson article. I am no t in t erested in questions of hot-hands, streaks and gam bling p er se. Instead, m y in terest, and fo cus here, is ho w t he New Y ork Times articles discus probabilit y and statistics, and the confused and incorrect statemen ts made in those a r t icles. Ho w eve r, in order to explain the NYT errors, w e will ha ve to discuss streaks, hot hands, and t he t wo academic pap ers to some exten t. 2 The C e n tral T ec h nical Issu e s W e w an t to iden tify the claimed “subtle misp erception ab out randomness” a nd “ ba sic statistical error” in GVT that the t w o articles in the NYT are talking ab out. T o do that, w e ha ve to sa y a bit ab out the statistical approach t o the study of streaks and hot hands. When trying to determine if streaks (successiv e bask ets made, heads on coin ﬂips, wins in gam bling, for example) ha v e non-random causes, suc h as skill or “b eing in the zone”, the statistical a pproac h is to compare n umerical features in observ ed data to features in data generated at random. F or example, supp ose that a pla y er mak es a bask et (a hit, co ded as ‘H’) on 50% of the shots he takes , and that w e hav e the en tire record of the pla y er’s hits and misses. W e could loo k at that data and ask what p er c entage of the Hs 2 are follo wed b y another H. It has little eﬀec t in long sequences , but in a short sequence, w e will compute the p ercen tag e b y coun ting the n umber o f Hs in all but the last p osition, and the n umber of those Hs that are fo llo w ed by another H (p ossibly in the last p osition). F or clarity , w e giv e that p ercen tage the name HH-p er c entage , although that term w a s not used in the NYT articles, or in the academic pap ers. W e could also determine the HH-p ercen ta ge from data on Hs follo we d by a miss, co ded as a ‘T’. See T able 1. The HH-p ercen tage migh t not b e the ideal w ay to study questions of streaks and hot hands, although a pla y er with a few long streaks (who probably would b e considered to ha v e a ho t hand) has a la rger HH-p ercen tage than a pla y er with more, but shorter, streaks. Still, the HH-p ercen ta ge ( in diﬀerent terminology) is one of the ﬁrst statistics examined in the GVT pap er, where t hey computed the HH-p ercen tage for sev eral indi- vidual NBA play ers in an individual season. And, the HH- p ercen tag e is the only statistic that is discusse d in the NYT articles, so it is the fo cus of this note. But, ho w sp eciﬁcally w ould w e use the HH-p ercen tage to determine if the play er’s Hs are unusually “ streaky”, i.e., more concen trated in to streaks t hat w hat w e w ould exp ect b y c hance alone? GVT sa ys “The pla yer’s p erforma nce, then, can be compared to a seque nce of hits and misses generated b y toss ing a coin.” Sp eciﬁcally , w e could generate a long random se quence, where each c haracter in the sequence is indep enden tly c hosen to b e a n H or a T with equal pro babilit y; and then compute the HH-p ercen tage f r om that long sequence. W e call that HH-p ercen t a ge a “reference n um b er”, and remem b er that it is obtained from a sequence that do es not ha v e any non-random inﬂuence . T hen, w e w ould compare the HH-p ercen ta ge obtained from the record of a c hosen pla y er to the reference num b er. Intuitiv ely , when a pla y er’s actual HH-p ercen tage is computed from a long sequence (i.e., a la rge amount of da t a ) it seem s appropria te to compare it to this reference n umber. 1 If the referenc e num b er is v ery close to, or larger than, the HH-percentage in the pla y er’s record, then the pla yer’s HH-p ercen ta ge do es not supp or t the conclusion that the play er’s streaks are due to some non-random inﬂuence. That means, from the p ersp ectiv e of the HH-p ercen tag e, the pla y er’s bask ets do not app ear t o b e more streaky than do t he Hs in a random sequence. Con vers ely , if the play er’s HH-p ercen tage is “signiﬁcan tly” larger than the reference v alue, w e do feel justiﬁed in thinking that some non-random inﬂuence is at work. Ho w m uch larger a pla y er’s HH-p ercen tag e must b e in or der to b e “signiﬁcan t” , to supp ort the assertion o f non- r andomness, is exactly the kind of issue t ha t is studied in statistics and probabilit y theory , and is not our main concern here. 1 How ever, if a play er’s HH-per centage is computed only from a “relatively shor t” sequence (sa y a single game or even a single season), then the reference num b er deﬁned a b ove mig ht not b e the most informative one to use. F oreshadowing what will come later in this pap er, this will b e a key issue. 3 Comparing a play er’s record to a rando mly g enerated sequence is the basic statistical approac h, but do w e actually need to generate a random seq uence in order to determine the reference v alue? No. W e migh t need to generate random sequence s to determine more complex statistics in random seque nces, but in the case of the HH-p ercen tag e, w e don’t need to generate any sequences b ecause w e know that the pr ob abi l i ty of an H following an H is exactly the probability of an H on any individual ﬂip, i.e, one-half. So, the observ ed HH-p ercen ta ge in a long randomly-generated sequence will b e ab out 50%; ab o ut equal to the frequency that an H is follo w ed by a T, or a T is follow ed b y an H. That p o int should not b e con trov ersial o r confusing. But the NYT article did confuse it Con tra ry to the po in t a b o v e, Johnson in the Octob er 17 NYT article states: F or a 50 percen t shoo ter, for example, the o dds of making a bask et are sup- p osed to b e no b etter after a hit – still 50-50. But in a pur ely r an d om situa- tion, ac c or d ing to the new analysis, a hit would b e exp e cte d to b e fol lowe d by another hit less than half the time. (italics added) T o be clear, the NYT article is talking ab out a “purely random situation” of (mem- oryless) shots b y a 50% sho oter, o r equiv alently , a sequence of fair coin ﬂips. It is not talking ab out some bask etball-related phenomena (for example, a play er b eing more tired or more closely guarded after making se v eral shots). And, for ev en greater clarity , I in- terpret the statemen t “ ... in a purely random situation ... a hit w ould b e expected to b e follo w ed by a another hit less than half the time” as the same as “... in a purely random situation ... the o dds of making a bask et aft er a hit are less than 50-50. Equiv alen tly , in a purely random situation ... the pro babilit y t hat a hit will b e follow ed b y another hit is less than one-half.” 2 3 Really!?! Can that statemen t ab out hits ( a nd coin ﬂips) in Johnson’s article be cor r ect, tha t “in a purely rando m situation ... a hit is expected to b e fo llow ed b y another hit less t ha n half the time?” Surely , there is something wrong he re, because in a pure ly random situation every ﬂip will b e an H with the same probabilit y that it is a T — exactly one-half. So, 2 If you think this in ter pretation is wro ng, then you will pro bably ﬁnd the rest o f this pap er wr ong, and can stop r eading now. 4 a hit is ex p e cte d to b e follo w ed b y another hit (H) one-half of the time, whic h is as often as it is exp ected to b e follo w ed b y a miss (T). Sev eral o f the on-line commen ts to the NYT submitted by readers after the publication of Johnson’s article correctly p oin ted this out, and ev en iden tiﬁed the source of Jo hnson’s confusion, whic h we will discuss in detail b elow . But, despite the readers commen t s, ten days later, App elbaum in the NYT article (in TheUps hot), doubled dow n o n Johnson’s statemen ts, making ev en more explicit statemen ts: Flip a coin, and there’s an equal chanc e it will land heads o r tails. Researc hers had treated that 50 percen t c hance as the deﬁnition of a random outcome. But Josh ua Miller of Bo cconi Univ ersit y and Adam Sa nj urj o of the Unive rsidad de Alican te p ointed out something surprising: In the aver age series of four c oin ﬂips, the se quenc e he ads-he ads is signiﬁc antly less c omm on than he ads-tails. (italics added) Really? In the ta ble of coin ﬂips (similar to T a ble 1 b elow) that App elbaum directs the reader to examine , heads-heads o ccurs exa c tly the same n umber of times that heads- tails o ccurs. So is App elbaum’s statemen t pure nonsens e, or is it based on some truth, but one that is v ery p o orly stated? He con tin ues: On av erage, just 4 0 .5 p ercen t of the heads are follow ed by another heads. Y es, this sounds crazy . But it happ ens to b e true. 3 And, this assertion has consequenc es for the study of streaks. Referring to MS, App elbaum writes: The implication, they argued, is that past studies had set the bar to o high. Streaks that has loo ked lik e random luc k w ere actually statistically unlik ely . The “hot hands f allacy”, they wrote, w as remark ably p ersisten t b ecause it w as true. 3 One might argue that App elbaum has a tiny , tin y bit a wiggle ro o m, b ecause he does not deﬁne what “the a verage ser ies of four coin ﬂips” means, or what “ on av erage” means in t he second quote. But he directs the reader to the Johnson article with the table showing that exa ctly 50 % of the hea ds, in the ﬁrst three p ositions, are follow ed by ano ther head. So, his statement is particular ly confused and incorrect. 5 4 So W h at is Go ing on? Both NYT articles imply that the “ba sic statistical error” made in GVT is to assume that the probability that an H will follow an H is one-half, in sequenc es of heads and tails created b y ﬂips of a fair coin. In t he case of the sixtee n length-four sequences, the implied “error” in GVT is the a ssumption t ha t 50% of the heads t ha t are follow ed b y another ﬂip, are follo w ed b y another heads. But they are. So w hat is going on here? Sp oiler alert : In trying to in terpret MS, the Johnson article made incorrect and imprecise statemen ts about probability and statistics. The App elbaum ar ticle rep eat ed, more strongly , the main one. Both pap ers miss the k ey p oints made in MS. In truth, in purely fair coin ﬂips, each H (other than the last one in the sequence ) will b e follo w ed b y a nother H with pr ob a b ility o n e-half . P erio d. Miller and Sanjurjo also make that clear. So ho w did Johnson and App elbaum get it so wrong? 4.1 The Johnson T able F ollowing a similar example and t a ble in the MS pap er (but not a similar conclusion), here is what Johnson did in his a r ticle. He looked at the sixteen, length-four sequences sho wn in T able 1. F or eac h sequence that con tains an H in one of the ﬁrst three p o sitions (there are fourt een of these) he calculated the p er c entage o f those Hs that ar e follow ed b y another H. F or example, in the seque nce HHTT, the p ercen tage is 50%, and in HHHH it is 100%, and in HTTT it is 0%. Hence, Johnson calculated the individual HH-p ercen tage for e ach of the r elev ant fourteen sequences. Then he added those fourteen HH-p ercen tages, divided b y fourteen, and got ab out 4 0.5%. That is, he aver age d the HH-p ercen tages calculated from the fourteen relev ant sequences. As he writes: ... calculate for eac h seq uence t he o dds that a head is follow ed b y a head and a v erag e the results. The answ er is not 50-5 0 , as most p eople w ould exp ect, but 40.5 percen t – in fav or of tails. All true. The arithmetic is righ t , and the 40.5% a ve rage may indeed seem surprising to some people. But so what? What does that av erage ha v e to do with the probabilit y that an H is f o llo w ed b y another H? No thing! It is nonsense to conclude from that a v erag ing that “a hit is expected to b e follow ed b y a hit le ss than 50% of the time”, or that “On a v erag e, just 40 .5 p ercent of the heads are follow ed by another heads.” 6 4.2 Coun ting Bathro oms In order to explain what Johnson and App elbaum got wrong, we lo ok here at a more extreme scenario. Supp ose w e w an t to calculate the av erag e n umber o f bathro oms in the houses in the U.S. The righ t w a y to calculate this is to ﬁnd t he n um b er of bathro oms in eac h of the (millions) of U.S. houses, sum up those n um b ers and divide by the num b er of houses in the U.S. But here is another suggestion: After ﬁnding the n umber of ba thro oms in eac h of the houses, divide t he houses into t w o groups: those t hat ha v e more than 30 bathro oms, and tho se that hav e 30 or few er bathro oms. (San Simeon, the former coun try house of the Hearst family , has 61 bathro oms, and the White House has 35). Next, compute the av era g e num b er of bathro oms in the ﬁrst group of houses (p erhaps that a v erage is 32.5 bathro oms), a nd compute the a v erag e for the second group o f houses (around 2 .7 in a recen t surv ey). Finally , av era g e t ho se t w o av erag es, to get 17.5 bathroo ms (just sligh tly more than I ha v e in m y house). And ev en though I made up the av erage o f 32.5 for the ﬁrst group, the correct a v erag e in the ﬁrst group will b e at least 30 (wh y?), and the av erage in the second group is a ctually close to 2.7, so the true a ve rage of those a v erag es will b e larger than 16. Probably (but what do es that really mean?), the a v erage of 16 bathro oms p er U.S. house do es no t mesh with your sen se o f realit y . So what we n t wrong? By av erag ing the t wo a v erages, w e giv e e qual w eigh t to eac h of the av erages, ignoring the fa ct that the ﬁrst a v erage comes from a v ery small n um b er of houses, while the second av erage comes f r o m a h uge n umber of houses . That kind of a verage is called an unweighte d a v erage. But, to get the correct av erage n um b er of bathro oms, you m ust giv e equal w eigh t t o each house , not to each gr oup of houses . No w if for some reason you don’t hav e data o n the n umber of bathro oms in eac h individual house, but are giv en the tw o av erages in the t w o groups, and ar e also given the num b er of houses in the t w o gro ups, y ou could multiply the ﬁrst a v erage b y the n umber of houses in that g r o up, multiply the second av erage b y t he num b er of houses in that group, and add the t w o pr o ducts to get the total n um b er of bathroo ms in the U.S. Then, to get the correct a v erag e num b er of bathro oms, y ou would div ide that tot a l b y the sum of t he n um b er of houses in the t w o groups, i.e., the total num b er of ho uses. This is called a weighte d aver age of the a verages, and w ould giv e a result of ab out 2.7. Note that computing the w eigh t ed av erage is just a bac kw ards w a y o f doing what w e w ould do to compute the av erage num b er of bathro oms in a U.S. house, if we ha d the raw data on each house: ﬁnd the total n umber of bathro oms and divide b y the n um b er of houses. Bac k to HH-p er cen t ages Ho w do es the bathro om story relate to HH-p ercen tag es? There are 24 Hs tha t o ccur in the ﬁrst three p o sitions of the 16 sequences of length four. 7 These 24 Hs are ana logous to the houses in the bathro om story . If you w ant to compute the p ercen tage of those 24 Hs that are follow ed b y an H, or equiv alen tly , ho w often “a hit is exp ected to b e follow ed by a hit”, y o u should not divide those 24 Hs in t o groups (in this case, 14 gro ups, eac h called a “sequence”), ﬁnd the HH-p ercen tage in eac h group (sequence ), and then a v erage those percen tages. T o do so give s equal w eigh t to eac h group (sequence ), ignoring the fact that some gro ups (sequences) ha v e more Hs than others do. That is, y ou should not compute an unw e ighte d av erag e of the HH- p ercen tages. Instead, to calculate the probabilit y that an H follow s an H, you need to giv e equal w eight to eac h H that o ccurs in the ﬁrst three p ositions of some sequence, or if y ou start from the HH-p ercen ta ges of the fourteen sequences, you need t o compute a w e ighte d a v erag e of those HH-p ercen tages; eac h HH-p ercen t a ge weigh ted b y (m ultiplied by) the n umber of Hs in the ﬁrst three p ositions of the seq uence that the HH-p ercentage comes fr o m. Another numerical reﬂe ction of the diﬀerenc e b et wee n un w eighted and w eighted a v- erage HH-p ercen tages is the fa ct that in the sixteen length-four sequenc es, there are only eight that ha v e a n y o ccurrence of HH, but there are elev en that hav e an o ccurrence of HT. Th at is, the distribution of HH and HT is not uniform in the fourteen sequenc es. 4 Similarly , there are sev en sequences that ha v e HHH, but eight that ha v e HHT. So , in random sequences, if y our unit of analysis is the whole sequence, y ou will observ e a T follo wing an H more often (in more sequences) than an H follow ing an H. Y ou will also observ e an H follo wing a T in more seque nces than an H following an H. So, b y eq ually w eighting the seque nces, w e under-represen t the HHs and ov er-represen t the HTs. The T ake -Home Lesson: The un w eigh t ed av erage o f the av erages calculated from non-o v erlapping subsets of a set is not alwa ys equal to t he a v erag e in the en tire set. That is just a n umerical fa ct, and is elemen tary text-b o ok material in any ba sic statistics b o ok or course. The n umerical example in Johnson’s article do es nothing more than illustrate that fact in the case o f all p o ssible length-f our sequences of fair coin ﬂips. It do es not establish t ha t “In a purely random situation ... a hit w ould b e expected to b e follo w ed b y anot her hit less than half the time.” 4.3 The MS T able and Un w eigh ted A v erages While Johnson and App elbaum completely miss the issue of we igh ted vers us un w eigh ted a v erag ing , Miller and Sa nj ur j o understand it p erfectly w ell, as did man y of the NYT readers who commen ted on the Johnson and App elbaum articles. MS con tains a table 4 A t ﬁr st, this may seem paradoxical since the tw o c ounts migh t b e ex pec ted to b e equal by “symme- try”. But, the tw o o ccurr e nces a re not symmetric, which I leav e you to p onder . 8 that is similar to the one in Johnson’s article (and to T able 1 b elow), and it obta ins the same a v era g e, but MS do es no t state the conclusion that Johnson and App elbaum do. In fact, in a blog dis cussion this summer, Miller states: W e do not assert that: “ a w a y to determine the pro babilit y of a heads follow- ing a heads in a ﬁxed s equence, you ma y calcu late the prop ortion of times a head is follo w ed b y a head for eac h p ossible sequence and then compu te the a v erag e pro p ortion, giving eac h se quence an equal w eigh t ing ” ... it is a mistak en intuition to treat this computation as an un biased estimator of the true probabilit y . MS b egins b y stating that if one million fair coins are each ﬂipp ed four times, and an HH-p ercen ta ge 5 is obta ined for each coin, those million HH-p ercentages w ould a v erag e to “appro ximately 0.4”. In explaining this, they state: The key ... is that it is not the ﬂip that is treated as the unit of analysis , but rather the se quenc e of ﬂips from eac h coin ... (italics added) Therefore, in treating the sequence as the unit of analysis, the a v erag e em- pirical probabilit y across coins amoun ts to an un we igh ted a v erage 6 ... The unw eighted a verage of a verages (ab out 0.405) is not equal to the probabilit y (exactly 0.5) of an H following an H in four fair coin ﬂips. The NYT a r t icles printed nonsense, b ecause what they wrote suggests that these are the same. 7 But why? The table in the Johnson ar t icle, which the NYT articles misunderstand, originates in the MS pap er. But wh y? One o f the reasons MS examines un w eighted a v erag es is explained next. 5 The actual ter ms they use a re “relative frequency” and “empirical pro bability”. 6 F or clarity , note that in this quote, it is implied that “ sequence of ﬂips” is “sequence of four ﬂips” . 7 The Johnson pap er is actually more c onfused and confusing, beca use, as explained above, it s uggests that the pr o bability that an H follows an H is less than half, and yet it also po in ts out that in the sixteen sequences of length four, the n umber of Hs that are followed by another H is exactly the same as the nu mber of Hs that are follow ed b y a T. It tries to explain this apparent c o ntradiction by in tro ducing the concept of a “ selection bias”. This is actually more nonsense; we will return to this later. 9 5 Mo d eling th e Gam b ler’s F allacy The Miller and Surjurjo pap er is concerned with sev eral streak phenomena in addition to hot hands in sp orts. The main one is called the “G am bler’s F a llacy”, whic h is the b elief that a streak (winning or losing) in a game of pure (or mostly pure) c hance, will so on b e rev ersed, in or der to achie v e the long-run exp ected win/loss frequency . 8 This fallacy is most clearly deﬁned in terms o f a sequence of fair coin ﬂips, where the gam bler’s fallacy is: If one observ es a growing streak of Hs, the probabilit y that the next ﬂip will b e T increases after eac h successiv e H. That is, t he longer the streak of Hs, the higher is the probabilit y that the next ﬂip will b e an T. Restricted to just t wo consecutiv e ﬂips, the gam bler’s fallacy is that the proba bilit y that an H will b e fo llo w ed b y another H is low er than the probability tha t it will b e follo w ed b y a T. Th us, the gam bler’s fallacy is similar to the b elief in a “hot hand”, but there an H is b eliev ed to b e mor e lik ely , r ather than less lik ely , after an H. Both GVT and MS a ssert that the g am bler’s fallacy is a commonly held b elief. Of course, this b elief is a f a llacy , since the pro ba bilit y that the next ﬂip will b e an H is precisely one-half (in a fair coin), no matter what t he past history is. MS uses unweighte d av erages of HH- p ercen tag es, because Miller and Sanjurjo ass ert that p eoples’ b eliefs ab out streaks in gambling are based on gam blers’ observ ations of man y sho rt, but who l e sequence s of ev ents, or c omplete games . These are the “units of analysis” that b est mo del ho w p eople incorrectly come to b eliev e in the gambler’s fallacy . In the analogy of coin ﬂips, a ﬁnite seque nce o f ﬂips (sa y , of length four) is the unit of analysis, a nd m ultiple sequences are observ ed. Miller and Sanjurjo assert that p eople use “natural” statistics, whic h e qual ly w eigh t what they observ e in eac h sequence or game. Hence, their b eliefs are essen tially based on an un w eighted av erag ing of the sequences a nd games they observ e. And since unw eigh t ed av erag es of HH-p ercen tages underestimate the true probabilit y that an H will follo w an H, 9 this in tuitive (but incorrect) thinking leads to a b elief in the gambler’s falla cy . Miller and Sanjurjo write: 8 In the “long run”, the frequency of wins should b e about equal to the fr equency of losses . That is a conseq ue nc e of the “law o f large num b ers”. The belief that w e should also see this balance in small sequences has b een facetiously called the “law of small num b ers”. 9 The unw eig hted a verage is 4 0.5% for sequences of length four . F or long er sequences, the unw eighted av er age HH-p er cent age remains less than 50%, although it approaches 50% as the sequence length in- creases. F or example, in length-six se quences, the un weight ed a verage HH-per centage is 41.6%, av er aged ov er the 6 2 sequences that have an H in one of the ﬁrst ﬁve p ositions. 10 The implications for learning are stark: to the exten t that decision mak ers up date their b eliefs regarding sequen tia l dep endence 10 with the (un w eighted) empirical probabilities that they obs erv e in ﬁnite length se quences, they can nev er unlearn a b elief in the gambler’s fallacy . ... no amoun t of exp osure to these s equences can mak e a belief in the gam blers fallacy go a wa y . And: ... in treating the sequen ce as the unit of analysis, the av erag e empirical probabilit y across coins amounts to an un w eigh ted a v erag e ... and th us leads the data to appear c onsisten t with the ga mbler’s fallacy . 11 6 But What Ab out Bask etball? W e ha v e seen wh y MS is concerned with un w eighted a v erages of HH-p ercen tages in their treatment of the Gam bler’s F alla cy . But what ab out streaks in bask etball? MS is concerned with un w eighted av erages there a lso, but the explanation for this is more subtle than fo r the Gambler’s F a llacy . T o get to t ha t explanation, w e ﬁrst ha v e to discuss another w ay that Miller and Sanjurjo explain their main statistical observ ation. 6.1 Alice and Bob In trying to explain the main technic al issue in their pap er, Miller (in an online p ost) describes a comp etition b etw een tw o pla y ers I will call Alice and Bob (I am mo difying the description of the game, but not altering its mathematical features). The scenario is as follow s: A computer has b een pro grammed 12 to sim ulate a fair coin ﬂip. It ﬁrst generates a random sequence of four fair coin ﬂips (prin ting out the sequence for lat er v eriﬁcation, and Bob can’t see the output no w); then, the computer randomly pic ks a p osition of one of the Hs in t he sequence, pro vided that it is in one of the ﬁrs t three positions. If there is no suc h p osition, the computer starts again. If there is suc h a p o sition, Bob is invite d to bet whe ther the following p osition in the seque nce is an H or a T. Note that the v alue (H or T) has already been generated and written do wn. If 10 The term “ sequential dep endence” refers to the way that one ev ent relates to a prior one. In the case of tw o ﬂips, it refer s to whether an H or a T follows an H. 11 The phrase ‘a cross coins’ s ho uld b e interpreted as ‘acr oss sequences’ in the treatment here, beca use in MS it is assumed that ea ch sequence is gener ated by a fair , but diﬀerent co in. 12 Both Alice and Bob hav e pre viously veriﬁed that the progr am is corr ect. 11 Bob’s b et is correct, Alice pa ys him $1, a nd if it is incorrect, Bob pa ys Alice $1. Notice that Alice has no activ e r o le except to pa y out or collect the winnings. What should Bob pic k, H or T? One is tempted to answ er “pic k either one, b ecause on any ﬂip the probabilit y of an H is the same as the probability of a T.” But that answ er ignores the full con text of the comp etition. The answ er is that Bob should pic k T, no t H. Randomly generating a sequence of length four is equiv alen t to randomly pic king one of the sixteen sequenc es sho wn in T able 1, b ecause the probability of generating an y sp eciﬁc sequence is that same as the probability of generating an y other sequence (i.e., ( 1 2 ) 4 ). So, instead of imagining the computer generating a ra ndom sequenc e of length four, imagine that the computer randomly (with equal probabilit y) picks one of the fourteen relev an t sequence s; a nd then randomly pic ks an H in one of the ﬁrst three p ositions of that sequence, at whic h p oin t Bob b ets either that the fo llowing (already determined) ﬂip is a n H or is a T. If w e rep eated this scenario many times, the frequency that the follo wing c haracter is an H, w ould b e a go o d estim ate of the u nweighte d a v erage of the HH-p ercen tages, ov er the fourteen relev ant sequence s. Since the un w eigh ted a ve rage of the HH-p ercen tages is 40.5%, the probabilit y that Bob will win if he pic ks H is only 0.405. That is wh y Bo b should pic k T. The k ey p oint is that this scenario has two stages: the computer ﬁ rst pic ks a sequenc e with equal probabilit y; and se c ond it randomly pic ks an H in the ﬁrst three positions of that sequence (if there is o ne). But tha t is v ery diﬀeren t from a one-s tage scenario where the computer rando mly pic ks an H in one of the ﬁrst three p ositions of the fourteen relev ant sequences . In this second scenario, the seque nces w ould not b e pic k ed with equal probability , because the distribution of Hs is not uniform. In the second scenario, the pro ba bility tha t Bob w ins if he pic ks H is exactly 0 .5, not 0.405. The ﬁrst scenario corresp onds to an unw eighted a v eraging of the HH- p ercen tag e observ ed in each sequence, and the second corresp onds to a w eigh ted av erage of the HH-p ercen tages. F urther, the second scenario roughly reﬂects how GVT obtained the reference n um b er it used to compare a pla y er’s HH -p ercentage, while the ﬁrst one roughly reﬂects how MS do es. 13 13 In this pap er, we ha ve only discussed HH-p ercentages b eca us e it is the only statistic discussed in the NYT articles, and it is suﬃcient to illustr a te the key diﬀerence in the a pproaches of GVT and MS. But actua lly , the main statistic discussed in b oth GVT a nd MS is a bit more in volv ed. Deﬁne the “TH- per centage” in a sequence as the p ercentage of Ts that are follow ed by an H. Then deﬁne statistic D for a sequence as its HH-percentage minus its TH-p erc e ntage. D relates the relative frequency that an H follows an H to the rela tive frequency that it follo ws a T in a sequence, and it ma y b e a more mea ningful statistic to us e to answer questions ab out “hot ha nds”. Now, considering again all the sequences of length four, we see that ther e are exactly the same n umber of HH pa irs as TH pairs. How ever, the un weigh ted average of the s ixteen D v alues is not zero, but something less than zero . This is analo gous to the fact that the unw eig hted a verage HH-p ercentage is less than 50%, the p ercentage of all Hs in the 12 6.2 Is this r elev an t for bask etball? The answ er dep ends on what sp eciﬁc question y ou are asking. F or example, w e could ask: Did a sp eciﬁc pla yer exhibit “streak shoo ting” in a sp eciﬁc g ame? or ask: Is a sp eciﬁc pla y er a “streak sho oter” g enerally , considered ov er a season or their en tire career? 6.2.1 Analysis for a single game F or the ﬁrst question, let’s supp ose tha t a play er, who has a long-term 5 0 % hit rate, sho ots four times in a game. W e w a nt to kno w if the pla yer exhibited a hot hand, and so w e compute his HH-p ercen tage for that game. W e then compare tha t n um b er to a reference n um b er deriv ed fr om random sequenc es g enerated without an explicit hot hand. W e could compare his nu m b er to the HH-p ercen tag e generated from a long random sequence , in whic h case the r eference n um b er should b e 50%. This essen tially (but not exactly) reﬂects the approach in GVT. But another approac h, whic h reﬂects a diﬀeren t w ay to mo del a play er without a hot hand, is to consider all the rando m sequence s of length four. If w e use that mo del, then the n umber to compare with is 40.5%. The reasoning, detailed nex t, is s imilar to the reason that Bob should b et on T rather than H. W e mo del the play er w i thout a hot hand simply as a fair coin, i.e., eac h shot is a hit (H) with probabilit y of one-half , indep enden t of an y other shot; so in a game, the pla ye r (with no ho t hand) g enerates a random sequence o f Hs and Ts. As discussed earlier, w e can also think of the generation of a ra ndom sequence o f four ﬂips as a random sele ction (with equal probabilit y) of one of the sixteen four-ﬂip sequences . Th us, instead of thinking of the pla yer (who do es not ha ve a hot hand) gener ating a random sequenc e of length four, w e mo del the play er’s record as a selection of one of the sixteen sequences, c hosen at random. 14 So, w e compare the play er’s actual HH-p ercen tage in the game to the HH-p ercen ta ges from the fo urteen r elev ant r a ndom sequence s. But whic h sp eciﬁc n umber obtained from those HH-p ercen t a ges should w e use? The statistical approach is to consider what we w ould see o v er time, if we randomly selected man y sequences of length four from the relev ant fourteen. Eac h random sequence is selected with the same probability , 1/14, and so if w e select man y random sequences of length four and tak e the av erage of the HH-p ercen tages w e observ e, what we will get is the s um of the ﬁrst three p ositions that ar e followed by another H. So, the issues that arise in us ing D v alues are well illustrated by cons idering only the HH-p er c ent ages. 14 But, since we are only interested in the hot-hand question, the only relev ant sequenc e s considere d in MS are the fo urteen s e quences that have an H in one of the ﬁrst three p ositio ns . I would hav e chosen to include the other t wo sequences as well, on the grounds that a pla yer who ma kes none of his shots, or only his las t shot, should cer tainly not b e s aid to hav e a ”hot hand”. 13 fourteen HH-p ercen tages, divided by 14, i.e., the unweighte d aver age of the f o urteen HH- p ercen tag es. So , in this mo del of a pla y er without a hot hand, w e should compare the real pla ye r’s HH -p ercentage to 40.5%. This means that when the unit of analysis is a n individual s equence, rather than an individual ﬂip, to determine if a play er with a 50% hit rate exhibits streak sho oting in a single sequence (a game, say), we should not compare the observ ed HH-p ercen ta ge t o 50%, but rather to a num b er less tha n 5 0%. F our is just for illustration Now, in most games, a pla y er sho ots more than four times, and in fact, sho ots a diﬀeren t num b er of times in each g ame. So the four- shot example is just an illustration, a simple idealized scenario used to explain the p oint made in MS: when the unit of analysis is a play er’s record in a single game, or p erhaps ev en a single seas on, the v alue w e c ompare to should b e lo w er than the long-t erm hit ra t e of that pla y er. So , for example, if w e observ e that a pla y er with a w ell-established hit-rate of 50% has a n HH-p ercen tage o f 50% in a game or season, that can b e tak en as evidence that the play er has exhibited a ho t hand in that sequence, rather than evide nce against it. How strong that evidence is in fav or of a hot ha nd requires additional probability theory , and is aﬀected b y the length of the sequence. Length four sequenc es demonstrate the eﬀect dramatically , but o v er a season or career, the s equence migh t b e long enough that the eﬀect is small. F or a fair coin, the av erage HH-p ercen ta g e, av eraged o v er a ll sequence s of length k , approaches one-half a s k increases, although it is alwa ys b elow one-half. 6.2.2 Season or career-long analysis F or the second question, if a career is long enough and an individual play er mak es man y shots, it seems appropria t e to compute an HH-p ercen tage ov er all of the Hs, equally w eighting eac h bask et, me aning that the unit of a na lysis is an individual H. This migh t ev en b e sensible for a single season, dep ending on the num b er of shots t a k en. That is essen tially the approac h taken in G VT, where a pla y er’s HH-p ercen tage is compared to his season-long hit rate. If they are close to eac h other, then GVT t ak es that as evidence against a hot hand. But according to MS, in a random sequence o f coin ﬂips with length equal to the n umber of shots, call it K , that a play er mak es in a typic al season, the un w eigh ted HH-p ercen ta ge a v erag ed o v er all t he p ossible K - length sequences, is still signiﬁcan tly less than 50%. Hence, MS assert that the pro p er unit of analysis is a pla y er’s record for an en tire season, considered as one sequence. In that case, when determining if a pla y er (with a 50% hit rate) w as generally a streak sho oter, w e should compare his HH- 14 p ercen tag e for t he season to the unw eigh ted av erage HH-p ercentage in all the p ossible K - length sequence s. Then, as in our discussion of a single game, an HH- p ercen tag e of 50% for the seas on should b e tak en a s evidence t ha t the pla yer is a streak sho oter. 6.3 It’s the mo del When there is a dispute b et w een academics, particularly in science or mathematics, it is easier for a journalist to explain the dispute b y sa ying that one of the parties made an “error” or had a “misp erception”. And, that explanation ma y b e more attractiv e to the public. But the realit y is often that the parties ha v e a legitimat e diﬀerenc e of opinion on some methodolog ical or data issue . When using mathematics to study a natura l or hum an phenomenon, we m ust create a detailed mo d el of the phenomenon to allo w the application of mathematics. Diﬀeren t mo dels can lead to diﬀeren t wa ys that ma t hematics is used. In the disagreemen t betw een the GVT and MS pap ers, the fundamen ta l issue is not that one of the parties made a mathematical erro r or had a misperception of randomness — the underlying issue concerns the “unit of analysis” that the mathematics applies to, and that is determined by the w ay o ne mo dels a pla y er without a hot hand. The unit of analysis then dictates whether a n unw eigh ted or w eigh ted a v erage of HH-p ercen tages (o v er all random seque nces of a ﬁxed length) is used to determine the reference n um b er that a play er’s HH-p ercen tage will b e compared to. The take-home lesson here is that mo deling is a critical and diﬃcult part of the application of mathematics . It is not enough to “get the math right”. T o make the math meaningful, y ou ha v e to create a meaningful mo del, and p eople often disagree on whic h mo dels are the most meaningful. 7 One Last Piece of Nonsen s e In the caption of the t able sho wn in Johnson’s article, after sho wing that the computed a v erag e is 40.5 p ercen t, Jo hnson adds: This is not, ho wev er, a violation of the la ws of randomness. A head is f o llo w ed b y a head 12 times and b y a tail 12 times. B ut by c onc e ntr ating only on the ﬂips that fol low he ads and ign oring the other data, we ar e fo ole d by a sele ction bias. (italics a dded) What? The disagreemen t betw een t he 40 .5 p ercent a ve rage, and the fact that in the table a head follow s a head exactly the same num b er of times that a tail follo ws a head, has nothing to do with “ concen trating only on the ﬂips that follow heads.” A “selection bias” is discussed in MS, but it is the consequence of c ho osing one of the fourteen relev an t 15 sequence s of length f o ur, with equal pr o babilit y , indep enden t of ho w man y Hs are in the ﬁrst three p ositions. As w e hav e discuss ed a b o v e, since t here are more sequences that ha v e at least o ne HT than hav e at least one HH, the selection bia s leads to seeing a T follow an H more o f ten than a n H f o llo wing an H. It is nonsense to sa y that “ b y concen trating only on the ﬂips that f ollo w heads a nd ignoring the other data, w e are fo oled b y a selection bias.” 8 Aristotl e and A p p e lbaum App elbaum, after asserting that “On av erag e, just 40.5 p ercen t of the heads a re follow ed b y anot her heads” contin ues with Go ahead, see for y o urself (link to the Octob er 17 NYT article by Georg e Johnson). That link leads to the Johnson NYT article, whic h contains the table sho wing that precisely 50 p ercen t of the heads are follow ed by another heads. So, although App elbaum encourages the reader to “see for yourself”, it seems that he did not make the eﬀort. Apparen tly , he w as so con vinced of the claim that he didn’t think it needed empirical testing. This reminds me of t he story ab out Aristotle and the role of theoretical v ersus em- pirical thinking. Aristotle asserted that men ha ve more teeth than w omen. As Bertrand Russell wrote: “Aristotle could hav e a v oided the mistak e of thinking tha t w omen ha v e few er teeth tha n men, b y the simple dev ice of asking Mrs. Aristotle to k eep her mouth op en while he coun ted.” So App elbaum didn’t do the counting. Ironically , b oth the NYT articles discus s the the psyc hology o f p erceiv ed ra ndo mness, and how easy it is to b e fo oled, ev en in the face of clear evidenc e. Johnson writes: F or all their care to b e ob jectiv e, scien tists are as prone a s any one to v aluing data that supports their hypothesis ov er those that contradict it. 9 What ab out the New Y ork Times, th e New spap er of Record Ho w could the Johnson article, and ev en more the App elbaum article, hav e b een pub- lished in the New Y ork Times? They wrote nonsensical things ab out pro babilit y and 16 seriously misunderstand MS. The W all Street Journal also wrote ab out the hot hands dispute in “The ‘hot hand’ debate gets ﬂipp ed on its head”, b y Ben Cohe n, Septem b er 29, 2 015, and initially made exactly the same mistak e as the NYT art icles. They wrote: T oss a coin fo ur times. W rit e do wn what happ ened. Rep eat that pro cess one million times. What p ercen ta g e of ﬂips after heads also come up heads? The ob vious answ er is 50%. That answ er is also wrong. The real answ er is 40%... But then on Septem b er 3 0 , in an online v ersion of the article, the error is noted and corrected to: T oss a coin four times. W rite dow n the p ercen tage of heads on the ﬂips coming immediately a f ter heads. Rep eat that pro cess one million times . On a v erag e, what is that p ercen tage? ... Corrections & Ampliﬁcations: A previous vers ion of this article incorrectly desc rib es the question regarding coin ﬂips. The question is ab out the av erage p ercen tage of ﬂips, not the o v erall p ercen tage of ﬂips. (Sept. 30) But the NYT did not mak e an y correction of the mistak es in the Johnson article. More embarrass ingly , since sev eral of the r eaders of the Johnson article correctly p o in ted out the nonsense, how did the App elbaum article mak e it pa st the editors? As one of the readers (Larry from St. Louis) commen ted online after the App elbaum article: It is sho c king that suc h a ba sic error would get through b oth the Sunday Review and the Upshot. Is no one on t he pap er pay ing attention to what p eople write? F urther, if the author of this Upshot column read the commen ts on the Sunda y Review a rticle, then he w o uld hav e ﬁgured o ut the error for himself. Apparen tly neithe r App elbaum nor the editors read the commen ts of the readers, or if they did, they didn’t understand them, or think to ask an exp ert. And no w, almost tw o mon t hs aft er the publication of the Johnson and Appelbaum articles, and in con trast to the WSJ, there is no retraction, or further clariﬁcation in the Times, or ev en t he prin ting of a letter to the editor. As an educator in a ﬁeld inv o lving mathematical reasoning, and one concerned with the public’s understanding of quantitativ e issues, and a long-time NYT subscriber 15 , this is all ve ry disturbing. 15 and not a WSJ reader 17 The 16 Num b er of Hs in the Number HH-P ercentage sequence s ﬁrst three p ositions of HHs HHHH 3 3 100 HHHT 3 2 66.66 ... HHTH 2 1 50 HHTT 2 1 50 HTHH 2 1 50 HTHT 2 0 0 HTTH 1 0 0 HTTT 1 0 0 THHH 2 2 100 THHT 2 1 50 THTH 1 0 0 THTT 1 0 0 TTHH 1 1 100 TTHT 1 0 0 TTTH 0 0 0 TTTT 0 0 0 T otal 24 12 Av erage from 40.5 the ﬁrst 14 sequence s T able 1: The sixteen HT sequences of length four. The ﬁrst fourteen contain a n H in the ﬁrst three p ositions. In eac h of those fourteen sequences, the nu m b er of Hs in t he ﬁrst three p ositions is show n; next the num b er of those Hs that are follow ed by a nother H is sho wn; then the p ercen t a ge of Hs in the ﬁrst three p o sitions that are fo llow ed by another H is sho wn. This is the HH-p ercen tage. The to t a l num b er o f Hs in the ﬁrst three p ositions is 24, and the num b er o f Hs in the ﬁrst three p ositions that a re follo w ed b y anot her H is 12 , exactly 50%. Ho wev er, the unw eigh ted av erag e of the p ercen ta g es is not 50%, it is a b out 40.5%. T rue, but so what? 18 References [1] T. Gilov ic h, R . V allone, a nd A. Tv ersky . The hot hand in bask etball: O n the mis- p erception of random sequenc es. Co gnitive Psycholo g y , 17 :295–314, 1985. [2] J. B. Mille r and A. Sanjurjo. Surprised b y the gam bler’s and hot hand fallacies? a truth in the la w o f small n umbers. IGIER W orking P ap er no. 5 52, Septem b er 15, 2015. 19

Hot Hands, Streaks and Coin-flips: Numerical Nonsense in the New York Times

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment