Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

In Neyman's original formulation, a 1-alpha confidence interval procedure is justified by its long-run coverage properties, and a single realized interval is to be described only by the slogan that it either covers the parameter or it does not. On th…

Authors: Scott Lee

Either a Confidence Interv al Cov ers, or It Doesn’t (Or Does It?): A Model-Based V iew of Ex-P ost Cov erage Pr obability Scott Lee 1 1 National Center for Emer ging and Zoonotic Infectious Diseases, Centers for Disease Control and Pre vention Abstract In Neyman’ s original formulation, a (1 − α ) confidence interval procedure is justified by its long- run cov erage properties, and a single realized interval is to be described only by the slogan that it either cov ers the parameter or it does not. On this vie w , post-data probability statements about the cov erage of an indi vidual interval are taken to be conceptually out of bounds. In this paper , I present two kinds of arguments against treating that “either-or” reading as the only legitimate in- terpretation of confidence. The first is informal, via a set of thought e xperiments in which the same joint probability model is used to compute both forward-looking and backward-looking proba- bilities for occurred-b ut-unobserved e vents. The second is more formal, recasting the standard confidence-interv al construction in terms of infinite sequences of trials and their associated 0 / 1 cov erage indicators. In that representation, the design-le vel coverage probability 1 − α and the de- generate conditional probabilities gi ven the full data appear simply as different conditioning le vels of the same model. I argue that a strict beha vioristic reading that privile ges only the latter is in tension with the v ery mathematical machinery used to define long-run error rates. I then sketch an alternati ve vie w of confidence as a predictiv e probability (or forecast) about the cov erage indica- tor , together with a simple normati ve rule for when intermediate probabilities for single coverage e vents should be allo wed. K eywords: confidence intervals; coverage probability; frequentist inference; single-case probabil- ity; predicti ve probability; Ne yman. Disclaimer : The findings and conclusions in this report are those of the author and do not neces- sarily represent the of ficial position of the Centers for Disease Control and Prev ention. 1 Intr oduction 1.1 Backgr ound When Jerzy Neyman introduced his theory of confidence intervals (CIs) in 1937 [21], he ga ve practicing statisticians a strong suggestion for how to interpret them: because θ is assumed to be a fixed, unknown constant, once a particular interv al is generated, the cov erage e xpression P ( L ( X ) ≤ θ ≤ U ( X )) is mathematically fixed, and so we can only say that the interv al either did 1 or did not succeed in cov ering it. The straightforw ard mathematical justification for this is that all the randomness in the confidence procedure (CP) li ves in the data X , and so once we hav e a partic- ular realization X = x i , the e xpression abo ve becomes degenerate in { 0 , 1 } . Intuiti vely , this also makes sense, since if we imagine sampling a particular set of interv al bounds an infinite number of times, the probability of success (under that design) will be either 0 or 1 , depending on whether the original interv al cov ered θ . As a consequence, practical guidelines for probabilistically inter- preting CIs typically re volv e around their long-run co verage properties, rather than the properties of any single constructed interv al [10, 17, 27, 16], and attempts to say otherwise are often, though not al ways [19], branded as errors of interpretation [11] or f allacies in reasoning [20], despite the natural inclination to attach some kind of probability to realized interv als ex post (i.e., ”post-data”). The tension between the accepted interpretation of CIs and the supposedly-fallacious one can be recast more generally as a statement about e vents we kno w ha ve occurred, b ut whose outcomes we hav e not observ ed [25]: for frequentists, randomness, and thus probability , li ves in the sampling process and not our kno wledge of the outcomes, and so once a sample has been dra wn, the ex-ante (i.e., ”pre-data”) probability has collapsed to some value in { 0 , 1 } . Although this is mathematically true, it should also ring alarm bells for practicing statisticians (b ut, in the case of CIs, ne ver seems to), because we happily use frequentist methods for statistical inference in exactly this kind of real- world scenario. T ake, for e xample, the case of medical diagnosis: gi ven that a patient tests positiv e on, say , a rapid diagnostic test for the influenza virus, what is the probability that she actually has it? If we stay consistent with our interpretation of CIs, we should also say no probability statement may no w be made: giv en her true underlying health state, the patient either does or does not have the flu, and there is no probability left to assign, because all of the randomness in the sampling process has now been exhausted. Clearly , though, following the ”either-or” logic here would ruin the clinical v alue of the diagnostic test in guiding care, and it w ould ob viate the ef fort epidemiologists and statisticians put in to estimating the test’ s positiv e predicti ve v alue ( P P V ) in the first place, neither of which seems particularly desirable. The interpretiv e tension also runs along philosophical lines, with frequentists and propensity- theorists generally taking an ontic view of things (i.e., what matters is ho w randomness plays out in the world, whether we know about it or not) [12], and Bayesians generally taking an epis- temic view (e.g., subjectivists identifying probability with personal degree of belief, or credence [5, 24, 12]) The latter have no trouble accommodating occurred-but-unobserved ev ents, but the former run into more dif ficulty , since the interpretations tend not to deal e xplicitly with the role of the observer in making probability assignments (see, e.g., [30]). Statistical inference seems to require some kind of epistemic component, though–estimation with full knowledge is simply calculation–and e ven Ne yman, arguably one of the foremost operationalists, admits the role of the observer in defining his theory of CIs (if θ is fix ed-but-unkno wn, the natural follow-up question is, ”Unkno wn by whom?”). More pointedly , if we do not care whether we know a CI co vered θ , why are we trying to estimate θ at all? The answer to this question is perhaps more philosophical than statistical, and it is one that has been addressed thoroughly in the literature on the philosophies of both probability and science, so I will not attempt to summarize it here. Ho we ver , in the sections belo w , I hope to show that, strictly speaking, the question itself is not one we need to entertain to provide a formal accounting of occurred-but-unobserv ed ev ents within frequentism proper , and that we can in fact talk quite sensibly about coverage probability ex post, as long as we state clearly 2 what we mean by the term ”probability”. 1.2 Paper o verview and contrib utions In what follo ws, I deliberately adopt a rather strict reading of Ne yman’ s slog an–in a nutshell, that ex post probabilities of cov erage are entirely out of bounds–and treat it as a normati ve rule (to be fair , though, this is ho w many , if not most, instructiv e pieces handle the interpretation, see e.g., [2, 22, 29, 10, 1, 18] for examples). The arguments will form a kind of reductio: if we really insist on this kind of rule, we run into uncomfortable, if not untenable, constraints on other frequentist uses of probability . The alternati ve I suggest, and perhaps this paper’ s main contrib ution, is that we keep Neyman’ s (very useful) conception of long-run error control, b ut that we loosen what we are allo wed to say about individual co verage ev ents, especially ex post. The rest of the paper is structured as follows. In Section 2, I begin the reductio informally by presenting three thought experiments to sho w some difficulties with inference that arise as a result, in escalating se verity from one that mak es the underlying probability model not very useful, to one that loses its design-lev el probabilities entirely . In Section 3, I present a formal ar gument based on K olmogorov-style probability theory to sho w why there is no real mathematical difference between probability statements about ev ents e x ante and ex post, leaning on the machinery of CIs for ex emplification. In Section 4, I discuss the implications of the prior sections, presenting a soft normati ve rule about whether to make intermediate probability assignments ex post and, if so, when. I also suggest a notion of the concept of ”confidence” as predictiv e probability , or as a model-based probabilistic forecast, and I suggest a fe w directions for future research. 2 Thought Experiments 2.1 Dr . I-Don’t-No A patient comes into a primary care clinic with a cough, a runny nose, and a fev er , all of which she de veloped within the past day . Because the patient’ s fe ver is mild, her physician believ es that she has the common cold, but just to be sure, she giv es her a rapid antigen test for the flu. Prior clinical testing has shown the rapid test to ha ve an estimated sensitivity of 0 . 75 and a specificity of 0 . 98 relativ e to polymerase chain reaction (PCR) testing; in this case, PCR has negligible error rates to true disease status, so we may treat it as a direct proxy for the latter . The patient tests positi ve, and no w the physician must decide whether to issue the flu diagnosis and write the patient a prescription for an antiviral, or to tell her she likely has the common col d and recommend a more conserv ativ e treatment with over -the-counter medications to reduce congestion, cough, fe ver , and body aches. Question : What course of action should the doctor r ecommend? Based on the test’ s known sensitivity and specificity relativ e to PCR, along with a current estimate of flu prev alence in the patient’ s area, the doctor might base her choice on the model-deri ved probability that the patient has the flu given her positiv e test result. This figure is kno wn as positiv e predicti ve value, or P P V , and can be written generally as P ( D = 1 | T = 1) . Assuming a 3 pre valence of 10% , the calculation comes out to P P V = sens · pr ev ( sens · pr ev ) + (1 − spec ) · (1 − pr ev ) = 0 . 75 · 0 . 10 (0 . 75 · 0 . 10) + 0 . 02 · 0 . 90 = 0 . 81 (1) where prev is the background pre v alence, sens is the test’ s sensiti vity , and spec is the test’ s speci- ficity . Under this model, the patient has a predicted probability of having the flu of 81% , so the physician strongly considers prescribing her the anti viral and recommending she stay home to recuperate. Remembering her training in biostatistics, though, the physician also notes that, with this particular patient in her clinic, both a fix ed test result T i = t i and a fixed underlying disease state D i = d i hav e been sampled, and so there is no randomness left to assign the patient a probability of disease– she either has the disease, or she does not. From this point of vie w , P P V turns into the degenerate conditional P ( D i = 1 | T i = t i , D i = d i ) = 1 { d i = 1 } ∈ { 0 , 1 } , and now the physician must base her recommendation on an underlying truth v alue she has no possible way of knowing. Realizing ho w strange this would be, she decides to use the model-deri ved number of 81% and recommends a course of action that assumes the patient truly has the flu. 2.2 The Cat T asting T reats A cat named Sophie lov es treats. T o keep her happy , her o wner buys her a big box of mixed- flav or treats from the pet supplies store. The label on the box guarantees that 75% of the treats are seafood-flav ored, and that the remaining 25% of the treats are chicken-fla vored. (Apparently , to make this guarantee, the company producing the treats stamps each one with a unique random ID number and then uses a perfectly reliable computer vision system to track which one ends up in which box, which then lets them automatically verify the fla vor composition of the boxes.) From past observ ation, the owner knows that Sophie prefers the seafood-flav ored treats, purring 80% of the time after eating them b ut only 60% of the time after eating the chicken-flav ored treats. The purring has a carryo ver ef fect to what she does after eating the treat, too: if she purrs, there is a 90% chance that she will take a nap, and only a 10% chance that she will roam the house looking for something else to eat. If she does not purr , though, she is much more inclined to keep foraging, and the two chances align at 50% . The o wner draws a treat from the box, tak es note of its number (in this case, #123 ), and puts it on the ground. Question A : Assuming the owner has no idea what flavor the treat is, what is the pr obability that Sophie ends up taking a nap? As before, we can base our answer on two model-based probabilities: the unconditional, which marginalizes the purring and napping probabilities o ver the distribution of fla v ors in the box; or the conditional degenerate, which considers the treat’ s flav or fixed but unkno wn. Under the former , 4 the cat’ s probability of taking a nap is P ( nap ) = P ( F = sea ) P ( nap | F = sea ) + P ( F = chk ) P ( nap | F = chk ) = 0 . 75 P ( nap | F = sea ) + 0 . 25 P ( nap | F = chk ) = 0 . 75  P ( purr | F = sea ) P ( nap | purr ) + P ( no purr | F = sea ) P ( nap | no purr )  + 0 . 25  P ( purr | F = chk ) P ( nap | purr ) + P ( no purr | F = chk ) P ( nap | no purr )  = 0 . 75  0 . 8 × 0 . 9 + 0 . 2 × 0 . 5  + 0 . 25  0 . 6 × 0 . 9 + 0 . 4 × 0 . 5  = 0 . 75 × 0 . 82 + 0 . 25 × 0 . 74 = 0 . 615 + 0 . 185 = 0 . 80 . (2) A f airly high probability at 80% , but one that makes sense, giv en that most treats in the box are the cat’ s f av orite, that she purrs 80% of the time after eating them, and that she naps 90% of the time after purring. On the other hand, when basing the calculation of f of the conditional de generate la w P ( F = sea | F = F 123 ) = ( 1 , F 123 = sea , 0 , F 123 = chk , ∈ { 0 , 1 } , (3) the 80% probability forks into two, with P ( nap | F 123 = sea ) = P ( purr | F = sea ) P ( nap | purr ) + P ( no purr | F = sea ) P ( nap | no purr ) = 0 . 8 × 0 . 9 + 0 . 2 × 0 . 5 = 0 . 82 , P ( nap | F 123 = chk ) = P ( purr | F = chk ) P ( nap | purr ) + P ( no purr | F = chk ) P ( nap | no purr ) = 0 . 6 × 0 . 9 + 0 . 4 × 0 . 5 = 0 . 74 . (4) The treat’ s fla vor is indeed a f act of the world, and the forked probability is mathematically correct. Ho wev er, since the purring and napping hav e yet to happen, and since our model giv es us a clear way of quantifying the uncertainty in those outcomes with a single number , the unconditional is probably the w ay to go (I imagine it is also the number that most statisticians would use to answer this question). Question B : The owner leaves the r oom so Sophie can eat in peace , then comes bac k a fe w minutes later to find her napping on the couch. What is the pr obability the tr eat was seafood-flavor ed? This one is easy , since all we need to do is use Bayes’ rule to calculate the required probability: P ( F = sea | nap ) = P ( nap | F = sea ) P ( F = sea ) P ( nap ) = 0 . 82 × 0 . 75 0 . 80 = 0 . 615 0 . 80 = 123 160 ≈ 0 . 77 . (5) 5 The motiv e for considering this conditional is that it is exactly the quantity our model assigns to the treat’ s hidden flav or gi ven the final outcome of the trial. For a particular treat, this probability might be described as epistemic, as it represents the owner’ s uncertainty about the flav or , not an y physical randomness in the world. Note, though, that in the actual sequences of e vents described by the model, the flav or is fixed before the cat eats, purring (or not) follo ws, and napping (or roaming) comes last. The same joint probability model gov erns all of these v ariables together , and so if we are willing to use that model befor e sampling to compute forward-looking probabilities like P ( nap ) , by the same mathematical rules, we are equally entitled to use it after the fact to compute backward-looking probabilities like P ( F = sea | nap ) , regardless of whether we view them as epistemic. Rejecting the latter while accepting the former would amount to using only part of the model’ s probabilistic structure, e ven though both ha ve well-defined frequentist properties o ver repeated sampling. 2.3 W e’r e in Deep T ruffle Now An artisanal chocolatier is scaling up operations and buys a new set of machines to help her make more truf fles. Her setup contains: 1. A fabricator that produces chocolate truffles, which has two components: one that creates the hollow chocolate shell, and one that fills the shell with chocolate ganache. The f abricator always produces a shell, b ut the filler doesn’t alw ays acti vate, failing to inject an y ganache at all 10% of the time. The other 90% of the time it works, though, and when it does, an auto-stop mechanism detects when the shell is full to pre vent o ver -filling. 2. A second machine that weighs the truf fles and detects the ones that are hollow , which are lighter . This machine is more reliable than the filler mechanism in the first, but it’ s still not perfect, incorrectly reading filled truffles as hollow 5% of the time (false positi ves) and hollo w truffles as filled 1% of the time (false ne gati ves). 3. A con veyor belt that sends truf fles from the first machine to the second. 4. A basic pressure sensor that detects whether any truf fle is on the con ve yor belt and emits a small single beep, if so (the sensor is perfectly accurate). 5. A second con veyor belt that can send truf fles from the weigher back to the filler as needed. If the second machine detects a hollo w truffle, it sends an electrical signal to the fabricator instruct- ing it to pause the shell-mak er , and it returns the suspect truf fle along the second con ve yor belt to the fabricator to be filled. Once the suspect truf fle arri ves, the fabricator attempts to fill the truffle as if it were a fresh shell, with the filler acti vating with exactly the same timing and probability ( 90% ). If the filler activ ates and the truffle is truly hollo w , it al ways fills it; if the truf fle is already full, ho we ver , the filler’ s sensor detects the fullness and does not inject more ganache. The fabri- cator then emits the truffle and sends it back down the belt to be weighed, all with e xactly the same timing as if the truf fle had been created from scratch. The chocolatier activ ates the machines, and the pressure sensor beeps shortly thereafter , indicating 6 the fabricator has emitted a truf fle. Question : W ith this specific truffle on the belt, but the weigher having not yet made its measur e- ment, what is the pr obability that the next truf fle the fabricator emits is corr ectly filled? Sticking only to design-le vel information, we can run the calculation like this: P ( next filled ) = P ( W = filled ) · 0 . 9 + P ( W = hollow ) · P ( next filled | W = hollow ) =  0 . 9 × 0 . 95 + 0 . 1 × 0 . 01  · 0 . 9 +  0 . 9 × 0 . 05 + 0 . 1 × 0 . 99  · P ( next filled | W = hollow ) . (6) P ( S = filled | W = hollo w ) = 0 . 9 × 0 . 05 0 . 9 × 0 . 05 + 0 . 1 × 0 . 99 = 5 16 , P ( S = hollow | W = hollow ) = 11 16 . (7) P ( next filled | W = hollow ) = 1 · P ( S = filled | W = hollo w ) + 0 . 9 · P ( S = hollow | W = hollow ) = 1 · 5 16 + 0 . 9 · 11 16 = 149 160 ≈ 0 . 93125 . (8) P ( next filled ) = 0 . 856 · 0 . 9 + 0 . 144 · 149 160 = 1809 2000 ≈ 0 . 9045 . (9) If, ho we ver , we decide to condition the probability on the realized fact that the current truf fle either is or is not filled, we end up with the same kind of forked probabilities we saw in the cat-treat experiment: P ( next filled | S = filled ) = P ( W = filled | S = filled ) · 0 . 9 + P ( W = hollow | S = filled ) · 1 = 0 . 95 × 0 . 9 + 0 . 05 × 1 = 0 . 905 , P ( next filled | S = hollow ) = P ( W = filled | S = hollow ) · 0 . 9 + P ( W = hollow | S = hollow ) · 0 . 9 = 0 . 01 × 0 . 9 + 0 . 99 × 0 . 9 = 0 . 9 . (10) Unlike in the cat-treat e xperiment, though, the forking here leads to an apparent puzzle where we no w hav e two probabilities attaching to exactly the same underlying e vent: the current truf fle’ s fill status. By design, e very emitted truf fle’ s fill status, including that of the one presently upcoming, is gov erned ex ante by P ( next filled ) ≈ 0 . 9045 . At the same time, conditioning on the realized 7 fact that the current truffle either is or is not filled yields the two conditional v alues 0 . 905 and 0 . 9 . The statement “either the current truffle is no w filled, or it is not” is necessarily true, but it cannot be pri vileged as the only admissible post-trial probability content for this ev ent, because doing so would prohibit the v ery quantity P ( next filled ) that the model instantiates by design. 2.4 A quick disclaimer The foregoing e xamples are not meant to describe how working statisticians actually reason about occurred-but-unobserv ed e vents in practice–clearly , and especially in the case of P P V , they do not. What they are meant to sho w , howe ver , is that taking Ne yman’ s ”either-or” slogan seriously leads us to a very real tension between what we are told to say , what we would like to say , and what our models should allo w us to say . In the following section, I take a look at the mathematics underlying this tension and build to ward what I hope is a reasonable way of resolving it. 3 Infinite sequences, micr ostates, and coverage Belo w , I briefly recall the mathematical definition of a confidence procedure and then recast it in terms of micr ostates —fully fixed infinite sequences of trials—in order to make e xplicit how the design-le vel co verage probability 1 − α relates to those sequences. The primary goal here is to set the stage for making a principled claim about whether , if e ver , the ”either-or” statement Neyman makes is compatible with the frequentist notion of probability , and, if so, when. 3.1 Confidence interv als and coverage indicators T o begin, we fix a parametric model (Ω , F , { P θ : θ ∈ Θ } ) , with parameter θ ∈ Θ treated as fix ed b ut unknown. No w , let X : (Ω , F , P θ ) → X denote the data, with distribution P X θ under P θ . Under this setup, a (1 − α ) confidence interv al for θ is a measurable map I : X → I , x 7→ I ( x ) = [ L ( x ) , U ( x )] , such that for e very fix ed θ ∈ Θ , P θ  θ ∈ I ( X )  = P θ  L ( X ) ≤ θ ≤ U ( X )  = 1 − α . (11) Since cov erage is a yes-no e vent, we can also define the cover age indicator Z θ ( X ) := 1 { θ ∈ I ( X ) } = 1  L ( X ) ≤ θ ≤ U ( X )  , (12) which is a { 0 , 1 } -valued random v ariable on (Ω , F , P θ ) . From here, we can see that (11) is equi v- alent to E θ  Z θ ( X )  = 1 − α , or Z θ ( X ) ∼ Bernoulli(1 − α ) (13) under P θ . 8 3. 2 Infinite sequences of experiments T o return to the concept of microstates, we can imagine embedding the interv al-generating proce- dure in an infinite sequence of experiments. For a fix ed θ ∈ Θ , let X 1 , X 2 , . . . ∼ P X θ be i.i.d. copies of the data, and then define I i := I ( X i ) = [ L ( X i ) , U ( X i )] , Z i := 1 { θ ∈ I i } = 1 { L ( X i ) ≤ θ ≤ U ( X i ) } , i = 1 , 2 , . . . . By (13), each Z i has the same Bernoulli (1 − α ) law under P θ , and ( Z i ) i ≥ 1 are independently and identically distributed (i.i.d.). In particular , P θ ( Z i = 1) = 1 − α, i = 1 , 2 , . . . . (14) Each outcome ω ∈ Ω determines an infinite micr ostate path  X 1 ( ω ) , X 2 ( ω ) , . . .  ,  I 1 ( ω ) , I 2 ( ω ) , . . .  ,  Z 1 ( ω ) , Z 2 ( ω ) , . . .  , with Z i ( ω ) ∈ { 0 , 1 } for e very i . From this viewpoint, a single realized world ω contains a fully- fixed infinite sequence of interv als and co verage outcomes; the only randomness is in which index i of this sequence is consulted when the procedure is run. The strong la w of large numbers applied to ( Z i ) i ≥ 1 yields 1 n n X i =1 Z i a.s. − → 1 − α , (15) so that with P θ -probability 1 , the realized microstate path has an asymptotic coverage fraction equal to the design-le vel 1 − α . 3.3 A small aside with Bor el-Cantelli As a small aside, it may be instructi ve to recast the infinite sequence ( I 1 , I 2 , . . . ) above as sampling with replacement from the “population” of interv als that our chosen procedure can produce from our data. When we are working in a discrete or finite–population setting–for example, if our data take v alues on a finite grid because of limited measurement resolution (the case for most, if not all physical e xperiments), or we are literally sampling from a finite list of outcomes–there will be only countably many distinct interv al v alues J that can appear , and some of them will occur with strictly positi ve probability π θ ( J ) = P θ { I ( X ) = J } > 0 . No w imagine picking one such interv al v alue J (with fixed numerical endpoints). It is true that e very time we run our procedure, we will either see J or not. Importantly , though, because the e vents { I i = J } are independent across all runs of the procedure, and because they all hav e probability π θ ( J ) , the Second Borel–Cantelli lemma then tells us that, with P θ –probability 1, the e vent { I i = J } happens infinitely often as i → ∞ . So, any interv al that has nonzero mass under the design will keep coming back, o ver and o ver again, along the infinite sequence of trials. 9 As in the infinite sequences above, each time our particular interval J reappears, it comes with the same two layers of probability giv en to it by the model: before seeing I i , the e vent { θ ∈ I i } has unconditional design-le vel probability 1 − α , while after seeing I i = J , the corresponding indicator Z i takes its fixed v alue in { 0 , 1 } . The key point is that these are probabilities at different conditioning le vels of the same model; this does not claim that P θ ( θ ∈ J ) = 1 − α for a fixed realized J . Rather , 1 − α is the pre-conditioning probability attached to each trial before re vealing which interv al value occurred. Of course, in the usual continuous CI story (for example, t intervals for a normal mean), the distribution of I ( X ) is itself continuous, and so the probability of seeing exactly the same endpoints twice is zero (under the model): e very exact numerical interval J has π θ ( J ) = 0 . In those models the Borel–Cantelli picture above is best read as a discrete intuition rather than a literal claim about recurrence of specific endpoints. The substanti ve long–run content of the design is still captured by the Bernoulli coverage indicators Z i and the law of large numbers, howe ver–just not by the reappearance of any particular interv al values themselv es. 3.4 Stating the Obvious The formalization abov e helps us see two important things. First, the statement that a gi ven interval either co vers the parameter or not is always true, reg ardless of whether it is the one we are about to construct, the one we just constructed, or the one we will construct a million samples from no w . By definition, for the i th use of the confidence procedure, the model simultaneously implies both P θ  Z θ ( X i ) = 1  = 1 − α and P θ  Z θ ( X i ) = 1   X i  = Z θ ( X i ) ∈ { 0 , 1 } almost surely . Equi valently , using a regular conditional v ersion, for P X i θ -almost e very realized sample x i , P θ  Z θ ( X i ) = 1   X i = x i  = 1 { θ ∈ I ( x i ) } ∈ { 0 , 1 } . These identities hold both ex ante and e x post, and the only difference is whether we treat X i as random or condition on its realized v alue. In probabilistic terms, they are just two layers of the same law of iterated e xpectations: E θ  Z θ ( X i )  = E θ n E θ  Z θ ( X i ) | X i  o = 1 − α . Reporting P θ ( Z θ ( X i ) = 1) = 1 − α amounts to working with a coarser information field (the design alone), whereas reporting P θ ( Z θ ( X i ) = 1 | X i ) uses the finer σ -algebra generated by X i . Regarding our interpretation of a single CI, choosing to report the de generate conditional proba- bility rather than the unconditional P θ ( Z θ ( X i ) = 1) = 1 − α is therefore a choice of conditioning le vel (i.e., of σ -algebra), not something dictated by the model itself. The second thing the formalization helps us see is that, in the infinite-sequence representation, each use of the procedure corresponds to an inde x i and an interv al I i . F or ev ery i , the model assigns the same design-le vel co verage P θ  Z θ ( X i ) = 1  = 1 − α , 10 e ven though in each realized world ω the interv al I i ( ω ) has fix ed endpoints. From the Borel- Cantelli aside abo ve, we also know that in the finite-population setting, this probability will also attach to the same numerical interv al an infinite number of times (which means if we make an y ex ante, design-le vel claims about cov erage, we will e ventually make the claim about an actual interv al we ha ve already constructed). In both cases we can see that, under the model, coverage probability really is attaching to the interv als themselves–just not the intervals conditioned on themselves– and that the “either-or” reading ex post does not arise from some deep mathematical asymmetry between pre- and post-data uses of the model, but rather from the choice of one conditioning le vel—the finest one—o ver the others the model also happily supports. 4 Discussion 4.1 Can we ha ve it both ways? Abov e, I revisited the two kinds of cov erage probabilities gi ven to us by the model underlying a confidence procedure: the design-le vel unconditional, and the degenerate conditional based on a realized interv al’ s endpoints. If both kinds of probabilities are av ailable under the model, do the y fare equally well as interpr etations of coverage ex post? And can the degenerate conditional be the only one we allow , to the exclusion of all others? I would like to suggest the answer to both questions is ”no”. T o see wh y , it will be helpful to return to the Deep T ruf fle thought e xperiment from Section 2.3. 4.1.1 The degenerate as an interpr etation In order to answer the question, ”What is the probability the ne xt emitted truffle is filled?”, there are two options for us to consider . On the one hand, there are two forked conditional probabilities for the next truf fle—one conditional on the current truffle in fact being filled, and one conditional on it in fact being hollow . On the other hand, there is a single design-lev el probability , obtained by av eraging those two conditionals with respect to the model’ s probability that the current truffle is filled. Mathematically , both constructions are legitimate. Interpretationally , howe ver , I would ar - gue that they are not on equal footing. Under a frequentist vie w , probability is defined with respect to a single, typically infinite sequence of e xperiments conducted under repeatable conditions (i.e., a reference class, or , in v on Mises’ terms, a “collecti ve”, a term which I use here because it more clearly captures the notion of an infinite sequence [30, 8]). In the chocolatier’ s situation, once we condition on the truth v alue of a realized e vent (“this truf fle is full” v ersus “this truf fle is hollo w”), a single infinite collectiv e is no longer av ailable to underwrite all of the probabilities we might like to assign to future e vents. T o interpret both forked probabilities as long-run frequencies, we must imagine two distinct collecti ves, each with its o wn version of the Marko v chain above describing the truffle-emission-and-weighing process. These collectiv es correspond to two incompatible fu- tures—one in which the present truf fle is (and always was) filled, and one in which it is hollow–b ut because the current truf fle cannot actually be both filled and hollow , only one of these can coincide with the actual state of aff airs. As Hajek points out, we need to choose a single reference class for obtaining the probabilities [13], and unless we entertain a kind of frequentist inference based on simultaneous collecti ves, we are at something of an impasse. An analogous issue arises for confidence interv als. As an example, let us imagine that we would 11 like to calculate the probability that both our current interval and the next interv al we construct will both cov er θ . Let C i denote the indicator that the i th interval produced by a fixed procedure cov ers θ , so that C i ∼ Bernoulli(1 − α ) and ( C i ) i ≥ 1 are i.i.d. under the model. At the design le vel we hav e Pr( C 1 = 1 , C 2 = 1) = (1 − α ) 2 , which can be read, in frequentist terms, as the long-run proportion of pairs of consecuti ve inter - v als in which both co ver θ . Once a particular sample has been observed and a particular interval constructed, howe ver , we can again be tempted to introduce fork ed, ex post readings of this joint probability . If the present interval in f act cov ers θ , then in the model Pr( C 1 = 1 , C 2 = 1 | C 1 = 1) = Pr( C 2 = 1 | C 1 = 1) = 1 − α, where the last equality follows from the independence of ( C i ) . If instead the present interval in fact fails to co ver , then Pr( C 1 = 1 , C 2 = 1 | C 1 = 0) = 0 . Mathematically , these probabilities are all perfectly well defined, but again, under a strict frequen- tist account, probabilities are typically interpreted with respect to a single infinite sequence of repetitions. For CIs, the unconditional design-le vel probability (1 − α ) 2 corresponds to that single collecti ve. By contrast, the two forked readings of the joint probability implicitly appeal to two distinct collecti ves: one in which the realized first interval co vers θ (yielding a joint probability of 1 − α ), and one in which it does not (yielding a joint probability of 0 ). As with the truffles, only one of these can coincide with the actual state of aff airs in the world, since the realized interv al is not simultaneously co vering and non-covering, and again, unless we are willing to treat the ensuing forked conditionals as li ving in separate collectiv es, the unconditional joint probability (1 − α ) 2 is probably the safer bet, understood again at the design level rather than as a statement about the specific realized pair of intervals. (If this seems like a toy example, we might also imagine wondering ho w many of, say , 1000 CIs we ha ve just constructed will co ver the parameter–without admitting a design-le vel p to plug in to the equation for the binomial mean, that question becomes a very dif ficult one to answer with a precise number). 4.1.2 The degenerate as the only interpr etation The conditional degenerate is certainly one interpretation we can adopt ex post, but can it be the only one? I suggest that the answer here must be ”no”, and precisely for this reason: if it were the only interpretation we could adopt, then we would then lose the ability to make design-le vel probabilistic statements about future ev ents conditioned on the current state of the affairs. The thought experiments in Section 2 make this consequence clear , especially in the case of Deep T ruf- fle, where insisting that the de generate conditional is the only possibility for the current truffle’ s fill status prevents us at all from recovering the design-le vel probability that the next truffle will be filled. Allowing both interpretations side-by-side w ould be perfectly fine, since we w ould still be able to say that the current truf fle is either filled or hollo w , if we wished, but we would ha ve the option of backing of f to the coarser -grained unconditional probability to estimate the fill status of the next truf fle, as our model requires, and as common practice might suggest. The question here is really not whether the Neyman-style interpretation is le gitimate (it is) or should be allo wed alongside the others our model defines (it should), b ut rather whether it should be the only one that is allo wed, to the exclusion of all others–again, in light of the e vidence above, it seems reasonable here to say ”no”. 12 4.2 What is confidence, then? The discussion abov e suggests that part of the trouble comes from conflating distinct layers of probability that the model makes av ailable. W e clearly hav e at least two model-based layers to work with when interpreting confidence interv als: at the design le vel, there is the unconditional cov erage probability 1 − α attached to the procedure as a whole; and at the ex post le vel, condi- tioned on the truth v alue of the co verage e vent for a particular interv al, there are the degenerate probabilities 0 and 1 corresponding to non-coverage and cov erage, respecti vely . Both are proper- ties of the same probability measure P θ , vie wed at dif ferent conditioning lev els. The design-le vel quantity arguably holds up better under scrutiny , as it av oids the multi collectiv e complications noted abov e. For some distributions and estimators, though, a bare report of 1 − α can feel some- what strange ex post. The “trivial” interv al, which returns the real line with probability 1 − α and the empty set with probability α , is perhaps the most f amiliar example, but W elch’ s uniform confidence interv al [31] and Basu’ s construction based on a θ -free X [3] are also instructi ve. In such cases, I would suggest that, while the two model-based layers are certainly still in play– considered only as a random draw , any interval generated by the procedure does, by definition, inherit the design-lev el cov erage probability—it may be more natural to admit a third layer of probability: that of a single-case pr edictive probability , which can be scored against the underlying 0 / 1 cov erage outcome. From this perspecti ve, our 1 − α report ex ante is a forecast about the cov erage indicator for the interv al to be constructed, distinct from the interval’ s realized coverage status under the model, and chosen so that it minimizes e xpected loss under a proper scoring rule ov er repeated sampling [9]. In a prequential reading [4], repeated use of the procedure produces a sequence of forecasts and outcomes ( p 1 , Z 1 ) , ( p 2 , Z 2 ) , . . . , where Z i is the cov erage indicator for the i th interval and p i is the predicti ve probability assigned to Z i = 1 gi ven whatev er information we choose to condition on. The forecast sequence ( p i ) can then be e v aluated by its long-run scores and calibration properties, independently of whether we identify p i with any particular P θ -probability . While our natural ex ante forecast would simply be p i = 1 − α , reflecting only the design, we might legitimately allow our ex post forecast to change if, as in the case of W elch’ s uniform CI, features of both the design and the numerical v alue of the realized interv al provide additional clues about coverage. In that richer information state, the predicti ve probability p i ( X i ) might not equal 1 − α , but it would not also necessarily collapse to { 0 , 1 } : it w ould be a genuine intermediate forecast about the coverage e vent Z i , subject to empirical assessment across repeated uses of the procedure. Along these lines, I suggest that this kind of forecast is, in essence, what Ne yman’ s usage of the term ”confidence” was gesturing tow ard: a non-oracle observer’ s best guess as to how many interv als like the one we hav e constructed, or the ones we are about to construct, will succeed in cov ering the parameter . Separating our probabilistic views of confidence procedures in this way—into (i) design-le vel probabilities under P θ , (ii) degenerate conditionals under P θ gi ven the full data, and (iii) information-inde xed predictiv e probabilities for the coverage indicator—would ease some of the tension in interpreting constructed intervals ex post, since we could then specify which vie w we are referring to when we talk about, for example, the “probability” that a single 13 interv al has co vered the parameter . It would also complement, but not conflict with, vie ws raised by authors who hav e e xtended the notion of confidence in other ways, for example, as an e xtended likelihood [26], or as the basis for CI-based distributional inference [32], as well as those who hav e rebuilt Fisher’ s original notion of fiducial inference [6, 7] with modern methods and philosophi- cal sensibilities [14, 15, 28]. The distinction between this perspectiv e and those is that it treats confidence not as something that might help us learn about value of θ or its likely distribution, b ut rather as a predicti ve probability of the underlying co verage e vent itself, giv en that the data, model, and estimator are fixed. In other words, it is simply a model-based guess as to whether the current interv al belongs to the subset of intervals the procedure can generate that co ver the parameter , not a guess about where, numerically , the parameter actually lies. T o sho w ho w such a framing might take shape mathematically , I explore this notion further in a separate paper . 4.3 A suggestion f or ex post probability statements Leaving the notion of confidence aside, I would like to end by posing a soft rule for deciding how to answer questions about the probability of occurred–b ut–unobserved e vents: only condition on post–trial information when it actually reduces uncertainty about the outcome. In formal terms, we may vie w this as choosing an information σ –algebra G with σ ( design ) ⊆ G ⊆ σ ( full microstate ) , and taking our e x post probabilities to be P θ ( · | G ) , where G is the information we happen to ha ve at hand, rather than conditioning on the maximal σ –algebra that renders the outcome e vent mea- surable (and hence de generate). Choosing such a G keeps our ex post probability statements within the hierarchy of σ –algebras defined by the model, a voids degeneracy , and preserves the collectiv e originally implied by the design. For a cat treat with unkno wn fla vor , a patient with an unknown disease status, or a constructed CI that giv es no clues as to cov erage, the only information we hav e post–trial is simply that sampling w as performed according to the experimental procedure—a probability–one ev ent under the model—which therefore leaves the pre–trial probability of success unchanged. As demonstrated abov e, conditioning instead on the hidden outcome e vent itself leads to (sometimes extreme) dif ficulties with ensuing probability calculations, for no particular reason other than that we hav e decided to switch our choice of reference class from the infinite one defined by the model, where the rele vant probability is intermediate and well defined, to the singleton one defined by the realized outcome, where the probability of the e vent was alw ays only ev er going to be either 0 or 1 , regardless of when we assigned it. More straightforwardly , and perhaps a bit pro vocati vely , I would like to suggest that as a matter of mathematical bookkeeping, we might reev aluate the standard frequentist notion of probability as li ving only in the process of random sampling. The “pre–data” and “post–data” distinction already has an epistemic fla vor , being naturally defined by someone’ s having observed sampling to hav e taken place (otherwise, how could we tell between the two?), and it suggests that prob- ability under the giv en model has somehow v anished once sampling has occurred. That impli- cation is not required by frequentism per se, at least as realized in the K olmogorov axioms: a Bernoulli–distributed outcome Y i determined by data X i carries an unconditional probability of 14 success p under the model, re gardless of whether it has yet been realized. Instead, it seems more faithful to the frequentist program to ground our probability statements purely in the σ –algebras gi ven to us by the model, choosing among them as needed so as to mak e our resulting inferences as accurate and as coherent as possible, and to resist the temptation to identify probability en- tirely with whatev er “chanciness” may be inherent in physical, real–world processes, including a statistician drawing random samples from an actual population. I would argue that the latter picture is more properly vie wed as a kind of frequentism–propensity–theory hybrid [12, 23], and that we might not gain much in return by blurring those lines (if anything k eeping them separate might help us see more clearly the pragmatic and philosophical benefits of both). On the other hand, he wing to a model-based view puts us directly in touch with the mathematical support we need to make precisely the sort of objecti ve, long–run probability statements that frequentism was originally concei ved to secure. 5 Conclusion In this paper , I ha ve presented two kinds of arguments against the common claim that Neyman’ s interpretation of confidence interv als is the only one that may correctly be made: one informal, based on thought experiments, and one more formal, based on se veral vie ws of the probability model gov erning coverage. Succinctly , the gist of both arguments is that the “either -or” logic Neyman appeals to, when taken seriously as a normati ve rule about when intermediate probabil- ity assignments are and are not allowed, causes substantial problems for probability calculations in frequentist settings. Moreover , that interpretation sits uneasily with the very mathematical machin- ery Ne yman uses to define the quantities he most cares about—namely , long-run error rates for the procedures he recommends. Those error rates are expectations of single-trial coverage indicators under the design, and thus presuppose non-degenerate, design-le vel probabilities for the corre- sponding single cov erage ev ents, precisely the sort of probabilities the strict “either-or” reading forbids us to use when interpreting a particular realized interv al. As an alternati ve, or perhaps as a complement, I suggest that we may continue to report the design- le vel coverage probability ex post, deciding to vie w a constructed interv al as simply one among many that could hav e been generated by the confidence procedure and not one with any particular numerical v alues attached to it. I also suggest that the notion of ”confidence” refers very likely to the same mathematical concept as predicti ve probability , or as a probabilistic forecast, and that keeping that information-bound probabilistic layer separate from the two layers established by the underlying model can resolve some of the longstanding tensions around the use and interpretation of CIs for inference. Refer ences [1] T alal S Alshihayb et al. “Some misinterpretations of Inferential statistics in dental public health literature”. In: BMC Oral Health 25.1 (2025), p. 1760. [2] Bradley P Carlin and Thomas A Louis. Bayesian methods for data analysis . CRC press, 2008. 15 [3] Anirban DasGupta. “Ancillary Statistics, Pi votal Quantities and Confidence Statements”. In: Selected W orks of Debabrata Basu . Springer, 2010, pp. 327–342. [4] A Philip Da wid and Vladimir G V o vk. “Prequential Probability: Principles and Properties”. In: Bernoulli (1999), pp. 125–162. [5] Bruno De Finetti. “La pr ´ evision: ses lois logiques, ses sources subjecti ves”. In: Annales de l’institut Henri P oincar ´ e . V ol. 7. 1. 1937, pp. 1–68. [6] Ronald A Fisher. “In verse probability”. In: Mathematical pr oceedings of the Cambridge philosophical society . V ol. 26. 4. Cambridge Uni versity Press. 1930, pp. 528–535. [7] Ronald A Fisher. “The fiducial argument in statistical inference”. In: Annals of eugenics 6.4 (1935), pp. 391–398. [8] Donald Gillies. Philosophical theories of pr obability . Routledge, 2012. [9] T ilmann Gneiting and Adrian E Raftery. “Strictly proper scoring rules, prediction, and esti- mation”. In: J ournal of the American statistical Association 102.477 (2007), pp. 359–378. [10] Ste ven N Goodman and Jesse A Berlin. “The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results”. In: Annals of internal medicine 121.3 (1994), pp. 200–206. [11] Sander Greenland et al. “Statistical tests, P v alues, confidence interv als, and power: a guide to misinterpretations”. In: Eur opean journal of epidemiology 31.4 (2016), pp. 337–350. [12] Alan H ´ ajek. “Interpretations of probability”. In: (2002). [13] Alan H ´ ajek. “The reference class problem is your problem too”. In: Synthese 156.3 (2007), pp. 563–585. [14] Jan Hannig, Hari Iyer, and Paul Patterson. “Fiducial generalized confidence interv als”. In: J ournal of the American Statistical Association 101.473 (2006), pp. 254–269. [15] Jan Hannig et al. “Generalized fiducial inference: A re view and ne w results”. In: Journal of the American Statistical Association 111.515 (2016), pp. 1346–1361. [16] Alexander T Hawkins and Lauren R Samuels. “Use of confidence interv als in interpreting nonstatistically significant results”. In: J ama 326.20 (2021), pp. 2068–2069. [17] Rink Hoekstra et al. “Rob ust misinterpretation of confidence intervals”. In: Psychonomic bulletin & r evie w 21.5 (2014), pp. 1157–1164. [18] R La ven and D A Y ang. “Common misinterpretations of statistical significance and P-v alues in dairy research”. In: JDS communications (2025). [19] Michael EJ Masson and Geof frey R Loftus. “Using confidence intervals for graphically based data interpretation.” In: Canadian J ournal of Experimental Psycholo gy/Revue cana- dienne de psycholo gie exp ´ erimentale 57.3 (2003), p. 203. [20] Richard D Morey et al. “The fallac y of placing confidence in confidence intervals”. In: Psychonomic b ulletin & re view 23.1 (2016), pp. 103–123. [21] Jerzy Neyman. “Outline of a theory of statistical estimation based on the classical theory of probability”. In: Philosophical T ransactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236.767 (1937), pp. 333–380. 16 [22] W alter W Piegorsch. Statistical data analytics: foundations for data mining, informatics, and knowledge discovery . John W iley & Sons, 2015. [23] Karl R Popper. “The propensity interpretation of probability”. In: The British journal for the philosophy of science 10.37 (1959), pp. 25–42. [24] Frank P Ramsey. “T ruth and probability”. In: Readings in formal epistemology: Sourcebook . Springer, 1926, pp. 21–45. [25] Milo Schield. “Interpreting statistical confidence”. In: ASA Pr oceedings of the Section on Statistical Education . 1997. [26] T ore Schweder and Nils Lid Hjort. “Confidence and likelihood”. In: Scandinavian J ournal of Statistics 29.2 (2002), pp. 309–332. [27] Philip Sedgwick. “Understanding confidence interv als”. In: BMJ 349 (2014). [28] T eddy Seidenfeld. “RA Fisher’ s fiducial ar gument and Bayes’ theorem”. In: Statistical Sci- ence 7.3 (1992), pp. 358–368. [29] Daren S Starnes, Dan Y ates, and David S Moore. The practice of statistics . Macmillan, 2010. [30] Richard V on Mises. Pr obability , statistics, and truth . Courier Corporation, 1981. [31] BL W elch. “On confidence limits and sufficienc y , with particular reference to parameters of location”. In: The Annals of Mathematical Statistics 10.1 (1939), pp. 58–69. [32] Min-ge Xie and K esar Singh. “Confidence distribution, the frequentist distribution estimator of a parameter: A re view”. In: International Statistical Re view 81.1 (2013), pp. 3–39. 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment