E-values as statistical evidence: A comparison to Bayes factors, likelihoods, and p-values

E-v alues as statistical evidence: A comparison to Ba y es factors, lik eliho o ds, and p-v alues Ben Ch ugg 1 , Aadit ya Ramdas 1 , and P eter Gr ¨ un wald 2 1 Carnegie Mellon Univ ersity 2 Leiden Univ ersity and Cen trum Wiskunde & Informatica Marc h 26, 2026 Abstract A recurring debate in the philosoph y of statistics concerns what, exactly , should coun t as a measure of evidence for or against a giv en h yp othesis. P-v alues, lik elihoo d ratios, and Ba y es factors all ha ve their defenders. In this pap er we add tw o additional candidates to this list: the e-v alue and its sequential analogue, the e-pro cess. E-v alues enjo y several desirable properties as measures of evidence: they com bine naturally across studies, handle comp osite h yp otheses, provide long-run error rates, and admit a useful interpretation as the wealth accrued by a b ettor in a game against the null distribution. E-pro cesses additionally handle optional stopping and optional con tin- uation. This work examines the extent to whic h e-v alues and e-pro cesses satisfy the eviden tial desiderata of diﬀeren t statistical traditions, concluding that they combine attractiv e features of p-v alues, likelihoo d ratios, and Bay es factors, and merit serious consideration as in terpretable and intuitiv e measures of statistical evidence. Con ten ts 1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 E-Statistics and the Betting Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 An Ov erview of Evidential Desiderata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 E-statistics, P-v alues, Likelihoo d Ratios, Ba yes F actors . . . . . . . . . . . . . . . . . . 13 5 Comparing Evidence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6 Limitations and Coun terarguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 7 Summary and F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1 In tro duction It is sometimes asserted that the p-v alue, ﬁrst in troduced by Karl P earson ( P earson , 1900 ) but p opularized and dev elop ed by Ronald Fisher ( Fisher , 1925 , 1935 ), is a measure of evidence against the null hypothesis. There is—to put it mildly—signiﬁcan t disagreemen t o ver this claim ( Berger and Sellk e , 1987 ; W right , 1992 ; W agenmak ers , 2007 ; Hubbard and Lindsa y , 2008 ; Greenland , 2019 ; Muﬀ et al. , 2022 ; Lakens , 2022 ). 1 P-v alues lack several prop erties that man y would exp ect of a measure of evidence, including consistency , sample size inv ariance, and the ability to handle optional stopping. These dra wbacks ha v e caused man y statisticians to renounce p-v alues as appropriate mea- sures of statistical evidence (though, of course, they remain useful for other purposes). F or example, Principle 6 in the American Statistical Association’s statement on p-v alues reads “b y itself, the p-v alue does not provide a go o d measure of evidence regarding a mo del or h yp othesis” ( W asserstein and Lazar , 2016 ). Critics of p-v alues turn to other prop osed notions of evidence, often the lik eliho o d ratio and the Bay es factor. No single notion has satisﬁed every one, ho wev er, and debates ab out the b est ob ject for quan tifying statistical evidence con tinue to rage (e.g., Roy all 1997 ; F orster 2006 ; Lele 2004 ; T ap er and Ponciano 2016 ; May o and Spanos 2011 ). In this article w e do not attempt to resolve these disputes. Instead, we hop e to simply add tw o new candidates to the discussion about statistical evidence: e-values/e- variables and e-pr o c esses . These ob jects (together, e-statistics ) are simple yet ric h ob jects in mathematical statistics that ha v e garnered signiﬁcan t attention o v er the past ﬁv e y ears. They hav e pro ven indisp ensable to recent progress across several areas, suc h as m ultiple testing, mean estimation, and c hangep oin t detection. W e refer to Gr ¨ un wald et al. ( 2024a ) and Ramdas and W ang ( 2025 ) for more on suc h applications. Contributions and outline. W e inv estigate the extent to which e-statistics satisfy v arious eviden tial criteria commonly used in the philosophy of statistics. T o do so, Section 3 collects 21 desiderata that a measure of statistical evidence should satisfy according to sev eral schools of thought, including Bay esian conﬁrmation theory and v arious ﬂav ors of frequen tism. Section 5 then ev aluates e-statistics against these criteria and compares them with p-v alues, likelihoo d ratios, and Ba yes factors. While e-statistics do not satisfy all criteria for evidence that w e iden tify throughout the literature, we ﬁnd that they p erform admirably well across a v ariety of measures. This stems partly from the fact that they can b e viewed as generalized likelihoo d ratios, thereb y inheriting man y of the likelihoo d ratio’s eviden tial prop erties. But they also satisfy criteria that the likelihoo d ratio fails, suc h as providing meaningful evidence against a single hypothesis without necessarily requiring an explicit alternativ e hypothesis. 1 That said, the ﬂexibility of e-v alues and e-processes also has drawbac ks. W e discuss these in Section 6 , where we also discuss their relationship to Birnbaum’s theorem and the lik eliho o d principle ( Birnbaum , 1962 ). Bey ond relating e-statistics to the literature on statistical evidence, a secondary con- tribution of the pap er is to clarify and sharp en several of the desiderata themselves. In particular, we distinguish b et ween static and dynamic eviden tial criteria. Static crite- ria are ev aluated on a ﬁxed batc h of data, whereas dynamic criteria concern sequential and counterfactual settings in whic h additional data may b e collected or further studies run. W e further distinguish b etw een diﬀerent kinds of counterfactual scenarios and diﬀer- en t forms of optional contin uation, providing examples of each. W e hop e this taxonom y clariﬁes a set of issues that are often conﬂated in the literature. W e b egin with an ov erview of e-statistics. 1 Though whether this is v aluable, or even sensible, dep ends partly on one’s view of statistical evidence. 2 2 E-Statistics and the Betting Game An e-v ariable E for a statistical hypothesis (i.e., collection of distributions) H is a non- negativ e random v ariable whose exp ectation is at most 1 under all P ∈ H : E H [ E ] ≡ sup P ∈H E P [ E ] ≤ 1. W e sometimes diﬀeren tiate b et ween the random v ariable E , whic h w e call the e-v ariable, and its realization which we call the e-v alue. In practice, there is a random quantit y X whic h represents our data (for example X may b e a vector of 100 indep enden t observ ations in an exp erimen t, or a summary of these data suc h as a z -score), and E can b e written as a (measurable) function of X , i.e. E is a statistic . An e-pro cess is the sequential analogue of the e-v alue. Here, w e mo del our data as a sequence X 1 , X 2 , . . . (which w e abbreviate henceforth to ( X t ) t ≥ 1 ) where each X i ma y b e either a simple data point or a v ector. No w consider a sequence of random v ariables ( E t ) t ≥ 1 = E 1 , E 2 , . . . where each E i can b e written as a function of the ﬁrst i outcomes, X 1 , . . . , X i . W e call such a sequence an e-pro cess for H if all E i are nonnegativ e and E H [ E τ ] ≤ 1 for every stopping time τ . Informally , a stopping time is a random (that is, data-dep enden t) time determined b y a stopping rule that for sequences X 1 , . . . X t of arbitrary length either says ‘stop’ or ‘contin ue’. Examples are ‘stop at t = 5’, ‘stop as so on as you hav e seen tw o outcomes larger than 1’, or ‘stop at the smallest t suc h that the p-v alue based on t outcomes is smaller than 0 . 05.’ 2 E-statistics hav e long app eared implicitly in the statistics and probability literature, esp ecially in relation to nonnegativ e supermartingales (whic h are a special case of e- pro cesses). It is only recen tly ho wev er (roughly 2020; see Ramdas and W ang 2025 , Section 1.8) that they ha ve b egun to b e studied as ob jects in their own right. E-statistics hav e often b een referred to as measures of evidence against H ( Shafer , 2021 ; Ramdas et al. , 2023 ): the larger the v alue the more evidence. This view is buoy ed b y their in terpretations as b etting scores in a game b et ween the statistician and nature ( Shafer and V o vk , 2005 , 2019 ). This game pro ceeds as follows. 2.1 The b etting game Fix some distribution P ∈ H . At time t = 1 , 2 , . . . , the statistician designs a nonnegative random v ariable B t , which is required to be a function of data X 1 , . . . , X t arriving at or b efore time t , satisfying 3 E P [ B t | X 1 , X 2 , . . . , X t − 1 ] ≤ 1 . (1) W e imagine the statistician buying a share of B t for $ 1 and his pa yoﬀ is $ B t once its v alue is revealed. The inequalit y in ( 1 ) expresses that, under P , the statistician do es not exp ect to make an y money , regardless of how cleverly B t is designed. W e assume that the statistician can buy an arbitrary (nonnegativ e) n umber of shares. If the statistician 2 F ormally , what we cov ered here is just a sp ecial case of the more general, measure-theoretic deﬁnition of e-pro cesses and stopping times. There we hav e a set of ﬁltered probability spaces { (Ω , ( F t ) t ∈ N , P ) } P ∈H on which ( E t ) is an adapted pro cess (i.e., E t is F t -measurable for each t ). Recall that a stopping time τ is deﬁned relative to such a ﬁltration as a random v ariable such that { τ = t } is F t -measurable for each t . 3 More formally and generally we can write E P [ B t |F t − 1 ] ≤ 1 where F t − 1 is all the information av ailable un til and including time t − 1, whic h is typically (though not alwa ys) the σ -algebra generated by data X 1 , . . . , X t − 1 . 3 starts with $ 1 in round 1, then after the ﬁrst outcome X 1 is rev ealed he will ha ve $ B t . If he rein vests all this money in round 2, then after X 2 is rev ealed he will hav e B 1 B 2 . Thus, if the statistician starts with $ 1 and re-in vests his money at each p oin t in time, his total capital at time t is giv en by 4 K t := Q i ≤ t B i . The in tuition b ehind the b etting game is that large v alues of K t ma y b e considered evidence against P . T o see this, note that if data were truly dra wn according to P , then the statistician do es not exp ect to gain an y money in this game, regardless of whatev er stopping time τ is emplo yed. More formally , for an y stopping time τ : E P [ K τ ] ≤ 1 , (2) whic h follows from ( 1 ) and the optional stopping theorem. Next, note that it is unlikely that K t can b ecome large. How unlik ely , exactly? By Marko v’s inequality , ( 2 ) immediately giv es that P ( K τ ≥ 1 /α ) ≤ α . F urther, by Ville’s inequality ( Ville , 1939 ), a time-uniform extension of Mark ov’s inequality , P  there exists t ≥ 1 : K t ≥ 1 α  ≤ α. (3) That is, the probability of the pro cess ( K t ) t ≥ 1 ev er exceeding 1 /α , no matter how long we c ontinue the game , is at most α . W e can relate e-v ariables and e-pro cesses to the b etting game as follo ws. When H = { P } is a singleton (whic h we henceforth call a p oint or simple h yp othesis), an e-v ariable ma y b e viewed as the pa yoﬀ from a single-round game, E = K 1 . The data X 1 in this game need not be a single observ ation—it may itself b e a batc h of data presented all at once. An e-pro cess, meanwhile, corresp onds to an ev olving capital pro cess ( K t ) t ≥ 1 . F or comp osite H , for each P ∈ H let ( K P t ) t ≥ 1 b e the capital pro cess of a game play ed against the singleton distribution P . Now every e-v ariable corresp onds to the minimum pa yoﬀ in a single-round game, E ≤ inf P ∈H K P 1 . Analogously , it can b e sho wn that every e-pro cess satisﬁes E t ≤ inf P ∈H K P t (and admissible e-pro cesses satisfy this with equality; see Ramdas et al. , 2022 ). In other words, an e-pro cess for H rep orts the minimum wealth across man y simultaneous b etting games, one for each P ∈ H , all play ed using the same data X 1 , X 2 , . . . . T o relate e-statistics to scien tiﬁc practice, consider the following scenario. Example 1. Supp ose a research group analyzes an initial batc h of data X 1 —for example, from a clinical trial—and rep orts an e-v ariable S 1 = S 1 ( X 1 ). Next, p erhaps b ecause the initial ﬁndings app ear promising, the same or another group analyzes a new, indep enden t batch of data X 2 , rep orting an e-v ariable S 2 . The pro cess may then con tinue, with further groups analyzing indep enden t batc hes X 3 , X 4 , . . . and rep orting e-v ariables S 3 , S 4 , . . . . T o com bine the accumulating evidence—for instance, in a meta- analysis ( T er Sc hure and Gr ¨ un wald , 2022 )—the researchers multiply the successiv e e-v ariables. W e may think of the e-v ariable S t as the pa yoﬀ B t at time t . Since the design 4 In the e-pro cess literature, K t is called a test sup ermartingale . 4 of the e-v ariable S t = B t at time t ma y dep end on the earlier data X 1 , . . . , X t − 1 , viewing the batches as arriving sequentially turns the deﬁning e-v ariable condition E P [ S t ] ≤ 1 in to the conditional requiremen t E P [ S t | X 1 , . . . , X t − 1 ] ≤ 1. It follows that, for an y stopping time τ , E τ := Q i ≤ τ S i is an e-v ariable, and that the running pro duct E t := Q i ≤ t S i deﬁnes an e-pro cess. Example 1 constructs an e-pro cess by m ultiplying e-v ariables. Conv ersely , when H is simple, ev ery e-pro cess ( E t ) t ≥ 0 for H can b e decomp osed as a pro duct of ‘past-conditional’ e-v ariables, i.e. of the form B t satisfying ( 1 ). Explicitly distinguishing b et ween these tw o construction-directions (creating e-processes out of e-v ariables and vice-versa) is not alwa ys imp ortan t, y et it will b ecome highly useful in Section 5.2 . Next let us pro vide an example of a fundamental e-statistic. Example 2 (Likelihoo d ratio) . Consider a simple hypothesis H = { P } where P has densit y p . Then the lik eliho o d ratio q /p for an y distribution Q with density q is an e-v ariable for H (see b elo w for assumptions implicit in this notation). T o see this, note that E P  q ( X ) p ( X )  = Z q ( x ) p ( x ) p ( x )d x = Z q ( x )d x = 1 . (4) By similar reasoning, for a sequence of random v ariables ( X t ) t ≥ 1 , the pro cess ( L t ) t ≥ 1 where L t = q ( X 1 , . . . , X t ) /p ( X 1 , . . . , X t ) deﬁnes an e-pro cess for the h ypothes is that the data are distributed according to P . If further they are indep endent and iden tically distributed (iid) under P and Q , we can write L t = Q i ≤ t q ( X i ) /p ( X i ). By reco vering the likelihoo d ratio for p oin t hypotheses, Example 2 highlights that one useful wa y to understand e-statistics is as extensions of likelihoo d ratios to comp osite settings. In particular, as opp osed to some other p opular extensions to the comp osite setting such as the generalized likelihoo d ratio (Section 3 ), they c ho ose to preserv e the prop ert y E H [ E ] ≤ 1, which w e can view as “fairness” in the b etting game: evidence is diﬃcult to exaggerate when pla ying against the n ull. W e will return to the connection b et w een simple versus simple likelihoo d ratios and general e-statistics later on. W e provide additional examples in Section 4 , but to orien t unfamiliar readers, let us pro vide a second e-statistic here. This example can b e view ed as an instance of the former. Example 3 (Gaussian e-v ariable) . F or an y λ ∈ R , the ob ject E λ ( X ) = exp( λX − λ 2 σ 2 / 2) is an e-v ariable for H = { N (0 , σ 2 ) } , i.e., a Gaussian with mean 0 and v ariance σ 2 . Note that E λ is the likelihoo d ratio b et ween N ( σ 2 λ, σ 2 ) and N (0 , σ 2 ). F urther, the pro cess ( E t ) t ≥ 1 where E t = exp( λ P i ≤ t X i − tλ 2 σ 2 / 2) = Q i ≤ t exp( λX i − λ 2 σ 2 / 2) is an e-pro cess for the hypothesis that the data ( X t ) t ≥ 1 are iid from N (0 , σ 2 ). Remark on notation and deﬁnitions: Although the theory of e-statistics can be dev elop ed in m uch greater generality ( Larsson et al. , 2025 ), throughout this pap er we assume that all distributions admit densities or probabilit y mass functions. W e use capital 5 letters such as P and Q to denote distributions, and corresp onding lo wercase letters suc h as p and q to denote their densities/mass functions. Whenev er w e refer to likelihoo d ratio q /p with P ∈ H , we assume it is almost surely well-deﬁned under b oth H and Q . 2.2 Evidence for or against? The meaning of small E . The b etting game provides a notion of evidence against a h yp othesis H , as opp osed to a notion of evidence for the h yp othesis. In particular, if E is an e-v ariable for H , then in general small v alues of E need not provide evidence in fa vor of H . This is b ecause one can play the b etting game conserv ativ ely . As an extreme example, the b et B t ≡ 1 is an e-v ariable but will never mak e any money regardless of the discrepancy b et w een H and the true data generation pro cess. That said, e-statistics are often used to compare t w o h yp otheses. In h yp othesis testing, for instance, we compare a null H 0 to an alternative H 1 and search for e-statistics under H 0 whic h grow quickly under H 1 . F or example, for singletons H 0 = { P 0 } and H 1 = { P 1 } , w e can consider the lik elihoo d ratio (Example 2 ). This suggests that there are indeed cases when a small e-statistic E for H 0 should b e viewed as evidence in fav or of H 0 . A minimal requiremen t for such a scenario is when the recipro cal E − 1 of E b ecomes an e-statistic when w e reverse the roles of H 0 and H 1 , as is the case for the likelihoo d ratio. In this case, evidence against H 0 can indeed b e viewed as evidence for H 1 and vice v ersa. The meaning of large E . Often, how ev er, H 1 is implicit or c hosen pragmatically based on the problem at hand. This is the case in recen t work on b ounded mean testing ( W audb y- Smith and Ramdas , 2024 ) (Example 7 b elo w) or in v arious conformal e-statistics ( V ovk , 2025 ), for instance. In suc h cases, the p ersp ectiv e of a large e-v alue providing evidence against H 0 remains clear whereas the interpretation as evidence for a sp eciﬁc H 1 less so. The following example sho ws that this ma y happ en ev en when H 0 = { P } is simple and the e-statistic E is a likelihoo d ratio so that formally , E − 1 is an e-statistic as w ell. In general, while an alternativ e can b e inferred from the b ets or, more indirectly , from the e-statistic’s deﬁnition, suc h an alternativ e might b e be st viewed as instrumentally useful, and not represen ting a b ona ﬁde h yp othesis that represen ts a true state of the world. Example 4. Ryabk o and Monarev ( 2005 ) sho w that bit strings pro duced by standard random num b er generators can be substan tially compressed b y lossless data com- pression algorithms suc h as zip , which is a clear indication that the bits are not so random after all. Here, the n ull hypothesis states that data are ‘random’ (independent fair coin ﬂips), and one measures ‘amount of evidence against H 0 pro vided by data x τ = x 1 , . . . , x τ ’ as τ − cl zip ( x τ ) , where cl zip ( x τ ) is the num b er of bits needed to code x τ using (sa y) zip . Kraft’s inequalit y ( Co v er and Thomas , 1991 ) sa ys that for arbitrary lossless co des for encoding binary strings of length τ into other binary strings of non-ﬁxed length, the co de length cl ( x τ ) as measured in bits satisﬁes P x τ ∈X ( τ ) 2 − cl ( x τ ) ≤ 1, where X ( τ ) is the set of binary strings determined by stopping time τ (if τ = n for ﬁxed n , this is simply 6 { 0 , 1 } n ). Thus, if we set q ( x τ ) := 2 − cl zip ( x τ ) w e get P x τ ∈X ( τ ) q ( x τ ) ≤ 1: q represen ts a sub-pr ob ability distribution . A t the same time, for the null we hav e H 0 = { P } , where P represen ts a sequence of fair coin ﬂips, so it has mass function p with for eac h x τ ∈ X ( τ ) , p ( x τ ) = 2 − τ . Deﬁning E τ = q ( X 1 , . . . , X τ ) /p ( Y 1 , . . . , Y τ ) w e thus ﬁnd E P [ E τ ] = X x τ ∈X ( τ ) q ( x τ ) p ( x τ ) p ( x τ ) ≤ 1 and log E τ = τ − cl zip ( X 1 , . . . , X τ ) . Th us, the Ry abko-Monarev bit-diﬀerence is the logarithm of an e-v alue. But note that there is no explicitly deﬁned alternativ e. Being able to signiﬁcantly compress a string b y zip intuitiv ely pro vides strong evidence that the null h yp othesis is false, and this is formalized by e-pro cesses. Nevertheless, ev en though we measure evidence by the log lik eliho o d ratio b et ween H 0 = { P } and Q , this evidence against { P } should clearly not b e construed as evidence in fav or of the sub-distribution Q , whic h is a complicated ob ject that helps to dete ct some structure in x 1 , . . . , x τ but is not b est thought of as either generating or predicting that structure. It turns out that this example generalizes: an y e-statistic has a co de length-diﬀerence in terpretation, whic h connects it to notions of evidence that hav e implicitly been suggested b y the information theory communit y , often in the context of Minimum Description L ength mo del comparison ( Gr ¨ un wald and Ro os , 2020 ; Gr ¨ un w ald , 2007 ). Since it is less well- kno wn, we do not sp ell out the desiderata coming from this tradition here but refer instead to ( Gr ¨ un w ald and Ro os , 2020 ) for those in terested. T o summarize: e-statistics are b est regarded as measures of evidence against a desig- nated hypothesis, with “evidence for an alternativ e” emerging only in distinctiv e scenarios. 2.3 Optimalit y of e-statistics Example 4 notwithstanding, in some settings there is a clearly deﬁned alternative h yp oth- esis H 1 . In suc h cases, it is natural to call an e-v ariable for testing H 0 optimal if it tends to b ecome large under H 1 . In particular, if the data are in fact generated under H 1 , we w ould lik e evidence against H 0 to accum ulate quic kly . This idea of “gro wing quic kly under the alternative” can b e formalized in several w ays. The most natural, and b y far the most studied, is lo g-optimality ; the e-v ariable achieving it has b een called GRO (growth-rate optimal) ( Gr ¨ un wald et al. , 2024a ; Lardy et al. , 2024 ) or the numer air e ( Larsson et al. , 2025 ) (the latter pap er also describ es alternative notions of optimality , as do es Koning 2024 ). In fact, the likelihoo d ratio from Example 2 is the log-optimal e-v ariable for test- ing P against Q . More broadly , this highlights that a given hypothesis may admit many diﬀeren t e-v ariables and e-pro cesses. W e return to this p oin t in Section 6 . F or now, let us turn to a discussion of the v arious statistical traditions and what prop erties they b eliev e a measure of evidence should p ossess. 7 3 An Ov erview of Eviden tial Desiderata Unsurprisingly giv en the ﬁerce debates at the foundations of statistics, there isn’t a single c hecklist of evidential criteria adopted b y all statisticians and philosophers of statistics alik e. Instead, diﬀeren t statistical traditions argue for diﬀerent criteria. F ollo wing F orster ( 2006 ), w e group these traditions into three broad groups: (i) lik elihoo d-based accounts, which treat evidence as relative supp ort b et ween h yp othe- ses and elev ate likelihoo d ratios as fundamen tal; (ii) Ba y esian accounts, whic h quan tify evidence b y changes in mo del o dds, typically via Ba yes factors and their approximations; and (iii) error-probabilit y accounts, whic h connect evidential strength to the abilit y of pro- cedures to detect discrepancies (e.g., through type-I/I I error rates or conﬁdence in terv als). The gray b o x b elo w summarizes the main criteria prop osed by these three constellations of statistical philosophies, augmented with some of our o wn. But let us ﬁrst provide some more bac kground on these three traditions. W e b egin with the latter. The error-probabilit y tradition. The error-probability tradition traces back to Fisher, Neyman, and Pearson. There is no uniﬁed view within this tradition of ho w statistics should b e practiced. In fact, Fisher famously had signiﬁcant disagreemen ts with Neyman and P earson, who prop osed a fully decision-theoretic theory of statistics in which ‘evidence’ really plays no role—there are only decisions, risks and error probabilities. Nevertheless, Fisher, Neyman and Pearson w ere all frequen tists and informal references to ‘evidence’ within the Neyman-Pearson tradition ab ound. 5 And, despite these disagreements, con tem- p orary frequentist practice largely draws on a hybrid error-probability tradition com bining ideas from Fisher, Neyman, and P earson. W e can thus still identify core assumptions on the desired prop erties within this tradition of a measure of evidence. In particular, a measure of evidence should b ound or diagnose the probability of b eing misled. In other w ords, the eviden tial strength of a giv en ob ject or pro cedure is inseparable from its (frequen tist) control of v arious error rates (type I/I I error, risk con trol, false disco very control, etc.) Evidence is stronger when the metho d w ould rarely give such supp ortiv e results if the underlying claim w ere false. This is precisely the guarantee oﬀered by a p-v alue, which is a statistic T = T ( X ) such that P ( T ≤ α ) ≤ α for all α ∈ (0 , 1) and all P ∈ H . (Here we would say that T is a p-v alue for H .) Thus, on the eviden tial view of a p-v alue, small v alues of P are evidence against H . Extensions of p-v alues such as s-v alues ( Greenland , 2019 ), second-generation p-v alues ( Blume et al. , 2019 ), and replication v alues ( Killeen , 2005 ) are all ro oted in the error- probabilit y tradition. Conﬁdence curves/distributions are also sometimes prop osed as eviden tial summaries on this view ( Xie and Singh , 2013 ). The error-statistics framework 5 In fact, even Neyman ( 1976 ) himself wrote ‘my o wn preferred substitute for ‘do not reject H ’ is ‘no evidence against H is found’. 8 of May o and Spanos ( Ma yo and Spanos , 2011 ; May o , 1996 ) (whic h do wnplays the dif- ferences b etw een Fisher and Neyman and fo cuses on commonalities instead) and Allan Birn baum’s “conﬁdence concept of evidence” ( Birnbaum , 1977 ) lik ewise sit squarely in the error-probabilit y family . Ba y esian conﬁrmation theory . The Ba yesian statistical tradition is, naturally , rooted in the Ba yesian view of probability , which treats model parameters as random v ariables instead of unkno wn constants. An agent places an initial prior distribution π ov er pa- rameters θ (often referred to as the agents’ b eliefs ) and, up on observing data X , up dates π ( θ ) via Ba yes’ rule to a p osterior π ( θ | X ). Instead of controlling error, evidence on the Ba yesian view is link ed to c onﬁrmation : data pro vide evidence for a h yp othesis or mo del to the exten t that they increase its p osterior supp ort relative to its prior supp ort. The most common to ol for quantifying evidence on the Bay esian view is the Bay es factor ( Jeﬀreys , 1939 ; Kass and Raftery , 1995 ), which is t ypically used to compare t wo distinct hypotheses. In particular, for H 0 = { P θ : θ ∈ Θ 0 } and H 1 = { P θ : θ ∈ Θ 1 } equipp ed with priors π 0 and π 1 , w e can write the c hange in p osterior o dds after observing X as the prior o dds m ultiplied by the Bay es factor: P ( H 1 | X ) P ( H 0 | X ) = P ( H 1 ) P ( H 0 ) · BF( X ) , where BF( X ) = P ( X |H 1 ) P ( X |H 0 ) , (5) where P ( X |H j ) = R Θ j P θ ( X ) π j ( θ )d θ is the marginal likelihoo d under hypothesis j . Note that the p osterior o dds are deﬁned by both prior distributions ov er h yp otheses, denoted as P ( H j ), as w ell as prior distributions ov er parameters Θ j , denoted b y π j . F or the purp oses of the Ba yes factor, how ev er, only π 0 and π 1 are necessary to deﬁne. Ba yes factors can b e diﬃcult to compute and appro ximations such as BIC are th erefore sometimes oﬀered ( W agenmakers , 2007 ). Ho wev er, for the purp oses of an evidential stan- dard, exact Ba yes factors are the most p opular among Bay esian conﬁrmation theorists, and will thus b e our main fo cus here. F or more detailed discussion of v arious measures of Ba yesian conﬁrmation, see Fitelson ( 1999 ); Eells and Fitelson ( 2000 ). The lik eliho o d tradition. Finally , there is the lik eliho od camp, which emerged as a riv al to b oth Bay esianism, whic h it saw as to o sub jectiv e, and error-probability acc oun ts, whic h it saw as o verly concerned with decision-making instead of evidence ( Hac king , 2016 ; Ro yall , 1997 ; Edwards , 1972 ). That said, likelihoo dists and Bay esians share a num b er of commitmen ts, emphasizing criteria such as com bination within and across studies, con- tin uity , consistency , and sharing the view that evidence is fundamen tally comparative b et w een tw o h yp otheses ( T aper and Lele , 2010 ; T ap er and P onciano , 2016 ; Lele , 2004 ). But likelihoo dists are ultimately frequentists and w anted an ob jectiv e, prior free, measure of evidence, th us turning to the likelihoo d ratio. Unlik e the Ba yes factor, ho wev er, which marginalizes o ver the prior, the likelihoo d ratio is not immediately w ell-deﬁned for comp osite hypotheses. In that case, one must app eal to some generalization, such as the generalized likelihoo d ratio (GLR) GLR(Θ 0 , Θ 1 ) := sup θ 1 ∈ Θ 1 p θ 1 ( X ) sup θ 0 ∈ Θ 0 p θ 0 ( X ) , (6) 9 or others (cf. Bick el 2011 , 2012 ). Indeed, the Bay es factor is one p ossible extension to the comp osite setting, though likelihoo dists tend to prefer using ob jective priors ( Jeﬀreys , 1939 ) in this case. E-statistics, as we will con tinue to discuss, are another p ossible gener- alization. As a consequence, it is sometimes natural to refer to e-statistics as “generalized lik eliho o d ratios,” and we will do so throughout this pap er. Ho wev er, it is imp ortan t to k eep in mind that they are distinct from ( 6 ). Aside from these tw o, the evidential prop erties of extensions of the likelihoo d ratio to comp osite settings are not alw ays clear, and often do not preserve the comfortable logic of the lik eliho o d ratio: that θ 1 is a times as lik ely as θ 0 if p θ 1 ( X ) = ap θ 0 ( X ). In Section 5 w e focus on the GLR as the natural composite extension, but w e trust that readers can extend the logic to a diﬀeren t generalization if desired. 6 With this background established, we turn to the eviden tial desiderata endorsed b y these three traditions. As noted in the introduction, w e divide these into t wo categories: static and dynamic. The static criteria are relatively standard in the literature and require little explanation. The dynamic criteria, by contrast, are more n uanced—and in some cases, en tirely nov el. W e further divide the dynamic criteria into t wo sub-categories: in ter-exp erimen t and in tra-exp erimen t. The former concerns the relationship b et ween multiple exp erimen ts, while the latter fo cuses on the in ternal dynamics of a single exp erimen t. Because these dynamic criteria are subtle, Section 5.2 devotes signiﬁcan t space to clarifying their mean- ing, comparing them with related criteria, and illustrating them through sp eciﬁc examples. Eviden tial Desiderata Static criteria Scalar measure. Evidence for or against a hypothesis should b e quantiﬁed b y a single real n umber. This is a criterion shared b y all traditions. Con tinuous measure. Evidence should b e a contin uous function of the data. That is, there is no discrete threshold at whic h something becomes evidence, which comes in degrees. This is again endorsed b y all three traditions. a See, e.g., Ro yall ( 1997 ); Lele ( 2004 ). Irrelev ance of the analyst. The evidence should dep end on the data only and not on who ev er is running the exp erimen t. That is, evidence should b e ob jectiv e. This is endorsed b y the likelihoo dists and error-probabilists, and explicitly rejected by the Bay esians. See T ap er and Ponciano ( 2016 ); Roy all ( 1997 ); Bick el ( 2012 ). Coherence. A measure of evidence should not assign higher evidence to an y implication of a h yp othesis than to the hypothesis itself ( Gabriel , 1969 ). F ormally: If H ⊂ H ′ then y ou should hav e at least as m uch supp ort for H ′ as you do for H ( Sc hervish , 1996 ). Or, equiv alen tly , for H ⊂ H ′ , the evidence against H ′ should not exceed that against H ( Bic kel , 6 W e note that in man y important instances (i.e. choices of H ), c onditional likeliho o ds ( Roy all , 1997 ) and p artial likelihoo ds such as those underlying the Cox ( 1972 ) regression mo del do turn out to b e e-statistics ( Hao and Gr ¨ un wald , 2024 ). 10 2024 ). This desideratum arose from multiple testing, and was only later applied to criticize p-v alues as v alid measures of evidence ( Schervish , 1996 ). b Consistency . A measure of evidence is consisten t if, roughly sp eaking, it fav ors the true h yp othesis as the sample size increases. This has b een formalized in v arious wa ys (cf. Grend´ ar 2012 ; Berger et al. 2003 ). Here we will sa y that, when comparing H 0 and H 1 , a measure of evidence M n on n observ ations is consistent if M n tends to ∞ in probabilit y under H 1 , and tends to 0 in probability under H 0 . Consistency is typically promoted b y lik eliho o dists ( Roy all , 1997 ) and Bay esians ( Chib and Kuﬀner , 2016 ). Sample size in v ariance. A giv en numerical v alue of the evidence should corresp ond to the same strength of evidence regardless of sample size or other asp ects of the sampling design (i.e., “same score, same evidence”). F or instance, an evidence v alue of x obtained with n = 20 should be comparable to the same v alue x obtained with n = 200. This is supp orted b y Bay esians and lik eliho o dists; see W agenmak ers ( 2007 ); Roy all ( 1986 ); Go odman and Ro yall ( 1988 ). Scale in v ariance. One-to-one transformations of the sample space (e.g., unit c hanges) should not chan ge the evidence. F or instance, measuring distance in meters v ersus feet should not c hange the in terpretation. This is supp orted b y lik eliho odists and Ba yesians ( T a- p er and Lele , 2011 ). Reparameterization in v ariance. One-to-one transformations of the mo del parameters should not c hange the evidence. This is supp orted by likelihoo dists ( Hartigan , 1967 ; Berger , 1983 ; Lele , 2004 ; T aper and Lele , 2011 ) and (some) Ba yesians ( Jeﬀreys , 1946 ; Ghosh , 2011 ). Single h yp othesis v alidity . One should b e able to giv e evidence for, or at least against, a single hypothesis, without comparing it to another hypothesis. This is explicitly endorsed by the error-probabilit y tradition, and explicitly rejected by the lik eliho od tradition ( Edwards , 1972 ; Ro yall , 1997 ). Long-run error rates. Thresholds on the evidential scale should corresp ond to explicit b ounds on the long-run frequency with which they are exceeded when the hypothesis is true. This is the main desideratum of the error-probabilit y tradition. See, e.g., Neyman and P earson ( 1928 ); May o ( 1996 , 2018 ); May o and Spanos ( 2011 ). Comp osite h yp otheses. A measure of evidence should handle comp osite hypotheses. This is implicitly endorsed by Bay esians, who place priors o ver comp osite h yp otheses, error-probabilists, and by those lik eliho o dists who seek to generalize the standard likeli- ho od ratio. It has also motiv ated a signiﬁcan t amoun t of recent work in the e-statistics literature ( Ramdas et al. , 2023 ). Nonparametric hypotheses. A measure of evidence should handle nonparametric hy- p otheses, where likelihoo d ratios ma y b e v ery hard to deﬁne (due to technical reasons, like not ha ving reference measures to deﬁne densities). As ab ov e, this is implicitly endorsed b y many prop onen ts of most traditions, but some of them handle it signiﬁcantly more satisfactorily than others. The lik eliho o d principle. If tw o data sets induce proportional likelihoo d functions for the parameters of in terest, then they should carry the same evidence ab out those parameters. This principle is foundational for the likelihoo d tradition ( Edwards , 1972 ; Ro yall , 1997 ; 11 F orster and Sob er , 2004 ; Berger and W olp ert , 1988 ) and is also adopted by Ba yesians who in terpret likelihoo d function as marginal lik eliho o d. Comp osabilit y under dep endence. It should b e p ossible to combine evidence from m ultiple sources without requiring strong assumptions on their dep endence structure. Suc h a desideratum arises more from practice than from philosophy . In multiple testing, for instance, it is common to wan t to combine p-v alues whic h can rarely b e considered inde- p enden t ( V ovk et al. , 2022 ; Efron , 2012 ). But v arian ts of comp osabilit y are supp orted by b oth the lik eliho odist and Bay esian traditions. See Roy all ( 1997 ); Bick el ( 2012 ); Morey et al. ( 2016 ). In ter-exp erimen t dynamic criteria Accum ulation I: Fixed design. When data arise from a ﬁxed num b er of indep endent exp erimen ts, the total evidence should accumulate from the evidence contributed by the comp onen t parts. This is supp orted by b oth the likelihoo dist and Bay esian traditions. See Ro yall ( 1997 ); Bick el ( 2012 ); Morey et al. ( 2016 ). Accum ulation I I: Flexible design. When data arise from a sequence of independent exp erimen ts, and the decision to p erform the next exp erimen t may dep end on the result of the previous one, the total evidence should accum ulate from the evidence contributed by the comp onen t parts. This generalizes Accumulation I by allowing the num b er of exp eriments to b e data-dep endent. The criterion is discussed frequen tly in the literature on e-v alues, where it serv es as part of their motiv ation ( Gr ¨ un wald et al. , 2024a ). Dynamic consistency . When assessing the evidence in an ov erall sample, we should obtain the same v alue regardless of whether we analyze the sample as a whole or partition it into sub-samples and combine the results. V ariations of this desideratum ha ve b een studied extensively in the economics and decision-theory literature ( Epstein and Schneider , 2003 ; Gr ¨ un wald and Halp ern , 2011 ). In ter-exp erimen t coun terfactuals. A measure of evidence should not dep end on data that were not observ ed. More concretely , it should not dep end on decisions ab out whether or not to contin ue collecting a new batch of data in coun terfactual scenarios in which the data w ere diﬀerent from what they actually were. In tra-exp erimen t dynamic criteria In tra-exp erimen t optional stopping. The reason an exp erimen t w as stopp ed should not aﬀect its eviden tial v alue. F or example, stopping early because the initial results appear promising, or con tin uing longer than originally planned, should not change the evidence. This is closely related to, but distinct from, Accumulation I I: here the issue is whether to stop or contin ue a single exp eriment, rather than whether to run an additional one. Han- dling optional stopping satisfactorily is a ma jor motiv ation in mo dern sequential analysis ( Ramdas et al. , 2023 ). In tra-exp erimen t counterfactuals. The measure of evidence should depend only on the realized even t that the observ er learns has o ccurred, and not on unrealized contingencies 12 in the exp erimen tal proto col. Thus, if tw o descriptions of the exp erimen t yield the same realized even t Y , they should yield the same evidence. This criterion, together with the previous one, has motiv ated criticism of p-v alues ( W agenmakers , 2007 ; F orster and Sob er , 2004 ). The general issue of counterfactuals is often brought up by likelihoo dists. a Note this is distinct from de cision-making in the error-probability tradition, which thresholds p-v alues in order to either reject or sustain the n ull hypothesis. But as a measure of evidence, the p-v alue is mean t to be con tinuous. b This is distinct from the notion of coherence often discussed in a Bay esian context, which is the requirement that a set of probabilities b e immune from a Dutc h b o ok ( Berger , 1983 ). 4 E-statistics, P-v alues, Lik eliho o d Ratios, Bay es F actors Before inv estigating to what extent e-statistics satisfy the desiderata listed abov e, let us pro vide more detail on how they relate to p-v alues, likelihoo d ratios, and Ba yes factors. W e b egin with p-v alues. E-v alues can b e con v erted to p-v alues and vice-versa. Indeed, for an e-v alue E , 1 /E is a p-v alue b y Mark ov’s inequalit y: P (1 /E < α ) = P ( E > 1 /α ) ≤ α . The inv erse of a p-v alue is not an e-v alue in general. E-v alues tend to be more conserv ative than p-v alues, i.e. more extreme data is needed to reach a threshold 1 /α . P-v alues can be con verted to e-v alues via c alibr ators ( V ovk and W ang , 2021 ): Non-negativ e functions suc h that E H f ( P ) ≤ 1. Examples of calibrators include f ( p ) = κp 1 − κ for any κ ∈ (0 , 1), and f ( p ) = R 1 0 κp 1 − κ dκ . Unfortunately , the e-v alues obtained via calibrators are usually not very p ow erful, and go od (for example, log-optimal) e-v alues m ust typically b e designed directly . Next, let us in tro duce the follo wing general method for constructing e-v ariables and e-pro cesses when comparing tw o comp osite hypotheses. Example 5 (Univ ersal inference) . Consider independent and iden tically distributed (iid) observ ations X 1 , X 2 , . . . , X t , null H 0 , and alternative H 1 , b oth of which ma y b e comp osite. W asserman et al. ( 2020 ) introduce the universal inference estimator U t := t Y i =1 ˆ q i ( X i ) ˆ p t ( X i ) , (7) where ˆ q i is some density in the alternative H 1 whic h may b e based on the ﬁrst i − 1 samples, and ˆ p t is the maxim um likelihoo d estimate among all distributions in the n ull, based on all t observ ations. a Alternativ ely , one can also equip H 1 with a prior π (whic h must b e indep enden t of X 1 , . . . , X t ) and consider V t := Z Y i ≤ t q ( X i ) ˆ p t ( X i ) d π ( q ) , (8) whic h w e call the “metho d of mixtures.” T o see that U t is indeed an e-v ariable for H 0 , note that Q i ≤ t ˆ p t ( X i ) ≥ Q i ≤ t p ( X i ) for all densities p in the null by deﬁnition, so for 13 all P ∈ H 0 , E P [ U t ] ≤ E P [ Q i ≤ t ˆ q i ( X i ) p ( X i ) ] = R  Q i ≤ t ˆ q i ( x i ) p ( x i )  p ( x 1 ) · · · p ( x n )d x 1 · · · d x n = Q i ≤ t R ˆ q i ( x i )d x i = 1, using that ˆ q i is a probabilit y density . Similar arithmetic applies for V t . Moreov er, ( U t ) t ≥ 1 and ( V t ) t ≥ 1 are e-pro cesses. a Here, for ease of comparison to other e-pro cesses, we consider the case without a holdout dataset. Univ ersal inference is more commonly known for the case where we do hav e a holdout dataset, w e migh t compute ˆ q i on these data to maximize U t , in which case ˆ q 1 = · · · = ˆ q t . This is called “split univ ersal inference.” When both h yp otheses are simple, the universal inference e-v ariable collapses to the lik eliho o d ratio. While it serves as an illuminating example, w e do not adv o cate its use in all situations: in case of a simple alternative and a comp osite n ull, the numeraire e- v ariable ( Larsson et al. , 2025 ) is inherently preferable to the univ ersal inference e-v ariable (ev en if w e hav e a diﬀerent notion of optimality in mind than log-optimality): for all outcomes, the n umeraire is at least as large as the universal inference e-v ariable, whereas for some outcomes the inequality is strict. On the other hand, the latter deﬁnes, in general, an e-v ariable only and not an e-pro cess. 7 The metho d of mixtures—pioneered b y Herb ert Robbins in the con text of obtaining conﬁdence sequences ( Robbins , 1970 )—and the metho d of predictable plug-ins, resp ec- tiv ely illustrated by U t and V t ab o v e, are common strategies in the world of e-statistics and illustrate ho w e-statistics can serv e as a bridge b et w een Bay esians and frequentists. Indeed, while e-statistics are fundamen tally frequentist ob jects, the metho d of mixtures sho ws how they can accommo date priors while retaining their frequentist prop erties. Meanwhile, the metho d of predictable plug-ins can b e seen as learning a distribution ov er the alternativ e o ver time, which also has a very Bay esian spirit. With Ba yesianism in mind, let us giv e an example of a Bay es factor that is also an e-v ariable. Example 6 (t-test Ba yes factor) . Consider iid observ ations X 1 , . . . , X t ∼ N ( µ, σ 2 ), where σ > 0 is unkno wn, and supp ose we wish to test the null H 0 := { N (0 , σ 2 ) : σ > 0 } against alternativ es with nonzero standardized eﬀect size δ := µ/σ . Let W b e a proper prior on δ , and equip the nuisance scale σ with the (improper) right-Haar/Jeﬀreys prior w H ( σ ) ∝ 1 /σ . The resulting Bay es factor is ( Rouder et al. , 2009 ): B t := R R R ∞ 0 Q t i =1 1 σ ρ  X i − σ δ σ  d σ σ W (d δ ) R ∞ 0 Q t i =1 1 σ ρ  X i σ  d σ σ , (9) where ρ denotes the standard normal densit y . Although the prior on σ is improper, B RH t is an e-v ariable for H 0 (in fact, ( B t ) is an e-pro cess for H 0 ) ( Gr ¨ un wald et al. , 2024a ; W ang and Ramdas , 2025 ). Indeed, under a mild 2 + ϵ -momen t condition, this e-v ariable is log-optimal ( P´ erez-Ortiz et al. , 2024 ). 7 By this w e mean that if w e construct the n umeraire e-v ariable E 1 , E 2 , . . . with E i the n umeraire v ariable for sample ( X 1 , . . . , X i ), the resulting pro cess ( E t ) t ≥ 1 is not an e-pro cess Ramdas et al. ( 2023 ). 14 T o say more ab out the relationship of e-statistics to Bay es factors, if Θ 0 = { θ 0 } is simple, then the Ba yes factor is an e-v ariable; the logic is the same as in ( 4 ). In general, ho wev er, the Bay es factor is not an e-v ariable: it only has exp ectation 1 under the prior- a veraged null, not uniformly ov er ev ery θ 0 ∈ Θ 0 . That said, Gr ¨ un w ald et al. ( 2024a ) sho w that, for suﬃcien tly regular parametric null hypotheses with simple alternativ es, log-optimal e-v ariables are Bay es F actors under a sp eciﬁc prior on Θ 0 . If the alternative hypothesis is composite, then one can obtain e-v ariables b y equipping the alternative with an arbitrary prior determined by the analyst. There is then a unique prior on Θ 0 , dep ending on the prior on Θ 1 , such that the resulting Ba y es factor is an e- v ariable. In this wa y , optimal e-v ariables can be seen as a subset of Bay es factors, though only the prior on the alternativ e, not on the null, is indep enden t of the analyst. As for the relationship betw een likelihoo d ratios and Bay es factors, the latter reduces to the former in p oint v ersus p oin t settings as the priors are determined (they are atoms). F or comp osite hypotheses, the Ba yes factor can b e seen as an extension of the lik eliho o d ratio if one allo ws for distributions ov er the parameter space. Next let us turn to a p o w erful e-v ariable for bounded random v ariables. Example 7 (Bounded mean testing) . F or iid random v ariables X 1 , . . . , X t lying in [0,1] with mean µ , consider the ob ject M t := Q t i =1 [1 + λ i ( X i − µ )], where λ i lies in [ − 1 / (1 − µ ) , 1 /µ ] and can b e based on X 1 , . . . , X i − 1 but not X j for j ≥ i (that is, it is pr e dictable ). Then M t is nonnegative and satisﬁes E H [ M t ] = 1, where H is the set of all distributions on [0,1] with mean µ (whether con tinuous, discrete, or mixed). In fact, the pro cess ( M t ) t ≥ 0 is an e-pro cess for testing the null h yp othesis that the data are iid from P for an y P ∈ H . Such e-pro cesses w ere leveraged recently by W audby-Smith and Ramdas ( 2024 ) to obtain state-of-the-art conﬁdence in terv als/sequences. The predictable sequence ( λ t ) t ≥ 1 is typically c hosen via the metho d of predictable plug-ins (see W audb y-Smith and Ramdas 2024 , App endix B for an ov erview), but the metho d of mixtures has also b een studied in this con text ( Stark , 2020 ). Note that unlik e in universal inference, the example ab o v e deﬁnes an e-statistic for all distributions with mean µ without an explicit alternative in mind—similar to what w e hav e seen in Example 4 . It is th us a go o d example of using a pragmatic, or instru- men tal, alternativ e to design a p o werful e-v ariable for the null, whic h in this case is all distributions on [0 , 1] with mean µ . It also sho ws that in practice, e-v ariables ma y sup er- ﬁcially look very diﬀeren t from lik eliho od ratios. Nev ertheless, by Ramdas et al. ( 2020 , Prop osition 4), it turns out that for all P ∈ H , there exists some distribution Q suc h that M t = q ( X 1 , . . . , X t ) /p ( X 1 , . . . , X t ) i.e., the lik eliho o d ratio b et ween Q and P . Here the q in the n umerator v aries along with the p in the denominator. Such a “lik eliho od-like” in terpretation holds for general e-statistics. W e discuss this more in Section 6 . It’s worth emphasizing that all of the e-statistics discussed here ob ey the basic logic describ ed in Section 2 which lends them a uniﬁed evidential interpretation. This is unlike the situation with lik eliho od ratios, whose proposed generalizations do not sit comfortably together. 15 5 Comparing Evidence Measures W e now sp ell out in more detail ho w p-v alues, Bay es factors, likelihoo d ratios, and e- statistics fare as measures of evidence. T able 1 summarizes the discussion. Regarding Bay es factors, w e will draw a distinction b et ween the ob jectiv e and sub- jectiv e traditions. In the ob jective tradition, priors are chosen b y formal rules that are determined b y the problem only and independent of the analyst ( Jeﬀreys , 1939 ; Kass and W asserman , 1996 ; Berger , 2006 ), whereas in the sub jectiv e tradition priors can b e any distribution. References to “ob jective/sub jectiv e Bay es factors” should b e assumed to mean Ba yes factors in the ob jectiv e/sub jective tradition. W e will also discuss b oth the simple versus simple lik eliho o d ratio, whic h we will refer to as simply the likelihoo d ratio, and the generalized lik eliho o d ratio deﬁned in ( 6 ), which w e will refer to as GLR. F or man y of the desiderata listed in Section 3 , the distinction betw een e-v ariables and e-pro cesses is immaterial. In these cases, we will refer simply to e-statistics without dif- feren tiating the tw o. The distinction b ecomes important when discussing counterfactuals and optional con tinuation; there we will b e sure to explicitly distinguish them. 5.1 Static Criteria Continuous me asur e. P-v alues, likelihoo d ratios, and Bay es factors are all con tinuous measures of evidence ( Schervish , 1996 ; Ro yall , 1997 ). Smaller p-v alues are interpreted as more evidence against the null; larger likelihoo d ratios and Bay es factors are interpreted as evidence in fav or of Θ 1 o ver Θ 0 . E-statistics are likewise con tinuous: larger v alues are more evidence against the n ull. Irr elevanc e of the analyst. The desideratum says that tw o analysts, analyzing the same data and the same H 0 (and, if present, H 1 ), should come to the same conclusions. This desideratum may b e violated if (i) they emplo y diﬀeren t prior distributions within H 0 (and p oten tially H 1 ), whenev er these are comp osite; (ii) they emplo y diﬀerent test statistics; (iii) they emplo y diﬀerent sampling plans (i.e., the rule that sp eciﬁes how data will b e collected) but happ en to obtain the same statistic. Here w e are concerned with whether the candidate notions satisfy the desideratum in the sense of not violating either (i) or (ii). (iii) is discussed further b elow in Section 5.2 . It is easy to c heck that likelihoo d ratios and GLRs then satisfy the desideratum. Bay es factors in the sub jective tradition do not satisfy it by design, whereas Bay es factors in the ob jective tradition do, as the prior is given by the structure of the problem. In triguingly , while the “sub jectivity” of Ba yes factors is a common criticism of Bay esians made b y frequen tists, it’s questionable whether p-v alues satisfy this criterion. Indeed, there are often multiple p-v alues for an y giv en problem. Which one is calculated and rep orted is, of course, dep enden t on the analyst. W e ha ve thus marked p-v alues as only partially satisfying this requiremen t. Similar commen ts apply to e-statistics. There are multiple wa ys to play the b etting game, and diﬀeren t analysts might thus use diﬀerent e-statistics and come to diﬀeren t conclusions. F urther, while they are p erfectly w ell-deﬁned in the frequentist setting, e-statistics c an be deﬁned in the Bay esian setting and, as discussed ab o v e, optimal e- 16 v ariables can often b e recov ered as Bay es factors with (sometimes p eculiar) priors. Lik e p-v alues then, we mark e-statistics as only partially satisfying this criteria. Coher enc e. Both p-v alues and Bay es factors can b e incoherent ( Lavine and Sc hervish , 1999 ; F ossaluza et al. , 2017 ). Moreov er, since the notion of the consistency requires evi- dence for or against c omp osite hypotheses, the ordinary simple-vs-simple likelihoo d ratio is to o narrow to address this desideratum. The GLR, how ev er, is coherent ( Bick el , 2012 ; F ossaluza et al. , 2017 ). Not all e-statistics are coheren t. Bick el ( 2024 ) observed that the numeraire e-v ariable ( Larsson et al. , 2025 ) is not coherent, whereas the univ ersal inference e-v ariable (and e- pro cess) is. W e therefore mark e-statistics as only partially satisfying this requirement in general. See Ramdas and W ang ( 2025 , Chapter 6) for further discussion on coherent e-v ariables. Consistency. Not all e-statistics are consistent, but many are. In fact, it is t ypically easy to construct e-v ariables and e-pro cesses which satisfy such a prop ert y . T o give but a short list: those designed for testing means of b ounded random v ariables ( W audb y-Smith and Ramdas , 2024 ), those designed for testing exp onential families ( Gr ¨ un wald et al. , 2024b ; Hao and Gr ¨ un wald , 2024 ), t wo-sample testing ( Shekhar and Ramdas , 2023 ) and others ( W audb y-Smith et al. , 2025 ). Still, there are obviously e-statistics that are inconsisten t — an example b eing the constant process E τ ≡ 1 that ignores all data. Meanwhile, recent w ork shows that univ ersal inference can b e conserv ativ e asymptotically , suggesting that univ ersal inference e-v alues ma y not satisfy asymptotic consistency ( T ak atsu , 2025 ). The p-v alue is not consisten t, while the likelihoo d ratio is ( Grend´ ar , 2012 ). The GLR is consisten t under regularit y and separabilit y assumptions ( Bic kel , 2012 ), but not in general. Ba yes factors are consistent in most practical settings (so we mark ed the desideratum as satisﬁed), though there are some exceptions dep ending on the prior and the h yp othesis class ( Casella et al. , 2009 ; Moreno et al. , 2010 ). Sample size invarianc e. Both likelihoo d ratios and sub jectiv e Ba yes factors ha ve sam- ple size inv arian t in terpretations. The lik eliho o d ratio, for instance, is interpreted as: if p θ 1 ( X ) /p θ 0 ( X ) = c , then the data ar e c times as likely under p θ 1 as they ar e under p θ 0 , a statemen t which do es not mention the sample size. Or, in the words of Go o dman and Ro yall ( 1988 ), “the lik eliho o d ratio has the same meaning in trials of diﬀerent designs and sizes.” Similarly for the GLR. Meanwhile, e-statistics can b e seen as monetary gains in a b etting game, an in terpretation whic h also doesn’t dep end on the sample size. P-v alues, on the other hand, are not sample size in v arian t ( Go odman and Roy all , 1988 ). F or ob jective Ba yes, the answer dep ends on the hypotheses under consideration. F or some hypotheses (suc h as in linear regression), the prior advocated in some ob jective Bay es metho ds de- p ends on the cov ariates ( De Heide and Gr ¨ un wald , 2021 ) and therefore, implicitly , also on the sample size. Thus, sample size inv ariance fails. Sc ale invarianc e. Both likelihoo d ratios and Bay es factors are in v arian t to one-to-one transformations of the data b ecause they are ratios of probabilities. T o elab orate, under an y suc h transformation b oth densities are multiplied b y the same quantit y: If Y = g ( X ) then the new density for Y is p θ j ( y ) = p θ j ( x ) | det J g − 1 ( y ) | where J g − 1 is the Jacobian. Th us, the change to both n umerator and denominator cancel out. (Note the Jacobian only app ears in contin uous-v alued data; for discrete data the inv ariance is immediate.) 17 Desider atum T r ad. p- values LRs Bayes factors e-statistics simp. GLR sub j. ob j. all some Scalar measure B,E,L ✓ ✓ ✓ ✓ ✓ ✓ ✓ Con tinuous measure B,E,L ✓ ✓ ✓ ✓ ✓ ✓ ✓ Irrelev ance of analyst E,L ≈ ✓ ✓ - ✓ ≈ ✓ Coherence O - - ✓ - - - ✓ Consistency B,L - ✓ ≈ ✓ ✓ - ✓ Sample size in v ariance B,L - ✓ ✓ ✓ ≈ ✓ ✓ Scale in v ariance B,L - ✓ ✓ ✓ ✓ ≈ ✓ Reparam. in v ariance B,L - ✓ ✓ ≈ ≈ ≈ ✓ Single h yp othesis v alidit y E ✓ - - - - ✓ ✓ Error rates E ✓ ✓ ≈ - - ✓ ✓ Comp osite hypotheses O ✓ - ✓ ✓ ✓ ✓ ✓ Nonparametric h yp otheses O ✓ - ✓ ≈ ≈ ✓ ✓ Lik eliho o d principle (Strict) L - ✓ ✓ ✓ - - ≈ Lik eliho o d principle (Lo ose) L - ✓ ✓ ✓ ≈ ✓ ✓ Comp osabilit y O ✓ - - - - ✓ ✓ Accum ulation I (Fixed) B,E,L ✓ + ✓ + - ✓ ≈ ✓ + ✓ + Accum ulation I I (Flexible) O - ✓ + - ✓ - ✓ + ✓ + Dynamic Consistency B, O - ✓ + - ✓ - - ✓ + In ter-exp erimen t counterfactuals O - ✓ + - ✓ - ✓ + ✓ + In tra-exp erimen t counterfactuals B,L - ✓ + - ∗ ✓ - - ≈ In tra-exp erimen t optional stopping B,L - ✓ + - ∗ ✓ - - ✓ + T able 1: Comparison of how diﬀeren t evidence functions fare with resp ect to v arious ev- iden tial desiderata. Here ✓ indicates the desideratum is largely satisﬁed, “–” that it is either not satisﬁed or cannot b e meaningfully deﬁned, “ ≈ ” that it is partially satisﬁed. In the ﬁnal six lines, ✓ + means that the evidence is well-deﬁned, has a clear in terpretation and provides v alid error rates. ✓ means that it is well-deﬁned and has a clear interpreta- tion, but in general cannot b e used to infer v alid error rates. ‘– ∗ ’ means that the evidence remains w ell-deﬁned, so the desideratum is formally satisﬁed, y et it lac ks any clear jus- tiﬁcation and does not lead to v alid error rates, so it is hard to imagine that one w ould ev er wan t to use it. The “T rad.” column indicates which tradition the desideratum comes from: L = lik eliho od tradition, B = Bay esian tradition, E = error-probability tradition, O = Other. Since there are typically multiple e-statistics for a given hypothesis, we break their analysis in to tw o categories: ‘all’ meaning all e-statistics whic h satisfy the criteria, or ‘some’ meaning that there exists at least one e-statistic whic h does (though typically there are man y). See the text for more discussion in eac h case. 18 The GLR is scale in v arian t for the same reason. Scale inv ariance for e-statistics is more subtle. Some e-statistics automatically sat- isfy this prop ert y , suc h as universal inference (again using that it is the ratio of proba- bilities) and those based on self-normalized pro cesses ( W ang and Ramdas , 2025 ) (since self-normalized statistics are scale free). Ho wev er, not all e-statistics are immediately scale inv ariant. Consider E ( X ) = 1 { X ≤ 1 } /P ( X ≤ 1) for any distribution P with P ( X ≤ 1) > 0. This is clearly an e-v ariable. If Y = aX , then E ( Y )  = E ( X ), so E is not scale inv ariant. This is b ecause a threshold has b een hard-coded in the example. One should arguably parameterize this threshold and instead consider the family E s ( X ) = 1 { X ≤ s } /P ( X ≤ s ). In this case, E s ( Y ) = E s/a ( X ). Allo wing the parameter space to c hange alongside the data is natural. F or instance, follo wing Example 3 , E λ ( X ) = exp( λX − λ 2 / 2) is an e-v ariable for H = { N (0 , 1) } . Supp ose w e let Y = aX , so under the n ull, Y ∼ N (0 , a 2 ). Then E is scale in v arian t in the sense that E λ ( X ) = E λa ( Y ). Notice that E λ ( X ) is the likelihoo d ratio of N ( λ, 1) and N (0 , 1). That is, the parameter λ is enco ding the alternativ e distribution. Hence, when we refer to the lik eliho o d ratio b eing scale in v arian t, w e are implicitly allo wing the parameter to c hange. R ep ar ameterization invarianc e. Using the same logic as ab ov e, lik eliho o d ratios are in v arian t to transformations of parameters (cf. McCullagh and Co x 1986 ). So too are Ba yes factors if one allows the prior to change along with the reparameterization. That is, if we b egin with a prior π ov er parameters θ , reparameterize θ 7→ ϕ ( θ ), and consider the pushforw ard measure f ( ϕ ) := π ( θ ) | det d θ / d ϕ | as the new prior, then the Bay es factor is in v arian t. Ho wev er, considering the pushforward is not alwa ys done in practice. F or instance, c ho osing a uniform prior in one parameterization will not necessarily yield a uniform prior in another. Thus, if one is committed to a speciﬁc class of priors, the Bay es factor can dep end on the particular parameterization. Some Bay esians ﬁnd this troubling, which led to in v arian t priors ( Jeﬀreys , 1946 ). Neither ob jectiv e Bay es factors nor sub jective Bay es factors alw ays use inv ariant priors, how ev er. A similar nuance applies to e-statistics as it did in the case of scale inv ariance. As an ob ject satisfying sup P ∈H E P [ E ] ≤ 1, a change in the parameter space simply relabels the n ull hypothesis, and do es not aﬀect the guarantee aﬀorded by the e-statistic. In other w ords, e-statistics are inv ariant to transformations of the parameter in the sense that they remain e-statistics. How ev er, if one is committed to a sp eciﬁc “generating procedure” for e-v ariables (e.g., universal inference with prior π ( θ ), or constructing E λ using a speciﬁc λ ), then its v alue might c hange after a transformation of the parameters. Therefore, w e mark general e-statistics as only partially satisfying this requiremen t. Finally , not all p-v alues are reparameterization inv ariant, leading to the search for some whic h are ( Ev ans and Jang , 2010 ). Single hyp othesis validity. P-v alues and e-statistics are deﬁned only in terms of a single h yp othesis H , whereas b oth lik eliho od ratios and Ba yes factors require tw o. See, e.g., Example 4 and 7 . That is, the latter provide relative evidence b et ween tw o hypotheses and the former pro vide a notion of absolute evidence against a single hypothesis. L ong-run err or r ates. This is p erhaps the deﬁning feature of evidence from the error- 19 probabilit y tradition. It holds that if one uses a measure of evidence to make decisions (whic h is itself already a con trov ersial endeav or, and marks one of the divides b et w een the Fisherian and Neyman-Pearson sc ho ols of thought), then the frequentist error on those decisions should b e con trolled. More precisely , if w e rep eat the decision pro cess many times, w e should hav e some guaran tee on how often the decision will b e wrong. As discussed in Section 2 , Marko v’s inequality (and Ville’s inequality for e-pro cesses) giv es us control ov er type-I error rates for e-statistics. The likelihoo d ratio, b eing an e-v alue, also gives error control, but the GLR on its own do es not. There, error rates m ust b e obtained by app ealing to the sampling plan, underlying distribution, or other mac hinery . Since in practice, GLRs are almost alwa ys used within a context in whic h suc h error rates are analyzed as function of the GLR for a given sampling distribution, we still marked them as approximately satisfying this criterion. Sub jective Bay es factors also do not provide type-I error con trol ( Gr ¨ un wald et al. , 2024a ); for ob jectiv e Bay es factors, there are some cases in which t yp e-I error control is provided, such as the t-test Bay es factor of Example 6 , but usually it is not (note that we refer to exact, nonasymptotic error rates here). It’s w orth men tioning so-c alled “p ost-ho c v alidit y ,” a new dev elopmen t in the theory of e-statistics. When measures of evidence are used to make decisions, e- statistics contin ue to provide meaningful error guaran tees, in the form of control on expected risk, under data-dep enden t signiﬁcance levels ( Gr¨ unw ald , 2023 , 2024 ; Ch ugg et al. , 2026 ). This is not allo wed by traditional p-v alues. F urther, the only subset of p-v alues which do pro vide suc h a guaran tee turn out to be precisely the inv erse of e-v ariables ( W ang and Ramdas , 2022 ; Koning , 2023 ). Comp osite and nonp ar ametric hyp otheses. The only eviden tial measure that, b y def- inition, cannot handle comp osite hypotheses is the simple-vs-simple lik eliho od ratio. All other measures handle comp osite h yp otheses and, in principle, even handle highly complex nonparametric h yp otheses. Some evidential measures, how ev er, handle suc h complex classes more naturally than others. F or instance, there exist simple e-statistics for the class of all b ounded distributions with mean µ (Example 7 ), which is a large nonparametric family . By contrast, to deﬁne a Ba yes factor for this same hypothesis one m ust place a prior on an inﬁnite-dimensional space of distributions (or otherwise restrict attention to a parametric or semiparametric sub-mo del). This can certainly b e done, but it is considerably less direct and substan tially more prior-dep enden t (and can lead to less evidence against hypotheses that are evidently wrong in light of the data; see Li et al. , 2023 ). F or this reason, w e regard this criterion as only partially satisﬁed b y b oth sub jective and ob jectiv e Ba yes factors. The likeliho o d principle. The likelihoo d principle is something like the founding c har- ter, or the constitution, of the likelihoo dists. It is useful to distinguish b et ween tw o v ersions: the strict and lo ose lik eliho od principle. The strict version holds that if t wo data sets induce prop ortional likelihoo d functions for the parameters of interest (i.e., for all parameters in H 0 and H 1 ), then the evidence should be the same for b oth data sets. The lo ose version holds that evidence can alwa ys b e written as, or at least upp er b ounded b y , a likelihoo d ratio with some elemen t of H 0 in the denominator. Let us deal ﬁrst with the strict likelihoo d principle. This is, of course, satisﬁed by the 20 lik eliho o d ratio. F or the sub jective Bay es factor, it is satisﬁed if one identiﬁes likelihoo d with marginal likelihoo d ov er the prior. Here we assume, as is usually done, that the priors are c hosen indep endently of the sampling plan, i.e. the analyst’s prior assessment of the parameters determining the actual distribution is independent of the sampling plan employ ed. This is in contrast to ob jective Bay esian approaches, in which priors are chosen b y formal means that often dep end on asp ects of the sampling plan suc h as the stopping time ( Berger and W olp ert , 1988 ). Indeed, Berger and W olp ert ( 1988 ); De Heide and Gr ¨ un wald ( 2021 ) giv e concrete examples where Jeﬀreys’ prior dep ends on the sampling plan, hence the v alue of the ob jective Bay es factor is not merely determined b y the lik eliho ods and as suc h it violates the strict likelihoo d principle. P-v alues violate it for the same reason, whereas the GLR satisﬁes it. As for e-pro cesses, the strict likelihoo d principle is satisﬁed by the likelihoo d ratio e-pro cess and the univ ersal inference e-pro cess (Example 5 ). There is a subtle catch whic h prev ents us from saying that the strict version is alw ays satisﬁed by e-statistics, how ev er. W e describ e this issue further b elo w under intra-experiment counterfactuals. The loose version of the lik eliho o d principle is clearly satisﬁed by any evidential mea- sure satisfying the strict v ersion. Moreov er, it is once again violated by p-v alues. As to e-statistics, as shown in Ramdas and W ang ( 2025 , Theorem 14.5) (a sp ecial case of which w e mentioned in Section 4 ), any e-v ariable E for H can b e shown to be dominated by lik e- liho od ratios, in the sense that for an y P ∈ H , there exists some Q suc h that E ( X ) ≤ q /p ; see Section 6 for a simple proof. F or ob jectiv e Bay es, this is not the case, but since ob- jectiv e Ba y esian evidence can alwa ys b e written as a lik eliho o d ratio with a (p otentially improp er) prior ov er Θ 0 in the denominator, w e still mark it as partially satisﬁed. Comp osability under dep endenc e. The con vex combination of an y set of (arbitrarily de- p enden t) e-statistics remains an e-statistic: If E 1 , . . . , E K are e-v ariables then P k ≤ K λ k E k is an e-v ariable where P i ≤ K λ i ≤ 1, λ i ≥ 0. Similarly for e-pro cesses. In fact, the only ad- missible wa y of merging e-v ariables is with a com bination such that P i ≤ K λ i = 1 ( W ang , 2025 ). There also exist rules for combining arbitrarily dep enden t p-v alues, including tw o times the a verage and the median of the p-v alues. See V ovk et al. ( 2022 ) for a nice o verview. Ba yes factors and lik eliho o d ratios are not comp osable under dep endence; they require (conditional) indep endence—see the ﬁrst desideratum in the following section. 5.2 Dynamic Criteria T o assess the dynamic criteria introduced in Section 3 , it will b e con venien t to introduce some uniﬁed notation that we can use throughout the section. W e will assume that we w ant to measure the evidence of some data X ∗ whic h decomp oses as X ∗ = ( X (1) , X (2) , . . . , X ( τ ) ). F or the inter-experiment dynamic criteria (the ﬁrst four criteria b elo w), X ( i ) is itself the data pro duced b y some study . F or the intra-experiment criteria, X ( i ) is more fruitfully considered to b e a single datum and X ∗ are the data of one study . In b oth scenarios, the total num b er of studies run or observ ations collected, τ , ma y or ma y not b e kno wn in adv ance (dep ending on the desideratum). A c cumulation I: Fixe d-Design. If the X ( j ) are indep enden t and τ is known in adv ance or determined indep enden tly of the data, then the lik eliho od ratio decomp oses as LR = 21 Q j ≤ τ p 1 ( X ( j ) ) /p 0 ( X ( j ) ), meaning that w e can combine evidence by multiplication. As is well kno wn, the same m ultiplicative accumulation holds for Bay es factors when the priors π 0 and π 1 are ﬁxed in adv ance, indep endently of the data-collection pro cess (what w e call the sub jectiv e Ba yesian setting). In that case, independent studies can b e com bined sequentially b y up dating each prior to the corresp onding p osterior after each study . The ov erall Bay es factor is then the pro duct of the stage-wise Ba yes factors. The situation is less clear for ob jectiv e Bay es metho ds. There, the prior is often not a gen uinely ﬁxed ingredient, but is instead chosen by a rule that ma y dep end on features of the exp erimen tal design, such as the sample space, stopping plan, or co v ariate structure. If later studies are added, the prior deemed appropriate for the com bined exp erimen t may diﬀer from the prior used for the ﬁrst study considered in isolation. As a result, sequential accum ulation b ecomes ambiguous: the pro duct of the stagewise Bay es factors need not coincide with the Bay es factor that w ould ha ve b een computed had the full exp erimen t b een speciﬁed from the start ( De Heide and Gr ¨ un w ald , 2021 ). Since this happ ens only for some, and not nearly all, c hoices of H , we marked the requirement as only partially satisﬁed. The GLR, mean while, is not additive on any scale, as the parameter ac hieving (or appro ximating) the supremum ma y change for diﬀerent samples. Indep enden t p-v alues can b e added using Fisher’s method, and the pro duct of t wo conditionally independent e-statistics remains an e-statistic, a prop erty whic h follows easily from the deﬁnition of the exp ected v alue. See Ramdas and W ang ( 2025 , Chapter 8) for more discussion. A c cumulation II: Flexible-Design. No w supp ose that the decision to p erform an ad- ditional study dep ends on the results of previous studies. If the data X ( j ) is inde- p enden t of X (1) , . . . , X ( j − 1) and X ( i ) is asso ciated with e-v ariable E i , then the pro duct Q i ≤ τ E i is an e-v ariable. This was a ma jor motiv ation in the original dev elopment of e-statistics ( Gr ¨ un wald et al. , 2024a ). This prop ert y is also satisﬁed by e-pro cesses that, in the in ter-exp erimen tal setting, are constructed on the ﬂy from a sequence of e-v ariables. GLRs fail to satisfy Accumulation I I for the same reason they failed to satisfy Ac- cum ulation I. Ob jectiv e Bay es factors fail this criterion more dramatically than they did Accum ulation I: the prior for the initial exp eriment may not only b e chosen in diﬀeren t w ays, its deﬁnition will in general dep ends on the num b er of studies that will even tually b e done. Hence the ob jective Ba yes factor ma y in fact b e undeﬁned. Sub jective Ba yes factors, on the other hand, contin ue to satisfy this criterion ( Rouder , 2014 ; De Heide and Gr ¨ un w ald , 2021 ). P-v alues fail for this and all subsequent desiderata b ecause they require that X ∗ is fully sp eciﬁed in adv ance of running the study . Consider the following example. Example 8. Supp ose that a randomized clinical trial to test a new medical treatment is p erformed on 50 patien ts represented by 50-dimensional data v ector X (1) . The result turns out to b e pr omising but not c onclusive : the researc hers observed a p-v alue of p 1 = 0 . 1 while they had a signiﬁcance lev el of 0 . 05 in mind. But their b oss is optimistic at the news and agrees to supply the resources to test another 30 patients, resulting in data X (2) . Is this go o d news? Not if one insists on measuring evidence in the second trial by 22 another p-v alue, sa y p 2 . The decision to gather X (2) dep ends on X (1) and, as a result, com bination metho ds lik e Fisher’s cannot be employ ed an y more. Indeed, they require that X ∗ = ( X (1) , X (2) ) whereas in our setting, X ∗ w ould b e equal to X (1) for some v alues of X (1) and equal to ( X (1) , X (2) ) otherwise. Similarly , joining the tw o data sets and recalculating the p-v alue leads to a wrong answer as well. In fact, the only known wa y to obtain a sequence of anytime-valid p-v alues in such settings is to con vert each p i , i = 1 , 2, into an e-v alue E i b y calibration (Section 4 ), and then rep ort not p i itself but p ′ i := 1 /E i . The quantit y p ′ 12 := 1 / ( E 1 E 2 ) can then b e in terpreted as a p-v alue for the com bined data. In practice, how ev er, this approach is t ypically ineﬃcien t ( Gr¨ unw ald et al. , 2024a , Section 7), and one can do b etter by w orking directly with e-v alues. Dynamic Consistency. Dynamic consistency asks that we obtain the same evidence when we treat X ∗ as a batc h v ersus when we treat samples X ( j ) separately , sequentially , and then combine them. F or p-v alues, likelihoo d ratios, and Bay es factors, satisfying dynamic consistency is equiv alen t to satisfying accumulation under a ﬂexible design (here w e again assume that τ can b e data-dependent). F or e-statistics w e m ust distinguish b etw een sev eral cases. If w e design an e-process for the stream of data X (1) , X (2) , . . . , then dynamic consistency is satisﬁed ( Ramdas et al. , 2023 ). If, how ev er, w e mo del the individual X ( j ) b y separate e-v ariables and com bine them via multiplication, then this desideratum is not satisﬁed in general ( Gr ¨ unw ald et al. , 2024a , Section 6). In particular, for some e-v ariables the pro duct Q τ j =1 E j of the e-v alues E j for study X ( j ) do es not coincide with the single e-v alue E ∗ for X ∗ that one would ha ve obtained if one had designed a log-optimal e-v ariable for X ∗ directly (though w e emphasize that, as stated earlier, the pro duct Q τ j =1 E j is a v alid e-v ariable, whic h of course yields v alid error rates as discussed in Section 2 ). Inter-exp eriment c ounterfactuals. A common criticism of p-v alues is that, parado xi- cally , they dep end on data that are not observed, or, more precisely , on what decisions an exp erimen ter would hav e taken in coun terfactual situations in whic h the data were diﬀer- en t ( Barnard , 1947 ; W agenmakers , 2007 ). There are multiple kinds of counterfactuals to consider, which we delineate as inter- and intra-experiment counterfactuals 8 . The former, discussed here, concerns coun terfactuals with resp ect to the n um b er of observ ations, whic h migh t also b e though t of as optional contin uation in coun terfactual situations. Example 9 (Example 8 , contin ued) . In a v ariation of the previous example, supp ose the researc hers told their b oss merely that the p-v alue was small enough for the result to b e “promising but not conclusive,” and not its actual v alue. The b oss, once again, resp onds optimistically and requests that they con tinue the trial on a second batc h of 30 patients. The researchers, who know their statistics, are now worried ab out inv ali- dating the results. But then news reaches the b oss that the p-v alue is 0.1. Disma yed, he decides to stop the trial. 8 W e hav e not found this distinction anywhere else in the literature, but it is imp ortan t to make since the criteria b eha ve quite diﬀerently under eac h sub-division! 23 Should the researc hers b e relieved? The answer is an emphatic no : a simple cal- culation shows that the mere fact that the sample would ha ve b een 80 patients (th us diﬀeren t from the originally planned 50) in a coun terfactual situation (i.e. if the ﬁrst 50 data p oin ts had b een diﬀerent than they actually were) makes the p-v alue in v alid. That is, counterfactuals can ruin the v alidit y of the p-v alue even if the sample plan w as not in fact changed for the data which were actually observed. The dep endence of the p-v alue on decisions that would ha v e b een made in coun ter- factual situations is disturbing, since in reality , it is often quite unknow able (or even meaningless to sp eak ab out) what exactly w ould ha ve happ ened had the data b een dif- feren t from what they actually w ere. That p-v alues cannot handle coun terfactuals is an immediate consequence of the fact that they cannot deal with accumulation. Similarly for the ob jective Bay es factor and the GLR. F or e-v ariables and e-pro cesses in inter-experimental settings (where the e-pro cesses are constructed b y multiplying the e-v ariables), this desideratum is satisﬁed b ecause the constructed e-v ariables are v alid at arbitrary stopping times. That is, regardless of the data witnessed th us far, stopping an e-process will alw a ys result in a v alid e-v alue. Similarly for sub jective Bay es factors, whic h also remain v alid at stopping times. Note that this implies that p-v alues whic h can b e written as the inv erse of e-v alues also satisfy this prop ert y . In general, ho wev er, such p-v alues are conserv ativ e and are not the ones used in practice. Intr a-exp eriment optional stopping. No w we supp ose that X ∗ represen ts data from a single study in whic h w e observe τ data p oin ts, X ∗ = ( X 1 , . . . , X τ ). As usual, τ is not ﬁxed: it is a stopping time whose deﬁnition the analyst may not know or may not wan t to ﬁx in adv ance. Consider the following simple example in the spirit of Lindley and Phillips ( 1976 ). Example 10. Supp ose the data are binary and we observe X ∗ = (0 , 1 , 0 , 1 , 1), but do not know whether the sample size w as ﬁxed to b e 5 in adv ance or if the sampling plan w as “stop as so on as you see tw o 1s in a ro w.” Or, then again, maybe each data p oin t is exp ensiv e and the p -v alue for a sample size of 5 was already so small that your b oss decided data collection should b e stopp ed. As we stated in Section 3 , this criterion is v ery similar to Accumulation I I, but mak- ing the distinction b et ween optional contin uation ‘inside’ a single study v ersus ‘b et ween’ studies sheds ligh t on sev eral subtleties for e-statistics. As w e saw in Section 2 , e-pro cesses are deﬁned suc h that they allow optional stopping. Likelihoo d ratios, b eing e-pro cesses, also satisfy this criterion. On the other hand, for e-v ariables (when directly deﬁned on the batc h X ∗ , as they often are in practice), optional stopping is an undeﬁned operation. As for the other notions of evidence, it is w ell-known that p-v alues do not allow for data-dep enden t stopping times. In fact, pic king the sample size as a function of the data and rep orting the resulting p-v alue has come to b e kno wn as “p-hackin g,” and is often cited among the most common “questionable research practices” ( Simmons et al. , 2011 ; John et al. , 2012 ). Bay es factors, mean while, allo w for optional stopping as long as the 24 priors are chosen a priori, indep enden tly of the stopping time (though note that they do not in general provide t yp e-I error rates, in con trast to lik eliho o d ratios). W e refer to De Heide and Gr ¨ un wald ( 2021 ) and Rouder ( 2014 ) for more discussion on Bay es factors and optional stopping. Lik e the Bay es factor with sub jectiv e prior, the GLR remains w ell-deﬁned under op- tional stopping, but lik e the Bay es factor, it fails to provide v alid error rates, as the suprem um can exaggerate evidence for the alternative ( Ramdas et al. , 2023 ). Therefore, while the Ba yes factor still has a clear in terpretation and can be justiﬁed in case the priors reﬂect trust worth y prior kno wledge, using the GLR here has no clear justiﬁcation at all. In practice, the GLR is usually employ ed within the error-probabilit y tradition, and its main goal is to pro vide error rates, which it cannot provide here. Intr a-exp eriment c ounterfactuals. Researc hers within the lik eliho od tradition hav e observ ed that the p-v alues’ dep endence on decisions made in counterfactual situations stretc hes b ey ond only the size of the dataset, as illustrated b y the following celebrated example: Example 11 ( Pratt’s v oltmeter , cf. Edwards , 1972 ; Sav age et al. , 1962 ) . Supp ose w e observ e X 1 , . . . , X n where n is ﬁxed and the X i represen t v oltages of electron tub es, measured with an accurate voltmeter. A statistician examines the X i assuming they are normally distributed with ﬁxed v ariance and some mean µ . He aims to use a p -v alue to measure the evidence against the null h yp othesis that µ = 7 . 0. Later he visits the engineer’s lab oratory , and notices that the voltmeter reads only as far as 10: the p opulation app ears to b e c ensor e d . Even though none of the X i w ere 10, this makes the standard p -v alue in v alid; it necessitates a new calculation based on a p -v alue that takes into accoun t the (p oten tial, counterfactual) censoring. Ho wev er, the engineer sa ys she also has a super-high-range-meter, equally accurate, whic h she w ould hav e used if any of the measurements had turned out ≥ 10. This is a relief to the statistician, b ecause it means the original p -v alue is correct after all. But the next da y the engineer telephones and sa ys, “I just disc over e d my high-r ange voltmeter was not working the day I did the exp eriment” . The statistician then informs her that a new analysis will b e required after all! The engineer is astounded. She says, “But the exp erimen t turned out just the same as if the high-range meter had b een working. I le arne d exactly what I would have le arne d if the high-r ange meter had b e en available . Next y ou’ll b e asking ab out my oscilloscop e!” W e ma y think of both the v oltmeter example and Example 10 as an instance of the follo wing generic phenomenon. The random v ariable X ∗ , taking v alues in some set X ∗ , deﬁnes a partition of the sample s pace Ω, sa y {E x : x ∈ X ∗ } , where E x is the set of exactly those ω ∈ Ω with X ( ω ) = x . Observing X ∗ = x is equiv alen t to observing that the even t E x o ccurred. No w, it is usually tacitly assumed that the deﬁnition of X ∗ is known to the observer. In practice, ho wev er, an analyst simply observes an even t E x , for some x ∈ X ∗ . The analyst then knows that E x happ ened but often do es not know how X ∗ w as ev en deﬁned. This is the case in Example 10 , where w e observed X ∗ = ( X 1 , . . . , X τ ) = (0 , 1 , 0 , 1 , 1) but did not 25 kno w whether τ was deﬁned as the constant 5 or as ‘stop when y ou see t wo 1s’. It is also the case in the v oltmeter example, where w e observed ( x 1 , . . . , x n ) with all x i < 10 but do not kno w whether X ∗ = ( X 1 , . . . , X n ) or X ∗ = ( X 1 ∧ 10 , . . . , X n ∧ 10). No w, abov e we deﬁned lik eliho od ratios as functions of the random v ariable X ∗ , but w e may equiv alen tly deﬁne them as functions of ev ents, i.e. measurable subsets of Ω. T o av oid diﬃcult mathematics, let us illustrate this in the case of discrete data. F or an y even t E , an y random v ariable X , and any v alue x suc h that X = x corresponds to ev ent E x , we hav e p 1 ( x ) /p 0 ( x ) = P 1 ( E ) /P 0 ( E ) as long as E = E x . Th us, w e can re-deﬁne lik eliho o d ratios as functions of ev ents and will get the same results as b efore, irresp ectiv e of the deﬁnition of X ∗ . As a consequence, the likelihoo d ratio is immune to decisions in coun terfactual situations in a very general sense. The same holds for the sub jective Bay es factor. F or the ob jectiv e Bay es factor, the GLR, and isolated e-v ariables, the desideratum fails to b e met for the same reasons as for inter-experiment counterfactuals. The situation is tric kier for e-pro cesses. Since likelihoo d ratios deﬁne e-pro cesses, w e migh t sa y this desideratum holds for at least some e-pro cesses. On the other hand, one in v ariably deﬁnes e-pro cesses on ﬁltrations (essentially sequences of random v ariables), not on arbitrary even ts, and at the time of writing it is not clear whether they can b e extended to ev ents b ey ond the simplest case in whic h H 0 and H 1 are simple (this is also the reason we only marked the strict likelihoo d principle to b e only approximately satis- ﬁed by e-pro cesses — likelihoo d ratios are deﬁned on arbitrary even ts, and e-pro cesses on the more restricted notion of ﬁltrations). F urther w ork is required to answ er this question. Ov erall, we ﬁnd that while no single e-statistic satisﬁes all evidential desiderata in Section 3 , eac h desideratum is (at least partially) satisﬁed b y at least one e-statistic. Moreo ver, when w e restrict atten tion to criteria satisﬁed b y al l e-v ariables, they satisfy the largest set of desiderata among those measures which can handle comp osite h ypotheses (whic h w e deem to b e a v ery important prop ert y!). Indeed, as w e’v e emphasize throughout and will explore more shortly , w e may think of e-statistics as providing a nov el extension to lik eliho o d ratios, arguably more sophisticated than the generalized likelihoo d ratio, that b eha v es particularly well on most common desiderata. With that said, let us not pretend that e-statistics are the p erfect measure of evidence. W e end b y considering some of their drawbac ks. 6 Limitations and Coun terargumen ts One could of course ob ject to e-statistics as a measure of evidence b y app ealing to any of the criteria listed ab ov e that they do not wholly satisfy . Instead, in this section we consider t wo broader critiques. Non-uniqueness. The ﬁrst is the fact that there can be man y e-v ariables for a given problem. Even if one uses e-statistics, the amount of evidence against a hypothesis is not determined by the structure of the problem alone, but also on how the analyst chooses to b et in the b etting game. 26 W e are sympathetic to this concern, but let us p oin t out that the situation is not so diﬀerent from other measures of evidence. F or instance, there are typically m ultiple p-v alues that one may use for a given problem, and the Ba yes factor dep ends on the c hoice of prior which ma y v ary from analyst to analyst. F urther, while the likelihoo d ratio for simple hypotheses is uniquely determined, when moving to comp osite settings the analyst has his c hoice of p ossible generalizations. Nev ertheless, the situation is arguably somewhat clearer for p-v alues than for e-v alues. In particular, within the error-statistics tradition, there is broad consensus that the quality of a p-v alue, when used for testing, is determined b y the maximal pow er it can achiev e. F or suﬃciently simple H 0 and H 1 —though only in suc h cases—one can deﬁne a uniformly most p o w erful p-v alue. In the e-v alue literature, by contrast, the preferred optimality criterion under a simple alternative is often taken to b e log-optimality . F or comp osite alternativ es, ho wev er, there are multiple wa ys of generalizing log-optimality , and these lead to quite diﬀeren t e-statistics. More broadly , there is less consensus in the e-statistics literature that log-optimality is the right criterion than there is in the error-statistics tradition that p o w er is the right one ( Koning , 2024 ). Birn baum’s theorem and the lik eliho od principle. Birnbaum ( 1962 ) show ed that t wo conditions entail the lik eliho o d principle: the suﬃciency principle and the condition- alit y principle. The result has come to b e known as Birnbaum’s theorem. The suﬃciency principle states that if there exists a suﬃcient statistic for the statistical mo del, then it captures ev erything ab out evidence. That is, if tw o exp erimen ts giv e the same suﬃcien t statistic, then they provide the same evidence for the parameter. The conditionalit y principle states that evidence should only dep end on whic h exp erimen t was actually run. F or instance, if an analyst decides b et ween t w o exp erimen ts b y ﬂipping a coin, the evidence should dep end only on that which was p erformed. Birn baum’s theorem is contro versial b ecause the conditionality and suﬃciency princi- ples are relatively w eak, striking man y frequen tists as p erfectly plausible. The conclusion, ho wev er, con tradicts the frequen tist’s fa vorite inferential metho ds. The theorem has thus fallen under v arious forms of scrutiny ( Durbin , 1970 ; Kalbﬂeisch , 1975 ; Ev ans et al. , 1986 ; Ev ans , 2013 ; Gandenberger , 2015 ). T o mak e the contro v ersy more concrete, consider again Example 10 . The lik eliho od ratio is en tirely determined from these data, thus the likelihoo dist is satisﬁed that they ha ve enough evidence ab out the situation to make a pronouncemen t. But the frequen tist who wan ts to emplo y a p-v alue is not. F or them, the sampling plan matters: the p-v alue diﬀers in all cases. Ho w do e-statistics ﬁt into this story? As w e’ve emphasized throughout the article, e-statistics are intimately related to likelihoo d ratios. F or one, the simple v ersus simple lik eliho o d ratio is an e-v ariable, and when testing tw o p oin t hypotheses, it is the unique log-optimal e-v ariable (see Shafer ( 2021 ) for a simple pro of ). Second, w e can upp er bound an y e-statistic E for H via a lik eliho o d ratio as follows. Given P ∈ H , deﬁne a distribution Q via the density q = p · E / E P [ E ] if E P [ E ] > 0, and Q = P otherwise. Then, since E P [ E ] ≤ 1, E ≤ q p . (10) 27 W e hav e already seen an instance of this in Example 4 , where the data-compression e- v ariable w as re-expressed as a lik eliho o d ratio. When E P [ E ] = 1 (as it is for E = E τ for all τ in the e-pro cess in Example 7 , for example), ( 10 ) b ecomes an equality . See Ramdas and W ang ( 2025 , Theorem 14.5) for a slightly more general statement. Th us, while e-statistics commit to a distinct core axiom ( E P [ E ] ≤ 1), they often re- co ver the likelihoo d ratio, or can be expressed or bounded in terms of lik eliho od ratios. E-statistics therefore o ccup y an in termediate p osition in the debate around Birnbaum’s theorem: they are closely connected to likelihoo d ratios, yet are not deﬁned b y a commit- men t to the likelihoo d ratio as the canonical measure of evidence. That stance may lea ve arden t likelihoo dists unsatisﬁed, but others may regard it as a reasonable price to pa y for the broader adv an tages of e-statistics. 7 Summary and F uture W ork W e ha ve p ositioned e-statistics in the broader literature on statistical evidence. They can b e viewed as generalizations of the likelihoo d ratio, and thus inherit many of the lik elihoo d ratio’s app ealing evidential prop erties. E-statistics are fundamentally frequen tist ob jects (though they are allow ed to dep end on priors via the metho d of mixtures (Example 5 , 7 ) and recov er Bay es factors in particular settings), thus a voiding some of the criticisms of Ba yesian conﬁrmation theory . Their game-theoretic meaning as the wealth accum ulated b y a statistician b etting against nature lends them an intuitiv e interpretation—one that, lik e p-v alues, is well-deﬁned even when considering a single hypothesis. Ov erall, e-statistics blend se v eral attractive features of p-v alues, lik eliho o d ratios, and Bay es factors, and w e hop e to hav e con vinced the reader that they are worth y of consideration as comp elling measures of statistical evidence. With all that said, man y evidential aspects of e-statistics remain underexplored. First, Koning ( 2024 ), building on w ork of Ramdas and Manole ( 2026 ), introduced a nov el notion of evidence against H : the maxim um probabilit y with which a certain randomized test, while controlling t yp e-I error, rejects the null. W e did not include this notion in our comparison b ecause it diﬀers substantially from existing desiderata, though it ma y w ell deserv e further study . Second, e-pro cesses as currently deﬁned do not fully satisfy the in tra-exp erimen t coun terfactual desideratum. Is it p ossible to generalize their deﬁnition so that some do? References George A Barnard. The meaning of a signiﬁcance level. Biometrika , 34(1/2):179–182, 1947. James Berger. The case for Ob jective Bay esian analysis. Bayesian Analysis , 1(3):385–402, 2006. James O Berger. In defense of the likeliho o d principle: axiomatics and c oher ency . Purdue Univer- sit y . Department of Statistics, 1983. James O Berger and Thomas Sellk e. T esting a point null hypothesis: The irreconcilability of p v alues and evidence. Journal of the Americ an statistic al Asso ciation , 82(397):112–122, 1987. 28 James O Berger and Rob ert L W olp ert. The likelihoo d principle. In Institute of Mathematic al Statistics: L e ctur e Notes–Mono gr aph Series . IMS, 1988. James O Berger, Jay anta K Ghosh, and Nitai Mukhopadhy ay . Appro ximations and consistency of Ba yes factors as mo del dimension grows. Journal of Statistic al Planning and infer enc e , 112 (1-2):241–258, 2003. Da vid R Bick el. A predictive approach to measuring the strength of statistical evidence for single and m ultiple comparisons. Canadian Journal of Statistics , 39(4):610–631, 2011. Da vid R Bick el. The strength of statistical evidence for comp osite hypotheses: Inference to the b est explanation. Statistic a Sinic a , pages 1147–1198, 2012. Da vid R Bick el. David R. Bick el’s contr ibution to the Discussion of ‘Safe testing’ by Gr¨ un wald, De Heide, and Ko olen. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 86(5):1133–1134, 2024. Allan Birn baum. On the foundations of statistic al inference. Journal of the Americ an Statistic al Asso ciation , 57(298):269–306, 1962. Allan Birnbaum. The Neyman-Pearson theory as decision theory , and as inference theory; with a criticism of the Lindley-Sa v age argument for Bay esian theory . Synthese , 36(1):19–49, 1977. Jeﬀrey D Blume, Rob ert A Greevy , V alerie F W elty , Jeﬀrey R Smith, and William D Dup on t. An in tro duction to second-generation p-v alues. The Americ an Statistician , 73(1):157–167, 2019. George Casella, F Ja vier Gir´ on, M Lina Mart ´ ınez, and El ´ ıas Moreno. Consistency of Bay esian pro cedures for v ariable selection. The Annals of Statistics , 37(3):1207, 2009. Siddhartha Chib and T o dd A Kuﬀner. Bay es factor consistency . arXiv pr eprint arXiv:1607.00292 , 2016. Ben Chugg, T yron Lardy , Aadity a Ramdas, and P eter Gr¨ unw ald. On admissibilit y in p ost-hoc h yp othesis testing. International Journal of Appr oximate R e asoning , page 109634, 2026. T.M. Cov er and J.A. Thomas. Elements of Information The ory . Wiley-Interscience, New Y ork, 1991. Da vid R Co x. Regression mo dels and life-tables. Journal of the R oyal Statistic al So ciety Series B , 34(2):187–220, 1972. Rianne De Heide and Peter D Gr ¨ un wald. Why optional stopping can b e a problem for Bay esians. Psychonomic Bul letin & R eview , 28(3):795–812, 2021. James Durbin. On Birn baum’s theorem on the relation betw een suﬃciency , conditionalit y and lik eliho o d. Journal of the A meric an Statistic al Asso ciation , 65(329):395–398, 1970. AF Edw ards. Likelihoo d. an accoun t of the statistical concept of lik eliho o d and its application to scien tiﬁc inference. British Journal for the Philosophy of Scienc e , 23(2), 1972. Ellery Eells and Branden Fitelson. Measuring conﬁrmation and evidence. The Journal of philoso- phy , 97(12):663–672, 2000. Bradley Efron. L ar ge-sc ale infer enc e: empiric al Bayes metho ds for estimation, testing, and pr e dic- tion , v olume 1. Cam bridge Universit y Press, 2012. 29 Larry G. Epstein and Martin Schneider. Recursive multiple priors. Journal of Ec onomic The ory , 113(1):1–31, 2003. Mic hael Ev ans. What do es the pro of of Birn baum’s theorem prov e? Ele ctr onic Journal of Statistics , 7:2645–2655, 2013. Mic hael Ev ans and Gun Ho Jang. Inv arian t p-v alues for mo del chec king. The Annals of Statistics , 38(1):512–525, 2010. Mic hael J Ev ans, Donald AS F raser, and Georges Monette. On principles and arguments to lik eliho o d. Canadian Journal of Statistics , 14(3):181–194, 1986. Ronald A Fisher. Statistic al Metho ds for R ese ar ch Workers . Oliver & Boyd (Edinburgh), 1925. Ronald A Fisher. The Design of Exp eriments . Oliver & Boyd Ltd, 1935. Branden Fitelson. The plurality of Bay esian measures of conﬁrmation and the problem of measure sensitivit y . Philosophy of scienc e , 66(S3):S362–S378, 1999. Malcolm F orster and Elliott Sob er. Wh y likelihoo d? The natur e of scientiﬁc evidenc e: Statistic al, philosophic al, and empiric al c onsider ations , pages 153–190, 2004. Malcolm R F orster. Coun terexamples to a likelihoo d theory of evidence. Minds and Machines , 16 (3):319–338, 2006. Victor F ossaluza, Rafael Izbicki, Gustav o Miranda da Silv a, and Lu ´ ıs Gusta vo Estev es. Coherent h yp othesis testing. The Americ an Statistician , 71(3):242–248, 2017. K Ruben Gabriel. Simultaneous test procedures–some theory of m ultiple comparisons. The Annals of Mathematic al Statistics , pages 224–250, 1969. Greg Gandenberger. A new pro of of the likelihoo d principle. British Journal for the Philosophy of Scienc e , 66(3):475–503, 2015. Mala y Ghosh. Ob jectiv e priors: An introduction for frequentists. Statistic al Scienc e , pages 187– 202, 2011. Stev en N Go o dman and Richard Ro yall. Evidence and scientiﬁc research. Americ an journal of public he alth , 78(12):1568–1574, 1988. Sander Greenland. V alid p-v alues b eha ve exactly as they should: Some misleading criticisms of p-v alues and their resolution with s-v alues. The Americ an Statistician , 73(sup1):106–114, 2019. Marian Grend´ ar. Is the p-v alue a go od measure of evidence? asymptotic consistency criteria. Statistics & Pr ob ability L etters , 82(6):1116–1119, 2012. P . Gr ¨ un wald. The Minimum Description L ength Principle . MIT Press, Cambridge, MA, 2007. P . Gr ¨ un wald and T. Ro os. Minimum Description Length revisited. International Journal of Mathematics for Industry , 11(1), 2020. P eter Gr ¨ unw ald and Joseph Halp ern. Making decisions using sets of probabilities: Updating, time consistency , and calibration. Journal of Artiﬁcial Intel ligenc e R ese ar ch (JAIR) , 42:393–426, 2011. 30 P eter Gr ¨ un wald, Rianne de Heide, and W outer Koolen. Safe testing. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 86(5):1091–1128, 2024a. P eter Gr ¨ unw ald, T yron Lardy , Y unda Hao, Shaul K Bar-Lev, and Martijn de Jong. Optimal e-v alues for exp onen tial families: The simple case. arXiv pr eprint arXiv:2404.19465 , 2024b. P eter D Gr ¨ unw ald. The e-p osterior. Philosophic al T r ansactions of the R oyal So ciety A , 381(2247): 20220146, 2023. P eter D Gr ¨ un wald. Bey ond Neyman–Pearson: E-v alues enable h yp othesis testing with a data- driv en alpha. Pr o c e e dings of the National A c ademy of Scienc es , 121(39):e2302098121, 2024. Ian Hac king. L o gic of Statistic al Infer enc e . Cambridge Universit y Press, 2016. Y unda Hao and P eter Gr¨ unw ald. E-v alues for exp onen tial families: the general case. arXiv pr eprint arXiv:2409.11134 , 2024. JA Hartigan. The likelihoo d and inv ariance principles. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 29(3):533–539, 1967. Ra ymond Hubbard and R Murray Lindsay . Why p v alues are not a useful measure of evidence in statistical signiﬁcance testing. The ory & Psycholo gy , 18(1):69–88, 2008. Harold Jeﬀreys. The ory of Pr ob ability . The Clarendon Press, Oxford, 1939. Harold Jeﬀreys. An inv ariant form for the prior probabilit y in estimation problems. Pr o c e e dings of the R oyal So ciety of L ondon. Series A. Mathematic al and Physic al Scienc es , 186(1007):453–461, 1946. Leslie K John, George Lo ew enstein, and Drazen Prelec. Measuring the prev alence of questionable researc h practices with incentiv es for truth telling. Psycholo gic al scienc e , 23(5):524–532, 2012. John D Kalbﬂeisc h. Suﬃciency and conditionalit y . Biometrika , 62(2):251–259, 1975. Rob ert E Kass and Adrian E Raftery . Bay es factors. Journal of the americ an statistic al asso ciation , 90(430):773–795, 1995. Rob ert E Kass and Larry W asserman. The selection of prior distributions by formal rules. Journal of the Americ an statistic al Asso ciation , 91(435):1343–1370, 1996. P eter R Killeen. An alternative to n ull-hypothesis signiﬁcance tests. Psycholo gic al scienc e , 16(5): 345–353, 2005. Nic k Koning. Con tinuous testing: unifying tests and e-v alues. arXiv pr eprint 2409.05654 , 2024. Nic k W Koning. P ost-ho c α hypothesis testing and the p ost-hoc p -v alue. arXiv pr eprint arXiv:2312.08040 , 2023. Dani ¨ el Lakens. Why P v alues are not measures of evidence. T r ends in Ec olo gy & Evolution , 37(4): 289–290, 2022. T yron Lardy , P eter Gr ¨ un wald, and P eter Harremo ¨ es. Rev erse information pro jections and optimal e-statistics. IEEE T r ansactions on Information The ory , 2024. Martin Larsson, Aadity a Ramdas, and Johannes Ruf. The numeraire e-v ariable and reverse infor- mation pro jection. The Annals of Statistics , 53(3):1015–1043, 2025. 31 Mic hael Lavine and Mark J Sc hervish. Bay es factors: What they are and what they are not. The Am eric an Statistician , 53(2):119–122, 1999. Subhash R Lele. Evidence functions and the optimality of the law of likelihoo d. The natur e of scientiﬁc evidenc e , pages 191–216, 2004. Yijia Li, Y uantong Li, and Xiao wu Dai. Li, li, and dai’s contribution to the discussion of ”estimating means of b ounded random v ariables by b etting” by waudb y-smith and aadity a ramdas, 2023. URL . Dennis V Lindley and Lawrence D Phillips. Inference for a Bernoulli pro cess (a Bay esian view). The Americ an Statistician , 30(3):112–119, 1976. Deb orah G May o. Err or and the gr owth of exp erimental know le dge . Universit y of Chicago Press, 1996. Deb orah G Ma y o. Statistic al infer enc e as sever e testing: How to get b eyond the statistics wars . Cam bridge Universit y Press, 2018. Deb orah G Ma yo and Aris Spanos. Error statistics. In Philosophy of statistics , pages 153–198. Elsevier, 2011. P McCullagh and DR Cox. In v ariants and lik eliho od ratio statistics. The A nnals of Statistics , pages 1419–1430, 1986. El ´ ıas Moreno, F Ja vier Gir´ on, and George Casella. Consistency of ob jective Ba yes factors as the mo del dimension grows. The Annals of Statistics , 38(4):1937, 2010. Ric hard D Morey , Jan-Willem Romeijn, and Jeﬀrey N Rouder. The philosoph y of Bay es factors and the quan tiﬁcation of statistical evidence. Journal of Mathematic al Psycholo gy , 72(6-18):36, 2016. Stefanie Muﬀ, Erlend B Nilse n, Rob ert B O’Hara, and Chlo´ e R Nater. Rewriting results sections in the language of evidence. T r ends in e c olo gy & evolution , 37(3):203–210, 2022. J. Neyman. T ests of statistical h yp otheses and their use in studies of natural phenomena. Com- munic ations in Statistics: The ory and Metho ds , 5(8):737–751, 1976. Jerzy Neyman and Egon S Pearson. On the use and interpretation of certain test criteria for purp oses of statistical inference: Part i. Biometrika , 20(1/2):175–240, 1928. Karl P earson. On the criterion that a given system of deviations from the probable in the case of a correlated system of v ariables is such that it can b e reasonably supp osed to hav e arisen from random sampling. The L ondon, Edinbur gh, and Dublin Philosophic al Magazine and Journal of Scienc e , 50(302):157–175, 1900. Muriel F elip e P´ erez-Ortiz, Tyron Lardy , Rianne de Heide, and Peter D Gr ¨ unw ald. E-statistics, group in v ariance and anytime-v alid testing. The Annals of Statistics , 52(4):1410–1432, 2024. Aadit ya Ramdas and T udor Manole. Randomized and exchangeable improv emen ts of m ark o v’s, c hebyshev’s and chernoﬀ ’s inequalities. Statistic al Scienc e , 41(1):121–142, 2026. Aadit ya Ramdas and Ruo du W ang. Hyp othesis testing with e-v alues. F oundations and T r ends ® in Statistics , 1(1-2):1–390, 2025. 32 Aadit ya Ramdas, Johannes Ruf, Martin Lars son, and W outer Ko olen. Admissible anytime-v alid sequen tial inference m ust rely on nonnegativ e martingales. arXiv pr eprint arXiv:2009.03167 , 2020. Aadit ya Ramdas, Johannes Ruf, Martin Larsson, and W outer M Ko olen. T esting exchangeabil- it y: F ork-conv exity , sup ermartingales and e-pro cesses. International Journal of Appr oximate R e asoning , 141:83–109, 2022. Aadit ya Ramdas, P eter Gr ¨ unw ald, Vladimir V o vk, and Glenn Shafer. Game-theoretic statistics and safe an ytime-v alid inference. Statistic al Scienc e , 38(4):576–601, 2023. Herb ert Robbins. Statistical metho ds related to the law of the iterated logarithm. The Annals of Mathematic al Statistics , 41(5):1397–1409, 1970. Jeﬀrey N Rouder. Optional stopping: No problem for Bay esians. Psychonomic bul letin & r eview , 21(2):301–308, 2014. Jeﬀrey N Rouder, Paul L Sp ec kman, Dongch u Sun, Richard D Morey , and Geoﬀrey Iverson. Ba yesian t tests for accepting and rejecting the null h yp othesis. Psychonomic bul letin & r eview , 16(2):225–237, 2009. Ric hard Roy all. Statistic al Evidenc e: A Likeliho o d Par adigm , volume 71. CR C Press, 1997. Ric hard M Roy all. The eﬀect of sample size on the meaning of signiﬁcance tests. The Americ an Statistician , 40(4):313–315, 1986. B Y a Ry abko and V A Monarev. Using information theory approac h to randomness testing. Journal of Statistic al Planning and Infer enc e , 133(1):95–110, 2005. Leonard J Sa v age, George Barnard, Jerome Cornﬁeld, Irwin Bross, IJ Go o d, DV Lindley , CW Clunies-Ross, John W Pratt, Ho ward Levene, Thomas Goldman, et al. On the founda- tions of statistical inference: Discussion. Journal of the Americ an Statistic al Asso ciation , 57 (298):307–326, 1962. Mark J Schervish. P v alues: what they are and what they are not. The Americ an Statistician , 50 (3):203–206, 1996. Glenn Shafer. T esting by b etting: A strategy for statistical and scientiﬁc communication. Journal of the R oyal Statistic al So ciety Series A: Statistics in So ciety , 184(2):407–431, 2021. Glenn Shafer and Vladimir V o vk. Pr ob ability and ﬁnanc e: it’s only a game! , volume 491. John Wiley & Sons, 2005. Glenn Shafer and Vladimir V ovk. Game-the or etic foundations for pr ob ability and ﬁnanc e . John Wiley & Sons, 2019. Sh ubhanshu Shekhar and Aadity a Ramdas. Nonparametric tw o-sample testing b y b etting. IEEE T r ansactions on Information The ory , 70(2):1178–1203, 2023. Joseph P Simmons, Leif D Nelson, and Uri Simonsohn. F alse-positive psychology: Undisclosed ﬂexibilit y in data collection and analysis allows presen ting anything as signiﬁcant. Psycholo gic al scienc e , 22(11):1359–1366, 2011. Philip B Stark. Sets of half-av erage n ulls generate risk-limiting audits: SHANGRLA. In Interna- tional c onfer enc e on ﬁnancial crypto gr aphy and data se curity , pages 319–336. Springer, 2020. 33 Ken ta T ak atsu. On the precise asymptotics of universal inference. arXiv pr eprint arXiv:2503.14717 , 2025. Mark L T ap er and Subhash R Lele. The natur e of scientiﬁc evidenc e: statistic al, philosophic al, and empiric al c onsider ations . Universit y of Chicago Press, 2010. Mark L T aper and Subhash R Lele. Evidence, evidence functions, and error probabilities. In Philosophy of statistics , pages 513–532. Elsevier, 2011. Mark L T ap er and Jos´ e Miguel P onciano. Evidential statistics as a statistical mo dern synthesis to supp ort 21st century science. Population Ec olo gy , 58(1):9–29, 2016. J. T er Sch ure and P . Gr ¨ unw ald. ALL-IN meta-analysis: breathing life into living systematic reviews. F1000R ese ar ch , 11(549), 2022. Jean Ville. ´ Etude critique de la notion de collectif. Bul l. Amer. Math. So c , 45(11):824, 1939. Vladimir V ovk. Conformal e-prediction. Pattern R e c o gnition , 166:111674, 2025. Vladimir V ovk and Ruo du W ang. E-v alues: Calibration, com bination and applications. The Annals of Statistics , 49(3):1736–1754, 2021. Vladimir V ovk, Bin W ang, and Ruo du W ang. Admissible w ays of merging p-v alues under arbitrary dep endence. The Annals of Statistics , 50(1):351–375, 2022. Eric-Jan W agenmakers. A practical solution to the p erv asiv e problems of p v alues. Psychonomic bul letin & r eview , 14(5):779–804, 2007. Hong jian W ang and Aadit ya Ramdas. Anytime-v alid t-tests and conﬁdence sequences for Gaussian means with unkno wn v ariance. Se quential Analysis , 44(1):56–110, 2025. Ruo du W ang. The only admissible wa y of merging arbitrary e-v alues. Biometrika , 112(2):asaf020, 2025. Ruo du W ang and Aadity a Ramdas. F alse disco v ery rate control with e-v alues. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 84(3):822–852, 2022. Larry W asserman, Aadit ya Ramdas, and Siv araman Balakrishnan. Universal inference. Pr o c e e dings of the National A c ademy of Scienc es , 117(29):16880–16890, 2020. Ronald L W asserstein and Nicole A Lazar. The ASA statement on p-v alues: con text, pro cess, and purp ose, 2016. Ian W audby-Smith and Aadity a Ramdas. Estimating means of b ounded random v ariables by b etting. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 86(1):1–27, 2024. Ian W audby-Smith, Ricardo Sandov al, and Michael I Jordan. Univ ersal log-optimality for general classes of e-pro cesses and sequential hypothesis tests. arXiv pr eprint arXiv:2504.02818 , 2025. S P aul W right. Adjusted p-v alues for simultaneous inference. Biometrics , pages 1005–1013, 1992. Min-ge Xie and Kesar Singh. Conﬁdence distribution, the frequentist distribution estimator of a parameter: A review. International Statistic al R eview , 81(1):3–39, 2013. 34

E-values as statistical evidence: A comparison to Bayes factors, likelihoods, and p-values

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment