An attempt at reading Keynes Treatise on Probability
The book A Treatise on Probability was published by John Maynard Keynes in 1921. It contains a critical assessment of the foundations of probability and of the current statistical methodology. As a modern reader, we review here the aspects that are m…
Authors: Christian P. Robert
R e ading Keynes’ T reatise on Probabilit y Christian P. R ober t Univ ersit ´ e P aris-Dauphine, CEREMADE, and CREST, P aris Abstract. A T reatise on Probabilit y was published by John Maynard Keynes in 1921. The T reatise contains a critical assessment of the philosophical foundations of probabilit y and of the statistical metho dology at the time. W e review the asp ects of the b o ok that are most related with statistics, av oiding uninteresting neophyte’s forra ys into philosophical issues. In particular, w e examine the argumen ts pro vided b y Keynes again the Ba yesian approac h, as w ell as the sketc hy alternativ e of a return to Lexis’ theory of analogies he prop oses. Our conclusion is that the T reatise is a scholarly piece of work lo oking at past adv ances rather than producing directions for the future. Keyw ords: probabilit y theory , frequency , Law of Large Num b ers, foundations, Bay esian statistics, history of statistics. 1 Intro duction A T reatise on Probability is John Maynard Keynes’ p olishing of his 1907 and 1908 Cambridge F ello wship dissertation ( The Principles of Probability , submitted to King’s College) in to a b o ok after an interruption due to the w ar (for censorship reasons, as Keynes w as then an advisor to the gov ernment). Although the author re- vised this dissertation in 1914 (as men tioned b y Aldric h, 2008a) and then in 1920 tow ards a general audience, the original p otential readers of A T reatise on Probabil- it y were therefore mostly lo cal academics, among whom his Cambridge colleagues Edgeworth and Y ule. Despite Keynes’ lasting in terest in statistics, this is also his most significant publication in this field, since his researc h focus had mov ed to economics b y then (as sho wn b y , e.g., Keynes, 1919). In contrast with the immense influence Keynes exerted and still exerts in this latter field, and in agreement with the fact that the original version w as an internal dis- sertation, the impact of A T reatise on Probability was v ery limited, for reasons further discussed hereafter. In this review, we consider the most relev ant parts of the b o ok, solely from a sta- tistical p ersp ective, a v oiding the outdated philosophical debates ab out the nature of probabilit y and of induction that constitute most of the T reatise and do not ov erlap with our personal interests. One m ust recall that this philosophical part on the foun- dations of probability is a common feature in b o oks of the p erio d as shown by the first 2 Keynes’ T reatise on Probability fift y pages of Jeffreys (1937) dedicated to “direct probabilities”. In terestingly , Keynes fa v ours a more sub jectiv e view of probability as a degree of belief, ` a la de Finetti (1974), while Jeffreys settles for a mathematical definition that implies there is only one “type” of probability . It is also worth reminding the reader at this stage that Andrej Kol- mogoro v’s b o ok laying the axioms of mo dern probability w as only published in 1933, since this explains why the concept of probability was still under debate at the time in b oth mathematical and philosophical circles. As in the parallel review of Jeffreys (1937) we underto ok in Rob ert et al. (2009), there is no attempt at drawing an history of statistics in this review, which is rather to b e tak en as a reflection of a mo dern reader up on a piece of w ork written one cen tury ago. F or earlier historical asp ects on the evolution of inv erse probability as a central piece of statistical thinking in the 19th Century , we refer the reader to the comprehensive co v erage in Dale (1999) who, ironically , stops its range at Karl Pearson, i.e. just b efore Keynes briefly entered the statistical scene. F or a broader historical p ersp ectiv e on the dev elopmen t of statistics and the state of statistics at the time of the T reatise , Stigler (1986) undoubtedly remains the essential reference. Before engaging up on this review, w e p oint out that the T reatise has b een previously assessed by Stigler (2002), who was similarly critical on the depth of the bo ok, and by Aldric h (2008a), who pro duced an extensiv e and scholarly survey on the impact (or lac k thereof ) of Keynes on the philosophical and statistical communities at the time. The later includes in particular a detailed study of the reviews written on A T reatise on Probability by philosophers and statisticians of the early 20th Cen tury , including Ronald Fisher (1923) and Harold Jeffreys (1931). 2 Contents of the T reatise “ A definition of probabilit y is not p ossible, unless it con tents us to define degrees of the probability-relation b y reference to degrees of rational b elief.” A T reatise on Probability , page 8. As clearly stated in the ab ov e quote, the pro claimed and am bitious goal of A T rea- tise on Probabilit y is to establish a logical basis for probabilit y and of dra wing a new “constructiv e” approac h for statistical induction. The extremely strong views con tained of the bo ok, as well as the highly critical reassessments of past and (then) current au- thors, like Laplace and Pearson, are reflecting up on the youth of the author and his earlier dispute with Karl P earson on correlation. The extensive cov erage of the (statis- tical if not probabilistic) literature of the time and the comprehensive—if not alwa ys insigh tful—discussion of most theories in comp etition shows the extent of the scholarly exp ertise of Keynes in statistics. The T reatise comprises of 23 chapters, regroup ed in five parts: I. F undamen tal Ideas; Rob ert, C.P . 3 I I. F undamen tal Theorems; I I I. Induction and Analogy; IV. Some Philosophical Applications of Probability; V. The F oundations of Statistical Inference. The first part sets the logical and philosophical grounds for establishing a theory of probabilit y , also touc hing upon the Principle of Indifference treated b elo w in Section 9.1. The second part is ab out probabilit y axioms seen from a mathematical logic persp ectiv e, although the mathematical depth is somewhat limited. This part also con tains a c hapter on Inv erse Probability , as disc ussed in Section 9. The third part is mostly philosophical and discusses Humean induction with few connections with statistical inference. P art IV is a short metaphysical digression on the meaning of randomness and its impact on conduct, completely unrelated with statistical inference. The statistical en tries in the T reatise are mostly found in P art V, which cov ers conv ergence theorems (the ”La w of Great Numbers” and the ”Theorems of Bernoulli, Poisson and Tc heb yc heff”), Bay esian inference and a call for a return to the ”Continen tal” principles laid by Lexis. As explored b elow, the amount of methodological inno v ation found in the bo ok is extremely limited, in line with Keynes’ own ac knowledgemen t that he is “unlikely to get muc h further” . 3 A restricted p ersp ective F rom a statistician’s viewp oint, the innov ativ e asp ects of the T reatise are quite lim- ited in that the statistical discourse remains at a highly rethorical—as opp osed to metho dological—lev el, drafting in v ague terms the direction for prosp ectiv e follow ers that ne v er materialised. While the T reatise presents b oth an historical (Dale, 1999) and a philosophical interest, from the p e rsp ectiv es b oth of Keynes’ academic career and of the foundations of statistics, there is no statistical adv ance to b e found in it. F or instance, the T reatise is missing the (then) current dev elopments on a comprehensiv e theory of statistical tests, started with Karl Pearson’s χ 2 and William Gosset’s t tests, and ab out to culminate in Fisher (1925). Given the conten ts of the so on-to-come ma jor adv ances represented by not only Fisher’s (1925) Statistical Metho ds , but also Jeffreys’ (1939) and de Finetti’s (1937, reprin ted as 1974) homonymous Theory of Probability , the T reatise do es not stand the comparison as it fails to provide even a thorough treat- men t of the theory of statistics of the time, if not prop osing adv ances in this domain. This lack of inno v ative material, along with the harsh tone of a critic who had con- tributed so little to the field, ma y explain why Keynes’ incursion in probability and statistics did not hav e a lasting impact, since even those most sympathetic to the b o ok (Jeffreys, 1931; Lindley, 1968) saw no practical nor metho dological asp ect to dra w from and praised asp ects external to their own field (Aldrich, 2008a). Stigler (2002, pp. 161- 162) similarly questions the worth of A T reatise on Probabilit y as a mathematical and statistical work, with almost sole fo cus “the binomial world” and he considers the b o ok unable to “carry the weigh t of a serious so cial scientific inv estigation” . 4 Keynes’ T reatise on Probability “Statistical techniques tell us how to ‘count the cases’.” A T reatise on Probabilit y , page 392. Keynes sp ends the ma jor p ortion of the T reatise decrying a large part of the (then) curren t statistical practice (first and foremost, Bay esian 1 statistics) as well as a ma jority of the past and (then) current statisticians, and in repro ducing the arguments of other (and more Con tinental) researchers, lik e Bo ole, Lexis, or v on Kries. (Again, this is in line with our argument that the b o ok is a scholarly and critical memoir rather than a inno v ativ e manifesto, even though the author aimed at a broader impact on the statistics communit y at the time.) F urthermore, most of Part V deals with observ ation frequencies (hence the quote at the top of this section) and their stabilisation. Quite curiously , when compared with, say , the muc h more mo dern treatise b y Jeffreys (1931) 2 , this b o ok does not contain analyses of realistic datasets, except when criticising von Bortkiewicz’s theory through the Prus- sian cav alry horsekick data, which is customarily used for introducing Poisson mo delling and is av ailable in R as the prussian dataset (R Dev elopment Core T eam, 2006). This is somehow surprising when considering the main research field of Keynes, namely Economics, where examples of considerable interest ab ound. In- stead, a v ery small n umber of (academic) examples like the prop ortion of b o ys in births is recurrently discussed throughout the b ook. T o b e complete ab out the statistics conten ts of A T reatise on Probabilit y , w e note that Part I I on F undamen tal Theorems also contains a chapter on the prop erties of v arious es timators of the mean in connection with the distribution of the obse rv ations, although Keynes dismisses its importance b y stating “It is without philosophical in terest and should probably b e omitted by most readers” (page 186). This chapter actually repro duces Keynes’ only gen uine statistical pap er, published in the Journal of the Ro yal Statistical So ciet y in 1911 on the theory of av erages (to b e discussed in Section 7). 1 It is w orth pointing out that the denomination of ‘Ba yesian’ appeared muc h later (Fien b erg, 2006), Keynes resorting to the (then) curren t denomination of Inverse Probabilit y or, in a more derogatory wa y , “the Laplacian theory of ‘unknown probabilities” (page 372). Referring to the authoritative arguments of Aldrich (2008a), we stress that the pap ers of Keynes on statistics were definitely Bay esian with most of his analysis b eing based on an uniform prior. He later started to worry ab out the influence of the prior, leading to the T reatise and to its harsh criticism of inverse probability , ev en though some Bay esian arguments remain in use within the b o ok (see fo otnote 4). This is quite in agreement with the practice of the da y , with statisticians mixing frequency and inv erse probability arguments, Pearson and Fisher included, even though earlier b o oks like Bertrand (1889) had p ointed out the distinction between ob jective and sub jective probabilities. 2 Jeffreys (1922) review ed the T reatise for Nature . His review was quite b enevolen t, despite most of Keynes’ p ersp ectives on statistics b eing foreign to his own. This may explain why Part V was mostly bypassed by Jeffreys’ review. His comments in Scientific Inference (1931, pp. 222-224) ab out Keynes’ refusal to admit that probabilities were num bers, hence are comparable, and in Theory of Probability (1939, p. 25) ab out Keynes’ unwillingness to generalise axioms, more truthfully reflects Jeffreys’ global opinion ab out the b o ok. Rob ert, C.P . 5 4 The lo w role of mo dels “The knowledge of statistical theory , which is required for this, trav els, I find, quite outside my knowledge.” Letter to P earson, 1915 . Throughout the b o ok, Keynes holds b oth probabilit y as a mathematical theory and probabilistic mo dels as the basis for statistics in very low regard, considering that un- kno wn probabilit y do not exist and that the repro ducibility of exp eriments is almost alw a ys questionable (as shown by the quote ‘Some statistical frequencies are, with nar- ro w er or wider limits, stable. But stable frequencies are not v ery common, and cannot b e assumed ligh tly” , page 336), apart from urn mo dels. When adopting this type of reasoning, Keynes thus falls into what we call an “ultra-conditioning fallacy”, namely that, the more co v ariates one conditions upon, the more differen t the individuals behav e, a p oint of view that go es against supporting statistical practice b ecause there can b e no frequency stabilisation. 3 F or instance, Keynes states that “where general statistics are av ailable, the numerical probabilit y whic h migh t b e derived from them is inappli- cable b ecause of the presence of additional knowledge with regards to the particular case” (page 29). (He then go es on deriding Gibb on for his use of mortality tables when he should hav e called for a do ctor!) This sho ws the gap b et w een the p ersp ectives of Keynes and those of Jeffreys (1931, 1939) and de Finetti (1937, reprinted as 1974), the later fo cussing on the exc hangeability of ev ents to deriv e the existence of a common if unkno wn probability distribution. It is also the more surprising given the earlier works of Pearson and Edgew orth in the 1880’s that developed mathematical statistics tow ards a general theory of inference (see Stigler, 1986, Chapters 9 and 10). The abov e quote, giv en in Stigler (1999) as a request from Keynes to Pearson to help as examiner at the Univ ersity of London, may how ever explain the reluctance of Keynes to engage in deep er mathematics. The use of particular sampling distributions (called la ws of errors ) in the repro- duction of his 1911 pap er on av erages (see Section 7) is not discussed in a mo delling p ersp ective but simply to bac k up the standard t yp es of a verages as maxim um lik eliho o d estimators. 4 “The general evidence which justifies our assumption of the particular law of errors which we do assume” (page 195) is never discussed further in the T reatise . As- sessing the worth of a probabilit y mo del against a dataset w as not an issue for Keynes, although P earson had earlier addressed the problem. 3 Discussing a similar p oint ab out Keynes, Stigler (1999, page 48) concludes that. with “this stan- dard, it is difficult to conceive of any p olicy issue that can be in vestigated statistically” . 4 Aldrich (2008a) argues quite convincingly that these are maximum a p osteriori (MAP) estimates corresponding to flat priors, making the term most probable more coherent. When stating the problem on page 194, Keynes indeed inv okes the “theorem of inv erse probability” , namely Bay es’ theorem on the parameter finite set. This theorem is also described in the 1911 pap er. Again, this shows that part of Keynes’ reasoning is still grounded within the principles of inv erse probability . 6 Keynes’ T reatise on Probability 5 Criticisms of frequentism “The frequency theory , therefore, entirely fails to explain or justify the most imp ortan t source of the most usual argumen ts in the field of probable infer- ence.” A T reatise on Probability , page 108. Giv en Keynes’ reluctance, men tioned ab o v e, to accept n umerical probabilities, mo d- els and reproducibility , it is no surprise that only extensiv e frequency stabilit y is accept- able for him: ”The ‘La w of Large Numbers’ is not at all a go o d name for the principle that underlies Statistical Induction. The ‘Stabilit y of Statistical F requencies’ would b e a muc h b etter name for it.” (page 336). He has a strong a priori against the almost sure stabilisation of an iid random sequence, esp ecially when considering real data. This is historically intriguing given the deriv ations by Bernoulli, de Moivre, and Laplace of the La w of Large Numbers more than a century earlier. “Some statistic al frequencies are, with narrow er or wider limits, stable. But stable frequencies are not very common, and cannot b e assumed lightly .” A T reatise on Probability , page 336. The criticism of the Central Limit Theorem (CL T), called Bernoulli’s Theorem in the bo ok, 5 that is found in Chapter XXIX is rather curious, in that it confuses mo del probabilities p with probabilit y estimates p 0 (the iden tical notation b eing an indicator of this confusion). 6 F or instance, on page 343, Keynes criticises the use of the Central Limit Theorem (CL T) for the Bernoulli dis- tribution B (1 / 2) and a coin tossing experiment as, when “heads fall at every one of the first 999 tosses, it b ecomes reasonable to estimate the probabilit y of heads at m uc h more than 1 / 2 ” . This argument is therefore confusing the probability mo del B ( p ) with the estimation problem. Keynes’ inabilit y to recognise the distinction may stem from his reluctance to use unkno wn probabilities such as p . Similarly , on pages 349-350, when considering the prop ortion of male births, an example dating bac k to Laplace, Keynes states that the probabilit y of ha ving n males births in a ro w is not p n , if p is the probability of a single male birth, but r s · r + 1 s + 1 · r + 2 s + 2 · · · r + n − 1 s + n − 1 if s is the num b er of births observ ed so far, and r the n umber of male births. The later is a sequen tial construction based on individual estimates for each new observ ation, neither 5 Bernoulli’s Theorem is historically the weak Law of Large Numbers but Keynes presents this result in conjunction with (a) a description of the binomial B ( n, p ) distribution and (b) the normal (CL T) approximation to the binomial cdf, a result he calls Stirling’s theorem. While Edgeworth had a clear influence on Keynes in Cambridge, his expansions providing a b etter approximation than the CL T are not mentioned (Hall, 1992). 6 Jeffreys (1931, p. 224) stresses that “Keynes’ postulate might fit the assigned probabilities instead of the true probabilities”. Rob ert, C.P . 7 a true (predictive) probabilit y nor a gen uine plug-in estimate. Most interestingly , under a flat prior on p , the predictiv e (marginal) probability of seeing n male births in a row is Z 1 0 p n ( s + 1)! r !( s − r )! p r (1 − p ) s − r d p = ( s + 1)! r !( s − r )! ( r + n )!( s − r )! ( n + s + 1)! = ( r + 1) · · · ( r + n ) ( s + 2) . . . ( s + n + 1) r . W e thus come to the conclusion that Keynes’ solution corresp onds to using Haldane’s (1932) prior, π ( p ) = 1 / ( p (1 − p ), whose impropriety difficulties (Rob ert, 2001) were not an issue at the time and even later in Jeffreys (1939). “It seldom happ ens that we can apply Bernoulli’s theorem to a long series of natural e v ents.” A T reatise on Probability , page 343. That Keynes concludes that Bernoulli’s Theorem (a simple version of the CL T in the binomial case) do es not hold exactly in this setting is clearly inappropriate. When considering that “knowledge of the result of one trial is capable of influencing the prob- abilit y at the next” , he is confusing the “true” probability with the estimated one. The same criticism applies to Keynes’ remark that “a kno wledge of some members of a p op- ulation ma y give us a clue to the general c haracter of the p opulation in question.” (page 346), a remark that b ears witness to Keynes’ sk epticism ab out the relev ance of prob- abilistic mo dels. F rom a Bay esian persp ective, it appears that Keynes mixes sampling distributions with marginal distributions, as in the latent v ariable example of page 346 dealing with observ ations from B ( p ) when p ∈ { p 1 , . . . , p k } : the observ ations become dep enden t when integrating out p . The statemen t “if we knew the real v alue of the quan tit y , the differen t measuremen ts of it would b e indep endent” (page 195) ma y b e understo o d under this light, ev en though it is a risky extrap olation giv en b oth the b o ok stance on Bay esian statistics and the lac k of evidence Keynes mastered this type of mathematical tec hniques (sho wn by the quote at the entry of Section 4). 6 Keynes’ views on statistical inference In the con tinuation of the quote from page 392 giv en ab ov e, the T reatise argues most vigorously against mathematical statistics by stating that the purp ose of statistics ough t to be strictly limited to preparing the n umerical asp ects of our material in an in telligible form . Keynes thus separates inference ( the usual inductiv e metho ds ) from statistics and clearly sho ws his skepticism ab out extending statistics b eyond a descriptive to ol. “The statistician is less concerned to disco v er the precise conditions in whic h a description can b e legitimately extended by induction.” A T reatise on Probabilit y , page 327. 8 Keynes’ T reatise on Probability The fo cus of statistical inference as describ ed in P art V is reduced to a probability assessmen t: “In the first type of argument w e seek to infer an unknown statistical frequency from an ` a priori probabilit y . In the second t yp e we are engaged on the in verse op eration, and seek to base the calculation of a probabilit y on an observed statistical frequency . In the second type we seek to pass from an observed statistical frequency , not merely to the probability of an individual o ccurrence, but to the probable v alue of other unknown statistical frequencies” (page 331). This is actually rather surprising giv en the o verall negative tone of A T reatise ab out probability theory . The first item ab ov e is a probabilistic issue and is treated as suc h in Chapter XXIX, whic h co vers both the normal and P oisson limit theorems, as w ell as ˇ Ceb y ˇ sev’s inequalit y . F urther criticisms of “Bernoulli’s Theorem” found in this chapter are limited to the fact that finding indep enden t and identically distributed (i.i.d.) replica- tions is a condition that is “seldom fulfilled” (page 342). The 1901 pro of by Liapounov of the CL T for general in- dep enden t random v ariables is not mentioned in Keynes’ b o ok and w as presumably unkno wn to the author. In- stead, he refers to Poisson for a series of indep endent random v ariables with differen t distributions, warning that “it is imp ortant not to exaggerate the degree to whic h Poisson’s metho d has extended the application of Bernoulli’s results” (page 346). Although ˇ Ceb y ˇ sev’s inequality has had very little impact on statistical practice, ex- cept when constructing conserv ativ e confidence interv als, Keynes is clearly impressed b y the result (of which he provides a very conv oluted pro of on pages 353-355) and he concludes—rather unfairly since Laplace wrote one century b efore—that “Laplacian mathematics is really obsolete and should b e replaced b y the very b eautiful work which w e ow e to these Russians” (page 355). Chapter XXIX terminates with an interest- ing section on simulation experiments aiming at an empirical v erification of the CL T, although Keynes’ conclusion on a very long dice exp erimen t is that, giv en that the fre- quencies do not match up “what theory w ould predict” (page 363), the dice used in this exp erimen t w as quite irregular (or ma yb e worn out by the 20,000 tosses!) “I do not believe that there is an y direct and simple method b y whic h we can mak e the transition from an observ ed numerical frequency to a numerical measure of probabilit y .” A T reatise on Probability , page 367. As illustrated by the ab ov e quote and discussed in the next section, Keynes do es not consider Laplace’s (i.e. the Ba y esian) approach to be logically v alid and he similarly criticises normal approximations ` a la Bernoulli, seeing b oth as “mathematical charla- tanery” (page 367)! Even the (maximum likelihoo d) solution of estimating p with the frequency x/n when x ∼ B ( n, p ) does not satisfy him (as b eing “incapable of a pro of” , page 371). Note that the maximum lik eliho o d estimator is called the most probable v alue throughout the b o ok, in concordance with the current denomination at the time Rob ert, C.P . 9 (Hald, 1999), without Keynes ob jecting to its Bay esian flav our. Ob viously , giv en that he wrote the main part of the b o ok b efore the war, he could not hav e used Fisher’s de- nomination of maxim um likelihoo d estimation since its introduction dates from 1922. 7 The method of least squares is also hea vily attac ked in Chapter XVII as “surrounded b y an unnecessary air of m ystery” (page 209), while conceding on the next page that it exactly corresp onds to assuming the normal distribution on observ ations (a fact that is not correct either). Once again, Keynes is missing the recent developmen ts of Pearson and Keynes on the estimation of regression co efficients, following the publication in 1889 of Natural Inheritance b y F rancis Galton and his discov ery of regression ( “one of the most attractiv e triumphs in the history of statistics” , according to Stigler, 1999, page 186.). Galton is only quoted twice in the T reatise and for marginal reasons, while regression do es not app ear at all. 7 On the p rincipal averages Chapter XVI I repro duces Keynes’ 1911 pap er in the Journal of the Roy al Statistical So ciet y on the characterisation of the distributions leading to sp ecific standard av erages as MAPs under a flat prior, i.e. mo dern MLEs, which means obtaining classes of densi- ties for whic h the MLEs are the arithmetic, the geometric and the harmonic av erages, and the median, respectively . The earlier decision-theoretic justifications of the arith- metic mean by Laplace and Gauß are derided as dep ending on “doubtful and arbitrary assumptions” (page 206), while the lac k of reparameterisation inv ariance of the arith- metic av erage as MLE is clearly stated (on page 208). This classification of standard a v erages as MLEs is more of a technical exercise than of true metho dological relev ance, b ecause the classification of distributions ( “la ws” ) that giv e the arithmetic, geometric, harmonic mean or the median as MLEs is obviously parameterisation-dependent, a fact later noted by Keynes but omitted at this stage despite his criticism of Laplace’s prin- ciple on the same ground. The deriv ation of the densities f ( x, θ ) of the distributions is based on the condition that the likelihoo d equation n X i =1 ∂ ∂ θ log f ( y i , θ ) = 0 is satisfied for one of the four empirical av erages, using differential calculus despite the fact that Keynes earlier deriv ed (on page 194) Bay es’ theorem by assuming the parameter space to b e discrete. 8 Under regularit y assumptions, in the case of the arithmetic mean, this leads to the family of distributions f ( x, θ ) = exp { φ 0 ( θ )( x − θ ) − φ ( θ ) + ψ ( x ) } , 7 Both ”Fisher” and ”estimation” entries are missing from the index of the T reatise . 8 Keynes notes that “differen tiation assumes that the p ossible v alues of y [meaning θ in our notations] are so numerous and so uniformly distributed that we may regard them as contin uous” (page 196). 10 Keynes’ T reatise on Probability where φ and ψ are arbitrary functions suc h that φ is t wice differentiable and f ( x, θ ) is a densit y in x , meaning that φ ( θ ) = log Z exp { φ 0 ( θ )( x − θ ) + ψ ( x ) } d x , a constraint missed by Keynes. (The same argumen t is repro duced in Jeffreys, 1939, page 167.) While we cannot judge of the lev el of no velt y in Keynes’ deriv ation with resp ect to earlier w orks, this deriv ation in terestingly produces a generic form of unidimensional ex- p onen tial family , tw ent y-five years b efore their rederiv ation b y Darmois (1935), Pitman (1936) and Ko opman (1936) as characterising distributions with sufficient statistics of constan t dimensions. The deriv ation of distributions for which the geometric and the harmonic means are MLEs then follows b y a change of v ariables, y = log x, λ = log θ and y = 1 /x, λ = 1 /θ , resp ectively . In those differen t deriv ations, the normalisation issue is treated quite off-handedly by Keynes, witness the function f ( x, θ ) = A θ x kθ e − kθ at the bottom of page 198, whic h is not in tegrable p er se . Similarly , the deriv ation of the log-normal density on page 199 is missing the Jacobian factor 1 /x (or 1 /y q in Keynes’ notations) and the same problem arises for the in v erse-normal densit y , which s hould b e f ( x, θ ) = Ae − k 2 ( x − θ ) 2 /θ 2 x 2 1 x 2 , instead of A exp k 2 ( θ − x ) 2 /x (page 200). At last, the deriv ation of the distributions pro- ducing the median as MLE is rather dubious because it does not seem to accoun t for the non-differen tiabilit y of the absolute distance in every point of the sample. F urthermore, Keynes’ general solution f ( x, θ ) = A exp Z y − λ | y − λ | φ 00 ( λ ) d λ + ψ ( x ) , where the in tegral is in terpreted as an anti-deriv ative, is such that the recov ery of Laplace’s distribution, f ( x, θ ) ∝ exp − k 2 | x − θ | inv olves setting (page 201) ψ ( x ) = θ − x | θ − x | k 2 x , hence making ψ dep endent on θ as well. In his summary (pages 204-205), Keynes (a) reintroduces a constant A for the normalisation of the density in the case of the arithmetic mean and (b) pro duces f ( x, θ ) = A exp φ 0 ( θ ) θ − x | θ − x | + ψ ( x ) in the case of the median. This later form is equally puzzling b ecause the ratio in the exp onen tial is equal to the sign of x − θ , leading to a p ossibly differen t weigh ting of exp ψ ( x ) when x < θ and when x > θ . Rob ert, C.P . 11 8 A reactiona ry proposal After an extensive criticism of the metho ds of the time and of the use of mathematical mo dels as a basis for statistical inference (see Section 4), Keynes concludes the T reatise with a defence of the metho d advocated by the late Lexis (who died in 1914), at the v ery momen t Fisher (1925) was defining statistics as “mathematics applied to data”. As analysed by Aldrich (2008a, Section 5), the defence is paired with Keynes’ attempt to link Lexis’ theory to his own principles of analogy in induction, as adv anced in Part I I I of the T reatise . The follo wing quote indicates why the attempt failed. “I hav e exp erienced exceptional difficult y , as the reader may discov er for himself in the follo wing pages, b oth in clearing up m y own mind ab out it and in exp ounding my conclusions precisely and in telligently .” A T reatise on Probabilit y , page 409. When considered from a mo dern p erspective, Chap- ters XXXII and XXXII I adv o cate a v ery empirical ap- proac h to statistics (which, in an anac hronistic w ay , pre- figures b o otstrap), namely to derive the stability of a probabilit y estimate by sub dividing a series in to a large enough num b er of subs Eries in order to assess the v ari- abilit y of the estimate or to sp ot heterogeneit y . Keynes asso ciates this approach with Lexis and app ears quite supp ortiv e of the latter, even though he comments that “Lexis has not pushed his analysis far enough” (page 401), b efore complaining ab out v on Bortkiewicz, “pre- ferring algebra to earth” (page 404). As highlighted by the ab ov e quote, the T reatise faces difficulties in build- ing a general theory around this approac h and the description of the mec hanism for dividing the series remains unclear throughout the chapters, since it seems to dep end on co v ariates. F or instance, the sen tence “all conceiv able resolutions in to partial groups” (page 395) is to b e opp osed to breaking “statistical material into groups b y date, place , and any other characteristic which our generalisation prop oses to treat as irrelev ant” (page 397). The mo del thus constructed has a mixture fla vour when the groups are made p er chance, or a hierarchical one otherwise. Indeed, the description of “the probability p for the group made up as follows” (page 395) p = z 1 z p 1 + z 2 z p 2 + . . . clearly corresp onds to a mixture, the z i ’s b eing the comp onen t sizes. In an y case, the description of Lexis’ theory sums up as testing for v ariations b et ween groups, i.e. by exp osing a p ossible extra-binomial—called supra-normal b y Keynes— v ariation. Keynes also mentions the p ossibilit y of an insufficient v ariation—the subnor- mal case—is attributed to dep endence in the data, which “cannot b e handled by purely 12 Keynes’ T reatise on Probability statistical metho ds” (page 399) 9 . A modern accounting of Lexis’ procedure for testing stabilit y and of why the author “failed” (and is now largely forgotten) is giv en in Stigler (1986, Chapter 6), who adds on page 238 that Keynes, as one of the few follow ers of Lexis, missed the p oint that “simple urns mo dels were insufficiently rich to supp ort the needs of a mo dern statistical analysis” . “Statistical induction is not really ab out the particular instance at all but a series .” A T reatise on Probability , page 411. As the final chapter, Chapter XXXI I I is Keynes’ last attempt at defending his own views about a constructiv e theory of statistical inference. How ever, it mostly sounds lik e rephrasing Lexis’ views, Keynes’ main p oint b eing that one should work with “ series of series of instances” (page 407) in order to chec k for the stabilit y of the assumed mo del. The p oin t made in the ab ov e quote is v alid at face v alue but the attempt at c hec king that all sub divisions of a dataset show the same v ariabilit y ( “until a prima facie case has b een established for the existence of a stable probable frequency , we hav e but a flimsy basis for an y statistical induction” , page 415) is doomed when pushed to its extreme division of the data in to individual observ ations. F urthermore, we again stress that the T reatise never explicitly deriv es a testing methodology in the sense of Gosset or of Fisher 10 , despite mentions made of “significant stability” (pages 408 and 415). When discussing Lexis’ disp ersion in Chapter XXXII, Keynes refers to a case when “the dispersion conformed approximately to the (...) normal la w of error” (page 358), but, again, no entry is found on the con temp orary Student’s t tests or Pearson’s χ 2 tests. The earlier criticisms of Keynes’ ab out the extension of an observed mo del to future o ccurrences also apply in this setting, a fact ackno wledged by the author: “it is not conclusiv e and I m ust lea v e to others its more exact elucidation” (page 419). Besides, the assessmen t of stability is not detailed and, while it seems to be based on normal appro ximations (to the binomial), the facts that the same data is used repeatedly and th us that the test statistics are dep endent app ear to hav e b een ov erlo ok ed by Keynes. 9 Inverse Probabilit y As already discussed in fo otnotes 1 and 4, the foundations of statistics w ere not suffi- cien tly settled at the time Keynes wrote his b o ok to allow for a clear distinction betw een frequen tist and Bay esian philosophies. The choice of prior distributions had already come under attac ks in the b o oks of Chrystal, V enn and Bertrand, but the alternativ e construction of a non-Bay esian setting w ould hav e to wait a few more years for Fisher’s (1925) new p ersp ectiv e. 9 The whole b o ok considers handling dep endent observ ations an imp ossible task, despite Marko v’s introduction of Mark ov chains a few years earlier. As p ointed out by Aldrich (2008), the lack of feedback from Y ule in A T reatise on Probability is apparent from the p essimistic views of the author about dep endent series, despite the proximity of the authors in Cambridge, since Y ule had already engaged into building a statistical analysis of time series. 10 Ronald Fisher also reviewed Keynes’ b o ok in 1923, concluding at the uselessness of Keynes’ p er- spective on statistics as describ ed in Aldrich (2008b). Rob ert, C.P . 13 “Ba y es’ enunciation is strictly correct and its metho d of arriving at it shows its true logical connection with more fundamen tal principles, whereas Laplace’s en unciation giv es it the app earance of a new principle sp ecially in tro duced for the solution of causal problems.” A T reatise on Probabilit y , page 175. When discussing the history of Bay es’ theorem in Chapter XVI, Keynes considers— as shown by the ab ov e quote—that only Bay es got his pro of right and that subsequent writers, first and foremost Laplace, muddled the issue (except for Mark ov)! While the author of the T reatise righ tly separates the mathematical result represented by Bay es’ theorem from its use in statistical inference, Keynes misses the fact that Laplace in- dep enden tly derived Bay es’ theorem from a purely mathematical p ersp ective, b efore applying (muc h later) inv erse probability principles in statistical problems. (Misunder- standing Ba yes’ theorem with not-y et Bay esian statistics seemed to b e quite common at the time since, as rep orted in Stigler, 1999, Karl P earson equates Ba yes’ theorem with Laplace’s Principle of Non-Sufficient Reason cov ered b elow.) An in teresting discussion in the T reatise revolv es around the (obvious) fact that the prior probabilities of the different causes should b e taken into account ( “the necessit y in general of taking into account the ` a priori probability of the different causes” , page 178). But one argumen t sheds light on the difficult y Keynes had with the updating of probabilities, as mentioned in the paragraph ab out the CL T: “how do we know that the p ossibilities admissible ` a p osteriori are still, as they were assumed to b e ` a priori , equal p ossibilities (page 176). 11 This section considers the specific argumen ts Keynes adv anced against Ba yesian principles. (W e note again that the statistical practice had the time had b oth frequentist and Ba yesian, i.e. sampling and p osterior, arguments mostly mixed in its arguments, as detailed in Aldrich, 2008a.) 9.1 Against the Principle of Indifference “My criticism will be purely destructiv e and I will not attempt to indicate m y own w ay out of the difficulties.” A T reatise on Probabilit y , page 42. The Principle of Indifference is Keynes’ renaming of the Principle of Non-Sufficient Reason advocated by Laplace and his follow ers for using (possibly improp er) uniform prior distributions. F ollo wing the ab ov e preamble, Keynes (rightly) shows the inconsis- tency of this approach under (a) a refinement of the av ailable alternative (pages 42-43) and (b) a non-linear reparameterisation of the mo del (page 45), the example b eing the c hange from ν into 1 /ν . An extension of this argument on page 47 discusses the de- p endence of the uniform distribution on the dominating measure (although the b o ok ob viously does not dally with a measure theory not y et finalised at the time and an ywa y 11 Note the accents used in ` a priori and ` a p osteriori , although there are no accent in Latin. They may hav e stemmed from the wa y F rench writers first used those terms, even though ` a p osteriori is also found in Jakob Bernoulli’s Ars Conjectandi ... The accent has v anished by the time of Jeffreys (1931). 14 Keynes’ T reatise on Probability b ey ond Keynes’ reac h), as illustrated b y Bertrand’s parado x. 12 This parado x p oints out the lack of meaning of a “random c hord” of a circle without a prop er probabilit y struc- ture and is reanalysed in Jaynes (2003, page 386) from an ob jective Ba y es p ersp ectiv e, where the author defends the maxim um in v ariance principle. In the follo wing and less con vincing paragraphs of Chapter IV, Keynes finds defaults with basic game examples (including the Mont y Hall problem) where again the equidistribution dep ends on the reference measure. “Who could supp ose that the probabilit y of a purely h yp othetical even t, of whatev er complexity (...), and which has failed to o ccur on the one o ccasion on whic h the hypothetical conditions were fulfilled is no less than 1 / 3 ?.” A T reatise on Probability , page 378. Similar arguments are adv anced in Chapter XXX when debating ab out Laplace’s la w of succession. 13 Those are standard criticisms found for instance in the earlier Bertrand (1889). Namely , putting a uniform distribution on all possible alternatives is not coheren t given that a subdivision of an alternativ e in to further cases mo difies the uniform prior. And, furthermore, a non-linear reparameterisation of a probabilit y p into q = p n fails to carry uniformity from p to q . In concordance with the spirit of the time (Lhoste, 1923; Bro emeling and Bro emeling, 2003), the debate ab out whether or not the Principle of Indifference holds mak es some sense, as shown b y the subsequen t defence b y Jeffreys, but it do es not hold m uch app eal now adays b ecause priors are recognised as reference to ols for handling data rather than expressions of truth or of “ob jectiv e probabilities”. Chapter XXXI debates on the inv ersion of Bernoulli’s Theorem, a notion that we in terpret as Bay es formula applied to the Gaussian appro ximation to the distribution of an empirical frequency: on page 387 f ( q 0 ) | h · f ( q ) P f ( q 0 ) | h · f ( q ) , apparen tly meaning f ( x | θ ) π ( θ ) R f ( x | θ ) π ( θ ) d θ in mo dern notations, is asso ciated with the statem en t that “all the terms can b e de- termined n umerically by Bernoulli’s Theorem” . Since this representation is somehow based on a flat reference prior (although the form ula at the bottom of page 386 whic h seems to in volv e tw o distributions on the parameter θ is incomprehensible), and th us on the Principle of Indifference, it is rejected by Keynes who cannot see a “justification for the assumption that all p ossible v alues of q are ` a priori equally likely” (page 387). 12 W e note ho wev er that the general setup of modern measure theory had already been given b y Henri Lebesgue in 1903 in Annali di Mathematica . 13 Although no black swan glides in, the section con tains the obligatory example of the probability of the sun rising tomorrow that found in almost every treatise on induction since Hume (T aleb, 2008). Rob ert, C.P . 15 9.2 Against p robabilising the unkno wn “Laplace’s theory requires the emplo yment of both of t wo inconsisten t meth- o ds.” A T reatise on Probability , page 372. The criticism of Ba y esian (Laplacian’s) tec hniques go es further than the rather stan- dard debate ab out the choice of the prior. F or Keynes, adopting a p ersp ective that unkno wn probabilities can b e mo delled as random v ariables is b eyond logical reasoning. Because an unknown probabilit y is indeterminate, Keynes considers that “there is no suc h v alue” (page 373). 14 Therefore, the Bay esian notion of setting a probability distri- bution ov er the unit interv al is both illogical and impractical, since “if a probability is unkno wn, surely the probabilit y , relativ e to the same evidence, of this probabilit y has a given v alue, is also unknown” (page 373). Keynes then argues that, if the hyperprior probabilit y is unknown, it should also b e endow ed with its own probabilit y measure, inducing “an infinite regress” (page 373). 10 Conclusion In conclusion, while Keynes’ early in terest in Probabilit y and in Statistics is unarguable, A T reatise on Probability could not hav e made a lasting contribution to Statistics, ev en from an historical p ersp ective, given the immense developmen ts taking place in Statistics at the turn of the Cen tury or in the neighbouring decades. The T reatise app ears in the end as a scholarly exercise fo cussing on past bo oks and lac king a vision of developmen ts that would hav e made Keynes a statistician of his time, while the aggressiv e tone adopted tow ards most of the writers quoted in the bo ok is undeserved when comparing the ac hievemen ts of both camps. It is therefore no surprise the b o ok has had no influence on the probabilit y and statistics communities: it would make no sense to advise students in the field to put aside ma jor treatises to p onder through A T reatise on Probability as, to adopt Fisher’s (1922) harsh but still relev an t words, “they w ould b e turned aw a y , some in disgust, and most in ignorance, from one of the most promising branc hes of mathematics.” Ackno wledgements The author’s research is partly supp orted by the Agence Nationale de la Recherc he (ANR, 212, rue de Bercy 75012 Paris) through the 2007–2010 grant ANR-07-BLAN- 0237 “SPBay es”. The first draft of this pap er w as written during the conference on 14 A more p ositive persp ective (see, e.g., Brady, 2004) is to consider that Keynes’ stance prefigures the theory of imprecise probabilities ` a la Dempster–Schafer (see, e.g., W alley, 1991), as for instance when he states that “many probabilities can b e placed b etw een numerical limits” (page 160), but, to us, this mostly shows the same confusion b etw een (interv al) estimates and true probabilities found elsewhere in the b o ok. The notion of replacing (p oint wise) probabilities by in terv als in the short Chapter XV is attributed to Bo ole—with another barb in the footnote of page 161—and it does not seem to be set to any implementable version in the statistical inference section (Part V). 16 Keynes’ T reatise on Probability F rontiers of Statistical Decision Making and Ba yesian Analysis in San Antonio, T exas, Marc h 17-20, held in honour of Jim Berger’s 60th birthda y , and the author w ould lik e to dedicate this review to him in conjunction with this even t. He is also grateful to Eric S´ er´ e for his confirmation of the arithmetic mean distribution classification found in Chapter XVI I of the T reatise . Detailed and constructiv e comments from a referee greatly help ed in preparing the revision of the pap er. This pap er w as composed using the ba.cls macros from the International So ciety for Bay esian Analysis. References Aldric h, J. 2008a. Keynes among the Statisticians. History of Political Economy 40(2): 265–316. —. 2008b. R.A. Fisher on Ba yes and Ba yes’ Theorem. Bay esian Analysis 3(1): 161–170. Bertrand, J. 1889. Calcul des Probabilit´ es . Paris: Gauthier-Villars et fils. Brady , M. 2004. J.M. Keynes’ Theory . Xlibris Corp oration. Bro emeling, L. and A. Bro emeling. 2003. Studies in the history of probability and statistics XL VI I I The Bay esian con tributions of Ernest Lhoste. Biometrik a 90(3): 728–731. Dale, A. 1999. A History of In v erse Probability: F rom Thomas B a yes to Karl Pearson . 2nd ed. Sources and Studies in the History of Mathematics and Physical Sciences, Springer-V erlag, New Y ork. Darmois, G. 1935. Sur les lois de probabilit´ e ` a estimation exhaustive. Comptes Rendus Acad. Sciences P aris 200: 1265–1266. de Finetti, B. 1974. Theory of Probabilit y , v ol. 1. New Y ork: John Wiley . Fien b erg, S. 2006. When did Ba yesian inference become “Ba yesian“? Bay esian Analysis 1(1): 1–40. Fisher, R. 1922. On the mathematical foundations of theoretical statistics. Philosophical T ransactions of the Roy al So ciety , A 222: 309–368. —. 1923. Review of J.M. Keynes’s Treatise on Probability . Eugenic Review 14: 46–50. —. 1925. Statistical Metho ds for Research W orkers . Edinburgh: Oliver & Boyd. Galton, F. 1889. Natural Inheritance . London: Macmillan. Hald, A. 1999. On the history of maximum likelihoo d in relation to inv erse probability and least squares. Statist. Sci. 14(2): 214–222. Haldane, J. 1932. A note on inv erse probability . Pro c. Cambridge Philosophical Soc. 28: 55–61. Rob ert, C.P . 17 Hall, P . 1992. The Bo otstrap and Edgeworth Expansion . Springer-V erlag, New Y ork. Ja ynes, E. 2003. Probability Theory . Cambridge: Cambridge Universit y Press. Jeffreys, H. 1922. The Theory of Probability . Nature 109: 132–133. —. 1931. Scientific Inference . 1st ed. Cambridge: The Universit y Press. —. 1937. Scientific Inference . 2nd ed. Cambridge: The Universit y Press. —. 1939. Theory of Probability . 1st ed. Oxford: The Clarendon Press. Keynes, J. 1911. The Principal Av erages and the Laws of Error whic h Lead to Them. J. Ro y al Statistical So ciet y 74: 322–331. —. 1919. The Economic Consequences of The Peace . New Y ork: Harcourt Brace. Kolmogoro v, A. 1933. Grundb egriffe der W ahrsc heinlichk eitsrechn ung . Berlin:Springer. Ko opman, B. 1936. On distributions admitting a sufficient statistic. T rans. Amer. Math. So c. 39: 399–409. Lhoste, E. 1923. Le Calcul des Probabilit ´ es Appliqu ´ e ` a L’artillerie 91: 405–423, 516–532, 58–82 and 152–179. Lindley , D. 1968. John Maynard Keynes: Contributions to Statistics. In In ternational Encyclop edia of the So cial Sciences , vol. 8, 375–376. New Y ork: Macmillan Compan y & The F ree Press. Pitman, E. 1936. Sufficient statistics and intrinsic accuracy . Pro c. Cam bridge Philos. So c. 32: 567–579. R Developmen t Core T eam. 2006. R: A Language and Environmen t for Statistical Computing . R F oundation for Statistical Computing, Vienna, Austria. URL http://www.R- project.org Rob ert, C. 2001. The Bay esian Choice . Springer-V erlag, New Y ork. Rob ert, C., N. Chopin, and J. Rousseau. 2009. Theory of Probabilit y revisited (with discussion). Statist. Science 24(2): 141–172 and 191–194. Stigler, S. 1986. The History of Statistics . Cambridge: Belknap. —. 1999. Statistics on the T able: The History of Statistical Concepts and Metho ds . Cam bridge, Massac h usetts: Harv ard Universit y Press. —. 2002. Statisticians and the history of Economics. Journal of the History of Economic Though t 24(2): 155–164. T aleb, N. 2008. F ooled by Randomness . 2nd ed. Random House. W alley , P . 1991. Statistical Reasoning with Imprecise Probabilit y . New Y ork: Chapman and Hall.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment