Local Bayesian Regression
This paper develops a class of Bayesian non- and semiparametric methods for estimating regression curves and surfaces. The main idea is to model the regression as locally linear, and then place suitable local priors on the local parameters. The metho…
Authors: Nils Lid Hjort
Lo cal Ba y esian Regression Nils Lid Hjort, Univ ersit y of Oslo Abstract. This pap er dev elops a class of Ba yesian non- and semipara- metric metho ds for estimating regression curv es and surfaces. The main idea is to mo del the regression as lo cally linear, and then place suitable lo cal priors on the lo cal parameters. The metho d requires the p osterior distribution of the lo cal parameters giv en lo cal data, and this is found via a suitably defined lo cal lik eliho o d function. When the width of the lo- cal data windo w is large the metho ds reduce to familiar fully parametric Ba yesian metho ds, and when the width is small the estimators are essen- tially nonparametric. When noninformative reference priors are used the resulting estimators coincide with recen tly dev elop ed well-performing lo cal w eighted least squares metho ds for nonparametric regression. Eac h lo cal prior distribution needs in general a cen tre parameter and a v ariance parameter. Of particular in terest are versions of the scheme that are more or less automatic and ob jectiv e in the sense that they do not require sub jectiv e sp ecifications of prior parameters. W e therefore develop empirical Ba yes methods to obtain the v ariance parameter and a hierarc hi- cal Bay es metho d to ac coun t for uncertaint y in the choice of centre param- eter. There are several p ossible versions of the general programme, and a n umber of its specialisations are discussed. Some of these are shown to b e capable of outperforming standard nonparametric regression metho ds, particularly in situations with sev eral cov ariates. Key w ords: Bay esian regression; empirical Bay es; hierarchical Bay es; k ernel smo othing; lo cal lik eliho o d; locally linear models; P oisson regression; semiparametric estimation; Stein-t yp e estimators 1. In tro duction and summary . Supp ose data pairs ( x 1 , y 1 ) , . . . , ( x n , y n ) are a v ailable and that the regression of y on x is needed, in a situation where there is, a priori, no acceptable simple parametric form for this. W e tak e the regression curv e to be the conditional mean function m ( x ) = E( y | x ), and c ho ose to w ork in the sligh tly more sp ecific mo del where the y i s are seen as regression curv e plus i.i.d. zero mean residuals. In other w ords, given x 1 , . . . , x n , w e hav e y i = m ( x i ) + ε i , i = 1 , . . . , n, where E ε i = 0 and V ar ε i = σ 2 . (1 . 1) This pap er is ab out non- and semiparametric Bay esian approac hes tow ards estimat- ing the m ( · ) function. 1.1. Two st and ard estima tors. Before embarking on the Bay esian journey w e describ e tw o standard frequentist estimation metho ds in some detail, since these will show up as important ingredien ts of some of the Bay es solutions to come. The first metho d uses e m ( x ) = the a that minimises n X i =1 ( y i − a ) 2 w i ( x ) , (1 . 2) that is, e m ( x ) = P n i =1 w i ( x ) y i / P n i =1 w i ( x ), for a suitable set of lo c al weights w i ( x ). These are to attac h most weigh t to pairs ( x i , y i ) where x i is close to x and little Lo cal Bay esian Regression 1 August 1994 or zero weigh t to pairs where x i is some distance aw a y from x . A simple wa y of defining such weigh ts is via a k ernel function K ( · ), some c hosen probabilit y densit y function. T ypical k ernels will b e unimo dal and symmetric and at least contin uous at zero. T o fix ideas we also take K to ha ve b ounded supp ort, whic h we may scale so as to b e [ − 1 2 , 1 2 ]. W e c ho ose to define w i ( x ) = ¯ K ( h − 1 ( x i − x )) , where ¯ K ( z ) = K ( z ) /K (0) . (1 . 3) Of course scale factors do not matter in (1.2), but it is helpful b oth for interpretation and for later developmen ts to scale the w eights in this wa y . W e think of w i ( x ) as a measure of influence for data pair ( x i , y i ) when estimation is carried out at lo cation x , and with scaling as p er (1.3), w i ( x ) is close to 1 for pairs where x i is near x and equal to 0 for pairs where | x i − x | > 1 2 h . The lo cal w eigh ted mean estimator (1.2) is the Nadaray a–W atson (NW) estimator, see for example Scott (1992, Ch. 8) for another deriv ation and for further discussion. Metho d (1.2) fits a lo cal constan t to the local ( x i , y i ) pairs. The second standard metho d is the natural extension whic h fits a lo cal regression line to the local data, that is, e m ( x ) = e a, where ( e a, e b ) minimise X | x i − x |≤ h/ 2 { y i − a − b ( x i − x ) } 2 w i ( x ) . (1 . 4) See W and and Jones (1994, Ch. xx) for p erformance prop erties and further discussion of this lo cal linear (LL) regression estimator. Obviously the idea generalises further, to other lo cal parametric forms for the regression curve, to other criterion functions to minimise, and to other regression mo dels. 1.2. Local likelihood functions. The basic Ba yesian idea to be dev elop ed is to place lo cal priors on the lo cal parameters, and use p osterior means as estimators. This requires a ‘lo cal lik eliho o d’ function that expresses the information conten t of the lo cal data. Supp ose in general terms that there is some parametric mo del f ( y i | x i , β , σ ), typically in terms of regression parameters β and a scale parameter σ , that is trusted lo cally around a given x , that is, for x i ∈ N ( x ) = [ x − 1 2 h, x + 1 2 h ]. The lik eliho o d for these local data are then Q x i ∈ N ( x ) f ( y i | x i , β , σ ). This can also b e written L n ( x, β , σ ) = Y x i ∈ N ( x ) f ( y i | x i , β , σ ) w i ( x ) , (1 . 5) with weigh ts as in (1.3) for the uniform kernel on [ − 1 2 , 1 2 ]. W e will also use (1.5) for more general kernel functions, requiring only that ¯ K is contin uous at zero with ‘correct level’ ¯ K (0) = 1, and call L n ( x, β , σ ) the lo cal kernel smo othed likelihoo d at x . The argument is that it is a natural smo othing generalisation of the b ona fide one that uses uniform w eights, and that the lo cal parametric model employ ed is sometimes to b e trusted less a little distance a w ay from x than close to x . F ur- ther motiv ation is provided by the fact that the resulting maximum lo cal lik eliho o d estimators sometimes ha ve b etter statistical b eha viour for non-uniform choices of Nils Lid Hjort 2 August 1994 k ernel function; the standard lo cal lik eliho o d with uniform weigh ts is for example not con tinuous in x . See further commen ts in Sections 8.3–8.4 below. Also note that when h grows large all weigh ts b ecome equal to 1, and we are back in familiar fully parametric territory . Observ e that when the lo cal mo del is normal with constant v ariance, then the ‘lo cal constant mean’ mo del gives a lo cal lik eliho o d with maximum equal to the NW estimator (1.2), and the ‘lo cal linear mean’ mo del corresp ondingly gives the LL estimator (1.4) interpretation as maximum lo cal likelihoo d estimator. This is discussed more fully in Sections 2 and 3. 1.3. General Ba yesian constr uction. A ‘lo cally parametric Bay esian regression programme’ can b e outlined quite broadly , but we will presently do so in terms of a lo cal v ehicle mo del of the form f ( y | x, β , σ ). It comprises several differen t and partly related steps. (a) Come up with a complete ‘start curve’ function m 0 ( x ), and give a prior for the scale parameter σ . This start curve is though t of as any plausible candidate or ‘prior estimate’ for the regression function m ( x ), and would also be called the ‘prior guess function’ in Ba yesian parlance. (b) F or eac h given x , place a prior on the lo cal regression parameter β = β x , typi- cally a normal with a centre parameter determined by the start curve function and a co v ariance matrix of the form σ 2 W − 1 0 ,x . (c) Do the basic Bay esian calculation and obtain the Bay es estimate b m ( x ) as the lo cal p osterior mean, that is, the mean in the distribution of E( y | x, β , σ ) given lo cal data. F urther inference (finding credibilit y interv als and so on) can b e carried out using the same distribution. This is ‘so far, so go o d’, and the problem is solved for the ‘ideal Bay esians’ who can accurately sp ecify the m 0 ( · ) function and the precision parameters W 0 ,x . Step (c) will indeed give estimators of in terest. Not every practising statistician can come up with the required parameters of the lo cal priors, how ev er. W e therefore add tw o more steps to the programme: (d) Obtain estimates of lo cal precision parameters W 0 ,x using empirical Bay es cal- culations, and still using the start curve function m 0 ( x ). (e) T o account for uncertain ty in sp ecifying the start curve, and to obtain an es- timate thereof, use a hierarc hical Bay esian approach, b y ha ving a bac kground prior on this curv e. There will b e sev eral p ossible v ersions of (d) and (e) here. W e shall b e concerned with versions of (e) that are in terms of a parametric start curve function m 0 ( x, ξ ), with a first-stage prior on the ξ parameter. W e shall also b e primarily in terested in v ersions of (d) and (e) that are reasonably ‘automatic’ and ob jectiv e in the sense that they do not require sub jective sp ecification of prior parameters. That is, a flat prior will typically b e used for the parameters present in m 0 ( x, ξ ), and the empirical Bay es metho ds in (d) will t ypically b e based only on the unconditional distribution for collections of lo cal data, and not, for example, on further priors for Lo cal Bay esian Regression 3 August 1994 the parameters in question. Mo difications are a v ailable under strong prior opinions, though. 1.4. Other w ork. Lo cal regression metho ds go back at least to Stone (1977) and Cleveland (1979). See F an and Gijb els (1992), Hastie and Loader (1993), Rup- p ert and W and (1994) and W and and Jones (1994) for recen t dev elopments. Lo cal lik eliho o d metho ds w ere first explicitly introduced by Tibshirani, see Tibshirani and Hastie (1987) and the brief discussion in Hastie and Tibshirani (1990, Ch. 6). F ully parametric Bay esian regression metho ds, corresp onding in the presen t context to a large windo w width h , are w ell kno wn, see for example Bo x and Tiao (1973). Ba yesian v ersions of lo cal regression metho ds do not seem to hav e b een considered b efore. Similarly spirited Ba y esian lo cally parametric estimation metho ds for haz- ard rates and probability densities hav e how ever b een prop osed and discussed in Hjort (1994b), using suitable lo cal likelihoo d constructions for such, dev elop ed in resp ectiv ely Hjort (1994a) and Hjort and Jones (1994). The main Ba yesian-lik e non- parametric regression metho d is that of splines, see Silv erman (1985), W ah ba (1990) and the discussion in Hastie and Tibshirani (1990, Ch. 3). Besides splines there do es not seem to ha ve been muc h work done in semi- and nonparametric Ba yesian regres- sion at all; a recent review pap er on general Ba yesian nonparametric metho ds by F erguson, Phadia and Tiw ari (1992) barely touc hes regression. Some metho ds hav e recen tly b een developed using mixtures of Dirichlet pro cesses, see Erk anli, M ¨ uller and W est (1994) and W est, M ¨ uller and Escobar (1994). 1.5. The present p aper. Section 2 go es through the Ba yesian regression programme for the ‘lo cal constant’ mo del, which is the most transparent case. It giv es sp ecific prop osals for interpolating b et ween a giv en start curve and the NW estimator, with weigh ting schemes that come from empirical Bay es considerations and that hav e go o dness of fit interpretations. In the end the start curve is av eraged o ver with respect to its p osterior distribution. Section 3 similarly treats the ‘lo cal linear’ mo del, where somewhat more technical calculations are called for. Ba yesian and empirical Bay esian generalisations of the LL estimator (1.4) are obtained. T o illustrate the v arious ingredients in the complete estimation programme a case of a linearly structured start curv e is studied in Section 4. Other sp ecialisations of the sc heme are considered in Section 5. One particular v ersion of interest mo dels the regression curve lo cally as b eing of the form a start curve, say with globally estimated parameters, times a lo cal correction factor, and pro duces in the end estimators that resem ble Bay e sian relatives of a frequentist metho d recently prop osed in Hjort and Glad (1994). The Ba yesian metho ds migh t b e especially fruitful in situations with sev eral co v ariates, since many of the standard metho ds based on lo cal smo othing ha ve severe difficulties then. This is briefly discussed in Section 6. T o show that the general apparatus also w orks well in regression mo dels other than the traditional one, we consider Poisson regression in some detail in Section 7, and giv e brief p oin ters to other types of applications. Section 8 presents some ad- ditional results and remarks. Matters dw elt with include fine-tuning of parameters, parallels to Stein-type shrink age estimators, and discussion of the k ernel smo othed Nils Lid Hjort 4 August 1994 lo cal likelihoo d approach. Some of the commen ts suggest problems for further re- searc h. Finally conclusions are offered in Section 9. 2. Inference for the ‘lo cal level’ mo del. Let the lo cal mo del b e the normal with constant v ariance and lo cal mean function m ( t ) = a for t ∈ N ( x ). The lo cal k ernel smo othed lik eliho o d b ecomes L n ( x, a, σ ) = Y x i ∈ N ( x ) h 1 σ exp n − 1 2 1 σ 2 ( y i − a ) 2 oi w i ( x ) = 1 σ s 0 ( x ) exp n − 1 2 1 σ 2 Q ( x, a ) o (ignoring constan t factors), where s 0 ( x ) = P N ( x ) w i ( x ) and Q ( x, a ) = P N ( x ) ( y i − a ) 2 w i ( x ). Note that the maximum lo cal likelihoo d estimator is equal to the NW estimator of (1.2), e m ( x ) = e m NW ( x ). 2.1. Basic local Ba yesian calcula tion. As the lo cal prior for a at x w e use a normal with mean m 0 ( x ) and v ariance σ 2 /w 0 . The precision parameter w 0 will be allo wed to v ary with x , see the following subsections, but at the moment x is fixed. Start out rewriting Q ( x, a ) = X N ( x ) { y i − e m ( x ) } 2 w i ( x ) + X N ( x ) { a − e m ( x ) } 2 w i ( x ) = Q 0 ( x ) + s 0 ( x ) { a − e m ( x ) } 2 . (2 . 1) Using this it is not difficult to deriv e that a given the lo cal data, and conditional on σ , is another normal, cen tred at b m ( x ) = E { a | lo cal data , σ } = w 0 m 0 ( x ) + s 0 ( x ) e m ( x ) w 0 + s 0 ( x ) = ρ ( x ) m 0 ( x ) + { 1 − ρ ( x ) } e m ( x ) . (2 . 2) This is the Ba yes estimator (since it do es not dep end on σ ), as p er Step (c) in the general sc heme describ ed in Section 1.3. While the lo cal posterior mean (2.2) is the essen tial ingredien t, as far as com- putation of the Ba yes estimate is concerned, w e also go to the trouble of noting the follo wing expressions for the sim ultaneous density of a = a x and lo cal data, conditional on σ : w 1 / 2 0 σ 1 σ s 0 ( x ) exp n − 1 2 1 σ 2 h w 0 { a − m 0 ( x ) } 2 + s 0 ( x ) { a − e m ( x ) } 2 + Q 0 ( x ) io = { w 0 + s 0 ( x ) } 1 / 2 σ exp h − 1 2 1 σ 2 { w 0 + s 0 ( x ) }{ a − b m ( x ) } 2 i w 1 / 2 0 { w 0 + s 0 ( x ) } 1 / 2 1 σ s 0 ( x ) exp n − 1 2 1 σ 2 h Q 0 ( x ) + w 0 s 0 ( x ) w 0 + s 0 ( x ) { e m ( x ) − m 0 ( x ) } 2 io . This will b e useful later in connection with estimation of σ and sp ecification of w 0 . And, in particular, a fuller description of the lo cal p osterior is m ( x ) | lo cal data , σ ∼ N b m ( x ) , σ 2 / { w 0 + s 0 ( x ) } . (2 . 3) Lo cal Bay esian Regression 5 August 1994 With a Gamma prior for 1 /σ 2 this also leads to a suitable t distribution for a given lo cal data; see Section 8.2. The Bay es estimator (2.2) is a conv ex com bination of start curve and the NW estimator. It pushes the NW estimator tow ards the start curve with a strength deter- mined b y the prior precision parameter w 0 . Note that the data strength parameter s 0 ( x ) can b e expressed as nhf n ( x ) /K (0), where f n ( x ) = n − 1 P n i =1 h − 1 K ( h − 1 ( x i − x )) is the classical k ernel estimator of the density f for the x s. The NW estimator corresp onds to ha ving w 0 close to or equal to zero, whic h means a flat and nonin- formativ e prior for the local level. Also note that when h is large, then e m ( x ) is the mean of the y i s and s 0 ( x ) = n , and the Bay es solution is as for the familiar fully parametric case. It is intuitiv ely clear that the Ba yes solution b m ( x ) has b etter precision than the standard estimator e m ( x ) as long as the true v alue m ( x ) is not to o far from the prior estimate v alue m 0 ( x ). A formal inv estigation of this can study their risk functions under squared error loss, that is, the t wo mean squared errors. This is done in Section 8.1 b elo w. F or the presen t moment note that E m ( x ) { e m ( x ) − m ( x ) } 2 = σ 2 t 0 ( x ) /s 0 ( x ) 2 , where t 0 ( x ) = X N ( x ) w i ( x ) 2 . This last quantit y can b e written nhR K /K (0) 2 times a kernel estimate of f , where R K = R K ( z ) 2 d z , so the mean squared error is approximately R K σ 2 / { nhf ( x ) } (under the constant lo cal mean mo del). The p oint is that the standard estimator is p erhaps acceptable in regions of high x i -densit y , but quite v ariable in regions of lo w x i -densit y . This indicates that the Ba yes estimate, whic h has exp ected risk σ 2 / { w 0 + s 0 ( x ) } , is likely to mak e a significant impro vemen t in situations where f is small, where w 0 is not small, and where the start curv e v alue m 0 is not far off. In the follo wing metho ds will b e giv en that make b oth w 0 and m 0 dictated b y data. 2.2. Estima ting σ and prior precision p arameters. The Bay es solution (2.2) needs sp ecification of the prior strength parameter w 0 . In some situations one could p erceiv ably sp ecify this num b er based on previous data sets or other prior considerations, as in purist Bay es analysis. It is of considerable interest to dev elop more automatic and data-dictated metho ds, how ever. Information ab out the scale parameter σ is also needed, in the form of an estimate or a p osterior densit y , in order to carry out further inference ab out m ( x ), see (2.3). The empirical Ba yes idea is to infer parameters of the prior from the uncon- ditional distribution of data. Giv en the lo cal constant lev el a , the NW estimator has mean a and v ariance σ 2 t 0 ( x ) /s 0 ( x ) 2 . It follo ws that its unconditional mean is m 0 ( x ) and unconditional v ariance σ 2 { t 0 ( x ) /s 0 ( x ) 2 + 1 /w 0 ,x } , that is, P 0 ( x ) = s 0 ( x ) { e m ( x ) − m 0 ( x ) } 2 has E { P 0 ( x ) | σ } = σ 2 n t 0 ( x ) s 0 ( x ) + s 0 ( x ) w 0 ,x o , (2 . 4) writing no w w 0 ,x for the precision parameter at x . With an estimate of σ this can b e used to assign v alues for different w 0 ,x s. W e shall arriv e at v ersions of this scheme b elo w. Nils Lid Hjort 6 August 1994 The distribution of lo cal data D ( x ) alone, given σ and w 0 ,x , is obtained by in tegrating out a from the expression that led to (2.3). The result is ρ ( x ) 1 / 2 σ − s 0 ( x ) exp[ − (2 σ 2 ) − 1 { Q 0 ( x ) + ρ ( x ) P 0 ( x ) } ] , in terms of the ρ ( x ) = w 0 ,x / { w 0 ,x + s 0 ( x ) } parameter. This can b e maximised o ver p ossible v alues of σ and w 0 ,x , with result e σ 2 x = Q 0 ( x ) s 0 ( x ) − 1 and e ρ ( x ) = e σ 2 x P 0 ( x ) . The e ρ ( x ) is set equal to 1 if e m ( x ) is so close to m 0 ( x ) that the ratio exceeds 1. This σ estimate is local to x and of some separate in terest, for example for chec king of the constan t v ariance assumption. A b etter estimate emerges by com bining information o ver man y neigh b ourho ods. Divide the in terv al in whic h data fall into say k such cells N ( x ), say with midp oints x 0 , 1 < · · · < x 0 ,k and lengths h 1 , . . . , h k . Let I b e the collection of these midp oin ts, so that full data in terv al = ∪ x ∈ I N ( x ) and { x 1 , . . . , x n } = ∪ x ∈ I D ( x ) . Assume no w that the lo cal lev els a 1 , . . . , a k at x 0 , 1 , . . . , x 0 ,k are tak en to b e indepen- den t in their join t prior distribution (and see Section 5.4 for an alternativ e). This leads to a com bined likelihoo d of the form Y x ∈ I n ρ ( x ) 1 / 2 σ − s 0 ( x ) exp[ − (2 σ 2 ) − 1 { Q 0 ( x ) + ρ ( x ) P 0 ( x ) } ] o , conditional on σ and the w 0 ,x s at the different midp oin ts lo cations. Some analysis rev eals that this is maximised for e σ 2 = P x ∈ I Q 0 ( x ) P x ∈ I { s 0 ( x ) − 1 } and e ρ ( x ) = e σ 2 s 0 ( x ) 1 { e m ( x ) − m 0 ( x ) } 2 = e σ 2 P 0 ( x ) . (2 . 5) Notice that this agree s very w ell with (2.4), esp ecially when uniform weigh ts are used, in whic h case t 0 ( x ) /s 0 ( x ) = 1. Note that the σ estimate is indep endent of start curv e m 0 , whereas the ρ ( x ) estimate quite naturally measures fit of data to the start curv e, lo cally at x . Again the e ρ ( x ) is truncated at 1; if s 0 ( x ) 1 / 2 | e m ( x ) − m 0 ( x ) | < e σ , then the prior mo del fits excellently at x , and one uses e ρ ( x ) = 1, or b w 0 ,x = ∞ , in (2.5). If P 0 ( x ) 1 / 2 > e σ , then the Ba yes-empirical-Ba yes estimate b ecomes b m ( x ) = e σ 2 P 0 ( x ) m 0 ( x ) + n 1 − e σ 2 P 0 ( x ) o e m ( x ) = e m ( x ) − e σ 2 s 0 ( x ) 1 e m ( x ) − m 0 ( x ) . (2 . 6) Often the prior strength parameter w 0 ,x w ould naturally b e considered to b e a smo othly changing function with x , and then the estimate implicitly given ab o v e will b e to o rugged. One ma y use v arious p ost-smoothing devices for w 0 ,x or its relativ e ρ ( x ). The lo cal go o dness of fit statistics P 0 ( x ) hav e b een scaled so as to ha ve appro ximately the same v ariance, whic h invites using e σ 2 divided by an a verage Lo cal Bay esian Regression 7 August 1994 of these as a single measure of the o verall faithfulness of observed data to the start curv e used. This leads to an estimator of the form b m ( x ) = k e σ 2 P x ∈ I s 0 ( x ) { e m ( x ) − m 0 ( x ) } 2 m 0 ( x ) + h 1 − k e σ 2 P x ∈ I s 0 ( x ) { e m ( x ) − m 0 ( x ) } 2 i e m ( x ) , (2 . 7) where b oth weigh ts are truncated to [0 , 1] if necessary . The Steinean o v ertones already discernible ab ov e are now more audible; see Section 8.1 b elo w for further discussion. Of course non-uniform av eraging is sometimes more appropriate when form- ing the denominator here. A satisfactory solution would in man y situations b e to mo del the prior v ariance σ 2 /w 0 ,x at x in a parametric fashion. The strat- egy is to estimate the necessary parameters either b y regressing P 0 ( x ) / e σ 2 against t 0 ( x ) /s 0 ( x ) + s 0 ( x ) /w 0 ,x at differen t x p ositions, or by maximising the combined lik eliho o d Y x ∈ I n w 0 ,x w 0 ,x + s 0 ( x ) o 1 / 2 1 σ P x ∈ I s 0 ( x ) exp h − 1 2 1 σ 2 n X x ∈ I Q 0 ( x ) + X x ∈ I w 0 ,x s 0 ( x ) P 0 ( x ) w 0 ,x + s 0 ( x ) oi with resp ect to the parameters present in w 0 ,x . In the end one uses (2.2) with inferred v alues for w 0 ,x (or, if one prefers, inferred v alues for ρ ( x )). F or an example, supp ose the prior v ariance at x is modelled as σ 2 r ( x ) /w 0 for a kno wn function r ( x ). Plugging in w 0 ,x = w 0 /r ( x ) in the combined lik eliho od leads to one feasible solution, but a simpler one is based on P 0 ( x ) / e σ 2 ' 1 + s 0 ( x ) r ( x ) /w 0 , assuming uniform weigh ts to b e used. Since the P 0 ( x )s hav e the same v ariability it is natural to equate the clean a verage ¯ P 0 / e σ 2 to 1 + c/w 0 to define the empirical Bay es estimate b w 0 , where ¯ P 0 = k − 1 P x ∈ I P 0 ( x ) and c = k − 1 P x ∈ I s 0 ( x ) r ( x ). This leads to b m ( x ) = m 0 ( x ) 1 + s 0 ( x ) r ( x ) c − 1 { ¯ P 0 / e σ 2 − 1 } + s 0 ( x ) r ( x ) c − 1 { ¯ P 0 / e σ 2 − 1 } e m ( x ) 1 + s 0 ( x ) r ( x ) c − 1 { ¯ P 0 / e σ 2 − 1 } . (2 . 8) This is quite similar to (2.7). If in particular the prior v ariance at x is seen as appro ximately inv ersely prop ortional to the density f ( x ) at x , then r ( x ) s 0 ( x ) is appro ximately constant, and (2.8) reduces to (2.7). As earlier the w eights of this Stein-t yp e estimator are truncated to [0 , 1]. The empirical Ba yesian attitude in this subsection has p erhaps been more ‘clas- sical’ than ‘pure Bay esian’; unconditional likelihoo ds hav e b een maximised rather than used in connection with additional priors. Maximising the likelihoo d is equiv- alen t to using the Bay es solution with a flat prior for the parameters, under a sharp 0–1 loss function, so the pro cedures suggested ab ov e for getting hold of w 0 ,x s can b e viewed as Ba yesian but, consciously , with no additional prior information ab out σ or the w 0 ,x s. Bay esian analysis with a Gamma prior for 1 /σ 2 is technically conv e- nien t, and it is reassuring to see that the noninformativ e prior version of this leads to exactly the same estimates as in (2.5). More generally , maximising the derived Nils Lid Hjort 8 August 1994 lik eliho o d for data alone, with resp ect to parameters present in the w 0 ,x s, giv es the same results as when maximising ov er the joint lik eliho od that includes σ . Y et other w ays of estimating σ and w 0 ,x s are briefly discussed in Section 8.2. 2.3. Prior uncer t ainty around the st ar t cur ve. The estimator that has so far b een dev elop ed is the Ba yes estimator (2.2) with inserted empirical Ba yes estimate, sa y b w 0 ,x , for prior precision. The start curv e function m 0 ( x ) en ters b oth (2.2) and the b w 0 ,x op eration crucially . There is usually uncertain ty around the c hoice of the m 0 curv e. A t wo-stage prior framework is to view m 0 ( · ) as the result of some bac kground prior pro cess. The final estimator is then, in principle, to av erage the estimator just describ ed ov er the p osterior distribution for m 0 ( · ) giv en all data. Supp ose that m 0 ( · ) is modelled parametrically , sa y m 0 ( x, ξ ), with a bac kground prior π 0 ( ξ ) for ξ . The regression curve estimator is b m ( x, ξ ) = b w 0 ,x ( ξ ) m 0 ( x, ξ ) + s 0 ( x ) e m ( x ) b w 0 ,x ( ξ ) + s 0 ( x ) , conditional on ξ . The final estimator is accordingly b m ( x ) = E { b m ( x, ξ ) | all data } = Z b w 0 ,x ( ξ ) m 0 ( x, ξ ) + s 0 ( x ) e m ( x ) b w 0 ,x ( ξ ) + s 0 ( x ) π 0 ( ξ | all data) d ξ . (2 . 9) This w ould hav e to b e computed through numerical integration or simulation. In situations where a sufficient amoun t of go od data gives a reasonably concen trated p osterior densit y π 0 ( ξ | all data), sa y around the Bay es estimate b ξ , a simple approx- imation is to plug this in, and use b m ( x, b ξ ) = b w 0 ,x ( b ξ ) m 0 ( x, b ξ ) + s 0 ( x ) e m ( x ) b w 0 ,x ( b ξ ) + s 0 ( x ) . F urther approximations could also be con templated, for example using a quadratic appro ximation of the integrand around the Bay es estimate for ξ . W e need to commen t on what is meant b y the density of ξ giv en all data here. One could base this on the parametric lik eliho o d σ − n exp[ − (2 σ 2 ) − 1 P n i =1 { y i − m 0 ( x i , ξ ) } 2 ], but this is not entirely satisfactory , since the p oin t in the end is to be nonparametric; the m 0 ( x, ξ ) mo del is only meant as an initial description. It is b et- ter, therefore, to view π 0 ( · ) as a prior for the ‘least false’ parameter v ector ξ 0 that giv es best parametric approximation to the true m ( x ) curve, in the sense that it min- imises the long-term version of n − 1 P n i =1 { m ( x i ) − m 0 ( x i , ξ ) } 2 . The appro ximate dis- tribution of the maximum lik eliho o d estimator e ξ , say L n ( e ξ | ξ 0 ), can often b e work ed out, even outside the conditions of the parametric mo del, see Hjort and P ollard (1994, Sections 3 and 4). This suggests employing the distribution of ξ given its max- im um likelihoo d estimate e ξ in (2.9), that is, using π 0 ( ξ ) L n ( e ξ | ξ ) / R π 0 ( ξ ) L n ( e ξ | ξ ) d ξ for π 0 ( ξ | all data). F or a sp ecific example that illustrates these calculations, supp ose that m 0 ( x, ξ ) is modelled as ξ 1 + ξ 2 g 2 ( x ) + · · · + ξ p g p ( x ), in terms of giv en basis functions g 2 , . . . , g p . Lo cal Bay esian Regression 9 August 1994 The explicit maximum lik eliho o d estimator is e ξ = ( P n i =1 z i z 0 i ) − 1 P n i =1 z i y i , where z i = z ( x i ) is the p -v ector (1 , g 2 ( x i ) , . . . , g p ( x i )) 0 . Under mild technical assumptions the estimator is consistent for the least false parameter vector ξ 0 describ ed abov e; indeed ξ 0 can b e expressed as { E z ( x i ) z ( x i ) 0 } − 1 E z ( x i ) y i . Results from Hjort and P ollard (1994, Section 3) imply furthermore that it is approximately a multinormal, cen tred at ξ 0 and with a co v ariance matrix e V /n , where e V = n − 1 n X i =1 z i z 0 i − 1 h n − 1 n X i =1 { y i − m 0 ( x i , e ξ ) } 2 z i z 0 i i n − 1 n X i =1 z i z 0 i − 1 . (2 . 10) The presen t use of this is that if the parameter v ector ξ is given a suitable prior, for example a uniform reference prior, and viewed necessarily as a prior for the least false ξ 0 , then the p osterior distribution is appro ximately a m ultinormal, cen tred at the estimate e ξ and with the same co v ariance matrix e V /n as ab o ve. This follows from general argumen ts and results pro vided in Hjort and Pollard (1994, Section 4). The appro ximation is b est when the prior is the noninformative uniform one, but is v alid for an y fixed prior as long as its co v ariance matrix is large compared to e V /n . Mo difications are av ailable to accoun t for strong prior opinions ab out ξ . The final Ba yes regression estimator (2.9) can now b e computed as follo ws. Dra w sa y 100 v alues of ξ from the N p { e ξ , e V /n } distribution. This gives 100 start curv es m 0 ( x, ξ ) that are all lik ely giv en the full data information. F or eac h of the 100 lik ely start curves the algorithm in question is used to compute the b m ( x, ξ ) curve. In the end these are av eraged. See Section 4 b elo w for illustrations. The discussion ab o v e used L n ( e ξ | ξ ) to get to π 0 ( ξ | all data). An alternativ e route worth y of at least a brief discussion is to use the marginal distribution of the com bined lo cal data sets D ( x ) as the lik eliho o d for the ξ parameters present in m 0 ( · ). The relev ant part of this likelihoo d b ecomes exp n − 1 2 1 σ 2 h X x ∈ I w 0 ,x s 0 ( x ) 2 w 0 ,x + s 0 ( x ) { e m ( x ) − ξ 0 z ( x ) } 2 io . Some calculations transfer this to a multinormal likelihoo d, with mean parame- ter B − 1 P x ∈ I ρ ( x ) s 0 ( x ) z ( x ) e m ( x ) and with cov ariance matrix σ 2 B − 1 , where B = P x ∈ I ρ ( x ) s 0 ( x ) z ( x ) z ( x ) 0 . One may now show that the pos terior density of ξ de- riv ed using this lik eliho o d is appro ximately the same as the one used ab o ve, if the m 0 ( x, ξ ) mo del is approximately correct. W e stick to the first metho d, mainly since w e aim at v alid inference outside the conditions of an y narrow parametric mo del. Another nice facet of the first metho d, in view of the calculations just discussed, is that it do es not need an y extra binning of data. Remark. The developmen t ab ov e illustrates the crucial Step (e) of the Bay esian programme, as outlined in Section 1.3, with a prior for a parametrised start curv e. This is carried further in Section 4 b elow. One might also use a nonparametric prior pro cess for the start curve m 0 ( · ), for example in the particular Gaußian form that leads to splines; see Hastie and Tibshirani (1990, Ch. 3) for some discussion. Giv en data the curve is Gaußian with sp ecified parameters and it is p ossible to Nils Lid Hjort 10 August 1994 sim ulate from this p osterior. This mak es it p ossible to obtain the final Bay es re- gression curve in analogy to (2.9). This nonparametric p osterior distribution would ha ve larger v ariance than in the parametric case, how ev er, and the whole pro cedure ends up b eing a nonparametric attempt at correcting a nonparametric construction. In many cases suc h a metho d would probably not accomplish v ery muc h, and we view the b enefits of working with a parametrised start curve, where a nonparametric correction is p erformed on a parametric initial construction, as more promising. 3. Inference for the lo cal linear mo del. The lo cal model work ed with in the presen t section is again the normal with constant v ariance, but this time with a lo cally linear regression, sa y m ( t ) = a + b ( t − x ) for t ∈ N ( x ). In this section ( e m ( x ) , e b ( x )) = ( e m LL ( x ) , e b LL ( x )) is the ( e a, e b ) computed from the LL metho d (1.4) at x . 3.1. Prerequisites. The story to ev olve is quite similar to that of Section 2, but with somewhat more in volv ed algebraic calculations. W e start with the lo cal k ernel smo othed lik eliho o d, which is σ − s 0 ( x ) exp {− (2 σ 2 ) − 1 Q ( x, a, b ) } , where Q ( x, a, b ) = X N ( x ) { y i − a − b ( x i − x ) } 2 w i ( x ) . W e need to b e more technically sp ecific ab out the maxim um lo cal likelihoo d esti- mators here, whic h are the ones already mentioned in (1.4). Introduce S ( x ) = s 0 ( x ) s 1 ( x ) s 1 ( x ) s 2 ( x ) , where s 0 ( x ), s 1 ( x ) and s 2 ( x ) are the lo cal sums of resp ectiv ely w i ( x ), w i ( x )( x i − x ) and w i ( x )( x i − x ) 2 . Then e a e b = s 0 ( x ) s 1 ( x ) s 1 ( x ) s 2 ( x ) − 1 P N ( x ) w i ( x ) y i P N ( x ) w i ( x )( x i − x ) y i , and the LL regression estimate itself is e m ( x ) = e a = { s 0 ( x ) s 2 ( x ) − s 1 ( x ) 2 } − 1 X N ( x ) w i ( x ) { s 2 ( x ) − s 1 ( x )( x i − x ) } y i . Some commen ts on the relative sizes of s 0 , s 1 , s 2 here are giv en in Section 8.6. 3.2. Local posterior calcula tion. Start out with a normal prior for the lo cal ( a, b ) with mean ( a 0 , b 0 ) and cov ariance matrix σ 2 W − 1 0 . Here a 0 = m 0 ( x ) and b 0 = m 0 0 ( x ), and again W 0 will b e allow ed to dep end on x later on. Rewrite the exp onen t in the lo cal likelihoo d as Q ( x, a, b ) = Q 0 ( x ) + a − e a b − e b 0 S ( x ) a − e a b − e b , Lo cal Bay esian Regression 11 August 1994 where Q 0 ( x ) = P N ( x ) { y i − e a − e b ( x i − x ) } 2 w i ( x ). Calculations analogous to those that led to (2.4) give that ( a, b ) and lo cal data D ( x ) hav e simultaneous distribution prop ortional to | W 0 + S ( x ) | 1 / 2 σ 2 exp h − 1 2 1 σ 2 a − b a b − b b 0 { W 0 + S ( x ) } a − b a b − b b i × | W 0 | 1 / 2 | W 0 + S ( x ) | 1 / 2 1 σ s 0 ( x ) exp n − 1 2 1 σ 2 h Q 0 ( x ) + e a − a 0 e b − b 0 0 { W − 1 0 + S ( x ) − 1 } − 1 e a − a 0 e b − b 0 io , where b a b b = { W 0 + S ( x ) } − 1 n W 0 a 0 b 0 + S ( x ) e a e b o = { I + W − 1 0 S ( x ) } − 1 a 0 b 0 + { I + W − 1 0 S ( x ) } − 1 W − 1 0 S ( x ) e a e b . (3 . 1) In particular, a b lo cal data , σ ∼ N 2 { b a b b , σ 2 { W 0 + S ( x ) } − 1 } , (3 . 2) and the Ba yes solution is b m ( x ) = b a . This is the appropriate generalisation of (2.2) and (2.3), and the remarks made ab out the structure and c haracteristics of the Bay es solution, at the end of Sec tion 2.1, are v alid in the present case to o, with suitable mo difications. Note in particular that if W 0 tends to zero, signifying a noninformative prior for the lo cal ( a, b ), then the Bay es solution is simply the LL estimator. The second sp ecial case to note is that if h is large, then we are again bac k in full parametric analysis in the linear normal mo del. 3.3. Estima ting σ and local prior precision p arameters. F or a giv en start curve one needs to assign v alues to the prior precision matrices W 0 = W 0 ,x . In analogy to (2.4), consider the matrix P 0 ( x ) = e m ( x ) − m 0 ( x ) e b ( x ) − m 0 0 ( x ) e m ( x ) − m 0 ( x ) e b ( x ) − m 0 0 ( x ) 0 S ( x ) = d ( x ) d ( x ) 0 S ( x ) . (3 . 3) Conditional on the local ( a, b ), the d ( x ) v ector has mean ( a, b ) and co v ariance matrix S ( x ) − 1 T ( x ) S ( x ) − 1 , where T ( x ) = P N ( x ) w i ( x ) 2 P N ( x ) w i ( x ) 2 ( x i − x ) P N ( x ) w i ( x ) 2 ( x i − x ) P N ( x ) w i ( x ) 2 ( x i − x ) 2 . Hence its unconditional mean and co v ariance matrix are resp ectively ( a 0 , b 0 ) and σ 2 { S ( x ) − 1 T ( x ) S ( x ) − 1 + W − 1 0 ,x } . Cons equen tly , P 0 ( x ) is un biased for σ 2 { S ( x ) − 1 T ( x ) + W − 1 0 ,x S ( x ) } . (3 . 4) Nils Lid Hjort 12 August 1994 Notice that T ( x ) = S ( x ) when the uniform k ernel is used in (1.3). Note also that the natural trace statistic d ( x ) 0 S ( x ) d ( x ) has exp ected v alue 2 + T r { W − 1 0 ,x S ( x ) } , if suc h uniform weigh ts are used. These facts can b e utilised in v arious wa ys to obtain empirical Bay es estimates of parameters presen t in the W − 1 0 ,x matrices, as commented on further b elo w. The argumen ts that led to (2.5) in the running one-parameter case cannot b e immediately generalised to the present running t wo-parameter case. T he combined lik eliho o d for the k groups of lo cal data b ecomes Y x ∈ I | ρ ( x ) | 1 / 2 σ − P x ∈ I s 0 ( x ) exp h − 1 2 1 σ 2 n X x ∈ I Q 0 ( x ) + X x ∈ I d ( x ) 0 S ( x ) ρ ( x ) d ( x ) oi , (3 . 5) in terms of ρ ( x ) = { W 0 ,x + S ( x ) } − 1 W 0 ,x . T aking partial deriv atives to find the maxim um here one ends up with e ρ ( x ) − 1 = (1 / e σ 2 ) d ( x ) d ( x ) 0 S ( x ), giving a lo cal es- timate of ρ ( x ) − 1 = I + W − 1 0 ,x S ( x ) in go o d agreemen t with (3.4). The difficulty is that the estimator is a rank 1 matrix, and there is in general no unique maximand e ρ ( x ). Estimating σ is less difficult. By writing ρ ( x ) = r x B x for a B x matrix with determinan t 1 one can maximise ov er σ and the k v alues of r x , to find e σ 2 = X x ∈ I Q 0 ( x ) . X x ∈ I { s 0 ( x ) − 2 } . (3 . 6) It is also easy to find Bay es estimates of σ under a Gamma prior for 1 /σ 2 . As in Section 2 a nice feature is that the σ estimate is indep endent of start curve and of the sp ecific forms used for ρ ( x ). As similarly discussed in Section 2.2 satisfactory solutions are obtained by smo othing ρ ( x ) − 1 = I + W − 1 0 ,x S ( x ) against (1 / e σ 2 ) d ( x ) d ( x ) 0 S ( x ) (3 . 7) in suitable wa ys. F or example, simple a veraging suggests using the inv erse of (1 / e σ 2 ) ¯ P 0 as start curv e weigh t, where ¯ P 0 = k − 1 P x ∈ I d ( x ) d ( x ) 0 S ( x ), that is, b m ( x ) b b ( x ) = e σ 2 ¯ P − 1 0 m 0 ( x ) m 0 0 ( x ) + ( I − e σ 2 ¯ P − 1 0 ) e m ( x ) e b ( x ) . (3 . 8) The w eight matrices are truncated to give weigh ts in [0 , 1] for b oth level and slop e, if necessary . This is the natural extension of metho d (2.7). Other suitable solutions emerge b y mo delling W 0 ,x parametrically and then either use regression based on (3.6) or maximising (3.5) under this constrain t. There do es not app ear to b e a canonically unique wa y of doing this, and several options may b e considered. An example is given in Section 4 where prior considerations suggest using W 0 ,x = w 0 A x in terms of a single parameter times a known matrix function. A solution can b e given m uch as in (2.8). Another example could b e to mo del level and slop e as lo cally indep enden t and with W 0 ,x = diag { w a r a ( x ) , w b r b ( x ) } , in terms of known functions r a ( x ) and r b ( x ) decided on by prior considerations. Again the Lo cal Bay esian Regression 13 August 1994 com bined likelihoo d (3.5) can be maximised with resp ect to w a and w b , or a suitable regression analysis can pro duce estimates, based on d ( x ) 0 S ( x ) d ( x ) / e σ 2 ' 2 + w − 1 a s 0 ( x ) /r a ( x ) + w − 1 b s 2 ( x ) /r b ( x ) . Presumably the exact sp ecification of these parameters is of secondary imp ortance compared to the sp ecification of the start function m 0 or its prior distribution. 3.4. Accounting f or uncer t ainty in the st ar t cur ve. This quite crucial ingredien t can b e discussed v ery muc h as in Section 2.3. Again a feasible solution is to tak e a parametric m 0 ( x, ξ ) as starting point, place a prior π 0 ( ξ ) on these bac kground parameters, compute the exact or appro ximate p osterior π 0 ( ξ | all data), leading in the end to b m ( x ) b b ( x ) = Z { c W 0 ,x ( ξ ) + S ( x ) } − 1 n c W 0 ,x ( ξ ) m 0 ( x, ξ ) m 0 0 ( x, ξ ) + S ( x ) e m ( x ) e b ( x ) o π 0 ( ξ | all data) d ξ , (3 . 9) for example through sim ulation. The final Bay es estimator is b m ( x ). 4. A particular construction. This section is mean t to illustrate the general sc heme of Section 3, while tending more carefully to some of the tec hnical details. Supp ose the start curve is simply a linear m 0 ( x, ξ ) = ξ 1 + ξ 2 ( x − ¯ x ), where ¯ x is the a verage of the x i s. The linear structure will b e utilised to generate ‘the 100 lik ely start curves’ that are needed to find the final Ba yes estimate, as well as to suggest a structure for and then estimates of prior precision parameters. A generalisation is noted in Section 4.3, while an attempt at making an automatic empirical Ba yesian curv e estimator is discussed in Section 4.4. 4.1. Obt aining prior p arameters. Im agine that the Ba yesian’s prior opin- ion ab out ξ 1 and ξ 2 is based on a previous data set with characteristics similar to the present one. He would then ha ve uncorrelated estimates ξ ∗ 1 and ξ ∗ 2 with v ariances resp ectively n − 1 0 σ 2 0 and n − 1 0 σ 2 0 / ( v ∗ ) 2 , say , in terms of prior sample size n 0 and prior sample x -v ariance ( v ∗ ) 2 . This statistician w ould use prior v ariance w − 1 0 { 1 + ( x − ¯ x ) 2 /v 2 } for a = ξ 1 + ξ 2 ( x − ¯ x ), prior v ariance w − 1 0 /v 2 for b = ξ 2 , with prior co v ariance w − 1 0 ( x − ¯ x ) /v 2 b et w een the tw o, in terms of the constant w − 1 0 = σ 2 0 /n 0 and the sample v ariance v 2 = n − 1 P n i =1 ( x i − ¯ x ) 2 for x i s. This sug- gests using prior co v ariance matrix of the form σ 2 W − 1 0 ,x for the lo cal ( a, b ), where W − 1 0 ,x = w − 1 0 1 + ( x − ¯ x ) 2 /v 2 ( x − ¯ x ) /v 2 ( x − ¯ x ) /v 2 1 /v 2 , or W 0 ,x = w 0 1 − ( x − ¯ x ) − ( x − ¯ x ) v 2 + ( x − ¯ x ) 2 . (4 . 1) This is accordingly a situation where prior knowledge has led to a sp ecific parametric form of the prior precision matrices, say W 0 ,x = w 0 A x or W − 1 0 ,x = w − 1 0 A − 1 x with Nils Lid Hjort 14 August 1994 kno wn matrix function A x . T o estimate the w 0 quan tity , one option is to maximise the com bined marginal likelihoo d for the lo cal data sets, that is, maximise X x ∈ I log | w 0 A x | | w 0 A x + S ( x ) | − 1 e σ 2 X x ∈ I d ( x, ξ 1 , ξ 2 ) 0 { w − 1 0 A − 1 x + S ( x ) − 1 } − 1 d ( x, ξ 1 , ξ 2 ) with resp ect to w 0 , whic h is an easy numerical task. Here d ( x, ξ 1 , ξ 2 ) = e m ( x ) − ξ 1 − ξ 2 ( x − ¯ x ) e b ( x ) − ξ 2 , with e m ( x ) and e b ( x ) as in the LL metho d (1.4). Alternativ ely one may use the structure d ( x, ξ 1 , ξ 2 ) 0 S ( x ) d ( x, ξ 1 , ξ 2 ) / e σ 2 = 2 + w − 1 0 T r { A − 1 x S ( x ) } + error term , for sev eral c hosen v alues of x , assuming uniform weigh ts in (1.3), to estimate w − 1 0 b y a suitable regression analysis. This also invites a graphical chec k on the ‘prior knowl- edge assumptions’ that led to the w 0 A x structure. Either wa y a Bay es-empirical- Ba yes estimator has b een constructed, of the form b m ( x, ξ 1 , ξ 2 ) b b ( x, ξ 1 , ξ 2 ) = { b w 0 ( ξ 1 , ξ 2 ) A x + S ( x ) } − 1 × n b w 0 ( ξ 1 , ξ 2 ) A x ξ 1 + ξ 2 ( x − ¯ x ) ξ 2 + S ( x ) e m ( x ) e b ( x ) o , (4 . 2) for eac h p ossible start curve ξ 1 + ξ 2 ( x − ¯ x ). 4.2. Posterior distribution for st ar t cur ve. Next turn to the approxi- mate distribution of start curves given all data. The maxim um lik eliho o d estimators are e ξ 1 = ¯ y and e ξ 2 = n − 1 P n i =1 ( x i − x ) y i /v 2 . The (2.10) matrix b ecomes e V = n − 1 P n i =1 ( y i − e y i ) 2 , n − 1 P n i =1 ( y i − e y i ) 2 ( x i − ¯ x ) /v 2 n − 1 P n i =1 ( y i − e y i ) 2 ( x i − ¯ x ) /v 2 , n − 1 P n i =1 ( y i − e y i ) 2 ( x i − ¯ x ) 2 /v 4 , where e y i is the fitted v alue e ξ 1 + e ξ 2 ( x i − ¯ x ). Only under the additional assump- tion that the underlying m ( · ) curv e is actually linear is e V close to the familiar e σ 2 diag { 1 , 1 /v 2 } . The general result explained in Section 2.3 implies that ( ξ 1 , ξ 2 ) giv en the data information is appro ximately a binormal, centred at ( e ξ 1 , e ξ 2 ) and with co v ariance matrix e V /n . The final estimator is reached as follows. Simulate say 100 v alues of ( ξ 1 , ξ 2 ) from the p osterior distribution just describ ed. W e may think of this as a wa y of generating 100 likely start curves. F or each such curve, compute the (4.2) estimate. In the end compute the av erage of the 100 curves b m ( x, ξ 1 , ξ 2 ). 4.3. General linearl y structured st ar t cur ve. Assume that the start curv e is parametrised as m 0 ( x, ξ ) = ξ 1 + ξ 2 g 2 ( x ) + · · · + ξ p g p ( x ) = ξ 0 z ( x ), where z ( x ) is the p -v ector (1 , g 1 ( x ) , . . . , g p ( x )) 0 . Its deriv ative is m 0 0 ( x, ξ ) = ξ 0 z ∗ ( x ) = Lo cal Bay esian Regression 15 August 1994 P p j =1 ξ j g 0 j ( x ). Let e K = { n − 1 P n i =1 z ( x i ) z ( x i ) 0 } − 1 . The same reasoning as ab ov e indicates that a reasonable structure for the cov ariance matrix of ( a, b ) at x is W − 1 0 ,x = w − 1 0 z ( x ) 0 e K z ( x ) z ( x ) 0 e K z ∗ ( x ) z ( x ) 0 e K z ∗ ( x ) z ∗ ( x ) 0 e K z ∗ ( x ) . The scheme is otherwise quite similar to that ab ov e; decide on a strategy to deter- mine the w 0 parameter, say b w 0 ( ξ ), making it p ossible to arrive at a b m ( x, ξ ) as in (4.2) with a single algorithm, for each giv en m 0 ( x, ξ ). Then put up the e V /n matrix as p er (2.10), and finally compute the av erage of 100 curves b m ( x, ξ ). 4.4. Completel y automa tic empirical Ba yesian regression. There are clearly many p ossible schemes to follow. This is inheren t in the Bay esian p ersp ectiv e; eac h new application is different from previous ones and serious considerations are required to elicit the particular prior pro cess, or at least its form. It is nevertheless useful to work out one or more metho ds that are completely automatic, in the sense of dep ending only on the data and not on sp ecification of sub jectiv e parameters, at the price of b eing pragmatically empirical Bay esian rather than strictly Ba yesian. Indeed the t wo standard estimators (1.2) and (1.4) can b oth b e given such inter- pretations, using flat priors for the lo cal parameters. Less drastic simplifications emerge b y letting the data lead to a suitable linear parametric form m 0 ( x, ξ ), and then emplo y metho ds describ ed in the preceding subsections. One particular construction is as follows. Use a ‘delete-knot regression spline’ metho d to approximate the curve with a function of the spline form m 0 ( x, ξ ) = P p j =1 ξ j g j ( x ), where g j ( x ) = { ( x − k j ) + } 3 , and the knots are placed at k 1 < · · · < k p . The metho d described in Breiman and Peters (1992, Section 2.3) is automatic and succeeds in selecting a rather small num b er p of suc h well-placed knots. Then go through the metho d of the previous subsection. 5. Other lo cal regression sc hemes. This section briefly dev elops and dis- cusses further specialisations of the general Bay esian lo cal regression strategy . The apparatus is applied to other kinds of mo dels in Section 7. 5.1. Local pol ynomial Ba yesian regression. It is clear that the metho ds and calculations of Section 3 can b e generalised to for example lo cal quadratic or cubic regression, without serious difficulties. The lo cal quadratic mo del uses m ( t ) = a + b ( t − x ) + c ( t − x ) 2 , and the estimator is E { a | lo cal data } . Cho osing the local p olynomial order must b e seen in connection with the c hoice of lo cal data window width, see Section 8.3. Allo wing three or four lo cal parameters would typically require somewhat larger data windo ws than for the one- and t wo-parameter metho ds of Sections 2 and 3. The metho d of splines uses lo cal cubic p olynomials, but the pieces are knotted together in a different wa y than with the present setup. 5.2. St ar t cur ve function times correction. Supp ose m 0 ( t ) is some initial description, p erhaps con taining ‘global’ parameters, and model the true curve lo cally as m ( t ) = m 0 ( t ) a for t ∈ N ( x ), with a lo cal correction factor a not far from 1. Nils Lid Hjort 16 August 1994 If the lo cal correction factor a is mo delled as N { 1 , σ 2 /w 0 } , then the Ba yes solution b ecomes b m ( x ) = m 0 ( x ) w 0 + u 0 ( x ) e a ( x ) w 0 + u 0 ( x ) , where u 0 ( x ) = P N ( x ) w i ( x ) m 0 ( x i ) 2 and e a ( x ) = P N ( x ) w i ( x ) m 0 ( x i ) y i /u 0 ( x ). It is also of in terest to note that if w eights ¯ K ( h − 1 ( x i − x )) m ( x ) 2 /m ( x i ) 2 are used instead of w i ( x ), that is, coming from the somewhat altered k ernel ¯ K ( z ) m ( x ) 2 /m ( x + hz ) 2 , then the metho d is a relativ e of one recen tly developed in Hjort and Glad (1994). An empirical Bay es scheme must b e devised to estimate w 0 , for example utilising P 0 ( x ) = u 0 ( x ) { e a ( x ) − 1 } 2 at v arious x s. F urther v arian ts emerge when the lo cal mo del is m ( t ) = m 0 ( t ) { a + b ( t − x ) } for t near x . 5.3. Other base densities in the local likelihood. Our kernel smo othed lo cal lik eliho o d has b een of the form Q N ( x ) [ σ − 1 g ( σ − 1 { y i − a − b ( x i − x ) } )] w i ( x ) , with g equal to the standard normal. Other densities g can b e used as well, without seriously disturbing the general metho d, apart from more b othersome numerical computations. An adaptiv e v ersion of the metho d would b e to stick in an estimate of g based on residuals { y i − e m ( x i ) } / e σ . Analysis in Section 8.4 suggests that the c hoice of g is not crucial to the final results, so we might as w ell stick to the conv enien t normal. 5.4. Additional context in the prior. Some of the developmen t in Section 2.2 used the assumption that the lo cal constan t levels at the k midp oin t p ositions x 0 , 1 < · · · < x 0 ,k , sa y a 1 , . . . , a k , w ere taken indep enden t in their join t prior distri- bution. One ma y also in tro duce an elemen t of smo othness or context in the prior, b y mo delling these as p ositiv ely correlated, sa y a ∼ N k { a 0 , σ 2 T − 1 0 } . In suc h a case, b m ( x 0 , 1 ) . . . b m ( x 0 ,k ) = ( T 0 + D ) − 1 T 0 m 0 ( x 0 , 1 ) . . . m 0 ( x 0 ,k ) + ( T 0 + D ) − 1 s 0 ( x 0 , 1 ) e m ( x 0 , 1 ) . . . s 0 ( x 0 ,k ) e m ( x 0 ,k ) , where D is the diagonal matrix with elements ( s 0 ( x 0 , 1 ) , . . . , s 0 ( x 0 ,k )) (assuming, for simplicity , that uniform weigh ts are used in (1.3)). The arguments that led to (2.5) and some of its later relativ es used a diagonal T 0 matrix. In the present setup the estimators again shrink the NW estimator to wards the start curve v alues m 0 ( · ), but in addition neigh b ouring estimates are pushed to w ards eac h other, with a strength determined from the correlation structure in the prior. This is indeed sensible, and would not b e very differen t from what happ ens for the metho d of splines, which can b e formulated in terms of a Gaußian prior pro cess for m ( · ) with p ositiv e correlations. W e still prefer the simpler indep endence framew ork, ho wev er. Con text and smo othness is accoun ted for in any case through the use of a smo oth start curve m 0 ( · ) and direct signals from the data themselves. A correlation mo del is somewhat b othersome, not so muc h regarding the Ba yes estimates ab o ve, but b y requiring an explicitly mo delled correlation structure and a more complicated empirical Ba yes scheme to estimate its parameters. Lo cal Bay esian Regression 17 August 1994 6. Sev eral co v ariates. Nonparametric regression is difficult in higher dimen- sions. It is easy to formally generalise man y of the one-dimensional metho ds to d co v ariates, including the NW metho d of (1.2) and the LL method of (1.4), but the curse of dimensionality leav es most neigh b ourho ods to o empty to give go od pre- cision, and the con vergence rate becomes increasingly unfav ourable when d gro ws. Successful metho ds, if sa y d ≥ 3, are t ypically those that lo ok for low er-dimensional structure in suitable w ays. F riedman and Stuetzle (1981) develop a pro jection pur- suit metho d. Clev eland and Devlin (1988) and Clev eland, Grosse and Sh yu (1991) discuss multidimensional versions of the p opular lo cal regression metho d ‘low ess’. Hjort and Glad (1994) prop ose and analyse an estimator that corrects nonpara- metrically on an initial parametric descriptor. Hastie and Tibshirani (1990) discuss mo dels and metho ds that approximate the real regression surface with one with a simpler additive structure, and also (in their Ch. 6) give a Ba yesian discussion of one version. Here the apparatus of previous sections is extended to the d -v ariate case. F rank and F riedman (1993) study partial least squares methods and also p oint out Ba yesian connections. The Bay esian metho ds ma y turn out to b e particularly fruitful in the multiv ariate case in view of the difficulties that standard metho ds ha ve there. The extension is quite straightforw ard in that the neces sary linear algebra is v ery similar to that dev elop ed in Section 3, at least as concerns the structure of the Bay es estimator and supplementing empirical Bay es metho ds to obtain prior precision parameters. The mo del is that y i = m ( x i ) + ε i for a smo oth surface m ( · ) in terms of say x i = ( x i, 1 , . . . , x i,d ) 0 for individual i , and i.i.d. error terms ε i with mean zero and v ariance 1. The lo cal mo del is m ( t 1 , . . . , t d ) = a + b 1 ( t 1 − x 1 ) + · · · + b d ( t d − x d ) = a + b 0 ( t − x ) for t ∈ N ( x ) , a suitably defined neigh b ourho od around x . W rite Q ( x , a, b ) = X N ( x ) { y i − a − b 0 ( x i − x ) } 2 w i ( x ) = Q 0 ( x ) + a − e a b − e b 0 S ( x ) a − e a b − e b , where e a, e b 1 , . . . , e b d minimise Q ( x, a, b ), and where S ( x ) = P N ( x ) w i ( x ) P N ( x ) w i ( x )( x i − x ) 0 P N ( x ) w i ( x )( x i − x ) P N ( x ) w i ( x )( x i − x )( x i − x ) 0 . Let ( a, b ) b e giv en a m ultinormal prior centred at ( a 0 , b 0 ), where a 0 = m 0 ( x ) and b 0 has comp onen ts b 0 ,j = ∂ m 0 ( x ) /∂ x j , determined from a suitable start surface m 0 ( x ), and with a co v ariance matrix σ 2 W − 1 0 , x . Then results (3.1) and (3.2) hold with only notational differences, and in particular this defines the Bay es solution b m ( x ) = b a . A successful version of this sc heme, esp ecially with a resp ectable d , would t yp- ically require a go od parametric mo delling of the prior precision matrix W 0 , x and then a suitable empirical Ba yes metho d to infer its parameters. T o this end note Nils Lid Hjort 18 August 1994 that equation (3.4), minorly mo dified, is v alid. This also means that the appropriate analogue of (3.8) should b e a go od estimator in the present d -dimensional case. And as in previous cases in this pap er this m 0 -dep enden t estimator should b e a v eraged with resp ect to the p osterior distribution of the start surface. 7. Lo cal linear Ba yes metho ds for Poisson regression and other re- gression mo dels. T o illustrate that the general empirical-hierarchical-Ba yesian programme can b e used in many other regression type models we first consider the P oisson regression case in some detail and then giv e brief p oin ters to still other areas of application. 7.1. Local inference for Poisson regression. Let y i | x i b e a P oisson v ariable with mean parameter m ( x i ). The task is to estimate the mean function m ( x ). Consider the lo cal level mo del where m ( t ) = a for a ∈ N ( x ) = [ x − 1 2 h, x + 1 2 h ]. The lo cal likelihoo d b ecomes Y N ( x ) n e − a a y i / ( y i !) o w i ( x ) = a P N ( x ) w i ( x ) y i exp n − a X N ( x ) w i ( x ) o. Y N ( x ) ( y i !) w i ( x ) . Supp ose there is some prior estimate of the form m 0 ( x ) = m 0 ( x, ξ ), in terms of suitable ‘global’ parameters ξ . If a is given a Gamma prior with parameters { w 0 m 0 ( x, ξ ) , w 0 } , that is, with mean v alue equal to m 0 ( x, ξ ) and v ariance equal to m 0 ( x, ξ ) /w 0 , then a | lo cal data , ξ ∼ Gamma n w 0 m 0 ( x, ξ ) + X N ( x ) w i ( x ) y i , w 0 + X N ( x ) w i ( x ) o . The Ba yes estimator, conditional on the start curve m 0 ( · , ξ ), b ecomes b m ( x, ξ ) = E { a | lo cal data , ξ } = w 0 w 0 + s 0 ( x ) m 0 ( x, ξ ) + s 0 ( x ) w 0 + s 0 ( x ) e m ( x ) , (7 . 1) where s 0 ( x ) = P N ( x ) w i ( x ) as before and e m ( x ) = P N ( x ) w i ( x ) y i /s 0 ( x ) is the nat- ural frequen tist estimate (maxim um lo cal likelihoo d estimate under the constan t lev el mo del). This is also the Ba yes solution under the natural noninformativ e prior (where w 0 tends to zero). As in Sections 2 and 3 empirical Bay es metho ds can b e set up to estimate the prior precision parameter w 0 = w 0 ,x at x . The marginal distribution of lo cal data D ( x ), still conditional on w 0 ,x and the start curv e, can b e work ed out to b e Γ( w 0 ,x m 0 ( x, ξ ) + s 0 ( x ) e m ( x )) Γ( w 0 ,x m 0 ( x, ξ )) s 0 ( x ) s 0 ( x ) e m ( x ) Q N ( x ) ( y i !) w i ( x ) × n w 0 ,x w 0 ,x + s 0 ( x ) o w 0 ,x m 0 ( x,ξ ) n s 0 ( x ) w 0 ,x + s 0 ( x ) o s 0 ( x ) e m ( x ) . The maxim um likelihoo d estimator c an b e computed from this, and is a suitable function of the sufficien t statistic e m ( x ). It is simpler and p erhaps equally reliable to Lo cal Bay esian Regression 19 August 1994 utilise the easily obtainable facts that e m ( x ) has mean v alue m 0 ( x, ξ ) and v ariance m 0 ( x, ξ ) { t 0 ( x ) /s 0 ( x ) 2 + 1 /w 0 ,x } , where t 0 ( x ) again is P N ( x ) w i ( x ) 2 . This leads to forming P 0 ( x, ξ ) = s 0 ( x ) m 0 ( x, ξ ) { e m ( x ) − m 0 ( x, ξ ) } 2 with mean v alue t 0 ( x ) s 0 ( x ) + s 0 ( x ) w 0 ,x , (7 . 2) giving a prior precision estimate b w 0 ,x = b w 0 ,x ( ξ ). This leads to the Ba yes-empirical- Ba yes estimator m ∗ ( x, ξ ) = b w 0 ,x ( ξ ) m 0 ( x, ξ ) + s 0 ( x ) e m ( x ) b w 0 ,x ( ξ ) + s 0 ( x ) = 1 1 + P 0 ( x, ξ ) − t 0 ( x ) /s 0 ( x ) m 0 ( x, ξ ) + P 0 ( x, ξ ) − t 0 ( x ) /s 0 ( x ) 1 + P 0 ( x, ξ ) − t 0 ( x ) /s 0 ( x ) e m ( x ) . (7 . 3) This is quite similar to the (2.6) estimator. The structure is simplest for the uniform k ernel, where t 0 ( x ) /s 0 ( x ) = 1, and otherwise this ratio is close to R K /K (0), which for example is equal to 0.80 for the optimal Jepanetsjnikoff kernel (see Section 8.3). The reasoning that led to estimator (2.7) gives m ∗ ( x, ξ ) = ¯ P 0 ( ξ ) − 1 m 0 ( x, ξ ) + { 1 − ¯ P 0 ( ξ ) − 1 } e m ( x ) , (7 . 4) where ¯ P 0 ( ξ ) = k − 1 P x ∈ I P 0 ( x, ξ ). Again the weigh ts are truncated to the unit in terv al. Similarly an analogue of estimator (2.8) can easily b e constructed. The k ey step is to mo del the w 0 ,x suitably as a function of x . Estimator (7.3) do es not use an y mo del at all for how w 0 ,x c hanges, and is quite nonparametric on this accoun t. Estimator (7.4) and suitable analogues of estimator (2.8) use parametric mo dels for w 0 ,x ; see the discussion that led to (2.7) and (2.8). All this happened conditional on a start curv e function m 0 ( x, ξ ). As in previous sections a natural tw o-stage Bay esian wa y to cop e with uncertaint y in the sp ecifi- cation of the start curv e is to place a prior on the parameters ξ , then compute the exact or approximate p osterior distribution π 0 ( ξ | all data). A natural scenario is a log-linear start curve mo del, say m 0 ( x, ξ ) = exp( ξ 1 + ξ 2 x ) or the more general exp { ξ 0 z ( x ) } = exp { ξ 1 + ξ 2 g 2 ( x ) + · · · + ξ p g p ( x ) } . The posterior distribution of ξ is appro ximately a m ultinormal, cen tred at the maximum likelihoo d estimate e ξ and with a co v ariance matrix of form e V /n , where in fact e V = n − 1 n X i =1 e e ξ 0 z i z i z 0 i − 1 n n − 1 n X i =1 ( y i − e e ξ 0 z i ) 2 z i z 0 i o n − 1 n X i =1 e e ξ 0 z i z i z 0 i − 1 . This can b e shown using metho ds of Hjort and Pollard (1994, Sections 3 and 4). The form for e V given here assumes that y i giv en x i is really a Poisson, but do es not assume that the mean function is of any particular form. The final estimator for m ( x ) is of the form b m ( x ) = E E { a | lo cal data , ξ } | all data = Z b w 0 ,x ( ξ ) m 0 ( x, ξ ) + s 0 ( x ) e m ( x ) b w 0 ,x ( ξ ) + s 0 ( x ) π 0 ( ξ | all data) d ξ . (7 . 5) Nils Lid Hjort 20 August 1994 7.2. Local log-linear anal ysis. This time take m ( t ) = a exp { b ( t − x ) } as the lo cal mo del around x . Let a b e a Gamma with parameters { w 0 m 0 ( x, ξ ) , w 0 } as ab ov e, and let b hav e some prior π ( b ). Analysis like in the previous subsection sho ws that a given lo cal data and b is an up dated Gamma, with E { a | lo cal data , b } = w 0 m 0 ( x, ξ ) + P N ( x ) w i ( x ) y i w 0 + P N ( x ) w i ( x ) exp { b ( x i − x ) } , and the Bay es estimator is the av erage of this ov er the distribution of b giv en lo cal data, b m ( x, ξ ) = E { m ( x ) | lo cal data } = Z w 0 m 0 ( x, ξ ) + P N ( x ) w i ( x ) y i w 0 + P N ( x ) w i ( x ) exp { b ( x i − x ) } π ( b | lo cal data) d b. The p osterior density for b can b e sho wn to b e of the form const . π ( b ) exp { bnh 3 u n ( x ) /k 0 } { w 0 + nhf n ( x ) /k 0 + nh 3 v n ( x, b ) /k 0 } w 0 m 0 ( x,ξ )+ nhf n ( x ) e m ( x ) /k 0 , where u n ( x ) = n − 1 h − 3 P N ( x ) K ( h − 1 ( x i − x ))( x i − x ) y i essen tially estimates ( mf ) 0 , and where v n ( x, b ) = n − 1 h − 3 P N ( x ) K ( h − 1 ( x i − x ))[exp { b ( x i − x ) } − 1] essen tially estimates f 00 + 2 bf 0 + b 2 f . Normal approximations of interest can b e w orked out based on this, but will not b e pursued here. As in the previous subsection one m ust next supply estimates of the w 0 = w 0 ,x v alues and in the end av erage the Ba yes-empirical-Ba yes estimate m ∗ ( x, ξ ) ov er the p osterior distribution of ξ giv en all data. 7.3. St ar t estima te times local correction. This time take m ( t ) = m 0 ( t, ξ ) a for a ∈ N ( x ) = [ x − 1 2 h, x + 1 2 h ] as the lo cal mo del, where a = a x is though t of as the lo cal m ultiplicative correction factor to the start curve m 0 ( x, ξ ). The lo cal likelihoo d b ecomes prop ortional to Y N ( x ) h exp {− m 0 ( x i , ξ ) a }{ m 0 ( x i , ξ ) a } y i i w i ( x ) = Y N ( x ) m 0 ( x i , ξ ) y i w i ( x ) a P N ( x ) w i ( x ) y i exp n − a X N ( x ) w i ( x ) m 0 ( x i , ξ ) o . Under present circumstances it is appropriate to giv e a a Gamma prior cen tred around 1, sa y with parameters ( w 0 , w 0 ). Then a given lo cal data and the bac kground ξ is seen to b e an up dated Gamma, and the Ba yes estimator is b m ( x, ξ ) = E { m 0 ( x, ξ ) a | lo cal data , ξ } = m 0 ( x, ξ ) w 0 + P N ( x ) w i ( x ) y i w 0 + P N ( x ) w i ( x ) m 0 ( x i , ξ ) . Note that the noninformativ e prior version of this gives the interesting estimator ¯ m ( x, ξ ) = X N ( x ) w i ( x ) m 0 ( x, ξ ) y i . X N ( x ) w i ( x ) m 0 ( x i , ξ ) , Lo cal Bay esian Regression 21 August 1994 whic h is equal to the ordinary NW-t yp e one only if the start curv e function is con- stan t in the neighbourho o d. It has exactly the same ‘start estimator times correction’ structure as that of estimators w ork ed with in Hjort and Glad (1994) pro vided the k ernel ¯ K ( z ) m 0 ( x, ξ ) /m 0 ( x, ξ + hz ) is used. The programme is once more to estimate w 0 = w 0 ,x in a suitable fashion, for giv en ξ , ending up with a Bay es-empirical-estimator m ∗ ( x, ξ ), and then a veraging this o ver say 100 likely prior curv es drawn with resp ect to the p osterior density of ξ giv en all data. Helpful for the first step here is to use P 0 ( x, ξ ) = X N ( x ) m 0 ( x, ξ ) n ¯ m ( x, ξ ) m 0 ( x, ξ ) − 1 o 2 ' 1 + w − 1 0 ,x X N ( x ) m 0 ( x i , ξ ) , with uniform k ernel weigh ting in (1.3). 7.4. Local linear Ba yes anal ysis of other regression models. The ideas and metho ds of this pap er can b e applied in many other regression situations, suc h as logistic regression and Cox regression in surviv al analysis. Y et another situation where similar metho ds can b e put forw ard is that of spatial in terp olation of random fields; the result would b e a lo cal version of the Bay esian Kriging metho d treated in Hjort and Omre (1994, Section 3). 8. Supplemen ting results and remarks. 8.1. Stein-type estima tion and risk function comp arison. The follo w- ing discussion is p ertinen t also for metho ds developed in the later sections, but to k eep matters simple we consider the situation in Section 2, where the lo cal levels m ( x ) at k p ositions w ere to b e estimated. Let us also for simplicity emplo y the uniform kernel in (1.3) so that t 0 ( x ) and s 0 ( x ) are b oth equal to the n umber of data p oin ts falling inside x ± 1 2 h . Supp ose the loss function inv olv ed is L ( m, m ∗ ) = X x ∈ I { m ∗ ( x ) − m ( x ) } 2 s 0 ( x ) . The standard estimator e m ( x ) has risk function equal to the constan t kσ 2 . In view of estimators (2.7) and (2.8), let us try out m ∗ ( x ) = e m ( x ) − c { e m ( x ) − m 0 ( x ) } /z , where z = X x ∈ I { e m ( x ) − m 0 ( x ) } 2 s 0 ( x ) . Its loss can b e written L ( m, m ∗ ) = L ( m, e m ) + c 2 /z − 2 c X x ∈ I s 0 ( x ) { e m ( x ) − m ( x ) }{ e m ( x ) − m 0 ( x ) } /z . Using partial in tegration and prop erties of the normal distribution one sees that E m { e m ( x ) − m ( x ) } q ( e m ( x 0 , 1 ) , . . . , e m ( x 0 ,k )) = { σ 2 /s 0 ( x ) } E m { ∂ /∂ e m ( x ) } q ( e m ( x 0 , 1 ) , . . . , e m ( x 0 ,k )) . Nils Lid Hjort 22 August 1994 This implies, with some calculations, that the risk of m ∗ is equal to the risk of e m plus E m ∆, where ∆ = z − 1 { c 2 − 2 c ( k − 2) σ 2 } . The b est v alue for c is c 0 = ( k − 2) σ 2 , making the ∆ function negative for all v alues. This shows that e m can b e improv ed up on not only in a region around a giv en prior estimate m 0 , but uniformly , if only k ≥ 3. This is the Stein phenomenon, see for example Lehmann (1983, Ch. 4) for discussion of this in a somewhat simpler framew ork. The dev elopment here suggests the estimator m ∗ ( x ) = e m ( x ) − ( k − 2) e σ 2 P x ∈ I { e m ( x ) − m 0 ( x ) } 2 s 0 ( x ) { e m ( x ) − m 0 ( x ) } , (8 . 1) whic h is quite similar to (2.7). F urther studies are needed in order to single out practical versions of our sc hemes with goo d p erformance against traditional comp etitors. A sim ulation study along the lines of Breiman and Peters (1992) could b e carried out, using proposals as outlined in Section 4.4 ab o ve, for example. 8.2. Al terna tive estima tors for σ and prior precision. Other es- timators can also b e developed for σ than in (2.5). With a prior π 0 ( σ ) for the σ parameter the sim ultaneous distribution of σ and the combined neigh b ourho o d data sets D ( x ) can b e written down, following the expression that led to (2.5). A con- v enient c hoice is the Gamma prior with parameters say ( 1 2 α, 1 2 β ) for λ = 1 /σ 2 , with prior guess σ 2 0 = β /α for σ 2 . Then λ , given the collection of all the lo- cal data sets, is still a Gamma with up dated parameters 1 2 α + 1 2 P x ∈ I s 0 ( x ) and 1 2 β + 1 2 P x ∈ I { Q 0 ( x ) + ρ ( x ) P 0 ( x ) } . This leads to the Bay es estimator b σ 2 = { α + X x ∈ I s 0 ( x ) } − 1 h β + X x ∈ I Q 0 ( x ) + X x ∈ I w 0 ,x s 0 ( x ) w 0 ,x + s 0 ( x ) { e m ( x ) − m 0 ( x ) } 2 i . It is also of interest to note that if the unconditional distribution of the k lo cal data sets is deduced, from the ab o ve b y integrating out σ , then its maximisers are exactly as in (2.5). The noninformative prior version of this is the one where α and β tend to zero, corresp onding in fact to having a uniform prior for log σ . One ma y also let the w 0 ,x parameters tend to zero, corresp onding to a uniform prior for eac h of the k lo cal lev els a ( x ). This invites P x ∈ I Q 0 ( x ) / P x ∈ I s 0 ( x ), whic h is quite similar to the one in (2.5). Y et other estimators of σ are discussed in Hastie and Tibshirani (1990, Section 3.4). W e sa w in (2.3) that the σ -conditional p osterior distribution of the lo cal con- stan t a w as a normal with v ariance prop ortional to σ 2 . The real lo cal p osterior distribution of a emerges by integrating this normal with resp ect to the p osterior distribution of σ . With the Gamma prior ( 1 2 α, 1 2 β ) for 1 /σ 2 used ab o ve some calcu- lations sho w that a = m ( x ) | lo cal data ∼ b m ( x ) + { w 0 ,x + s 0 ( x ) } − 1 / 2 b σ t ν , where t ν is a t distribution with degrees of freedom equal to ν = α + P x ∈ I s 0 ( x ). With the noninformative prior on σ and uniform w eights we get a t with n degrees of freedom. This leads to p oin t wise Bay esian credibility bands for the m ( x ) curv e. Lo cal Bay esian Regression 23 August 1994 8.3. Choosing kernel, band width and order. The lo cal likelihoo d L n ( x, β , σ ) of (1.5) with a uniform k ernel has discontin uities in x when the end- p oin ts of the x ± 1 2 h interv al hit data p oints. This dra wback is inherited for b oth Ba yes estimators and maxim um local lik eliho o d estimators. Necessary for contin uity of these is contin uity of the k ernel function ¯ K in the full [ − 1 2 , 1 2 ] supp ort interv al, in particular ¯ K ( ± 1 2 ) = 0 is required. Let K be a probability densit y k ernel with standard deviation σ K and R K = R K 2 d z , and let ¯ K b e related via (1.3). The appro ximate or asymptotic mean squared error of the maximum lo cal lik eliho o d estimator, sa y the LL estimator (1.4), can be expressed as { σ K R K } 4 / 5 times a fac- tor dep ending on the unkno wn m ( · ) and then divided by n 4 / 5 . The b est kernel in this sense is the one supp orted on [ − 1 2 , 1 2 ] and minimising σ K R K . This is the Jepanetsjnik off kernel, and in terms of ¯ K this is ¯ K ( z ) = 1 − (2 z ) 2 on [ − 1 2 , 1 2 ] and zero outside. Cho osing bandwidth and order of the lo cal mo del is a non trivial problem, c hiefly related to the v ariance–bias balancing act. Metho ds and insight reached in the non- Ba yesian lo cal regression context are mostly relev ant also in the presen t Bay esian framew ork, and a pragmatic view w ould b e to let the non-Bay esians decide on h and the lo cal p olynomial order, as b est they can for a giv en problem, and then use the Ba yesian metho ds dev elop ed here with the c hosen h and order. Cleveland and Devlin (1988) use certain M -plots that resemble Mallows’ C p -statistics. Tibshirani and Hastie (1987) use Ak aike’s information criterion (with a uniform kernel, and with symmetric nearest neighbours windows); the so-called Bay esian information criterion of Sch w arz (1978) could as easily b e used. F an and Gijb els (1992) consider v ersions of ‘plug-in’ metho ds to decide on h . V arious Bay esian metho ds can also b e developed, including fanciful ones that for a given order start with a prior for the h or the h x pro cess. An easier metho d, whic h is Ba yesian in the sense that it can incorp orate prior knowledge, is to work with the widest practical mo del, sa y the third order lo cal regression m ( t ) = a + b ( t − x ) + c ( t − x ) 2 + d ( t − x ) 3 for t ∈ [ x − 1 2 h, x + 1 2 h ] , and mo del the lo cal co v ariance matrix W − 1 0 ,x so as to suitably p enalise third and second order presence. The result w ould also resemble a sp ecial case of general smo othing-betw een-mo dels estimators that are discussed generally in Hjort (1994c). Rather than developing such ideas w e are conten t here to describ e a natural lo- cal go o dness of fit metho d, which also might b e useful for the non-Bay esian lo cal regression metho ds. F or the running lo cal line case (lo cal p olynomial order 1), let Q 0 ( x ) = X N ( x ) { y i − e a ( x ) − e b ( x )( x i − x ) } 2 , using uniform w eights for simplicit y . Under the h yp othesis that the regression really is linear in the N ( x ) = x ± 1 2 h interv al, and with normal residuals, Q 0 ( x ) /σ 2 ∼ χ 2 s 0 ( x ) − 2 , where s 0 ( x ) is the num b er of data p oin ts falling in N ( x ) window. A rough strategy is therefore to expand the x ± 1 2 h windo w, from a suitable minim um length Nils Lid Hjort 24 August 1994 h 0 ( x ) on wards, as long as Q 0 ( x ) / e σ 2 ≤ q (0 . 80 , s 0 ( x ) − 2), the upper 20% point (for example) of the χ 2 with s 0 ( x ) − 2 degrees of freedom. A kurtosis correction can readily b e supplied if the residuals are non-normal, and the chosen v alues of h = h x should b e p ost-smoothed somewhat to give a smo oth curv e. There are obvious mo difications of this metho d for lo cal p olynomial orders say 2 and 3 (in particular, with appropriate versions of e σ 2 of (3.6), subtracting resp ectiv ely 3 and 4 in the denominator). In the end the resulting estimators, sa y of order 1, 2, and 3, can b e scrutinised and one of them could b e selected from a suitable ov erall criterion. 8.4. Wha t is the kernel smoothed local likelihood aiming a t? Con- sider the local k ernel smo othed likelihoo d L n ( x, a, b ) that w as the starting point of Section 3. Indirectly its use hinges on considering the normal { a + b ( t − x ) , σ 2 } mo del, sa y f ( y | t, a, b, σ ), to b e a relev ant approximation to the real f ( y | t ) densit y , for t in the vicinity of a giv en x . T o quantify this, note that n − 1 times the log-lo cal lik eliho o d tends to Z ¯ K ( h − 1 ( t − x )) f ( t ) n Z f ( y | x ) log f ( y | t, a, b, σ ) d y o d t, b y the la w of large n um b ers. This sho ws that estimation based on L n ( x, a, b, σ ) aims at approximating the real f ( y | t ) as w ell as p ossible in the sense of minimising the lo cally weigh ted distance Z N ( x ) K h ( t − x ) f ( t ) ∆[ f ( · | t ) , f ( · | t, a, b, σ )] d t b et w een true and parametrically mo delled distributions, where K h is the scaled v ersion K h ( z ) = h − 1 K ( h − 1 z ) and where ∆ [ f , g ] is the Kullback–Leibler distance R f log( g /f ) d y . In the presen t case, this can b e seen to b e the same as minimising Z N ( x ) K h ( t − x ) f ( t ) log σ + 1 2 [ σ 2 tr + { m ( t ) − a − b ( t − x ) } 2 /σ 2 ] d t, where σ tr is the underlying true standard deviation for y i − m ( x i ). This sho ws that lo cal linear mo delling aims to provide the b est approximation a x + b x ( t − x ) to the true m ( t ) around x in the sense of minimising R K h ( t − x ) f ( t ) { m ( t ) − a x − b x ( t − x ) } 2 d t . This readily implies a x = m ( x ) + O ( h 2 m 00 ( x )). Also, the least false σ 2 aimed at is P x ∈ I R N ( x ) K h ( t − x ) f ( t )[ σ 2 tr + { m ( x ) − a x − b x ( t − x ) } 2 ] d t P x ∈ I R N ( x ) K h ( t − x ) f ( t ) d t , whic h is slightly bigger than the true v ariance parameter. The difference is O ( h 2 ) and usually small. These results suggest that quite sensible b est lo cal approximat- ing mo dels are aimed for, and, in particular, that the indirectly utilised normality assumption cannot matter m uch. See Section 5.3 for non-normal lo cal likelihoo ds. 8.5. A v oiding the binning of da t a. Sev eral of the metho ds arrived at here dep end at the outset on the particular binning of data in to cells, with ∪ x ∈ I D ( x ). Lo cal Bay esian Regression 25 August 1994 Appro ximations of the quantities inv olved can b e developed to replace sums ov er x ∈ I with sums o ver all data p oints, th us a voiding dependence on any particular binning. 8.6. Nearness of the two basic estima tors. The follo wing calculations indicate that estimators stemming from the lo cal constant mo del (Section 2) often can b e view ed as simpler appro ximations to those stemming from the lo cal linear mo del (Sections 3 and 4). Those using the lo cal a + b ( t − x ) mo del are essentially p erforming an O ( h 2 ) debiasing op eration on those using the simpler a mo del. With notation as in Section 3.1, start out noting that the sizes of s 0 , s 1 and s 2 dep end on the underlying densit y of the x s, sa y f , and its first and second deriv atives. F or the symmetric probabilit y densit y k ernel K used in connection with the w eights w i ( x ), see (1.3), let k j = R z j K ( z ) d z for j ≥ 1 and let k 0 = K (0). Consider f n ( x ) = n − 1 h − 1 n X i =1 K ( h − 1 ( x i − x )) , g n ( x ) = n − 1 h − 3 n X i =1 ( x i − x ) K ( h − 1 ( x i − x )) , q n ( x ) = n − 1 h − 2 n X i =1 { h − 2 ( x i − x ) 2 − k 2 } K ( h − 1 ( x i − x )) . These functions are essentially estimates of f ( x ), f 0 ( x ) and f 00 ( x ); indeed f n has mean f + O ( h 2 ), g n has mean k 2 f 0 + O ( h 2 ), and q n has mean 1 2 ( k 4 − k 2 2 ) f 00 + O ( h 2 ). And s 0 ( x ) = nhf n ( x ) /k 0 , s 1 ( x ) = nh 3 g n ( x ) /k 0 , s 2 ( x ) = nh 5 q n ( x ) /k 0 + k 2 nh 3 f n ( x ) /k 0 . In the typical large-sample analysis the smo othing parameter h has to tend slowly to zero in order to achiev e consisten t estimation of f , f 0 and f 00 , in fact nh 5 → ∞ is required here. This shows that the size of s 0 ( x ) is typically bigger than those of s 1 ( x ) and s 2 ( x ). Some analysis also shows that e m LL ( x ) essentially performs an O ( h 2 ) debiasing t yp e ‘correction’ on e m NW ( x ). This also sa ys that the NW estimator can b e seen as a first order appro ximation to the LL estimator, when h is small. Corresp onding remarks are v alid for the Bay es estimators. 8.7. Ba yes estima tion of regression deriv a tives. Supp ose the first deriv ative m 0 ( x ) is to b e estimated. A natural metho d is to use lo cal mo dels of the form m ( t ) = a + b ( t − x ) + c ( t − x ) 2 around each giv en x , then carry out Bay esian estimation of the lo cal parameters b y conditioning on lo cal data, and in the end use the e b = e b x comp onen t. 9. Conclusions. W e hav e describ ed a general Ba yesian/empirical Ba yesian lo cal regression method, comprising up to fiv e steps (a)–(e), as detailed in Section 1.3. There is a b ewildering plethora of p ossible implementations of the general idea, Nils Lid Hjort 26 August 1994 as witnessed in Sections 2–4. It is w orth stressing that the class of lo cal linear regression metho ds, which has enjoy ed increased p opularity recen tly , emerges as the sp ecial case of the Bay esian programme which corresp onds to flat priors on the lo cal parameters. The breadth of the sp ectrum of solutions is a consequence of the Bay esian paradigm; each new application could ha ve a different prior on the start curv e and a different structure of the prior cov ariance matrix σ 2 W − 1 0 ,x (see Section 3). In addition there are several w ays of carrying out the empirical Bay es step, that of estimating parameters in the W 0 ,x matrices from data. The metho ds of Section 4 end up as quite definite prop osals, though, in situations where the start curve can b e parametrised linearly in basis functions. W e ha ve also made the p oin t that the Bay es solutions ha ve the potential of outp erforming classical metho ds, not only in the vicinity of prior guesses, but uniformly . Crucial factors for successful applications migh t include a go o d parametric represen tation of the start curv e m 0 ( x, ξ ) and that of a go o d representation for the prior precision matrices W 0 ,x . F urther study is needed in order to single out the b est practical wa ys of doing lo cal Bay esian regression. References Bo x, G.E.P . and Tiao, G.C. (1973). Ba yesian Inference in Statistical Analysis. Addison-W esley , Menlo P ark. Breiman, L. and Peters, S. (1992). Comparing automatic smo others (a public service en terprise). In ternational Statistical Review 60 , 271–290. Clev eland, W.S. (1979). Robust lo cally w eighted regression and smo othing scatter- plots. Journal of the American Statistical Asso ciation 74 , 829–836. Clev eland, W.S. and Devlin, S.J. (1988). Lo cally w eighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical As- so ciation 83 , 596–610. Clev eland, W.S., Grosse, E. and Shyu, W.M. (1991). Lo cal regression mo dels. In Statistical Mo dels in S (J. Cham b ers and T. Hastie, eds.), 309–376. W adsworth, P acific Grov e, California. Erk anli, A., M ¨ uller, P . and W est, M. (1992). Ba yesian curve fitting using multiv ari- ate normal mixtures. Biometrik a , to app ear. F an, J. and Gijb els, I. (1992). V ariable bandwidth and lo cal linear regression smo others. Annals of Statistics 20 , 2008–2036. F erguson, T.S., Phadia, E.G., and Tiwari, R.C. (1992). Bay esian nonparametric inference. In Curren t Issues in Statistical Inference: Essa ys in Honor of D. Bas u (M. Ghosh and P .K. Patnak, eds.), 127–150. IMS Lecture Notes & Monograph Series 17 . F rank, I.E. and F riedman, J.H. (1993). A statistical view of some chemometrics regression to ols (with discussion). T echnometrics 35 , 109–148. F riedman, J.H. and Stuetzle, W. (1981). Pro jection pursuit regression. Journal of the American Statistical Asso ciation 76 , 817–823. Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chapman and Hall, London. Lo cal Bay esian Regression 27 August 1994 Hastie, T.J. and Loader, C.R. (1993). Lo cal regression: Automatic kernel carpentry (with discussion). Statistical Science 8 , 120–143. Hjort, N.L. (1994a). Dynamic likelihoo d hazard rate estimation. T o app ear, hope- fully , in Biometrik a . Hjort, N.L. (1994b). Bay esian approaches to non- and semiparametric density es- timation. In vited pap er presented at the Fifth V alencia International Meeting on Ba yesian Statistics, to b e published, hop efully , in Ba yesian Statistics 5 . Hjort, N.L. (1994c). Estimation in mo derately missp ecified mo dels. T o app ear, hop efully , in Journal of the American Statistical Asso ciation . Hjort, N.L. and Glad, I.K. (1994). Nonparametric density estimation with a para- metric start. T o app ear, hop efully , in Annals of Statistics . Hjort, N.L. and Jones, M.C. (1994). Locally parametric nonparametric densit y estimation. T o app ear, hop efully , in Annals of Statistics . Hjort, N.L. and Omre, H. (1994). T opics in spatial statistics (with discussion). T o app ear, hop efully , in Scandinavian Journal of Statistics . Hjort, N.L. and Pollard, D.B. (1994). Asymptotics for minimisers of con vex pro- cesses. T o app ear, hop efully , in Annals of Statistics . Lehmann, E.L. (1983). Theory of P oint Estimation. Wiley , New Y ork. Rupp ert, D. and W and, M.P . (1994). Multiv ariate lo cally weigh ted least squares regression. Annals of Statistics , to app ear. Sc hw arz, G. (1978). Estimating the dimension of a mo del. Annals of Statistics 6 , 461–464. Scott, D.W. (1992). Multiv ariate Densit y Estimation: Theory , Practice, and Visu- alization. Wiley , New Y ork. Silv erman, B.W. (1985). Some asp ects of the splines smo othing approach to non- parametric regression curve fitting (with discussion). Journal of the Roy al Sta- tistical So ciet y B 46 , 1–52. Stone, C.J. (1977). Consistent nonparametric regression (with discussion). Annals of Statistics 5 , 595–645. Tibshirani, R.J. and Hastie, T.J. (1987). Lo cal likelihoo d estimation. Journal of the American Statistical Asso ciation 82 , 559–567. W ah ba, G. (1990). Spline Mo dels for Observ ational Data. SIAM, Philadelphia. W and, M.P . and Jones, M.C. (1994). Kernel Smo othing. Chapman and Hall, Lon- don. T o exist. W est, M., M ¨ uller, P . and Escobar, M.D. (1994). Hierarc hical priors and mixture mo dels, with application in regression and density estimation. In Asp ects of Uncertain ty: A T ribute to D.V. Lindley (A.F.M. Smith and P .R. F reeman, eds.). Wiley , London. Nils Lid Hjort 28 August 1994
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment