Resolution and Scale Independent Function Matching Using a String Energy Penalized Spline Prior

The extension of the classical Bayesian penalized spline method to inference on vector-valued functions is considered, with an emphasis on characterizing the suitability of the method for general application.We show that the standard quadratic penalt…

Authors: David M. Rogers, Thomas L. Beck

1.1. Motivation and Problem Background. Function estimation from observational data is an important modeling tool because it allows us to explore the relationship between the independent variable and its response variable as well as predict observations yet to be made. Some simple example applications involve estimating continuous changes in a property with respect to time (e.g. helmet acceleration during impact [32] or dose-response curves [15]). More advanced applications can involve more than one dimensional input, for example estimating scalar functions of space and/or time variables [18,6]. Here the function is modeled as a linear combination of functions with one or two-dimensional arguments, giving it the classification of a generalized additive model [12]. Yet another generalization is possible when the function is vector-valued [38], for example the force on a particle moving in three dimensions [9]. We find in this last case the framework for analyzing a large class of novel applications such as stochastic integrators. In order for a function estimation method to be robust with respect to the experimental setup, it must satisfy two major requirements. First, the estimator must be resolution independent in the sense that it becomes stable in the limit of arbitrarily high resolution (where an acceptable approximation to true function is almost certainly in the parameter space). Standard least squares estimators (maximum likelihood estimation using Eq. ( 1) with λ = 0 and without log λ ti terms) [25] do not satisfy this criterion. Smoothing splines were shown to have this property [27] and even an asymptotic equivalence to convolution kernel-based smoothing in the early work of B. Silverman [31,32], which was carried on by K. Messer [21] and D. Nychka [24]. A subsequent improvement in the original formulation came through divorcing the measurement positions, {r l }, from the spline knots [8] which allows for more efficient calculations. The resolution independence of penalized splines allows us to overcome the difficult "bin-width" problem encountered in linear least-squares fitting, where too low or too high resolution cause unphysically smooth curves or overfitting. Second, the estimator must be scale independent in the sense that it predicts the correct function shape when both the function and its noise are subject to arbitrary scaling. The Akaike information criterion (AIC) method fails this test, while generalized cross-validation (GCV) is scale independent but fails to separately address estimation of the sample variance. The physical spline device already satisfies the infinite resolution limit sought above for a nonparametric method. It consisted of a section of rubber held in place with push-pins at several points along a desired smooth curve [37]. This device could be numerically implemented by simply minimizing the potential energy function for (a suitable computational description of) a string's position subject to the given constraint points. However, modern applications of function fitting are often based on noisy measurements so that we cannot assume the input data {y, r} M 1 to constitute absolute constraints. To modify the above device for this situation, imagine placing a push-pin at the location of each input measurement and making its connection to the spline by attaching a vertical spring. This is, in fact, the system we are led to by the Bayesian analysis of Sec. 2 -where the role of the spring constants are played by z, the inverse of the measurement variance. In addi-tion, the specification of an energy function leads to a complete description of the probability density for our spline's position if Boltzmann statistics are assumed. This assumption equates the minimum energy solution to the popular maximum likelihood estimate of statistics. Unfortunately, the system's potential energy function now depends jointly on the spring constants and the spline tension, λ. This causes the ratio between the two (λ/z) to decide the form of the final solution. Nevertheless, it is an important result of this report that when proper prior probabilities are chosen for the tension and spring constants, scale-independence is achieved. In order to motivate the choice of B-spline basis functions used to describe f (r), we note some of their properties here. In keeping with the nonparametric requirement we require that the basis set be able to approximate any sufficiently smooth function to good accuracy. Using B-spline basis functions to represent the function space of interest allows a choice of the function's continuity, via the spline order, as well as its resolution, via the number of free parameters or knots. These choices can be important in numerical applications of the fitted function. In addition, for all choices of the knot spacing and spline order, the best B-spline approximation can be shown to have the same approximation error as a polynomial fitting [26]. Several issues in using penalized splines have been addressed by previous studies. In attempts to understand scale dependence, several proposals for choosing the penalty scale parameter have been considered [28,18,15], including introduction of non-uniform (in r) penalty and/or variance parameters [10,5,30,4]. Adaptation to cases with non-Gaussian sampling error have been addressed by consideration of inference schemes with alternative likelihood functions [3]. Notably, several authors have interpreted the complete fitting process as a Bayesian inference problem [18]. This leads to consideration of the penalty function as a prior distribution on function space, with the attendant penalty parameter as a (nuisance) hyperparameter. 1.2. Material Covered. In view of the above considerations, it seems that each new application area of P-splines should reconsider the proper choice of prior probability for function space and likelihood function for the problem at hand. This report will attempt to address these issues in the Bayesian formulation of P-splines, so that the method can be used in new applications with confidence. First, section 2 presents a physically motivated re-derivation of the prior penalty by likening the function to a stretched string. The shortcomings of the standard, scale-invariant prior probability for the penalty parameter are re-examined and found to be due to the possibility of a degenerate polynomial solution. We show that we can exclude this possibility by introducing a string zero-point energy to modify the standard Jeffrey's prior, while retaining scale invariance for arbitrarily large measurement scalings. Next we derive important properties of the proposed prior distribution in section 3. Resolution independence is proved by finding that the approximation error of the n th derivative of a B-spline solution is of order h r-n . A review of the link between spline and convolution-based smoothing shows the role of higher-order derivatives in the penalty function. The marginal distribution of the penalty parameter serves to further clarify the role of the zero-point energy and greatly simplify sensitivity analysis to this fixed numerical value. The insight gained in these two sections allows automatic identification of the scale parameter in new application areas. Section 4 compares results from Markov chain Monte-Carlo (MCMC) simulations on several test applications to provide a numerical test of the scale independence sought. An implementation of the multi-dimensional method for matching the output of a stochastic integrator is also reviewed and tested. For the particular problem studied, it is found that use of the posterior average estimator gives significantly better results than maximum likelihood estimation for small sample sizes. Finally, notes on B-splines and computational implementation of the integrations necessary for calculating higher order derivatives in the penalty function are provided in the appendix. 1.3. Problem Formulation. We consider the regression problem where M (possibly noisy) measured values, {y l , r l } M 1 , of a property y = f (r)+noise are to be fitted to an arbitrary function f (r) = We allow for the possibility that y and r are vector-valued by choosing the design matrix D(r) ∈ M N ×p . In order to infer the value of the parameters θ ∈ R p , we derive the penalty function The indices are necessarily introduced by allowing each dimension of y to imsart-aos ver. 2009/08/13 file: stsplines.tex date: November 6, 2018 have its own measurement variance σ 2 i = z -1 i , and f (r) to be composed of the sum of N f B-spline functions B(t) of one-dimensional variables t k (r) ∈ S t k and parameters θ k such that N f k=1 p k = p. Each of these functions has its own roughness penalty of order n, whose scale parameter (or tension) is λ k . We can analyze this complicated form by considering the simple case where N and N f are both one, making t 1 (r) = r and 2. Derivation of the Method. Thermal Equilibrium of a Stretched String. The prior probability for the space of all possible functions has been widely taken to be a Gaussian distribution on the function's spline parameters. The penalty matrix (inverse of the variance-covariance) is either an n th order difference on successive spline parameters or derived from an integral over the squared n th order derivative of the function -corresponding to an n th order random walk of the control points or the function f (r), respectively [8,18]. Here, an appeal to the latter, traditional, penalty function is laid out based on consideration of the spline fit as a smooth string. In the tradition of the classical spline device [37], it seems appropriate to consider the function's n -1 th derivative f (n-1) (r) as such a stretched string with tension T (r). Once a suitable energy function for the string is available, the assumption of Boltzmann statistics specifies a probability distribution for the string's position (and hence its parameters). Under Dirichlet boundary conditions, the standard expression for a string's potential energy is given by [using the notation from Eq. (2)]: where ρ(r)dr is the volume element for integration (i.e. dr for a line, 4πr 2 dr for a spherically symmetric membrane, etc.). For uniform tension T 0 (or tension with unknown scale but known variation with respect to r) and a B-spline basis for f (r), this formula can be integrated to give the familiar P-spline penalty function: where we have made the definitions: The relation for the derivatives of f (r), Eq. ( 5), follows from the linearity of B-splined functions in their parameters, θ. As we allow a separate Q for each additive function of one variable, A is simply a row-vector. In this report, we distinguish matrix-multiplication from scalar multiplication with the symbol "•". Likening the penalty to the spline energy gives the following connection to statistical mechanics. A system allowed to exchange heat with its surroundings until thermal equilibrium is reached has a Boltzmann probability distribution: Where the inverse thermal energy β determines the scale of the energy fluctuations (and thus the spline roughness) allowed in the system, whose position is specified by coordinates x. Defining the product βV T 0 as λ recovers the traditional penalty function and assigns the meaning of "dimensionless tension" to the penalty parameter, λ. Applying Boltzmann statistics to our string system thus gives us an improper normal distribution, since the n string modes corresponding to coefficients of an n -1 degree polynomial have been assigned a uniform prior due to their absence from the energy function. It is interesting to note that the above formula starts from a string energy density which is dependent on the shape of the function, not the number of spline parameters, p. However, the prior becomes dependent on p through the rank of Q when normalized by integration over θ. From a mechanical perspective, this occurs simply because larger numbers of parameters allow the function much more free space to move. This makes the probability of any individual choice of θ correspondingly less. From a Bayesian perspective, this can be justified by considering how the number of functions within a small region of function space grows with p. We have thus recovered the common element of all previous formulations of P-splines via invoking Boltzmann statistics on an intuitive physical model system. As noted in the introduction, however, the penalty parameter and sample variance remain to be dealt with. The standard Bayesian choice is a simple scale-invariant prior for λ, giving Eq. 10 with E 0 = 0 as the complete prior. Although this approach works most of the time, it has one serious numerical issue which appears in the limit as λ approaches infinity. This situation occurs when the sample size is small and the derivative order is large, allowing the belief that an n -1 degree polynomial is a possible solution to the inference problem (i.e. a singularity at λ = ∞). Because MCMC sampling alternates between drawing values for θ and λ, it can get stuck once the algorithm encounters a λ large enough to force a polynomial solution. Because of the divergence at large λ, this parameter requires an alternative prior distribution as has been done by several authors. Indeed, the earlier non-Bayesian suggestions for penalty parameter selection implicitly suggest a prior on λ. Our analogy to stretched strings provides an alternative method for deriving such a prior. We gain a vital clue by the above noted degeneration of the sampling process into a least-squares polynomial fit. Since our problem formulation in terms of stochastic splines implies an aversion to such simple solutions, we need to find a way to explicitly add this aversion as prior information. Adding ν observations of small displacements, ∇ (n) f (r i ) 2 = w 2 i to our state of knowledge gives a Gamma distribution for λ| E 0 = ν 1 w 2 i with parameters a 0 = ν/2, b 0 = E 0 /2. This approach was tried by Lang and Brezger [18] with ν = 2 and found to give results dependent on the scale of the function measurements due to bias toward a 0 /b 0 -which is the prediction for λ implied by the displacements assumed in the new prior. This prediction for λ is effectively specifying an energy scale for measuring displacements in our string. Jullion and Lambert [15] propose to set ν = 2 and provide a hyperprior for E 0 /ν. The corresponding interpretation is that ν prior observations of the string displacement were made but subject to large uncertainty. However, these approaches assume some prior knowledge about both ends of the energy scale, where the original intent of the additional information was simply to eliminate numerical instability at the high tension (degenerate solution) end. The usual difficulty with scale-invariant priors is divergence toward zero due to lack of observations of the corresponding process' scale. Here, observations of the tension parameter are already made through the movements of the function away from a polynomial form. Divergence toward infinity as noted above should only occur if the input points actually lie on a polynomial and no noise occurs in the system. In all other cases, such divergence is an unwanted numerical artifact. In order to deal with this artifact, suppose that we arbitrarily set a value for θ T • Q • θ at which we consider the spline to be a degenerate polynomial solution. Since we are not interested in variations of the function below this threshold, we propose modifying the standard scale-invariant prior to impose this condition on the sampling process by adding E 0 to θ T •Q•θ. As noted in the discussion above, this is equivalent to the limit of specifying a string containing a zero-point energy, E 0 , without any explicit observations (i.e. ν = 0). This argument shows that E 0 is a measure of residual uncertainty that the solution is an n -1 degree polynomial. P (θλ|I) = P (θ|λI) P (λ|I) Setting ν = 0 has important consequences for previously reported problems with this prior [15], since the implied prior distribution for λ now approximates the scale independent prior, in particular its cumulants approach zero independent from E 0 . In fact, the new prior distribution on ln λ is sigmoid-like, going from 1 at -∞ to 1/2 at ln λ 1/2 = ln(2 ln 2) -ln E 0 (with slope -(ln 2)/2) to 0 at ∞. Gaussian likelihood functions with variance σ 2 ≡ 1/z, the posterior distribution is given by And the conditional posterior distributions are therefore imsart-aos ver. 2009/08/13 file: stsplines.tex date: November 6, 2018 where we have formed a matrix D ∈ M M ×p by stacking row-vectors of spline coefficients B(r) from all observations. Similarly, Y is a column-vector of all the observed function values. An additional parameter appears because the same type of situation described for λ can also occur for the inverse variance parameter when the number of measured data points is low compared to the spline resolution. In this case, the spline can exactly match the input measurements resulting in a singularity as z → ∞. This case has been studied in detail by Wahba and Wang [36]. Our remedy will be just as above, adding a minimum variance V 0 into the residual sum of squares. Using MCMC techniques [29,11] to sample the posterior distribution Eq. (12)(13)(14) is the direct analogue of observing the multiple positions of a spline device (as described in the introduction) in thermal equilibrium. Alternately, a maximum likelihood estimate can be carried out using relatively fewer iterations of the above steps by simply maximizing each conditional posterior and iterating until convergence. Both processes require a Cholesky decomposition for θ at each iteration, which is an O(p 3 ) operation. We can calculate the classical mean-squared error (MSE) of the function estimate f (r) using Eq. 12 to compare with standard results [23]. This procedure gives an indication of the convergence toward a known θ 0 . This is in contrast to the Bayesian a posteriori error estimate for θ, which is given by the covariance matrix (zD T • D + λQ) -1 ≡ K/(zM ). The squared error is expressed as 2 dr and its expectation is taken with respect to all possible variations of the random noise. For the one-dimensional case, we therefore assume the vector of observations is Y = D • θ 0 + ǫ, and the corresponding estimator and error PDF are: Where we have used The second term on the right hand side is reminiscent of the Bayesian estimate, while the first term represents the bias introduced by assuming function smoothness. For a fixed λ/z, the total dependence on M for this term (remembering Tr(D T • D) = M for periodic splines) is order M -2 . However, this is not a direct estimate of the bias of the full Bayesian inference scheme, which allows λ/z to vary. 2.4. Multi-dimensional generalization. We first generalize Eq. ( 12) to the case where f (r) = N f k=1 f k (r) is the sum of multiple additive scalar functions. We can do this by extending the basis for y(r) = D(r) • θ by tacking on additional coefficients. For the case of ] -taking the total number of parameters to be p = N f k=1 p k . Each additive function should have its own string energy prior, making λ a vector of dimension N f and Q a set of matrices penalizing their respective functions. The posterior probability for each element of λ replacing Eq. ( 14) is thus In addition, when N f > 1, each function can only be determined to within an additive constant, and identifiability constraints must be placed on all but one of them. Next, we generalize to vector-valued functions for models of the form f (r l ) = D l • θ, where D l is an N × p matrix. This can be constructed by summing matrices formed from the tensor product of any direction vector g k (r) ∈ R N with the familiar spline coefficients: Identifiability problems are worse for the multidimensional case. For definiteness, assume without loss of generality that each function has a unique argument B k (r) = B k (t k (r)). If two functions B k1 and B k2 have the same argument but different directions, they can combined to make a single unique function B k (r k ) = (g k1 (r) + g k2 (r)) ⊗ B(r k ) • θ k . Cases with the same argument but different ranges are irrelevant, since their ranges can be combined to make a single, larger spline function or considered as two independent variables for the present analysis. If, on the other hand, two functions with different arguments share the same directional vector for all samples encountered, an identifiability problem occurs. We therefore make the further assumption that the sample size is sufficiently large so that if two directional vectors differ in some region in the space of r, that region is included in the sample. Allowing multiple functions to share the same set of spline parameters collapses the total set of functions to identify via combining functions B k which share parameters θ K to make a total of N F unique functions. Note the use of capitalized indices for combined sets. This leaves us with genuine identifiability constraints which can be seen by examining Wherein θK denotes the constrained θ K (i.e. its average has been subtracted) and C K are arbitrary additive constants. The effect of constraining θ K during sampling is to force C K to zero. Any direction in which {C} could vary while leaving f (r) unchanged for every r must therefore have a corresponding constraint. The number of constraints is thus determined by the rank of R ∈ M M ×N f , formed by the column-vectors D * ,K •1 (associating K with the column). Using the definition (20 the assumptions above imply that the number of constraints is equal to the number of persistent linear dependencies among g K (r * ). It is worthwhile to notice that a multidimensional basis function can be constructed from differentiation of a scalar function of the familiar linear mixed model type (e.g. . When such derivative information is used, more data points than a corresponding sample from the scalar function are acquired with each sample. In this case identifiability problems only occur when the function can be simplified by reducing the dimension of {t K (r)} N F 1 in a linear algebraic way. Using a multivariate Normal likelihood function and assigning each dimension of the observation its own independent variance again gives a set of linear equations to solve during sampling, replacing the mean and variance of Eq. ( 12) with ( 22) . Where we have formed a list of matrices D ∈ M S×N ×p by stacking D l ∈ M N ×p for all observations and Q is understood to be an appropriate blockdiagonal matrix of Q k -s. Correlations between elements of each measurement can be incorporated by a trivial modification as long as they remain a known function of r. If N I dimensions share a common z I , the posterior probability for each element of z replacing Eq. ( 13) is (23) z 3. Properties of the Proposed String Energy Prior. Most of the properties of penalized splines can be analyzed analytically for a fixed ratio, α ≡ λ z . This is due to the existence of unique solutions to the onedimensional spline fitting problem for an unrestricted function in a Hilbert space [17,27]. In addition, Silverman [31] showed how the solution process can be understood in terms of a convolution kernel at the large sample size limit, establishing an analogy between convolution smoothing methods and spline smoothing [21,22,24,1]. These results allow us to investigate resolution independence and the effects of differing penalty function orders, n. To show how this result comes about and analyze its limitations, we will sketch a short and intuitive derivation here, while noting that more extensive research on this analogy has been carried out by others. Next, in order to expand these results for variable α, we derive the marginal likelihood of the variance and penalty parameters. This allows us to understand the properties of a Bayes' estimate for θ in terms of averages over α as well as study the trade-off between under or over smoothing. A general formula for the Fourier transform of the convolution kernel [Eq. (36)] corresponding to spline smoothing with arbitrary derivative order, n, was first given by Silverman [31]. Subsequently, several authors improved the results on asymptotic convergence of smoothing spline and convolution-based methods, including Messer [21] and Messer and Goldstein [22], where convenient approximate convolution kernels were derived for fixed sample spacing. Nychka [24] relaxed the equal spacing restriction and showed that the convolution kernel decays exponentially as long as the samples are spaced closely enough. Finally, Abramovich and Grinshtein [1] provided a systematic derivation of an asymptotically equivalent convolution kernel for arbitrary derivative order, sample spacing, and variable tension. We will briefly sketch the results of the latter here, while providing the connection to finite element solutions as proposed here in order to establish resolution-independence. We begin with the task of finding the function u ∈ H n (Ω) (where H n (Ω) is the standard Sobolev space of functions with square integrable derivatives on Ω up to n th order) which minimizes the functional (24) Φ In the following discussion we are considering both the infinite, Ω = R, and periodic domains, Ω = [0, V ). Next, note that we can re-formulate the sum imsart-aos ver. 2009/08/13 file: stsplines.tex date: November 6, 2018 appearing on the right hand side as an integral using the definitions: so that Φ can be conveniently expressed as a norm using the scalar product < a, b >= Ω ā(t)b(t)dt, where ā denotes complex conjugation. ( 26) To keep the notation simple, the definitions φ h ≡ φ 1/2 and ρ h ≡ ρ 1/2 are made. The solution can be found by setting the functional derivative to zero since the functional is positive and has a unique minimum as long as ρ h is bounded above zero everywhere in Ω and φ(t) contains at least n mass points [16]. We proceed as in Kimeldorf and Wahba [17] by using the scalar product invariance of the Fourier-Plancherel transform [34] Ω e iωt g(t)dt, as well as the convolution theorem Functional differentiation with respect to ũ(ω) yields the minimum as the solution of The last equation gives us the differential operator form of the minimization problem (24) [16,35,2]. The interpretation for continuous sample density φ(t) is immediate, however it also remains valid for finite M in the following sense. On any interval between sample points, φ(t) as defined above is formally zero, and hence the solution u(t) is that of the homogeneous equation -i.e. a 2n -1 order polynomial when ρ is constant or (by definition) an L-spline for general ρ [17]. When integrating equation (29) over small regions, we can treat all terms in the equation on equal footing at the sample points. The net effect of the sample points is thus to effect changes in the function's n th (and higher) order derivatives. The above argument (and references) show that the unconstrained solution of Eq. ( 24) is a unique element of H n (Ω), and therefore (by definition) nonparametric. To show precisely what is meant by a resolution independent spline estimate, we must prove that the spline function can approximate the nonparametric solution to arbitrary accuracy as the number of spline parameters increases. To do this, we draw an analogy between the B-spline solution θ|α of Eq. ( 12) and the Ritz-Galerkin finite element (weak) solution of Eq. ( 29). This lets us apply the typical error bounds for B-spline approximation [26]. Multiplying (29) through by a test function v(t) and integrating gives the bilinear form (30) a(u, v) To use standard procedures to get the approximation error in the induced norm u a = a(u, u) we first define the norms (31) |v| with |v| k,∞ as the maximum of v (k) over Ω. Now, let u h = D({t} M 1 ) • θ be the solution of Eq. ( 12) and note that it is the minimizer of u -u h a over all u h in the space of r th order Cardinal B-splines with knot spacing h (denoted S r h ). We thus have, for any v h ∈ S r h , Where we have defined ρ max = |ρ h (u -v h )| 0,∞ . Choosing v h as a projection of u onto S r h following Reif [26], we find where C 1 and C 2 are constants proportional to |u| n,∞ . The approximation error thus falls into two regimes depending on the "effective" samples per interval h n M/(αρ max ). For large effective sample sizes, the first term in Eq. ( 33) dominates and gives For small sample sizes, appropriate in the infinite resolution limit, the second term dominates and gives |u -u h | n,2 ∼ O(h r-n ) -both of which are of the same order as an optimal polynomial approximation to u. From Eq. ( 36) it is also easy to find a convolution kernel estimate for u. According to Eq. 12, this estimate is Which explicitly states our solution as a convolution with the kernel G (x, y), so that our function estimate is a linear combination of B-spline basis functions. Setting ρ = 1 in the large sample size limit, when φ is approximately constant over a large range of knots, G is symmetric and its Fourier transform [from Eq. ( 28)] approaches [31] (36) . Convolution kernels for other several derivative orders, n, are plotted in Figure 1, while for general ρ and φ, Abramovich and Grinshtein [1] have provided a method to systematically derive such asymptotic approximations. These asymptotic approximations give a sense of how the algorithm changes in response to differing penalty orders, n. From the discussion following (29), as n increases, higher order derivatives of the function are changing at the sample points, leading to a larger width convolution kernel. Note that the asymptotic approximation plotted in Fig. 1 breaks down when M < p. However resolution independence is still maintained; since in intervals without design points the solution will approximate the L-spline solution to (29). This provides a second interpretation to n as specifying the order of the "default" polynomial solution. As in the previous section, the change of variables from λ to zα uses the insight that the spline parameter estimate, θ, depends only on the ratio α. This change makes the posterior PDF of section 2.3 into Convolution kernels, G(x -y), plotted for α/V M = {1, 1/3, 1/9} (solid black, dashed, thick grey, respectively), and φ = 1. Higher sample numbers cause the kernel to steepen, placing more emphasis on x = 0, while higher α values do the opposite. The marginal probability density zα|DI is found by integrating over the pdimensional θ, assuming there are enough input measurements to determine the n improper dimensions of Q corresponding to an n-1 degree polynomial fit. This shows that the variance is Gamma distributed conditional on α. (39) P (z|αDI Interestingly, this estimate for the variance is at odds with most procedures suggested in the literature [35]. Taking the inverse of the average, we get E (z|α which resembles the square error estimate with M -n samples but includes the string energy, ǫ 2 Q . An explanation for this could be that specifying α gives information about the error scale, z relative to average string deviation where specifying λ alone would not. This reveals some of the subtle distinctions which are often missed between λ and α, especially in discussions of variance estimates. It also validates the interpretation of E 0 and V 0 as residual uncertainties and proves that the variance estimate is insensitive to them as long as they remain less than ǫ 2 Q and ǫ 2 f . Marginal distribution of the Penalty Parameter. Since z|αDI has a Gamma distribution, the normalizing constant for z is known and it can be integrated out of (38) to give The first term and the determinant in the denominator can be thought of as offsetting terms, pushing smoothness higher until the point where the p -n eigenvalues of Q dominate the determinant. The second term is the most interesting if we remember that α represents the ratio between the spline tension and spring constants. The above distribution says that we can learn about the over/under-smoothing tradeoff through comparing the relative magnitudes of the function deviations, ǫ 2 f (α) + V 0 and the roughness, ǫ 2 Q (α) + E 0 . As α increases, the function deviations increase and the roughness decreases, so that the two energy scales can reach a balance. The appearance of E 0 and V 0 is natural in this context, since it prevents choosing either extreme. The (twice negative log of the) posterior likelihood (40) can also be compared to the AIC and GCV functions for choosing α which have been extensively analyzed in the literature. Defining H(α) ≡ D • (D T • D + αQ)D T , these two functions are: Where we have set ẑ = 23.74 -2 , following Eilers and Marx [8] and ignored constant terms. First, we note that neither of these two functions has a straightforward generalization for inference on vector-valued functions, so we must consider alternatives here. This is because there is not a one-to-one correspondence between individual functions to be added together and the dimensions of y(r), rather all functions have the ability to contribute to all output dimensions. Examining the criteria specified in this report, the GCV has very good properties for the one-dimensional case. Although the interpretation of α as specifying the relative scale between λ and z is not clear with the GCV estimate, its minimum remains invariant to scaling both the function and its error (scale independence) since ǫ 2 f and H are functions of α only. However, alternatives such as the AIC or Mallow's C p criterion which require an a priori estimate for the variance do not necessarily have this property. Figure 2 compares the different functions of α for Härdle's motorcycle helmet data [8]. It can be seen that the GCV and AIC predict values in roughly the same neighborhood, while the present log-likelihood has a minimum at a much lower α. Despite this, the present method differs very little from the GCV estimate in this example, while the AIC estimate (not shown) is indistinguishable from the GCV curves. We can therefore conclude that averaging over a range of α values does not spoil the fit to this data set, but allows additional smoothing (as can be seen from Fig. 1) as well as a more cautious error estimate. 4. Example Problems. We consider some example cases which serve to test the scale independence of our method as well as illustrate the significance and usefulness of multidimensional function matching. First, the proposed prior is compared to other suggestions in the literature for a few simple test functions. Next, we progress to a high-dimensional example which attempts to describe the dynamics of 256 atoms. This system and others like it are numerically tractable since their behavior can be explained in terms of only a few additive functions, while the main difficulty of fitting these functions lies in separation of the measured sums. 4.1. Method Comparison. We investigated the properties of various Pspline formulations using a standard set of model functions proposed by Lang and Brezger [18]. As in that report, ρ is chosen to be constant, and the derivative order is set to n = 2. The most important test functions for our current purposes are the linear function f 1 (r) = r/1.758 and the sinusoidal function (here modified from the original for exact periodicity) f 3 (r) = sin(πx/3)/0.72. The linear function tests the algorithm's stability in the degenerate polynomial solution case, and the sinusoidal function tests the algorithm's over/undersmoothing trade-off. The sinusoid function was modified because for the small number of samples used here almost all fits were noticeably linear when periodic B-splines were not used. These functions are defined over the range [-3, 3] and fit to 20 4 th order (cubic) B-spline knots using 20 equidistantly spaced noisy samples in order to force the solver into the low-sample regime. In order to compare across different versions of the algorithm and different scales within each version, the same set of independent random noise values generated from the standard normal distribution were appropriately scaled and used in all tests. All mean squared error results in this subsection have been renormalized via division by the square of the initial scaling applied. This makes for easier comparison, but requires us to keep in mind that larger error scales still produce worse fits in an absolute sense. All systems used MCMC Gibbs sampling of λ,z,θ as described in section 2.3 with a burn-in of 2500 steps, followed by 25000 sampling steps, collecting one sample every 5 steps for a total of 5000. Correlation times for λ were always higher than those for z, and are included in our figures (right y-axis). Correlation times for θ are not as important, since the conditional average (E (θ|λ, z)) has been collected with each sample. Figures 3 and4 show boxplots of the (normalized) empirical root mean squared deviation (RMSE) of the spline from its respective function at the 20 input sample points -log scale, left axis. Each figure shows the RMSE for the input samples (far left) as well as results from four different prior distributions for λ. They are (from left to right): X: (λ|I) ∼ Γ(0, 10 -10 /2), (z|I) ∼ Γ(0, 10 -10 /2); Y : (λ|δI) ∼ Γ(1, δ), (δ|I) ∼ Γ(10 -4 , 10 -4 ); Z 2 : (λ|I) ∼ Γ(1, 10 -2 ), and Z 6 : (λ|I) ∼ Γ(1, 10 -6 ). For each function and prior distribution, several different noise scales were applied to the random deviations, with σ indicated on the x-axis. In order to compare the relative error achieved with the efficiency of the method, the autocorrelation function of λ was calculated and fit to an exponential for each test to estimate the correlation time. The average and standard deviation of this quantity over all 100 runs gives an indication of the number of MCMC steps required to draw an independent Boxplots of error distributions for (left to right) variance of input samples, X: this report's method, Y : Lambert and Jullion's method, and Lang and Brezger's method using Z2: b = 10 -2 and Z6: b = 10 -6 . For each method, several noise scales are shown as explained in the text. Green lines (offset for visual clarity) display decorrelation time. The right panel shows the ratio between the sample RMSE and the estimated RMSE (left scale, calculated by averaging z -1 ) and the magnitude of posterior average penalty parameter. The linear case is an interesting smoothing limit to investigate, since the best approximation we can make is fitting to a polynomial of order n-1, and the bias of the MSE derived in Eq. ( 17) is zero. A polynomial fit corresponds to λ → ∞ in Eq. (2), a case for which our numerical solver becomes unstable [see discussion in section 2.2], which indeed occurred during our test when we set E 0 = 0. Our choice for E 0 sets an effective maximum on λ, which could lead to undersmoothing. However, the MSE and RMSE results of Fig. 3 from different methods are almost identical with the exception of Z 2 , indicating that λ was large enough. This figure shows that the λ selected by each method is strongly influenced by their respective prior distribution. In particular, Z 2 and Z 6 show a clear preference for 10 -2 and 10 -6 , respectively. Methods X and Y seem to show more response to variations in the error scale, decreasing λ as σ 2 increases (implying a relatively stable choice for α). Fitting to the sinusoid function shows the under/oversmoothing tradeoff of each method. Figure 4 shows that all methods (with the exception of Z 2 and Z 6 ) perform equally well predicting both the function and error scale. Methods Z 2 and Z 6 break down at low signal to noise, where the strength of their prior distribution for λ begins to outweigh the data. The similar performance of X and Y in this test indicates that they are less sensitive to their prior for λ. This could have been predicted from fitting results, since (for X) the function error, ǫ 2 f , deviates by more than E 0 from a polynomial solution (see Sec. 3.3), and (for Y ) λ does not have too small a value. To see why the fitting results of Y may display sensitivity to the prior parameters, we can integrate the compound prior P Y (λδ|I) dδ to give P Y (λ|I) = (b/(λ + b)) a with a = b = 10 -4 as used here. This function deviates significantly from the Jeffrey's prior λ -1 at small λ, indicating a preference for larger λ. However, it has a smoother decay toward zero at λ → ∞ than X. This smoother decay at large λ could also be achieved by adding a similar hyperprior for E 0 in our method. The function f 3 was selected for a further test of scale-dependence. This was done by varying the overall scale of the function and its noise on a logarithmic scale from 10 -9 to 10 +9 . Each data set used for fitting contained 50 equally spaced samples. Although a scale-dependence of the spline fit was noted by Lang and Brezger with the suggested remedy of standardizing the input samples, this strategy is difficult to justify if adaptive penalties are considered, and impossible for multi-dimensional measurements, where no linearly independent set of directions to standardize is guaranteed. None of the tests reported here make use of such standardization of input values, ), other symbols as in Fig. 3 since an ideal method should produce identical (scaled) results for any choice of overall scale. The results of this test are shown in Fig. 5. As expected, Z 2 and Z 6 show a bias toward their respective prior λ values, giving good MSE results for scales of 10 -1 or 10 -3 . However, they show under/oversmoothing when the scale is below/above that value, respectively. Undersmoothing has a characteristically higher MSE and a larger range of σ estimates, while oversmoothing shows a wider range of MSEs and outliers where σ is overestimated along with a wide range of λ correlation times. For X, the method breaks down at scales less than 10 -5 . This happens because the scale of the input data is below the order of √ E 0 , violating our assumption in setting E 0 . To understand this in the context of our prior from Sec. 2.2, even as the data scale diminishes (reducing ǫ 2 Q ), λ gets stuck at E 0 and can't go higher. This can be seen directly in Fig. 5. The same process happens for z, which gets stuck at V 0 as the scale diminishes. This has caused log 10 σ2 to bottom out at -11.5 at a scale of 10 -6 and -11.6 at a scale of 10 -9 [off the scale of Fig. 5]. Based on these observations, we can ask if ǫ 2 Q /E 0 or ǫ 2 f /V 0 is too small (say less than 10) as an indicator of whether our method is predicting an exact polynomial fit or an exact fit of the input data. At large scales, these ratios are large, indicating that E 0 and V 0 have become irrelevant and scale-independence is achieved. As explained above for method Y , λ must decrease at large scales, but the prior for λ is too small close to zero, causing oversmoothing. 4.2. Multidimensional Problems. Modeling of dynamic processes is usually carried out via numerical integration in time (τ ) of, e.g. Newton's equation of motion for a given potential function, E(r). imsart-aos ver. 2009/08/13 file: stsplines.tex date: November 6, 2018 Where the system position is specified by r ∈ R N (N being three times the number of atoms, r being indexed by 3i + α). Standard molecular dynamics potential functions consist of multiple additive functions and are of the form where each t k (r) is a scalar function of r, e.g. a distance between two atoms. Attempting to match samples of the force {y, r} requires differentiating the above with respect to r to give the linear mixed model ( 45) Here -∂t k (r)/∂r is an N -dimensional vector with components -∂t k (r)/∂r 3i+α giving the direction of each contribution to the total force, identified with g k (r) in section 2.4, and each The most striking property of this model is that it holds to the simple linear mixed model form and is amendable to the same sorts of analyses carried out there. For a nontrivial example of this type of problem, we consider a system of 256 Lennard-Jones particles with unit mass interacting via the pair potential Using the notation |r i -r j | to mean the Euclidean distance between r i ∈ R 3 and the closest periodic image of r j (since all the atoms have been placed into a cubic cell of length 7.49). By setting c ij = 1/2 whenever 1 ≤ i ≤ 128 and 129 ≤ j ≤ 256, or c ij = 1 otherwise, we have set up a twocomponent (binary) mixture. Due to their relatively low cross-attraction, particles of type A (1 ≤ i ≤ 128) and those of type B (129 ≤ i ≤ 256) have been found to be immiscible under the types of conditions used here [20], separating into two liquid layers within the simulation cell. The liquid state may be metastable with respect to a solid crystal phase, but such crystallization was not observed in any of the simulations reported here. Numerical integration of Eq. ( 43) with a time-step of 1.461 × 10 -3 was used to simulate this system until steady-state behavior was observed. Five hundred configurations, {r} 500 1 , were obtained by sampling every 500 steps in a canonical ensemble simulation, which uses a slight (stochastic) modification of the equations of motion [33] to guarantee that the velocities ∂r i /∂τ are independently normally distributed about zero with variance β -1 , here chosen to be 0.7917. Although the units stated in this report make use of a reduced units convention, the problem remains unchanged if we set physical units of ǫ = 1.26 kJ/mol, σ = 3.7 Å, T = 120 K, m = 16 g/mol, ∆τ = 1.935 fs, and γ = 5.1677 ps -1 . For consistency with Eq. ( 45), the set of forces was generated from the set of configurations using the following procedure. First, the three pairwise potential functions 4c ij t -12 -t -6 were replaced with their spline representations on t ∈ (4/7, 17/7) and used to find the mean force corresponding to each configuration. Next, independent normally distributed random noise with magnitude σ F = 60.91 was added to each dimension of each generated force sample. This procedure generates the same sample distribution as would be expected from a Langevin dynamics simulation with a moderate damping coefficient of γ∆τ = 10 -2 . Since we have three functions to match (A:A, A:B, and B:B), we use Eq. ( 20) with N F = 3 to compute D l,k ∈ M N ×p k for each frame and calculate the posterior mean and covariance using Eq. (22). In this equation, the sets K are assumed to include all terms of Eq. ( 46) which share a common set of parameters θ K (i.e. interactions between atoms of type A:A, A:B or B:B). Similarly, we will assume each particle type has its own z i , so we use Eq. ( 23) with N I = 2 and each set I is simply the set of all coordinates belonging to atoms a common type. The pairwise functions were fit to 6 th order B-splines with 170 knots and h = 0.1/7 to give a range of (0, 17/7), forcing the function and all its derivatives to zero at 17/7. MCMC sampling was carried out as described for the one-dimensional test cases except for the use of the density function ρ(r) = r 2 , which is more appropriate for the spherically symmetric function domains considered here. Figure 6 shows the fitting results and a comparison between the posterior average and maximum likelihood estimators for a variety of sample sizes. Corresponding distributions of the observed pairwise distances for the M = 50 case are shown in Fig. 7 as a function of sample size (left scale). For this case the present approach is contrasted with the generalized least squares solution on the right scale. Since the actual system does not allow observations of all pairwise distances (particularly for small r), the average mean-squared error between the functions and their spline fits were calculated by averaging over the observed distances from all 500 samples. The range of observed distances is also the appropriate one for considering the approximation error estimates of section 3.1, which serve as upper bounds on any subset of Ω where ρ > 0. An unusually high error is observed for the function describing interac- tions between particles of type A and B. Inspection of the spline estimates reveals that this error is due to oversmoothing (the MLE estimate was essentially zero) caused by the relatively small number of samples for this function, and so does not occur when the prior is set to zero -even for the M = 50 case (GLS, Fig. 7). This failure of the MLE makes it a worse choice than GLS, and could not have been predicted from calculation of the posterior function error. The expected error conditional on the MLE estimate for α, z is (in units of Fig. 6) -5.0 for all functions at M = 50. At M = 100, the A-A and B-B error estimates jump to around -1.9 and remain constant as M increases, while the A-B error estimate remains at -5.0 until M = 350, where it jumps to -2.2 and slowly increases to -2.1 at M = 500. These numbers significantly underestimate the fitting error at small sample sizes due to their neglect of variations in α, z. On the other hand, the posterior average estimate behaves as expected, approximating the input function with accuracy increasing with sample size. The expected function error using this method is slightly over-estimated due to the uncertainty in α, z -smoothly decreasing from -1.5, -1.6 for A-A,A-B at M = 50 to -1.9, -1.8 at M = 500. Examining the M = 250 case, the likelihood ratio of the posterior average estimate θ to the MLE is 10 -585 , but the average estimate performs better, and must be used, because of the width of the posterior distribution. Considering θ as a 510-dimensional vector, if the probability distribution for α| θ is relatively flat over a range R away from θ, then there are R 510 "states," θ, α, z, for which θ = θ|α and which therefore have a very low, but similar posterior probability. Integrating over a region in state-space is essential in justifying the astronomical difference in pointwise probabilities. Finally, as the sample size increases, the posterior probability narrows, causing both estimators to converge. The dramatic failure of the MLE shown in Fig. 6 demonstrates the importance of averaging over the posterior parameter distribution for small sample sizes. 5. Discussion. This paper presents a re-derivation of the commonly employed Bayesian statistical model for penalized spline matching. Several questions are addressed relating to the applicability of the method for general problems. Its main limitations stem from the possibility of choosing an inadequate form for the function to be fitted, y(r). This can happen either by leaving out the dependence of y(r) on some important function of rcausing larger than necessary noise in the data set -or by choosing a resolution that is too low (too large bin width) in order to save computational time. This latter problem can result in an oversmoothed function and will again increase the fitted value of σ 2 . Physical concepts are important to justify use of the quadratic penalty parameter and give a useful interpretation to the fitting process. Using the energy function for states of a stretched string, we can apply the Boltzmann distribution to derive a simple prior probability on function space. This method can be used to justify further adaptations of Eq. ( 3) and could prove to be generally useful in other problems of inference. Also by analogy to the physical system, we can anticipate a unique solution to the optimization problem Eq. ( 26). This proof is a classical result of the smoothing spline literature, and has been invoked here to show that our B-spline representation also optimally approximates the unique minimum, converging as the resolution is increased. By explicitly considering the variation of the smoothing parameter with respect to function scale, we are able to clearly present and compare several choices for the smoothing parameter proposed in the literature. These classical choices are difficult to apply in the multidimensional context, and several have the additional possibility of scale dependence. Although the GCV criterion has the desired scale independence property, it also has the problem of predicting an exact match to the input data at low sample sizes [36]. This problem (and the related one of predicting an exact polynomial fit) was solved in this report by placing constraints on the ability of the smoothing parameter estimate to force such an exact fit. The result is an estimator which is asymptotically scale independent for input data satisfying these minimum variance and polynomial deviation properties. Special consideration has also been given to formulation of vector-valued function fitting problems. An important observation is that the dimensionality of these problems can make maximum likelihood estimation ineffective for small sample sizes. There have certainly been previous treatments of multidimensional fitting and integrator matching, both in considering vector generalized additive models [38] and in adapting generalized linear least squares to use Bayes' theorem in physical systems [14,19]. However to the author's knowledge, this is the first report of the extension of P-splines to fit the drift and diffusion terms of molecular Langevin dynamics simulations. The approach developed in this paper can be directly used to fit stochastic integration processes such as the Langevin systems considered here, giving a well-defined way to extrapolate forward in time. This would work particularly well since the inference result gives the average and variance of each step in the integration process. However, although normally distributed force error is usually assumed in applications of the Langevin equation (and Euler integration of stochastic equations in general), the question of correspondence between dynamics where this assumption does not hold requires separate consideration. Future research could also re-consider the creation of an "overall" scaleinvariant prior for smooth z(r) where random noise is an issue in this process. This would be particularly important for variable volatility in market data or inhomogeneous molecular systems. {I} = {(u 0 , ⌊u 0 ⌋ + 1), (⌊u 0 ⌋ + 1, ⌊u 0 ⌋ + 2), • • • , (⌊u 1 ⌋ , u 1 )} to give a sum of integrals from each interval identified by its unique ⌊u⌋. Assuming we have a general form for the integral of ρ times a power of x on each interval, we write (56) with [F] ij = I d i+j (d + ⌊u⌋ + ∆ + x 0 /h) k dd, i, j = 0, 1, . . . , r -n -1. The complete penalty matrix Q is then assembled according to (55) by shifting each of the matrices (56) into its proper location (multiplying θ I = {θ} ⌊u⌋ ⌊u⌋-r+1 ). For periodic splines, ρ must also be periodic and the edges of this matrix are wrapped to correspond to the correct parameters. For aperiodic splines, edges are simply discarded, since their corresponding parameters have been assumed to be zero. imsart-aos ver. 2009/08/13 file: stsplines.tex date: November 6, 2018

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment