Hybrid data regression modelling in measurement
Measurement involves the determination of quantitative estimates of physical quantities from experiment, along with estimates of their associated uncertainties. Herewith an experimental system model is the key to extracting information from the exper…
Authors: ** - Vladimir B. Bokov (NPP Automatica JSC, Russia) **
Hybrid data regress ion modelling in measu rement Vladimir B. Bokov ∗ NPP Automatica JSC, Russia Summ er y . Measurement involves the de termination of quantitative e stimates of physical quantities from experim ent, along with estimates of their associated uncertainties. Herewith an experimental system model is th e key to extracting informat ion from the experimental data. The measurement information obt ained depends directly on the quality of the model. With this concern novel regression modelling techniques have b een fashioned by data integration from com put er-simulation a nd physica l designed experiments. These techniques have allowed attaining the advanced le vel of model completeness, parsimony, and pre cision via a pproximation of the exact unknown model by mathe matical product of available theoretical and appropriate empirical functions. The purpose o f this approximation is to represent adequately the true m odel on the conside red region of fac tor spac e wit h all advantages of theoretical modelling. This allows a further f ocus on the mea surement scie nce of issue. Pneumatic gauge hybrid data ca ndidate models’ building, solving and validation reviled that such adequate models permit t o attain minimum discrepancy from em pirical evidence. Keywords : Physical model; Experimental design; Computer-simulati on experiment 1. Introduction Computer models or codes a re often used to perform computer-simulation experiments before physical experiments are undertaken. The codes may have high-dimensional i nputs, which can be scalars or funct ions, a nd output ma y also be multivariate. In this paper interest is focused on a relatively sma ll set of inputs or expla natory variables and on a single response. The design and analysis of com puter experiments has evolved as the power of computer has grown. Sacks e t al . (1989) provided a review of issues for design and tec hnique used in the analysis of response from c omputer code s. Maki ng a number of c omputer code runs a t various input values is what they c all a computer experiment. And the computer models cons idered here a re deterministic; replicate observations from running the c ode w ith the same input value s will be identic al. This lack of random error makes computer experiments different from physical experiments. Traditional statistical a pproaches consider c omputer and physical experiments separately with corresponding separate designs, analyses, and results. A com pelling argument can be made that be tter, m ore powerful statistical results can be obt ained if we simultaneously analyze the c ombined data of physical and com puter designed experiments. The analysis of such combined data permits the unknown coefficients in an assumed overall regression or response s urface model to be estimated more pr ec isely, thereby producing a better-fitting response surface model which is crucial in measurement. Measurement involves t he determination of qua ntitative e stimates of physical quantities from experiment, a long with e stimates of their associated uncertainties. In thi s e ndeavour, a mathematical model of me asurement system is required in order to extract inf ormation from the experimental data (Cox, Forbes, and Harris, 2002). This implies model building; developing a m athematical m odel of the experimental system in terms of equa tions i nvolving ∗ E-mail: bvb@uk2.net 2 parameters that describe all the relevant asp ects of the system, and model solving ; determining estimates of model parameters from the measured data by solv ing the e quations constructed as p art of the model. The response of ma ny mea surement systems depends on more than on e variable and it is important t o model the response of such sy stems as a function of all relevant expla natory variables or factors. Common appro ac hes to modelling multivariate data have been r eviewed by Boudjemaa et al . (2003) including approaches specific to data on a regular grid (e .g., tensor product polynomial and splines) and more general approaches (e.g., radial basis functions and support vector machines). Furthermore, it is useful to classify t he ty pes of data a rising i n metrology into t hree categories: discrete, continuous and hybrid (Cox, Forbes, a nd Harris, 2002). Discrete data represent t he measurement of a s ingle response variable at a finite number of set tings of factors. Cont inuous data de fine one or more attributes of the system over a continuum, while hybrid data have both discrete and c ontinuous components. Herewith, Reese et al . (2004) noted that it i s statistically eff icient and desirable to f it a s ingle c ommon response surface model that com bines the physical expe rimental data and the computer model out put data to express t he re lationship between the factors and response va riable. And in this paper we make us e of designed e xperiment approach for multi variate hybrid data modelling employing data from t wo s ources; compute r-simulation and physical designed experiments. In t he both types of e xperiments the same response variable is considered. Moreover, the c omputed and observed values of the response are c onsidered for the same s et of factors a nd at the sa me factor levels in the both expe riments. Such appr oac h is of particular importanc e for measurement situations where the system under study is imperfectly understood . With this advance it is possible to achieve the higher level of model precision a nd uncertainty reduction. Investigators of measurement systems are often concerned with eluc idation of some functional re lationship E ( y ) = η = F ( ξ 1 , ξ 2 ,…, ξ k )= F ( ξ ξ ξ ξ ) connect ing the e xpected value of response E ( y )= η with k quantitative explanatory varia bles ξ 1 , ξ 2 ,…, ξ k . Usually, the nat ure of this relationshi p is approximated by local polynomial and in f unction based statistical modelling it i s convenient not to have t o deal with t he actual numeric al measures of the variables ξ 1 , ξ 2 ,…, ξ k , but to work with coded variable s x 1 , x 2 ,…, x k . These varia bles are convenient linear transformations of the original ξ 1 , ξ 2 ,…, ξ k and, so expressions containing the x 1 , x 2 ,…, x k can always be rewritten in terms of the ξ 1 , ξ 2 ,…, ξ k . In ge neral, a polynomial φ ( x , θ θ θ θ ) in coded variables is a linear com bination of powe rs and products of the x ’s a nd θ θ θ θ is a vector of paramete rs. The polynomial expression of order s can be thought of as a Taylor’s serie s e xpansion of the true underlying function F ( ξ ξ ξ ξ ) truncated after t erms of s th order. The higher t he orde r of the approximating polynomial, the more closely Ta ylor se ries can approxima te the true function. The smaller the region o ver which the approximation needs to be done, the better approximation is possible by given order polynomial. And, in practice over li mited re gion of factor space, a polynomial of only first or second order might adequately represent the true function. On the other hand in measurement it is possible to use cont inuous modelling (Reader- Harris, e t al ., 2000; Esward and Wright, 2007) employing theoretical function to represent the response rather t han approximate it by polynomial. In this ca se a us eful theoretical representation takes account of the pri ncipal feature s of t he relationship under study. The theoretical function a rises from a pa rticular measurement process theory. This theory le ads to a set of differential equations which solution f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ) is the f unction in search. This f unction depends on the vector ξ ξ ξ ξ = ( ξ 1 , ξ 2 ,…, ξ k ) T of explanatory variables a nd on the vector ϕ ϕ ϕ ϕ of parameters or physical consta nts and in contrast to the polynomial serves as a n approximation for the true relationship F ( ξ ξ ξ ξ ) over the whole region of factor space. 3 Moreover, cont inuous theoretical model c ontributes to our scientific understandi ng of the measurement process under study, it provides a better ba sis for extrapolation, and it is parsimonious in the use of parameters a nd to give be tter estimates of the res ponse. So, if w e knew theoretical function but did not know the values of the parameters in ϕ ϕ ϕ ϕ , then it would be best to fit this f unction directly to data. However, ofte n these parameters a re known from previous experience but the continuous theoretical model is still inadequate. Essentially, all models are approximations but continuous theoretica l and discrete empirical models repre sent e xtremes (Box a nd Draper, 1987). The former would be appropriate in the ext reme case where a theory was accurately known about the relationship under study, and the latter would be appropriate in the other e xtreme case, where nothing could be assumed except the functional relationshi p be tween y a nd ξ ξ ξ ξ was l ocally smooth. However, generally a modelling sit uation exis ting f or the most real measurement systems i s somewhere in be tween a nd, as a n inv estigation proceeds and i nformation is g ained, the situation can c hange . So, it is natural to ask if pure statistical modelling can be me rged with theoretic computer model to produce improv ed predictions; the a nswer is yes (Bayarri et al ., 2007). Present paper discusses how to reduce modelling unce rtainty for real measurement system. When system model is formulated and fitted relying on theoret ical considerations only, inferences made from it will be biased and overoptimistic because they ignore missing knowledge and data processing ac tions which should prece de the inference. Thus, the main message of thi s paper i s that in m easurement sys tem analysis we can cope with m odel uncertainty by e mploying h ybrid data regression modelling. For that Section 2 introduces hybrid data mul tiple re gression modelli ng. Section 3 describes the distributional properties of hybrid data multiple regression. Sec tion 4 provide s an e xample of data a nalysis comparing multiple linear regression modelling to h ybrid da ta modelling and demonst rates how h ybrid data mul tiple regression techniques may be entertaine d for considered pneumatic gauge example. Finally, Section 5 provides conclusions. 2. Hybrid data multiple regression modelling Hybrid data regression model ling aims at obtaining overall response surface m odel, which for measurement system is related to t he production of more rele vant information of inc reased utility. Hybrid data model ca n represent a measurement process in grea ter detail a nd with les s uncertainty than what is obtainable s eparately from continuous theoretica l or discrete empirical models. Hybrid da ta modelling is a procedure whereby knowledge and da ta are fused into consistent, accurate, and intelligible whole. This is concerned with da ta modelling from two sources where the data come from computer-simulati on and physical de signed experiments. Ma ny model building and solving problem s in measurement involve hybrid data modelling. Until recently , these problems have bee n addre ssed using standard methods t hat do not ma ke the best use of a ll the data, pos sibly leading to bia ses and s tatistical inefficiency in re sults. Hybrid da ta modelling t echniques that impr ove upon e xisting methods need to be developed, validated, and imple mented in the context of and re lated t o integrated model building and solving procedure. Hybrid dat a modelling re fers to the application of techniques to c ombine data f rom computer- simulation and e mpirical sources with which the quality of information about a known ty pe of behaviour is to be improved. Employing this approach the modelling of measurement process can be enhanced if the exact unknown functional relationship between response and a set of e xplanatory variable s approximate simultaneously by theore tical function f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ), which e ncapsulates process theoretica l knowledge, and proper empirical function φ ( x , θ θ θ θ ), which locally approximate a discrepancy between theoretica l a nd empirical 4 data over the considered region of fac tor space . At this point, model functional structure is formulated by means of these two functions and the model may be written y = Φ { f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ), φ ( x , θ θ θ θ )} + δ ( ξ ξ ξ ξ ) + ε , (1) where systematic error δ ( ξ ξ ξ ξ ) = η – Φ { f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ), φ ( x , θ θ θ θ )} is a difference between the expected value of the response, E ( y ) = η , and t he value of model functional s tructure Φ { f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ), φ ( x , θ θ θ θ )}, and ε is a random error. The formulation of model functional structure may be accomplished in different ways (Bokov, 2007) and expressions discussed in this paper a re limited to those based on function multiplication. Therefore, here theoretic-empirical model is written y = f ( ξ ξ ξ ξ , ϕ ϕ ϕ ϕ ) x φ ( x , θ θ θ θ ) + e , (2) where e is an experimental error. This mode l involves p +1 param eters θ θ θ θ = ( θ 0 , θ 1 …, θ p ) T and two important ques tions that arise are: does the postulated model completely and adequat ely re present t he empirical data and if so, what are the best estimates of the parameters in θ θ θ θ ? With this concern we introduce hybrid data multiple regression and for fitting model to data w e e mploy the m ethod of least squares. 2.1. Hybrid data model building Suppose model (2) functional structure has been approximated by i e quations ( i =1, 2,…, n ) E ( y i ) = f ( ξ i 1 , ξ i 2 ,…, ξ ik ; ϕ ϕ ϕ ϕ ) x { θ 0 + θ 1 φ i 1 ( x )+ θ 2 φ i 2 ( x )+…+ θ p φ i p ( x )}, (3) where f ( ξ i 1 , ξ i 2 ,…, ξ ik ; ϕ ϕ ϕ ϕ ) is the representation of t heoretical function computation for the i th combination of actual factor levels ξ i 1 , ξ i 2 ,…, ξ ik and θ 0 + θ 1 φ i 1 ( x )+ θ 2 φ i 2 ( x )+…+ θ p φ i p ( x ) is the empirical function polynomial a pproximation for the i th linear c ombination of basis functions φ i 1 ( x ), φ i 2 ( x ),…, φ ip ( x ) which are powers and produc ts of the coded variables x =( x 1 , x 2 ,…, x k ) T . For this approxim ate model the experimental conditions ( ξ i 1 , ξ i 2 ,…, ξ ik ) have been run randomly, yie lding obs ervations y 1 , y 2 ,…, y n and for the sa me conditions the corresponding deterministic computer simulations ha ve been carried out, yielding computations z 1 = f ( ξ 11 , ξ 12 ,…, ξ 1 k , ϕ ϕ ϕ ϕ ), z 2 = f ( ξ 21 , ξ 22 ,…, ξ 2 k , ϕ ϕ ϕ ϕ ), …, (4) z n = f ( ξ n 1 , ξ n 2 ,…, ξ nk , ϕ ϕ ϕ ϕ ). By means of these simulations we e mploy the computer-simulation experiment (Sacks et al ., 1989) using t he t heoretical function and randomization theory of experimental design. Therefore, employing the results of computer-simulation and physical designed experiments we obt ain a model which relate s the obse rvations y i to the known z i , φ i 1 ( x ), φ i 2 ( x ),…, φ ip ( x ) and unknown θ 0 , θ 1 …, θ p by n equations y 1 = z 1 x [ θ 0 + θ 1 φ 11 ( x ) + θ 2 φ 12 ( x ) +…+ θ p φ 1 p ( x )] + e 1, y 2 = z 2 x [ θ 0 + θ 1 φ 21 ( x ) + θ 2 φ 22 ( x ) +…+ θ p φ 2 p ( x )] + e 2, … , (5) y n = z n x [ θ 0 + θ 1 φ n 1 ( x ) + θ 2 φ n 2 ( x )+…+ θ p φ np ( x )] + e n , where e i is the errors. This may be written in matrix notation as y =diag( z ) X θ θ θ θ + e (6) Here, in general, y is a n x 1 vector of e xperimental observations, z is a n x 1 vect or of theoretical function computations f or t he c orresponding com binations of actual fa ctor levels 5 from the experimental design, X = ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) x x x x x x x x x x x x np n n p p p φ φ φ φ φ φ φ φ φ φ φ φ L M O M M M L L L 2 1 3 32 31 2 22 21 1 12 11 1 1 1 1 is a n expe rimental design n x ( p+ 1) ful l column matrix of the basis func tions of coded factor levels with identity element, θ θ θ θ is a ( p+ 1) x 1 vector of polynomial coefficients, and e i s a n x 1 vector of errors for this model. This is a hybrid data multiple regression model and if the diagonal matrix D =diag( z ) equals to identity matr ix I then we have well-known multiple linear regression model y = X θ θ θ θ + e x , (7) where e x i s a n x 1 vector of e rrors for this model. Thus, the multiple linear regression model may be regarded as the special case of the hybrid da ta multiple regr ession model with such a meaning that no computer-simulation data are involved in model building. 2.2. Hybrid data model solving In ge neral, the goodness of the least-squares e stimates of the parameters i n θ θ θ θ depends on the nature of error dis tribution. T he least-squares estimates would be appropriate if experimental errors were statistically independent, with constant variance, and normally distributed. The Gauss-Markov theorem states that, for the m odel (7) with elements of e x pair-wise uncorrelated and having equa l variances σ x 2 , the least-squares esti mators of pa rameters in θ θ θ θ have the smallest variances of all linear unbiased estimators of these parameters. The least squares estimates obtained, ass uming the model (7) is true, are q= ( X T X ) -1 X T y (8) and will in general be biased, since E ( q ) = ( X T X ) -1 X T E ( y ) =( X T X ) -1 X T DX θ θ θ θ , (9) because the true model which should have been fitted is y = DX θ θ θ θ + e (10) Here the diagonal matrix D may be pre sented as a sum of identity matrix I and m atrix ∆ ∆ ∆ ∆ = D – I then equation (9) becomes E ( q ) = ( X T X ) -1 X T E ( y ) =( X T X ) -1 X T ( I + ∆ ∆ ∆ ∆ ) X θ θ θ θ = θ θ θ θ + A θ θ θ θ , (11) where A =( X T X ) -1 X T Y is t he alias ma trix a nd Y = ∆ ∆ ∆ ∆ X =( D – I ) X . The alias m atrix will be null only if Y = 0 , when D = I . Furthermore, if we suppose that p+ 1 columns of X split into two sets X 1 , X 2 of p 1 , p 2 columns and vector θ θ θ θ split into t wo vectors θ θ θ θ 1 and θ θ θ θ 2 , respectively, then the model (10) may be rewritten y = D X 1 θ θ θ θ 1 + D X 2 θ θ θ θ 2 + e (12) and, taking into account that D = I + ∆ ∆ ∆ ∆ , this model may be presented as y = X 1 θ θ θ θ 1 + X 2 θ θ θ θ 2 + Y 1 θ θ θ θ 1 + Y 2 θ θ θ θ 2 + e , (13) where Y 1 =( D – I ) X 1 and Y 2 =( D – I ) X 2 . Here, for the multiple li near regression model (7), as a part of the hybrid dat a model, the simpler model with f unction X 1 θ θ θ θ 1 mig ht be one which might be adequate and function X 2 θ θ θ θ 2 might represent further terms which perhaps would have to be adde d if the terms of X 1 θ θ θ θ 1 were ina dequate t o represent the res ponse. However , for the hybrid data model (13) the re is the other opportunity to overcome the i nadequacy of the simpler model wit h f unction X 1 θ θ θ θ 1 by adding the te rms of Y 1 θ θ θ θ 1 instead of X 2 θ θ θ θ 2 . Therefore, hybrid data model adequacy will s ignificantly depend on employed th eoretical function. 6 Moreover, if the te rms of X 1 θ θ θ θ 1 and Y 1 θ θ θ θ 1 were inadequate to re present the re sponse, then the terms of X 2 θ θ θ θ 2 and Y 2 θ θ θ θ 2 could also be added to represent the response adequately. Using the terms of t he multiple linear regression mode l explicit ly the hybrid data model may be formulated as y = X θ θ θ θ + Y θ θ θ θ + e (14) Such a model formulation let us an opport unity to ana lyse the hybrid data model in comparison with the multiple li near re gression model. And, because matrices Y and X have the sa me number of rows, t hen for model solving the matrix X may be augmented by Y and the model (14) is written y =[ X Y ] θ θ + e (15) Let us denote matrix [ X Y ] by Ψ Ψ Ψ Ψ and vect or θ θ by β β β β . T herefore, the model we deal with is y = Ψ Ψ Ψ Ψβ β β β + e , (16) where β β β β is a 2( p+ 1) x 1 vec tor of parameters, Ψ Ψ Ψ Ψ is a n x 2 ( p+ 1) ma trix of know n values and e is a vector of random error terms. These terms can be defined as e = y – E ( y ) so that E ( e ) = 0 and E ( y ) = Ψ Ψ Ψ Ψβ β β β . Every e lement in e has variance σ 2 and zero covariance with e very other element; i.e., var ( e )= E ( e e T )= I σ 2 . Thus, we have e ~ ( 0 , I σ 2 ) and y ~ ( Ψ Ψ Ψ Ψ β β β β , I σ 2 ). The more general cases where var( e )= V , whether matrix V be non-singular or singula r, are discussed by Searle (1971). The normal equations corresponding to the model (16) can be deri ved by least square s and they turn out to be Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ b = Ψ Ψ Ψ Ψ T y (17) They involve Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ which is square and symmetric. Als o, because matrix Ψ Ψ Ψ Ψ may be of less than full column rank, then Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ma y be of less than full rank as well. An d whenever Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ is not of full rank the norm al equations (17) cannot be so lved wit h one solution, m any solutions available. We use b to denote a solution to (17) b = ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y , (18) where ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – is a generalised inverse of Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ , and it is only solut ion and not an es timator of β β β β . Suppose that vector b is partitioned as 2 1 b b then the normal equations may be presented as [ ] Y X Y X T T 2 1 b b = y Y y X T T (19) So the corresponding solution is 2 1 b b = − y Y y X Y Y X Y Y X X X T T T T T T (20) The generalized inverse of symmetric and singular matrix has been derived by Rohde (1965). On defining G –1 = ( X T X ) –1 and Q – as generalized inverse of Q , where Q = C – B T G –1 B , C = Y T Y and B = X T Y , then the generalized inverse of partitioned matrix is − Y Y X Y Y X X X T T T T = − − + − − − − − − − − − Q G B Q BQ G G B BQ G G 1 1 1 1 1 T T 7 = ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − Q X X X Y Q YQ X X X X X X Y YQ X X X X X 1 1 1 1 1 T T T T T T T T T , (21) where Q – =[ Y T { Ι Ι Ι Ι – X ( X T X ) –1 X T } Y ] – . Therefore, equation (20) may be rewritten as 2 1 b b = ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − Q X X X Y Q YQ X X X X X X Y YQ X X X X X 1 1 1 1 1 T T T T T T T T T y Y y X T T (22) from which b 1 = {( X T X ) –1 X T + ( X T X ) –1 X T Y Q – Y T X ( X T X ) –1 X T –( X T X ) –1 X T Y Q – Y T } y = ( X T X ) –1 X T [ I – Y Q – Y T { I – X ( X T X ) –1 X T }] y (23) and b 2 = { Q – Y T – Q – Y T X ( X T X ) –1 X T } y = Q – Y T { I – X ( X T X ) –1 X T } y (24) Further, defining Z ={ I – X ( X T X ) –1 X T } Y (25) as the matrix of deviations of matrix Y from its calculated values with the use of experimental design ma trix a nd because matrix I – X ( X T X ) –1 X T is ide mpotent, hence jus t as in (18) we can write (24) as b 2 = ( Z T Z ) – Z T y (26) Then b 1 is given by b 1 = ( X T X ) –1 X T { I – Y ( Z T Z ) – Z T } y (27) 2.3. Consequences of solution for model For the hybrid dat a model the solution b is a function of observations y and for the generalized inverse ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – , E ( b ) =( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T E ( y ) =( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ β β β β = J β β β β , (28) i.e. b has expected value J β β β β where J =( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ . Hence b is an unbiased estimator of J β β β β . The uncertainty or variance-covariance matrix of b is var ( b ) = var (( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y ) = ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T var ( y ) Ψ Ψ Ψ Ψ {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – } T = ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – } T σ 2 (29) Thus, the matrix ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – } T determines the variances and covariance s of the elements in b . A similar result holds for b 2 : using the partitioning form of ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – shown in (21), with Q – = ( Z T Z ) – , result (29) becomes (see Appendi x A) var 2 1 b b = ( ) { } ( ) − − + − − − + − − − − − − − − − − − − − − − − YQ Z Q YQ X YG Z Q YQ Z YQ I X G XG Y Q YQ X YG Z YQ I X YXG YQ X I G T T 1 T T T 1 1 T T 1 T T 1 T 1 σ 2 (30) Hence var ( b 2 ) = Q – Z T Y Q – σ 2 (31) and var ( b 1 )= ( X T X ) -1 { I + X T Y Q – Y T X ( X T X ) -1 – X T ( I – Y Q – Z T ) Y ( X T X ) -1 X T Y Q – } σ 2 (32) plus cov ( b 1 , b 2 )= –{ Q – Y T X ( X T X ) -1 +( X T X ) -1 X T ( I – Y Q – Z T ) Y Q – } σ 2 . 2.4. Estimating E ( y ) for hybrid data model Corresponding to the vector of observations y we have the vector of estimated expected values ∧ ) ( y E = y ˆ = Ψ Ψ Ψ Ψ b= Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y. (33) 8 This vector is invariant to the cho ice of gene ralized inverse of Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ which is used for ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – , because matri x Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T is invariant (Sea rle, 1971). Hence, y ˆ is the vector of estimated expected value s corresponding to the vector y of observ ations. This means that no matter what solution of the normal equations is used for b the vector y ˆ will always be the same. It impli es that we can get a solutio n to the normal equations in any way, call it b , and no matter which solution it is, Ψ Ψ Ψ Ψ b will be correct v alue of y ˆ . The uncertain ty matrix of estim ator y ˆ = Ψ Ψ Ψ Ψ b is readily obtained using varianc e operator var ( y ˆ )= var ( Ψ Ψ Ψ Ψ b )= Ψ Ψ Ψ Ψ var ( b ) Ψ Ψ Ψ Ψ T = Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – } T Ψ Ψ Ψ Ψ T σ 2 = Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T σ 2 (34) On substit uting Ψ Ψ Ψ Ψ = [ X Y ] and usin g Q – , this conve rts (see Append ix B) to var ( y ˆ )={ X ( X T X ) -1 X T + Z ( Z T Z ) – Z T } σ 2 (35) 2.5. Residual sum of squares For hybrid data model the vector of fitted values correspondin g to the observ ed values can be obtained from equation (33). The residual sum of squar es in this cas e is SS E =( y – y ˆ ) T ( y – y ˆ ) = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y (36) because matrix I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T is symmetri c and idempo tent. Further, because Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T is invariant to ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – , so is SS E . Hence, SS E is inv ariant to whatever solution of the normal equations is used for b . This is another result invariant to the many solutions of the normal equations . In (36) SS E is written as a quad ratic form in y therefore, with y being di stributed ( Ψ Ψ Ψ Ψ β β β β , I σ 2 ) the expect ed value of SS E is E ( SS E ) = E [ y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y ] =tr[{ I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } I σ 2 ]+( Ψ Ψ Ψ Ψ β β β β ) T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } Ψ Ψ Ψ Ψ β β β β ( 37) Throu gh the properties of Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T and due to t he fact that the trace of an i demp otent matrix equals its rank E ( SS E )=rank{ I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } σ 2 =[ n –rank{ Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T }] σ 2 ={ n – rank( Ψ Ψ Ψ Ψ )} σ 2 . Hence, an unbias ed estimator of σ 2 is 2 ˆ σ = SS E / { n –rank( Ψ Ψ Ψ Ψ )} (38) The use of rank( Ψ Ψ Ψ Ψ ) in the e xpe ctation is becaus e matrix Ψ Ψ Ψ Ψ may be not of full column rank. The rank of Ψ Ψ Ψ Ψ depends on the nature of data available and design matrix X employed. When the choice of design matrix X follows to condit ion n ≤ 2( p +1 ) the rank of Ψ Ψ Ψ Ψ is alway s equal to n . In this case from equation (38) SS E =0 and the estimation of σ 2 becomes i mpossibl e without experimental arrangement in which two or m ore runs are made at an i dentical set of levels of the explanator y variables. These runs should be made in such a way t hat they are subject to all the sources of errors that beset r uns made at different sets of lev els of the explanatory variables. Thus, employing the exper imental designs which give rank( Ψ Ψ Ψ Ψ )= n the hybr id data model has minimum residual sum of squares which equal to its pure error sum of squares. An expr ession for SS E involv ing X and Z can also be establ ished. For (3 6) is equiv alent to SS E = y T { I – X ( X T X ) -1 X T – Z ( Z T Z ) – Z T } y (39) because f rom (34) and (35) Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T = X ( X T X ) -1 X T + Z ( Z T Z ) – Z T . In equation (39) SS E is written a s a quadratic form in y . Therefor e, with y being distributed ( Ψ Ψ Ψ Ψβ β β β , I σ 2 ) t hen the expected value of SS E is E ( SS E )=tr{ I – X ( X T X ) -1 X T – Z ( Z T Z ) – Z T } I σ 2 +( Ψ Ψ Ψ Ψ β β β β ) T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } Ψ Ψ Ψ Ψ β β β β ( 40) And due to the fact th at the trace of an id empotent matrix equals its rank we have E ( SS E ) =rank{ I – X ( X T X ) -1 X T – Z ( Z T Z ) – Z T } σ 2 =[ n –rank{ X ( X T X ) -1 X T }–rank{ Z ( Z T Z ) – Z T }] σ 2 = { n 9 –rank( X )–rank( Z )} σ 2 . Hence, because rank( X ) = p+ 1 then an unbia sed estimate of σ 2 is 2 ˆ σ = SS E / { n – p –1–rank( Z )} which is the same as (38) because rank( Ψ Ψ Ψ Ψ ) = rank( X )+rank( Z ). 2.6. Partitioning the total sum of squares The t otal sum of squares is SS T = y T y a nd t he residual sum of squares for the hybrid da ta model has been defined by equation (36). The difference SS R = SS T – SS E = y T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y = b T Ψ Ψ Ψ Ψ T y (41) represents the portion of SS T attributable t o having fit ted the regression a nd called the sum of squares due to regression. And w ith regard to equati on (39) the regression sum of squares for hybrid data model may be rewritten as SS R = y T X ( X T X ) -1 X T y + b 2 T Z T y , (42) because b 2 T = y T Z ( Z T Z ) – . Then we see that SS R = SS Rx + SS R c (43) where SS Rx = y T X ( X T X ) -1 X T y and, so we can call SS R c = SS R – SS Rx = b 2 T Z T y (44) the hybrid data regression sum of squares corrected for the sum of squa re s due to m ultiple linear regression. Similar to SS R c the corrected sum of squares of the y ’s is SS Tc = SS T – SS Rx = y T y – y T X ( X T X ) -1 X T y = y T { I – X ( X T X ) -1 X T } y= υ υ υ υ T υ υ υ υ , ( 45 ) where υ υ υ υ ={ I – X ( X T X ) -1 X T } y. The three forms of pa rtitioning the sums of squa res are shown in Table 1. The first c olumn shows t he s ums of squares attributable to fitting the model (16). In the second column SS Rx i s the sum of squares due to mul tiple linear regression and SS Rc is the sum of squa res f or fitting the model, c orrected for the sum of squares due to multiple linear regression. The third column is identical to the second except that SS Rx has been de leted from t he body of the table and subtracted from SS T to give SS Tc as the total s um of squares corrected for the sum of squares due to multi ple linear regression. Table 1 form the basis of analysis of variance table for hybrid data regression. Table 1. Partitioning the sums of squares for hybrid data regression. SS Rx = y T X ( X T X ) -1 X T y SS R = y T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y SS Rc = y T { Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T – X ( X T X ) -1 X T } y SS Rc = y T { Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T – X ( X T X ) -1 X T } y SS E = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y SS E = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y SS E = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y SS T = y T y SS T = y T y SS Tc = υ υ υ υ T υ υ υ υ 3. Distributional properties of multiple hybrid data regres sion If errors e are normally distributed e ~ N ( 0 , I σ 2 ) the n the dis tributional properties of y a nd functions of y follow directly. From (16) we have e = y – Ψ Ψ Ψ Ψ β β β β and, therefore y ~ N ( Ψ Ψ Ψ Ψβ β β β , I σ 2 ). The b solution is a linear function of y , hence it is normally distributed b ~ N [ J β β β β , ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – } T σ 2 ]. The b is i ndependent from 2 ˆ σ . To show this we have b a s defi ned by equation (18) and SS E – by equa tion (36). So, when y ~ N ( Ψ Ψ Ψ Ψβ β β β , I σ 2 ), then SS E and b a re distri buted i ndependently because ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T I σ 2 { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } = {( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T –( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } σ 2 = 0 (Searle, 1971). The sta tistic SS E / σ 2 has a χ 2 –distribution. From (36), SS E is a quadratic in y . Therefore, if I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T is idem potent, SS E / σ 2 = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T }/ σ 2 y a nd var ( y )= I σ 2 , hence (1/ σ 2 ) { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } I σ 2 is idempotent as well. So, when y ~ N ( Ψ Ψ Ψ Ψβ β β β , I σ 2 ), then y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y is non-central χ 2 [rank { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T }, ( Ψ Ψ Ψ Ψβ β β β ) T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } Ψ Ψ Ψ Ψβ β β β /2], and c onsequently SS E / σ 2 ~ χ 2 [rank { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T }, β β β β T Ψ Ψ Ψ Ψ T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } Ψ Ψ Ψ Ψβ β β β /2 σ 2 ], which, because of properties of a generalized inverse and with rank( Ψ Ψ Ψ Ψ )= m , reduces to 10 SS E / σ 2 ~ χ 2 n–m (46) 3.1. Non-central χ 2 ’s Having shown that SS E / σ 2 has a ce ntral χ 2 –distribution, it can be demonstrated that SS R ha s non-central χ 2 –distribution. Furthermore, this t erm is indepe ndent of SS E . Thus w e are led to F -statistic that has non-central F -distribution and t his i n turn is ce ntral F -distribution under certain null hypotheses. From (41) SS R = y T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y where matrix Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T is idempotent a nd its products with matri x I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T a re null. Therefore, when y ~ N ( Ψ Ψ Ψ Ψ β β β β , I σ 2 ), the n the quadratic f orms y T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y and y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y are distrib uted independently because of properties of Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T and Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T I σ 2 { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } = 0 (Searle, 1971). So, SS R / σ 2 is distributed independently of SS E , w ith SS R / σ 2 ~ χ 2 [rank { Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T }, ( Ψ Ψ Ψ Ψ β β β β ) T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ β β β β /2 σ 2 ] ~ χ 2 ( m , β β β β T Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ β β β β /2 σ 2 ) (47) Similarly SS Rx / σ 2 = y T X ( X T X ) -1 X T y / σ 2 , where X ( X T X ) -1 X T is idempotent and the products of X ( X T X ) -1 X T and I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T are null since I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T = I – X ( X T X ) -1 X T – Z ( Z T Z ) – Z T . Hence SS Rx is distributed independently of SS E and SS Rx / σ 2 ~ χ 2 [rank { X ( X T X ) -1 X T }, ( Ψ Ψ Ψ Ψ β β β β ) T X ( X T X ) -1 X T Ψ Ψ Ψ Ψ β β β β /2 σ 2 ] ~ χ 2 { p +1, β β β β T Ψ Ψ Ψ Ψ T X ( X T X ) -1 X T Ψ Ψ Ψ Ψ β β β β /2 σ 2 } (48) because rank{ X ( X T X ) -1 X T }= rank( X )= p +1. Analogous argument a pplies t o SS R c = b 2 T Z T y = y T Z ( Z T Z ) – Z T y , where Z ( Z T Z ) – Z T is idempotent and has nul l products with both X ( X T X ) -1 X T and I – X ( X T X ) -1 X T – Z ( Z T Z ) – Z T . Hence, SS R c is independent of SS Rx and SS E , and SS Rc / σ 2 ~ χ 2 [rank { Z ( Z T Z ) – Z T }, ( Ψ Ψ Ψ Ψ β β β β ) T Z ( Z T Z ) – Z T Ψ Ψ Ψ Ψ β β β β /2 σ 2 ] ~ χ 2 { m – p –1, β β β β T Ψ Ψ Ψ Ψ T Z ( Z T Z ) – Z T Ψ Ψ Ψ Ψ β β β β /2 σ 2 } (49) because rank{ Z ( Z T Z ) – Z T }= rank( Z ) = m – p –1. 3.2. F-distributions Applying the definit ion of t he non-central F -distribution to the foregoing results leads to the following F -statistics F R = ( ) m n SS m SS E R − / / ~ F ( m , n–m , β β β β T Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ β β β β /2 σ 2 ) (50) F Rx = ( ) ( ) m n SS p SS E Rx − + / 1 / ~ F { p +1 , n–m , β β β β T Ψ Ψ Ψ Ψ T X ( X T X ) -1 X T Ψ Ψ Ψ Ψ β β β β /2 σ 2 } (51) F R c = ( ) ( ) m n SS p m S S E Rc − − − / 1 / ~ F { m–p –1, n–m , β β β β T Ψ Ψ Ψ Ψ T Z ( Z T Z ) – Z T Ψ Ψ Ψ Ψ β β β β /2 σ 2 } (52) Under certain null hypotheses the non-centrality parameters i n (50)-(52) are zero, a nd these non-central F ’s t hen b ecome central F’ s , so providing the s tatistics for testing those hypotheses. 3.3. Analysis of varian ce Calculations of the above F -statistics can be summarized in analysis of variance tables. For this the calculations f or F R are resumed in Table 2. This table summ arizes not onl y the sums of squares but also the degrees of freedom of the as sociated χ 2 –distributions . In the mean squares, which are sums of s quares divided by degr ees of freedom, it also shows calculati on of the numerator and denominator of F . And then the calculation o f F itself is shown. 11 Table 2. Analysis of variance for fitting hybrid data regression. Source of Variation Sum of Squares Degrees of Freedom Mean Squa re F -statistic Regression SS R = y T Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y m MS R = SS R / m F R = MS R / MS E Residual SS E = y T { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y n–m MS E = SS E /( n–m ) Total SS T = y T y n In a manner similar to Table 2 t he F -ratios of (51) a nd (52) are summarized in Table 3. And the abbre viated form of t his, based on T able 1 a nd showing only the c alculation of (52), is as shown in Table 4. Table 3. Analysis of variance showing the term of multiple linear regression. Source of Variation Sum of Squares Degrees of Freedom Mean Squa re F -statistic Linear regression SS Rx = y T X ( X T X ) -1 X T y p +1 MS Rx = SS Rx / p F Rx = MS Rx / MS E Corrected Regression SS R c = b 2 T Z T y m–p –1 MS R c = SS Rc /( m–p –1) F Rc = MS Rc / MS E Residual SS E = y T y – y T X ( X T X ) -1 X T y – b 2 T Z T y n–m MS E = SS E /( n–m ) Total SS T = y T y n Table 4. Analysis of variance corrected for the term of multiple linear regression. Source of Variation Sum of Squares Degrees of Freedo m Mean Squ are F -s tatistic Corrected Regression SS R c = b 2 T Z T y m–p –1 MS R c = SS Rc /( m–p –1) F Rc = MS Rc / MS E Residual SS E = y T y – y T X ( X T X ) -1 X T y – b 2 T Z T y n–m MS E = SS E /( n–m ) Corrected Total SS Tc = y T y – y T X ( X T X ) -1 X T y = υ υ υ υ T υ υ υ υ n–p –1 Tables 3 and 4 show the same things for F Rx a nd F Rc of (51) and (52). But Table 4 presents the abbre viated form o f the c omplete analysis of variance ta ble shown in Table 3 . This abbreviated form is derived by rem oving SS Rx f rom the body of t he table and subtracting it from SS T to give SS Tc , as in Table 1 . Thus Table 4 does not contain F Rx , but it is i dentical to Table 3 insofar as F Rc = MS Rc / MS E is c oncerned. Although Table 4 i s t he form in which this analysis of variance may be s een, the most informative is Table 3 which de monstrates how SS R of Table 2 is partitioned into SS Rx and SS R c , so summarizing both F Rx and F Rc . 3.4. Pure errors In multiple hybrid data regression modelling a model is fitted to data from de signed experiment. Here if two or more observ ations on re sponse va riable at the same settings of predictor variables ha ve bee n completed then a f ormal test for the lac k of fit may be conducted. For this the mat rix Ψ Ψ Ψ Ψ should be of full column rank and follow the condi tion n >2( p +1). If thi s matrix has n =2( p +1) the n the de sign matrix X from which matrix Ψ Ψ Ψ Ψ is obtained may be augmented by additional set of setting s of e xplanatory variables. For example, the first-order design matri x ma y be augmented by c entre runs which do not impact the usual effect estimates in a 2 k design (Myers and Montgomery, 1995). The lack of fit test requires t hat we have true replicates on the response y for a t least one set of lev els on the factors. These replicate points are used to obtain a model -independent estimate of σ 2 . Hence, f or l true replicates in which ξ ξ ξ ξ 0 and x 0 are fixed, the hybrid dat a m odel may be writte n y u = f ( ξ ξ ξ ξ 0 , ϕ ϕ ϕ ϕ ) x φ ( x 0 , θ θ θ θ )+ e u , where u =1, 2,… , l , implying y ~ = f ( ξ ξ ξ ξ 0 , ϕ ϕ ϕ ϕ ) x φ ( x 0 , θ θ θ θ )+ e ~ , 12 where y ~ = ∑ = l u u y 1 / l and e ~ = ∑ = l u u e 1 / l . So, whatever theoretical and empirical functions may be, we have ∑ = − l u u y y 1 2 ) ~ ( = ∑ = − l u u e e 1 2 ) ~ ( . And if the e u are indepe ndent random variable s with variance σ 2 , E ∑ = − l u u y y 1 2 ) ~ ( = E ∑ = − l u u e e 1 2 ) ~ ( = ( l –1) σ 2 (Box and Draper, 1987). If we have n such sets of replicated runs with l i runs in the i th set made a t ξ ξ ξ ξ i , the individual internal sums of squares may be pooled together to f orm a pure error sum of squares having as its de grees of freedom the sum of the se parate degrees of freedom. Thus the pure error sum of squares is SS PE = ( ) ∑ ∑ = = − n i l u i iu i y y 1 1 2 ~ (53) with degrees of freedom ( ) ∑ = − n i i l 1 1 = N – n , N = ∑ = n i i l 1 , and an estimator of σ 2 is SS PE /( N – n ). In addition, when the hybrid data regression model is fitting to da ta, the pure error sum of squares i s also par t of the re sidual sum of squares. The residual from the u th observation a t ξ ξ ξ ξ i is y iu – i y ˆ = ( y iu – i y ~ ) – ( i y ˆ – i y ~ ); on squaring both sides and summing, ( ) ∑ ∑ = = − n i l u i iu i y y 1 1 2 ˆ = ( ) ∑ ∑ = = − n i l u i iu i y y 1 1 2 ~ + ( ) ∑ = − n i i i i y y t 1 2 ~ ˆ , (54) that is, residual sum of squares = pure error sum of squares + lack of fit sum of squares. The corresponding equa tion f or degrees of freedom is ∑ = n i i l 1 – m = ∑ = − n i i l 1 ) 1 ( + n – m . And, we must have n > m . If n = m , there will be no sum of squares nor de grees of freedom for lack of fit, since the model will always “fit pe rfectly” (Box and Draper, 1987). However this is feasible for the hybrid da ta mode l when the choice of design matrix follows the condi tion n ≤ 2( p +1). For suc h a model the rank of N x 2( p +1) m atrix Ψ Ψ Ψ Ψ is the smallest m for which there exis t N x m matri x Ω Ω Ω Ω and m x 2( p +1) matrix Ο Ο Ο Ο such th at Ψ Ψ Ψ Ψ = Ω Ω Ω Ω Ο Ο Ο Ο . In such full-rank decomposition m is a m aximum number of linearly independe nt rows of Ψ Ψ Ψ Ψ which equals to n . Thus f or this model there i s no lack of fit and the res idual sum of squares equals to pure error sum of squares. A m easure of t he goodness of fit of the regression is the m ultiple c orrelation c oefficient, estimated a s the product mome nt c orrelation between the observed y i ’s and the pre dicted i y ˆ ’s (Searle, 1971). Denoted by R , for the hybrid data regression model it can be calculated as R 2 = ( b T Ψ Ψ Ψ Ψ T y – n y 2 )/( y T y – n y 2 ), (55) where y = n –1 ∑ = n i i y 1 . This statistic repres ents the frac tion of the variation about the mean that is explained by the f itted m odels. It is often used as an ov erall measure of the fit a ttained (Box and Draper, 1987). No m odel can explain pure error, so that the maximum possible value of R 2 is R 2 max = ( y T y – n y 2 – SS PE )/( y T y – n y 2 ). (56) And the hybrid data regression model may have t his value f or its multi ple c orrelation coefficient because from (41) b T Ψ Ψ Ψ Ψ T y = y T y – SS E a nd if the c hoice of design matrix follows the condition n ≤ 2( p +1) then SS E = SS PE , he nce from (55) and (56) R 2 = R 2 max . 13 3.5. Test of hypotheses For the hybri d data model f ollowing equations (50-52) we indicated that those results provide statistics suitable for t esting c ertain hypotheses. As shown i n equation (50) the s tatistic F R is distributed as a non-central F wit h non-centrality parameter β β β β T Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ β β β β /2 σ 2 . This para meter i s zero under the null hypothesis H 0 : Ψ Ψ Ψ Ψ β β β β = 0 , when F R then has a central F -distribut ion, F m, n–m , and can be compared to tabulated values thereof to test that hypothesis. When F R ≥ tabula ted F m, n–m at the α -level, we reject the null hypothesis H 0 : Ψ Ψ Ψ Ψ β β β β = 0 , otherwise we do not rejec t it. Assuming the mode l E ( y ) = Ψ Ψ Ψ Ψ β β β β we mig ht then sa y tha t when F R i s significant the m odel accounts for a significant portion of the variation in the y -variable. B ut this doe s not mean that this model, for the partic ular set of factors, which is the set of e xplanatory variable s of theoretical function, is nec essarily the most suitable model: its empirical function may have a subset of those fac tors which are as s ignificant as the whole, or ma y include f urther factors which, when used in combination with some or all of the factors already used, are significantly be tter than those a lready used; or there may be the other non-linear functions of these factors that are a t least as s uitable as the f actors as used. None of these contingenc ies is inconsistent with F R being significant a nd the ensuing conc lusion that the data a re in concordance with the model E ( y ) = Ψ Ψ Ψ Ψ β β β β . Just as F R provides a tes t of the mode l E ( y ) = Ψ Ψ Ψ Ψ β β β β , so doe s F Rc provide a te st over and abo ve the multiple li near regression model. In general, therefore, F Rc must be l ooked on as providing a test of the model E ( y ) = Ψ Ψ Ψ Ψ β β β β over and a bove the model E ( y ) = X θ θ θ θ . Since the latter can be considered as fit ting the multiple linear regression mode l to data we look upon F Rc a s providing a t est of the model E ( y ) = Ψ Ψ Ψ Ψ β β β β over and above the m ultiple linea r regression model. When F Rc is significant we conclude that the model satisfactorily ac counts for varia tion in the y -variable. This i s not t o be tak en as evidence tha t all ele ments of β β β β are non-zero, but only at least one of them, or one linear combination of them, may be. If F Rx has first been found significant, t hen F Rc being significant indicates that a model with terms in it additional to the multiple linea r regression model explains significantly more of the variation in the y -variable than does the model E ( y )= X θ θ θ θ . The case of both F Rx and F Rc being significant has just been discusse d; a further poss ibility is that F Rx is si gnificant but F Rc is not. This is e vidence of the linear regre ssion model being relevant but that fitting the rest of the model does not explain variation in the y -variable. 3.6. Examining residuals The vector of residuals r = y – Ψ Ψ Ψ Ψ b (57) can be plotted in a variety of wa ys and othe rwise investigated to see if assumptions inherent in the assumed model are not being preserved. W ith matrix I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T , which is symmetric and idempotent, r = y – Ψ Ψ Ψ Ψ b = y – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T y = { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y . So, concerning distributional properties of residuals, their e xpected values are E ( r ) = E [{ I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y ] ={ I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } Ψ Ψ Ψ Ψ β β β β = 0 , because Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ = Ψ Ψ Ψ Ψ , and re sidual uncertainty matrix i s var ( r ) = var [{ I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } y ] = { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } σ 2 . These properties hold true for the residua ls of a ll c onsidered here hybrid data models. Consideration of the extent to which they satisfy ot her conditions is the means whereby assumptions of the model can be investigated. For example, in as suming normality of the error term s in the model we have E ( r ) ~ N [ 0 , { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } σ 2 ]. Plotting the values of r to see if they appear normal ly distributed therefore provides a me ans of seeing i f the assumption e ~ N ( 0 , I σ 2 ) might be wrong. In doing this we ignore the fac t that because var ( r ) = { I – Ψ Ψ Ψ Ψ ( Ψ Ψ Ψ Ψ T Ψ Ψ Ψ Ψ ) – Ψ Ψ Ψ Ψ T } σ 2 the r’s are correlated since, as Searle (1971) and Anscombe and Tukey 14 (1963) indic ate, f or at least a two-way ta ble with more t han three rows and columns, the effect of correlation among residuals upon graphical procedures is usually negligible. 4. Pneumatic gauge modelling Hybrid da ta modelling implies concurrent computer simulation a nd physical e xperimentation and with this concern let us considers pneumatic gauge example. Pneumatic gauging is an e stablished technique for determining the location of a quasi-rigid surface (see, for e xample, Evans and M organ, 1964; Jackson, 1978; Bri dge e t al. , 2001). The principle of the device i s s imple. A jet of gauging flui d, t ypically air, flows from a reservoir at known press ure a long a t ube fitted with a n ori fice and out of a c onvergent nozzle maintained at a known location regarding to a work-piece. The pressure drop betwe en the inlet to the nozzle and the surroundings (i.e., atmosphe re) is sensitive to the distance or clearance S be tween t he output t ip of t he nozzle a nd the work-piec e. The ba ck-pressure P measured by the pre ssure indicator placed between the orifice a nd nozzle-work-piece sensor, varies in inverse proportion to the clearance S . Pneumatic gauging involves the determination of dimensional deviations from measurement experiment a long wit h estimates of thei r associated uncertainties. And i n order to extract information from the experimental da ta, g auge mathe matical model is required. The model will need to specify how the gauge is expected to respond to input dat a and this is the key to extracting i nformation f rom the data. T he quality of information obtained depends directly on the ga uge mode l explanation of its observed behaviour and more dependable uncertainties associated with the quantities of interest. 4.1. Gauge theoretical modelling The pneumatic gauge can be modelled by algorithmic model which pave the way for application of iterative nume rical a pproximation methods. Continuous modelling may be used to create a representation of steady-flow physical phenomenon at t he gauge and therefore gain a better understanding of that phenomenon and enhance the knowledge base of measurement. G auge t heoretical model i s a stea dy-state relationship P λ = f ( S ) between calculated back-pressure P λ and clearance S. This model ma y be ela borated by c onsidering adiabatic air flows through the both orifice and nozzle-work-piece sensor. The steady-fl ow model of the orifice may be derived from the e nergy equa tion (Zalmanzon, 1965) and in Mathcad computer c ode notati on ( Mathcad User's Guide , 1999) is written ( ) ( ) ( ) otherwise P P if P P P P RT g BP C G S S S S O O 1 / 2 1 / / 1 / 2 1 2 1 1 2 1 2 − − + + + + ≥ − − = γ γ γ λ γ γ λ γ λ γ γ γ γ γ γ , ( 58) where G O is the weight f low rate through the orifi ce, B is the c ross-sectional area of the orifice, which i s a s imple transformation of its i nternal diameter d O , namely, B = π d O 2 /4 , g i s the acceleration of grav i ty, γ is t he ratio of spe cific heats for air, P S is the pressure in a chamber of compressed air supply, R is the gas constant, T is the abs olute temperature, and C O is the discharge c oefficient whic h takes into account the losses of mechanical energy at the flow entrance area of the orifice and which can be found experimentally only (Zalmanzon, 1965). The s teady-flow m odel of the nozzle-work-piece sensor is a relationship between response variable - we ight flow rate G through the sensor and predictor variables - clearance S and 15 back-pressure P λ . Considering the sensor a s a short restriction with a diabatic and turbulent flow it m ay be shown that sensor steady-flow model is derived f rom t he energy e quation a nd in Mathcad computer code notation is written ( ) ( ) ( ) otherwise P P if P P P P RT g CAP G a a a 1 / 2 1 / / 1 / 2 1 2 1 1 2 1 2 − − + + + + ≥ − − = γ γ γ λ γ γ λ γ λ λ γ γ γ γ γ γ , (59) where A is the cross-sectional area of the sensor, which is a simple transformation of the clearance S , namely, A = π d S , d is the inner diameter of nozzle, P a is the press ure of a ir in the outlet sec tion of the sensor, and C is the discharg e coefficient which ta kes into acc ount the losses of mechanical energy at the flow entrance area of the sensor. Gauge theoretical model may be derived by using equa tions desc ribing adiabatic air f lows through the both orifice and nozzle-work-piece sensor. In gaug ing sit uation the orifice and nozzle-work-piece sensor a re in series. So, for one-dimensional steady flow the weight flow rates through the se two restrictors must be equal in order for the total weight of air in the tube between the orifice and nozzle t o be consta nt. Therefore, as suming t hat the te mperature T is constant, the weight fl ow rat e equa lity throu gh the orifice a nd nozzle-work-piece s ensor in Mathcad notation may be written ( ) ( ) ( ) otherwise P P if P P P P BP C S S S S O 1 / 2 1 / / 1 / 2 1 2 1 1 2 1 − − + + + + ≥ − − γ γ γ λ γ γ λ γ λ γ γ γ γ γ γ ( ) ( ) ( ) otherwise P P if P P P P CAP a a a 1 / 2 1 / / 1 / 2 1 2 1 1 2 1 − − + + + + ≥ − − = γ γ γ λ γ γ λ γ λ λ γ γ γ γ γ γ (60) This theore tical model is given as an implicit algorithmi c model in which the variables P λ and S are link ed through equation (60). H owever, by s everal a uthors ( Zalmanzon 1965 , Nakayama 1964, Crnojevic, et al ., 1997) it was revealed t hat the discharge c oefficients C O and C are not c onstant even for a small variation of such variable quantities as S , P S , and P λ . So, in this circumstance it is impossible to find least squared estimates of the di scharge coefficients a nd to fit gauge theoretical model to data. Theref ore, it should be conside red that available gauge physical knowledge is substantially incomplete and, consequently it would be preferable t o apply the multiple linear regression model ling. With t his approach a multiple linear regression model is used instead of the theoretical model. Such a model does not express a ny physical knowledge unde rlying back-pressure variati on and ref lects only the local influence of explanatory variables on response variable. 4.2. Gauge linear regression modelling For hi gh prec ision measurement it is essential the employment of comprehensive, ade quate, and precise gauge model. This model is a relationship between observed response variable – 16 measured back-pressure P and expla natory variables; cross-sectional area A of the sensor, absolute pressure P S , and cross-sectional area B of the orifice. Initially, let us assume that t he f irst-order pol ynomial model mi ght describe the investigated relationship which is P 1 = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ε 1 , (61) where x 1 represents the coded variable of sensor c ross-sectional area A , x 2 represents the coded variable of absolute pressure P S , x 3 represents the coded variable of orifice cross- sectional area B, and ε 1 is t he error for this model. Also, t his polynomial mode l is a linear function of the unknown parameters θ 0 , θ 1 , θ 2 , θ 3 and the method of least squares is t ypically used to e stimate th ese regression coefficients (Box a nd D raper, 1987; Myers and Montgomery, 1995). The back-pressure P 1 is related to three variables that can be controlled in experiment. Table 5 demonstrates the data resulting f rom an experimental investigation into effect of three variables ; cross-sectional area of the s ensor, c ross-sectional area of the orifice and absolute pre ssure supply, on gauge ba ck-pressure for the fir st-order model. Columns 2, 3, and 4 show the levels used for the e xplanatory va riables A, P S , and B in nat ural unit s of measurements and columns 5, 6 a nd 7 displa y the le vels in terms of the coded variables x 1 , x 2 and x 3 . The least squared estimator of effects θ θ θ θ 1 in the model is q 1 =( X 1 T X 1 ) -1 X 1 T P 1 , where X 1 i s the experimental design n x ( p +1) matrix of coded factor levels with identity e lement and P 1 is the n x 1 vector of back-pressure observations. The estimate of effects θ θ θ θ 1 is q 1 = (208.423 –34.409 36.616 18.277) T . So, the leas t squares fit of first-order regression model is P 1 fit =208.423–34.409 x 1 +36.616 x 2 +18.277 x 3 (62) The F -te st f or significance of regression is a test to dete rmine if there i s a linear relationship between the ba ck-pressure and a set of coded variables x 1 , x 2 , and x 3 . The test is summarized in T able 6. And , because α =0.05 h as been select ed, then we h ave F 0 =17.85> F 0.05, 3, 7 =4.35. So, we would reject the hypothesis H 0 : θ 1 = θ 2 = θ 3 =0, concluding that at least some of these parameters are nonzero. Table 5. Two-level factorial design and data for pneumatic gauge modelling Run A , mm 2 P S , MPa B , mm 2 x 1 x 2 x 3 P 1 , kPa P 1 fit , kPa P λ , kPa P λ fit , kPa P ν , kPa P ν fit , kPa 1 0.251 0.199 0.503 -1 -1 -1 189.487 187.939 187.986 188.345 187.410 188.704 2 1.257 0.199 0.503 1 -1 -1 126.332 119.121 115.955 126.913 116.513 126.729 3 0.251 0.297 0.503 -1 1 -1 283.336 261.170 280.554 283.472 279.595 283.427 4 1.257 0.297 0.503 1 1 -1 155.164 192.352 134.781 154.861 136.175 154.948 5 0.251 0.199 1.131 -1 -1 1 197.724 224.493 196.727 198.295 196.582 198.099 6 1.257 0.199 1.131 1 -1 1 167.422 155.675 155.951 166.684 155.431 166.922 7 0.251 0.297 1.131 -1 1 1 294 .516 297.724 293.607 294.225 293.388 294.322 8 1.257 0.297 1.131 1 1 1 240 .874 228.906 229.213 240.605 226.924 240.681 9 0.754 0.248 0.817 0 0 0 213 .317 208.423 206.223 213.083 204.463 212.938 10 0.754 0.248 0.817 0 0 0 212 .532 208.423 206.223 213.083 204.463 212.938 11 0.754 0.248 0.817 0 0 0 211 .944 208.423 206.223 213.083 204.463 212.938 Table 6. Analysis of variance for the first-order regr ession model Source of Variation Sum of Squares , (kPa) 2 Degrees of Freedom Mean Squa re, (kPa) 2 F 0 Regression S Sr =2.287 x 10 4 3 MSr=7. 623 x 10 3 17.85 Error SSe =2.99 x 10 3 7 MSe=427. 148 Lack of F it SS LoF =2.989 x 10 3 5 MS LoF =597.818 1 260 Pure Error SS PE =0.949 2 MS PE =0.474 Total Syy =2.586 x 10 4 14 17 As s hown in Table 5, the experimental desi gn has three replica ted centre runs . So, the residual sum of s quares can be partitioned into pure error a nd lack of fit c omponents. The lack-of-fit test in Table 6 tests the lack of fit for the first-order regression model. The F -value for this test is large ( F 0 =1260> F 0.05, 5, 2 =5.79), implying that this regression model i s inadequate. In terms of equation (13) for adequacy attainment t he simpler first-order model with function X 1 θ θ θ θ 1 of first-orde r may be added by the X 2 θ θ θ θ 2 terms of second-order which perhaps would r eprese nt t he response a dequately. So, it could be a ssumed th at the second-or der polynomial model might describe the investigated relationship which is P 2 = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + θ 11 x 1 2 + θ 22 x 2 2 + θ 33 x 3 2 + θ 12 x 1 x 2 + θ 13 x 1 x 3 + θ 23 x 2 x 3 + ε 2 (63) Table 7 demonstra tes the data resulting from an e mpirical investigation into e ffect of the three variables on gauge ba ck-pressure P 2 for the second-order model. The least squared estimator of effects θ θ θ θ in the s econd-order model is q =( X 2 T X 2 ) -1 X 2 T P 2 , whe re X 2 is the Box- Behnken design matrix of coded factor levels with identity element a nd P 2 is the n x 1 vector of back-pressure observati ons. The estimate of effects θ θ θ θ is q = (212.598 –34.274 38.221 21.697 0.286 –2. 362 –6.333 –9.561 13.288 6.227) T . So, t he least s quares fit of the second-order regression model is P 2 fit =212.598–34.274 x 1 +38.221 x 2 +21.697 x 2 +0.286 x 1 2 –2.362 x 2 2 –6.333 x 2 2 –9.561 x 1 x 2 +13.288 x 1 x 3 +6.227 x 2 x 3 (64) The F -test f or significance of regression is summarized in Table 8 and, because α =0. 05 has been sel ected, t hen F 0 =118.419> F 0.05, 9, 5 =4.77. So, we would reject the hypothesis H 0 : θ 1 = θ 2 = θ 3 = θ 11 = θ 22 = θ 33 = θ 12 = θ 13 = θ 23 =0, concluding that at least some of these parameters are nonzero. Table 7. Box-Behnken design and data for the second-order regression model Run A , mm 2 P S , MPa B , mm 2 x 1 x 2 x 3 P 2 , kPa P 2 fit , kPa 1 0.251 0.199 0.817 -1 -1 0 195.273 195.273 2 0.251 0.297 0.817 -1 1 0 292.260 292.260 3 0.546 0.199 0.817 1 -1 0 14 7.907 147.907 4 1.834 0.297 0.817 1 1 0 206.648 206.648 5 0.251 0.248 0.503 -1 0 -1 237.147 237.147 6 0.251 0.248 1.131 -1 0 1 246.561 246.561 7 1.257 0.248 0.503 1 0 -1 139.963 139.963 8 1.257 0.248 1.131 1 0 1 202.530 202.530 9 0.754 0.199 0.503 0 -1 -1 14 7.220 147.220 10 0.754 0.199 1.131 0 -1 1 18 5.564 185.564 11 0.754 0.297 0.503 0 1 -1 209.787 209.787 12 0.754 0.297 1.131 0 1 1 273.039 273.039 13 0.754 0.248 0.817 0 0 0 213.317 212.565 14 0.754 0.248 0.817 0 0 0 212.532 212.565 15 0.754 0.248 0.817 0 0 0 211.944 212.565 Table 8. Analysis of variance for gauge second-order regression model Source of Variation Sum of Squares , (kPa) 2 Degrees of Freedom Mean Squa re, (kPa) 2 F 0 Regression S Sr =2.624 x 10 4 9 MSr =2.916 x 10 3 118.419 Error SSe =123.114 5 MSe =24.623 Lack of F it SS LoF =122.165 3 MS LoF =40.722 85.831 Pure Error SS PE =0.949 2 MS PE =0.474 Total Syy =2.637 x 10 4 14 18 As shown in Table 7, the e xperimental design has three replicated c entre runs. So, the residual sum of s quares can be partitioned into pure error a nd lack of fit c omponents. The lack-of-fit test in T able 8 tests the lack of fit for quadratic m odel. The F -value for this test is large ( F 0 =85.831> F 0.05, 3, 2 =19.16), implying that the second-order regression model is inadequate as well. Numerical model valida tion indicates that the residual sum o f squares which remains unexplained by this model is 123.114(kPa) 2 a nd the sample standard de viation is 2. 965kPa. Hence, t his confirms that even with second-order linear regression model only the low level of pre cisi on may be a ttained. This regression model has l ow prec ision because implies big standard deviation which indicates t hat struc tural gauge model s hould be improved in some way. 4.3. Gauge hybrid data modelling From above cons ideration follows that i n terms of model (13) the simpler f irst-order model with function X 1 θ θ θ θ 1 was added by the X 2 θ θ θ θ 2 terms of second-order but it has not represented the response adequately. So, for model adequacy attainment the other option has been entertained by adding the terms of Y 1 θ θ θ θ 1 instead of X 2 θ θ θ θ 2 . Applying computer-simulation and em pirical data for model building and s olving it has been acce pted that the the oretical adiabatic mode l tolerably presents the availa ble physical knowledge. However, knowing t hat the second-order linear regression model doe s not fit a dequately the empirical data, the e xact unknown m odel has been a pproximated by the product of polynomial empirical f unction and given theoretical function (60). The theoretical implicit function may be represented explicitly by P λ = f ( A , P S , B ; λ λ λ λ ), where the vector λ λ λ λ =( γ C C 0 ) T . So, gauge hybrid data model may be written P 1 = f ( A , P S , B ; λ λ λ λ ) x φ ( x , θ θ θ θ 1 λ )+ e λ , (65) where x is t he ve ctor of coded variables x 1 , x 2 , x 3 and the f unction φ ( x , θ θ θ θ 1 λ ) is spe cified as the first-order polynomial which in coded variables is written φ ( x 1 , x 2 , x 3 ; θ θ θ θ 1 λ ) = θ 0 λ + θ 1 λ x 1 + θ 2 λ x 2 + θ 3 λ x 3 (66) For gauge hybrid da ta model building the physical designed experiment shown i n Table 5 has been e mployed. So, here the first-order experimenta l design w ith corre sponding levels of explanatory variables was use d for hybrid da ta modell ing and to ensure the design of full column ra nk it was supplied with three replicated centre runs. The same set of levels of explanatory variables was use d f or the corresponding det erministic computer-simulation experiment which has bee n carried out using the theoretical func tion (60). In this experiment we were interested in obtaining theoretical data, so the values of di scharge coefficients have been accepted a s C 0 = C =1, yielding the re sponse vect or P λ presented in Table 5. Then, vector P λ has been transformed into diagonal matrix D λ =diag( P λ ). Gauge hybrid dat a model ha s been rewritt en as P 1 = X θ θ θ θ 1 + Y λ θ θ θ θ 1 λ + e λ , where X is the t wo- level factorial design matrix with identity eleme nt and Y λ = ( D λ – I ) X . Next the model was transformed into e quation P 1 = Ψ Ψ Ψ Ψ λ β β β β λ + e λ , where Ψ Ψ Ψ Ψ λ =[ X Y λ ] and β β β β λ = λ 1 1 θ θ , and its l east squared s olution was obtained using equation b λ = ( Ψ Ψ Ψ Ψ λ T Ψ Ψ Ψ Ψ λ ) –1 Ψ Ψ Ψ Ψ λ T P 1 . The result is b λ = (27.044 4.607 6.614 3.894 0.907 −0.012 −0.010 −0. 016) T . Here the matrix X was selected so that the matrix Ψ Ψ Ψ Ψ λ T Ψ Ψ Ψ Ψ λ has an inverse and, hence the vector b λ is an e stimate of effects β β β β λ . For model validation the c omputation of fitted hybrid data model fit λ P = Ψ Ψ Ψ Ψ λ b λ was completed and the obtained result shown in Table 5. R esidual a nalysis was used for model validation and the residuals from the fitted model were com puted by equation R λ = P 1 – Ψ Ψ Ψ Ψ λ b λ . The normal plot of residuals for gauge hybrid data model is shown in Figure 1 a nd the point s 19 at t he plot f orm approximately straight line. So, we can accept the residuals are normally distributed and apply the elaborated hybrid data regression analysis. 2 1 0 1 2 2 1 0 1 2 N_Score i R λ i kPa 100 150 200 250 300 2 1 0 1 2 R λ kPa P λ fit kPa Figure 1. Normal probability plot of residuals Fig ure 2. Plot of residuals versus for gauge adiabatic hybrid data model respons e predicted values The scatter plot of residuals versus predi cted value s is t he pr imary plot which ma y be used to a ssess sufficiency of the functional part of the model (NIST /SEMATECH e-Handbook of Statistical Methods , 2006). The plot shown in Figure 2 dem onstrates the residuals approximate the random errors that make the relationship between predictor a nd response variables a statistical relationship and s uggest that main effects hybrid data model fits the data. Using equa tions presented in Table 3 the F -test for significance of regression has been summarized in Table 9 and, because α =0.05 was se lected, then F Rx =84730>> F 0.05, 4, 3 =9.12 and F Rc =505> F 0.05, 4, 3 =9.12. Here F Rx has first be en found significant and t hen F Rc being significant indica te that the model with t erms in it additional t o the first-order linear regression model explains significa ntly more of the va riation in t he P 1 -variable than the model E ( P 2 ) = X θ θ θ θ . Also, the lack-of-fit test in T able 9 tests the lack of f it for gauge m ain effects hybrid data model. The F -value f or this test ( F LoF =7.342< F 0.05, 1 , 2 =18.51) is small, implying that the main effects hybrid data model is adequate. Table 9. Analysis of variance for main effects adiabatic hybrid data model. Source of Variation Sum of Squares , (kPa) 2 Degree of Freedo m Mean Squa re, (kPa) 2 F Regression x SS Rx =5.007 x 10 5 4 MS Rx =1.252 x 10 5 84730 Regression c SS Rc = 2986 4 MS Rc = 746.402 505 Residual SS E = 4.432 3 MS E = 1.477 Lack of F it SS LoF = 3.483 1 MS LoF = 3.483 7.342 Pure Error SS PE = 0.949 2 MS PE = 0.474 Total SS T = 5.037 x 10 5 11 However, e arlier some ana lytical work has been done on developing criteria f or judging the adequacy of regression model from a prediction poi nt of view . The Box and Wetz (1973) work suggests t hat t he observed F -ratio must be at least four or five times t he cri tical value from the F -table if t he reg ression model is to be usef ul as a predictor (Myers and Montgomery, 1995). From above for a diabatic hybrid da ta m odel the observed F -ration is 2.5 times the critical value w hich is not follow to the Box a nd Wetz suggestion . Therefore, the way of hybrid data model improvement as a predictor may be connected with the other theoretical function employment. 20 The steady-flow model (58) of the orifice was derived f rom the energy equation assuming that the a ir-flow process i s a diabatic. But if isochoric air-flow process is presumed the n in Mathcad computer code notation the steady-flow model for the orifice is written ( ) otherwise P P P if P P P RT g B C G S S S O O 2 5 . 0 2 ≥ − = ν ν ν ν (67) and the isochoric steady-flow model of the nozzle-work-piece sensor is ( ) otherwise P P P if P P P RT g CA G a a a 2 5 . 0 2 ν ν ν ν ≥ − = , (68) where G O ν and G ν are the weight flow rates through the orifice and sensor, respectively, and P ν is the calculated back-pressure for isochoric air-flow. In this case the st eady-state model of the gau ge may be derived by using e quations describing the isochoric ai r flows through the both orifice and nozzle-work-piece sensor. And for one-di mensional steady flow the weight fl ow rate s t hrough these two restrictors must be equal in order for the total weight of air in the tube between the orifice and nozzle to be constant. Therefore, in this case the weight flow rate equality is written ( ) otherwise P P P if P P P B C S S S O 2 5 . 0 ≥ − ν ν ν ( ) otherwise P P P if P P P CA a a a 2 5 . 0 ν ν ν ≥ − = (69) This implicit algorithmic model may be represented explicitly by equa tion P ν = f ( A , P S , B ; ν ν ν ν ), where the vector ν ν ν ν =( C C 0 ) T . So, gauge isochoric hybrid data model may be written P 1 = f ( A , P S , B ; ν ν ν ν ) x φ ( x , θ θ θ θ 1 ν )+ e ν (70) Here for hybrid data m odel buil ding the physical experim ent shown in Tabl e 5 has been employed as well. And in parallel t o the physical e xperiment the corresponding dete rministic computer-simulation experiment has been carried out using algorithmic function (69) with C 0 = C =1 and yielding the response vector P ν presented in Table 5. Further, employing the matrix D ν =diag( P ν ), gauge hybrid data m odel has been rewritten as P 1 = X θ θ θ θ 1 + Y ν θ θ θ θ 1 ν + e ν , where Y ν = ( D ν – I ) X . After t hat the model ha s been transformed into equation P 1 = Ψ Ψ Ψ Ψ ν β β β β ν + e ν , where Ψ Ψ Ψ Ψ ν =[ X Y ν ] and β β β β ν = ν 1 1 θ θ , and its solution by equa tion b ν = ( Ψ Ψ Ψ Ψ ν T Ψ Ψ Ψ Ψ ν ) –1 Ψ Ψ Ψ Ψ ν T P 1 was b ν = (15.429 5.647 7.694 2.555 0.971 −0.006 −0.026 −0.013) T . For mode l validation the computati on of fitted hybrid data model fit ν P = Ψ Ψ Ψ Ψ ν b ν was completed and obtained result shown in Table 5 by vector fit ν P . Re sidual analysis was used for mode l validation and the residuals from the f itted model were computed by equa tion R ν = P 1 – Ψ Ψ Ψ Ψ ν b ν . The normal plot of residuals for gauge isochoric hybrid data mode l is shown in Figure 3 and the points a t the plot form approximately straight l ine. So , we can admit the residuals are normally distributed and apply the hybrid data regression analysis. The scatter pl ot of residuals ve rsus predic ted values s hown in Figure 4 d emonstrates the residuals a pproximate the random errors that make the relationship between predictor and response variables a statistical relationship a nd suggest that main ef fects hybrid da ta m odel fits the data. 21 1 0 1 2 1 0 1 2 N_Scor e i R ν i kPa 100 150 200 250 300 1 0.5 0 0.5 1 R ν kPa P ν fit kPa Figure 3. Normal probability plot of residuals Fig ure 4. Plot of residuals versus for gauge isochoric hybrid data model response pre dicted values The F -test f or significance of regression has b een summarized in Table 10 and, because α =0.05 was selected, t hen F Rx =145200>> F 0.05, 4, 3 =9.12 and F Rc =866> F 0.05, 4 , 3 =9.12. Here F Rx ha s fi rst been obtained significant and then F Rc bei ng significant indicate tha t the model with terms in it additional to the first-order linear regression m odel explains significantly more of the variation in the P 1 -variable than the model E ( P 2 ) = X θ θ θ θ . A lso, the lac k-of-fit test in Table 10 tests the lack of fit for gauge m ain eff ects hybrid da ta mode l. The F -value f or this test is small ( F LoF =3.45< F 0.05, 1, 2 =18.51), implying tha t the hybrid data model is adequate. Table 10. Analysis of variance for main effects isochoric hybrid data model. Source of Variation Sum of Squares , (kPa) 2 Degree of Freedo m Mean Squa re, (kPa) 2 F Regression x SS Rx =5.007 x 10 5 4 MS Rx =1.252 x 10 5 145200 Regression c SS Rc = 2987 4 MS Rc = 746.863 866 Residual SS E = 2.586 3 MS E = 0.862 Lack of F it SS LoF = 1.637 1 MS LoF = 1.637 3.45 Pure Error SS PE = 0.949 2 MS PE = 0.474 Total SS T = 5.037 x 10 5 11 Here f or isochoric hybrid data model the obse rved F -ration i s 5.4 times the cri tical value which is fol low to the Box and Wetz s uggestion. Also, this hybrid data model encapsulates the available physical gauge knowledge. For thi s adequate model the unexplained residual sum of squares remains only 2.586(kPa) 2 which is 47. 6 times less than f or the second-order multiple li near regression model and can be considered as an indicated quantity f or superior model quality . Furthe rmore, the sam ple standard devia tion here is only 0. 509kPa, which is 5.8 times less th an for gauge multiple lin ear regres sion model. Hence, the main effects isochoric hybrid data model has the best precision because implies the smallest standar d deviation and emulates the gauge with minimum uncertainty. 5. Discussion and conclusions Undoubtedly the subject of statistical model construction or identification is heavily dependent on the results of theoretical a nalysis of the ob ject under observation (Akaike, 1974). And s ince all models are wrong it is impossible to obtain a “correct” one by e xcessive elaboration (Box, 1976). On the contrary following W illiam of Occam the hybrid da ta modelling can represent an economical description of natural phenomena. 22 In iterative physical experimentation usually the initi al design is run and then, in t he light of the results obtained, it is de cide d what should be done next. But whatever is done must be done by us ing the expert subject-matter knowl edge of the investigator whos e input can contribute to possible changes in the model, in the i nput variables, in the measured responses and in the objective (Box, 1994). Therefore, in c oncurrent physical and com puter experimentation the initial design needs to be run physicall y and computationally and then, using stati stical inference it is decided wh at should be done next. He re, for adequacy attainment, instead of e mpirical com ponent complication the theoretical and simple em pirical components may be combined into hybrid data regression m odel. By means of hybrid da ta regression modelling it is possible to incorporate quant itativ ely the a vailable subject-matter knowledge int o measurement system regression model. This new type of model building and solving te chniques overcomes the main draw back of multiple linear regression modelling, connected with the lack of theoretical information and dat a a bout underlying mecha nisms, and let us a possibility to elaborate adequate, precise, and parsim onious models whic h ca n be used for better measurement system unde rstanding and its performance enha ncement. Hybrid data multiple regression modelling is a procedure for da ta fusion from computer-simulation and physical designed experiments with the aim of maximizing t he m easurement information content. The theory of inference re garding parameter estimation generally a ssumes that the true model for a given se t of physica l data is known and pre -specified. In pra ctice a model may be formulated from these data, and it is i ncreasingly common f or te ns or even hundreds of possible models to be entertained (data mining) (C hatfield, 1995). But even whe n a model is pre-specified on subject-matter grounds and then correcte d employing empirical data, it may be formulated inc orrectly. Thus, incorporating data from computer experiment let us a possibility to e laborate parsimonious and adequate models with the use of les s empirical data. Finally, a single hybrid data model which entertains s tatistically significant subject-matter knowledge and data may be selected a s a ‘winne r’ when t he other models give ina dequate or unsatisfactory fit. In hybrid data modelling the statistical i nference is broadened to include theoretical mode l formulation, but it is not c lear to what e xtent we c an formalize the steps taken by an analyst during hybrid data generation and model building. However, the hybrid data modelling adapts naturally to c ope with subject-matter knowledge uncertainty. This t ype of modelling gives an opportunity to start working with such uncertainty and t o give due re gard to the computer-based revolution in model formulation which ha s taken place. With hybrid da ta multiple regression modelling we have an a dvantage to extend the formal statistical inference beyond standard regression techniques to theoretical considerations. And, by incorporating statistically significant computer-sim ulation data it is pos sible to improve m odel performance significantly. The elaborated techniques have been successfully e mployed for adiabatic and isochoric hybrid da ta modelli ng of t he pneumatic gauge. The application of the se techniques, using the numerical correctness o f Mathcad software p ackage, has rev ealed their e ffectivenes s for elaboration of parsimonious and adequate models. The present i nvestigation has elucidated that if the modelling is accomplished e mploying c oncurrently the two sources of data from computer-simulation a nd physical designed e xperiments; i t is possible to achieve significant model improvement. The validat ion of gauge a diabatic and isochoric hybrid data multiple regression models has confirmed that these models fit the reference empirical data much better tha n the s econd-order multiple l inear regression model for the considered region of factor space and the isochoric model has become a winner from the prediction point of view. Appendix A 23 var 2 1 b b = ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − Q X X X Y Q YQ X X X X X X Y YQ X X X X X 1 1 1 1 1 T T T T T T T T T Y Y X Y Y X X X T T T T ( ) ( ) ( ) ( ) ( ) T T T T T T T T T T − − + − − − − − − − − − Q X X X Y Q YQ X X X X X X Y YQ X X X X X 1 1 1 1 1 σ 2 = ( ) ( ) { } [ ] ( ) { } − − − − − − − − Y X X X X I Y Q 0 Y X X X X Y YQ I X X X I T T T T T T T T I 1 1 1 ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − Q YQ X X X X X X Y Q X X X Y YQ X X X X X T T T T T T T T T 1 1 1 1 1 σ 2 = ( ) { } ( ) − − + − − − + − − − − − − − − − − − − − − − − YQ Z Q YQ X YG Z Q YQ Z YQ I X G XG Y Q YQ X YG Z YQ I X YXG YQ X I G T T 1 T T T 1 1 T T 1 T T 1 T 1 σ 2 Appendix B var ( y ˆ )=[ X Y ] − T T T T T T Y X Y Y X Y Y X X X σ 2 = [ X Y ] ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − Q X X X Y Q YQ X X X X X X Y YQ X X X X X 1 1 1 1 1 T T T T T T T T T T T Y X σ 2 = [ X Y ] ( ) ( ) ( ) ( ) ( ) − − + − − − − − − − − − T T T T T T T T T T T T T T Y Q X X X X Y Q Y YQ X X X X X X X Y Y Q X X X X X X 1 1 1 1 1 σ 2 = [ X ( X T X ) -1 X T { I + Y Q – Y T X ( X T X ) -1 X T – Y Q – Y T } + Y Q – Y T { I – X ( X T X ) -1 X T }] σ 2 = [ X ( X T X ) -1 X T + X ( X T X ) -1 X T Y Q – Y T X ( X T X ) -1 X T – X ( X T X ) -1 X T Y Q – Y T + Y Q – Y T { I – X ( X T X ) -1 X T }] σ 2 = [ X ( X T X ) -1 X T – X ( X T X ) -1 X T Y Q – Y T { I – X ( X T X ) -1 X T }+ Y Q – Z T ] σ 2 = { X ( X T X ) -1 X T – X ( X T X ) -1 X T Y Q – Z T + Y Q – Z T } σ 2 = [ X ( X T X ) -1 X T +{ I – X ( X T X ) -1 X T } Y Q – Z T ] σ 2 = { X ( X T X ) -1 X T + Z ( Z T Z ) – Z T } σ 2 References Akaike H. (1974) A ne w look at t he stati stical mode l identification. IEEE Transactions on Automatic Control , AC-19 , 716-723 Anscombe, F. J. and Tukey, J. W. (1963) The e xamination and analysis of residuals. Technometrics , 5 , 141–160 Bayarri, M. J., Berger, J. O., Higdon, D., Kennedy, M. C., Kottas, A., Paulo, R., Sac ks, J., Cafeo, J. A., Cavendish, J ., Lin, C. H. a nd Tu, J. (2007) A f ramework f or validation of computer models. Technometrics , 49 , 138-154 Bokov, V. B. (2007) Theoretical and empirical data-inclusive process cha racterization. Journal of the Royal Statistical Society A, 170 (3), 735-758 Boudjemaa, R., Forbes, A . B. a nd Harris, P. M. (2003) “ Multivariate empirical m odels and their use in metrology”, Na tional Physical Laboratory , Crown, Te ddington, NPL Report CMSC 32/03 Box, G. E. P. (1976) Science and statistics. J. Am. Statist. Ass. , 71 , 791–799 Box, G. E. P. (1994) Statistics and quality improvement. J. R. Statist. Soc. A, 157 , 209–229 Box, G. E . P. a nd Draper, N. R. (1987) Empirical Model-building and R esponse Surfaces . New York: Wiley. 24 Box, G. E. P. and Wet z, J. M. (1973) “Criterion for j udging the a dequacy of estimation by an approximation response polynomial”, Department of Statistics, University of W isconsin, Madison, Technical report No. 9 Bridge, S. P., Robbins, P. T., Paterson, W. R. and W ilson, D. I. (2001) A pneumatic gauging sensor for mea suring the thickness of s oft films. Proc Instn M ech. Ingrs ., 215 , Part E , 19- 27. Chatfield, C. (1995) Mode l uncertainty, data mining and st atistical inference (with discussion). J. R. Statist. Soc. A, 158 , 419–466 Cox, M. G., Forbes, A. B. and Harris, P . M. (2002) Discrete modelling. In Software Support for Metrology, Best Practice Guide 4 . Teddington: National Physical Laboratory Crnojevic, C., Roy, G., Florent, P. and Bettaha r, A. (1997) Influence of regulator di ameter and i njection nozzle geometry on flow structure in pne umatic dimensional control systems. Journal of Fluids Engineering , 119 , 609-615 Esward, T. J. and W right, L. (2007) Conti nuous modelling. In Software Support for Metrology, Good Practice Guide 15 . Teddington: National Physical Laboratory Evans, J. C. a nd Morgan, I. G. (1964) Principle s of pneumatic gauging , Notes on Applied Science, NPL, No 34 Jackson, G. F. (1978) An a ssessment of the a pplication of pneumatic ga uging to the measurement of mass transfer coefficients. J. Phys. E: Sci. Instrum ., 11 , 569-575 Mathsoft (1999) Mathcad User’s Guide . Cambridge: MathSoft. Myers, R. H. a nd Montgomery, D. C. (1995) Response Surf ace Me thodology: P rocess and Product in Optimization using Designed Experiments . New York: Wiley. Nakayama, Y. (1964) Action of the fluid in the air-micrometer. Bulletin of JSME , 7 , 698-720 National Ins titute of Standards and Technology (2006) Engineer ing St atistics Handbook . Gaithersburg: Na tional Insti tute of Standards and Technolog y. (Available f rom www.itl.nist.gov/div898/handbook/ ) Reader-Harris, M., Stewart, C. W., Lord, G. J. and Forbes, A. B. (2000) Continuous modelling in metrology”, National Engineering Laboratory, Crown, Teddington, Tech. Rep. 058/2000 Reese, C. S., Wils on, A. G., Hamada, M., Martz, H. F . and Ryan, K. J. (2004) Integrated analysis of computer and physical experiments. Technometrics , 46 , 153-164 Rohde, C. A. (1965) Generalized inverses of partitioned matrices, J. Soc. Ind. App. Math., 13 , 1033-1035 Sacks, J., W elch, W. J., Mitchell, T. J. and Wynn, H. P. (1989) Design and a nalysis of computer experiments. Statistical Science , 4 (4), 409-435 Searle, S. R. (1971) Linear Models . New York: Wiley. Zalmanzon, L . A. (1965) Compone nts for Pneumatic Control Instruments; the Static and Dynamic Characteristics of Pneumatic R esistances, Capacitances and Transmission Lines . Oxford: Pergamon
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment