Fuzzy sets in nonparametric Bayes regression

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 89–104 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000000 84 F uzzy sets in nonparametric Ba y es regress ion Jean-F ran¸ cois Angers 1 and Mohan Delampady 2 Universit´ e de M ontr ´ eal and Indian Statistic al Institute, Bangalor e Abstract: A simple Bay esian approach to nonparametric regression is de- scrib ed using fuzzy sets and membership functions. Membership functions are int erpreted as likelihoo d functions for the unkno wn r egression function, so that with the help of a reference prior they can be transf ormed to prior den- sity functions. The unkno wn regressi on function i s decomposed into wa velets and a hierarchica l Bay esian approach is empl oy ed f or making inferences on the resulting w a v elet coeﬃcient s. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 2 Nonparametric regr ession a nd wa v elets . . . . . . . . . . . . . . . . . . . . 91 3 Prior information and members hip functions . . . . . . . . . . . . . . . . 92 4 Posterior calcula tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 5 Mo del chec king and Bayes factors . . . . . . . . . . . . . . . . . . . . . . . 9 7 5.1 Pr ior robustness of Bayes facto rs . . . . . . . . . . . . . . . . . . . . 9 8 6 Examples, simulations a nd illustrations . . . . . . . . . . . . . . . . . . . 100 6.1 Mo del chec king . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 02 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 03 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 04 1. In tro duction Consider the mo del (1.1) y i = g ( x i ) + ε i , i = 1 , . . . , n, and x i ∈ T , where ε = ( ε 1 , ε 2 , . . . , ε n ) ′ ∼ N (0 , σ 2 I ), σ 2 is unknown and g ( · ) is a function de ﬁned on some index set T ⊂ R 1 . Inferences ab out g such a s its estimation and es tima tio n error as well a s mo del chec king ar e of in terest. Without par ametric a s sumptions such as thos e leading to linear reg ression, this is a no npa rametric regr ession problem. A Bayesian appr oach to (fully) nonparametric 1 D´ epartemen t de math´ ematiques et de statistique, U nive rsit´ e de Montr ´ eal, M on tr ´ eal H3C 3J7, Canada, e-mail: angers@d ms.umont real.ca 2 Statistics and Mathematics Unit, Indian Statistical In stitute, Bangalore 560 059, India, e-mail: mohan@is ibang.ac .in AMS 2000 subje c t classiﬁc ations: Pr imary 62G08; secondary 62A15, 62F15. Keywor ds and phr ases: function estimation, hierarchical Ba y es, membership function, model c hoice, wa velet. 89 90 J.-F. Angers and M. Delamp ady regres s ion problems t ypically requires spec ifying pr ior distributions on function spaces whic h is rather diﬃcult to handle. The extent of the complexity of this approach can be gauged from sources suc h as Ghosh and Ramamoo rthi [ 9 ], Lenk [ 10 ], a nd so on. F ur thermore, quantifying useful prior informatio n such as “ g is close to (a sp eciﬁed function) g 0 ” is diﬃcult pro babilistically , whereas this se ems quite stra ightforw ard if instead an a ppropria te metric o n the concerned function space is used. This is where fuzzy sets or member ship functions can be made use of. Before pr o ceeding to this pro blem, let us reca ll the following details on fuzzy sets and membership f unctions. Deﬁnition 1.1. A fuzzy subset A of a space G (or just a fuzzy set A ) is deﬁned by a member ship function h A : G − → [0 , 1]. The mem ber ship function, h A ( g ), is supp ose d to express the degree o f compati- bilit y of g with A . F or example, if G is the r eal line and A is the set o f po ints “close to 0 ”, then h A (0) = 1 indicates that 0 is cer tainly included in A , but h A ( . 05) = . 01 says that .05 is not really “clo se” to 0 in this c o ntext. Similarly , if G is a se t of functions and A ⊂ G is a set of functions “close” to a given function g 0 , then h A ( g 0 ) = 1 indicates that g 0 is certainly included in A ; howev er, if h A ( g 1 ) = . 01 with g 1 ( x ) = 10 g 0 ( x ) + 100 then g 1 is not really “ close” to g 0 in this case. Note that even when G = Θ is the pa rameter space , a members hip function h A ( θ ) is not a probability density or mass function deﬁned on Θ, and he nc e canno t b e used to o btain a prior distribution directly . Instea d, as we hav e done in [ 4 ], we prop ose that a reasonable in terpretation for a fuzzy subset A o f Θ is that it is a likeliho o d function f or θ giv en A . (See also [ 12 , 1 3 ]). This in ter pretation s eems to be able to answer so me of the questions reg arding how a ppropriate the co ncept of fu zziness is in mo deling our p erception of imprecisio n. F r ench [ 7 ] discusses a few of these ques - tions. First of all, likeliho o d is an accepted means for mo deling imprecision. Another impo rtant question is how to deﬁne h A ∩ B from h A and h B for incorp ora ting h A and h B in Bay esian inference . If A and B are indep endent, then interpreting h A and h B as lik eliho o d functions leads to the res ult that h A ∩ B = h A h B , for this purpo se. F urther, the qualitative ordering that underlies a mem bers hip function can also b e inv e s tigated with this interpretation, in conjunction with a pr ior distribution, as we study later. See [ 4 ] for an application to hierarchical and r obust Bay es inference. This pap er is closely rela ted to Angers and Delampady [ 3 , 4 ]. In the latter , we used fuzzy sets to help sp ecify a pr ior density on a ﬁnite para meter space, wher eas in the for mer a nonpar ametric function estimator using wav elet decomp os ition w as presented. It is therefor e natur al to explore whether a combination of these tw o ideas can b e fruitfully employed. T ow ards this end, we adopt the semi-par ametric approach of using the w av elet decomp os itio n of a regress ion function a long with a r emainder, thus e ﬀecting a dra stic reduction of the dimension of the parameter space. Next we prop ose to incorp or ate the impre cise prior information o f the k ind “the true r egressio n function g is close to g 0 ” using membership functions. W av elet decomp osition of g 0 provides the w av elet co eﬃcients θ 0 which we take as the prior mean of the wa velet co eﬃcients θ cor resp onding to g . Precisio n o f this prior mea n estimate is unclear, so we trea t the prior v ar iance of these co eﬃcients as a hyper- parameter with a second sta ge (reference) prio r. This appro ach is also tied to the question of Bayesian robustnes s. If a membership function stating “ we think g is close to g 0 ” is the only prior input that we intend to incor p o rate, how sho uld we then pro ceed with the Bay esian inference related to g ? W e claim that the only natural appro ach is to study the ro bustness of the inferences resulting fro m the Nonp ar ametric r e gr e ssion 91 class o f prior densities c ompatible with this mem ber ship function (see Section 5.1). This appro ach is further explained in the following sections. This pa pe r is o rganized as follows. In Sections 2 and 3, a brief s ummary o f our previous work [ 3 , 4 ] is pr e s ented. In Sec tio n 4, details on the pos ter ior ca lculations are given. The main fo c us of this pap er, namely , model c hecking using Bayes factors and fuzzy mo dels, is discusse d in Section 5 . Sim ulations and practical examples are presented in the last section to illustr a te this theme. 2. Nonparametric regressio n and w a v elets T o pro ceed further, we as sume that the regress io n function g and its prior guess g 0 are in L 2 and imp os e a wav elet structure o n bo th of them. T o do this, we consider a co mpactly supp orted wa v elet function ψ ∈ C s , the set of real-v alued functions with contin uous der iv atives up to order s . Thus, g ( x ) can b e written as g ( x ) = X | k |≤ K 0 α k φ k ( x ) + ∞ X j =0 X | k |≤ K j β j,k ψ j,k ( x ) = g J ( x ) + R J ( x ) , where g J ( x ) = X | k |≤ K 0 α k φ k ( x ) + J X j =0 X | k |≤ K j β j,k ψ j,k ( x ) , R J ( x ) = ∞ X j = J +1 X | k |≤ K j β j,k ψ j,k ( x ) , (2.1) φ k ( x ) = φ ( x − k ) , ψ j,k ( x ) = 2 j / 2 ψ (2 j x − k ) , α k = Z T φ ( x − k ) g ( x ) dx and β j,k = Z T 2 j / 2 ψ (2 j x − k ) g ( x ) dx. Here K j is such that φ k ( x ) and ψ j,k ( x ) v a nis h on T whenever | k | > K j , and φ is the scaling function (“father wa velet”) cor resp onding to the “mother wa v elet” ψ . Such K j ’s ex is t (and are ﬁnite) since the w av elet function that we hav e chosen has compact supp ort. (See [ 3 ] o r [ 8 ] for details.) Since the num b er of o bs erv a tions is ﬁnite, o nly a ﬁnite n um ber of parameters c a n b e estimated. Therefore, as suggested in [ 3 ], the r esolution level J should b e chosen such that the total num ber of un- known parameter s which need to b e estimated is no larger than n , the num b er of observ ations. Since the total num b er of α and β parameter s is b ounded by l X 2 J +1 + J ( l ψ + 1) + ( l φ + l ψ + 2) where l X , l ψ and l φ represent the length of the supp or t of T , ψ ( · ) a nd φ ( · ) resp ec- tively , w e r equire this upper b ound to be less than n . The o ptimal resolution lev el J , how ev er, should b e chosen using a Bay esian mo del selection technique, for example [ 3 ]. Consequently , we prop ose that the function g J ( · ) b e estimated from the da ta 92 J.-F. Angers and M. Delamp ady and the function R J ( · ) b e consider ed a s a “ nuisance par ameter” to b e eliminated by integrating it out. F urther we assume that g 0 ( x ) = X | k |≤ K 0 α 0 ,k φ k ( x ) + ∞ X j =0 X | k |≤ K j β 0 ,j,k ψ j,k ( x ) , where α 0 ,k = Z T φ ( x − k ) g 0 ( x ) dx, β 0 ,j,k = Z T 2 j / 2 ψ (2 j x − k ) g 0 ( x ) dx. Since g 0 is given, the wav elet co eﬃcients can b e assumed known. F ro m [ 1 ], it is known that β j,k = O (2 − j s ) where s denotes the degree of “smo oth- ness” of ψ , that is ψ ( · ) ∈ C s ( s > 1 / 2 ). F urther, some of the a priori information wa v elet co e ﬃcient s can be translated into E [ α k ] = α 0 ,k ; E [ β j,k ] = β 0 ,j,k ; (2.2) V ar( α k ) = τ 2 ; V ar( β j,k ) = τ 2 / 2 2 j s ; (2.3) (see [ 3 ] for details). Here , τ 2 is a ﬁrs t stag e hyper-par ameter, a suitable prior on which will b e sp eciﬁed later . (It is also supp os ed that E [ β j,k ] = β 0 ,j,k = 0 for j ≥ J + 1.) How ever, the prio r probability distribution on the co eﬃcients α k and β j,k itself will b e sp eciﬁed by a membership function to b e introduced later. Since there are only a ﬁnite num ber of β j,k that can be estimated, we cannot exp ect to have a very informa tive pr io r distribution o n the remainder term, R J . How ev er, to b e able to pro ce e d with the Bay e s ian approach, we need to ass ume that R J ( · ) is a sto chastic pr o cess. F or simplicity a nd computatio nal ease, we co nsider a Gaussian pro ces s with zer o mean (compatible with E [ β j,k ] = β 0 ,j,k = 0 for j ≥ J + 1). Spec iﬁc a lly , following [ 3 ], we now as sume that R J ( x ) is a Ga us sian pro cess with mean function 0 and cov aria nce kernel τ 2 Q ( x, y ) where Q ( x, y ) = ∞ X j = J +1 1 2 2 j s X | k |≤ K j ψ j,k ( x ) ψ j,k ( y ) . (2.4) Note that a prior assumption such as β j,k being indep endent normal r andom qua nti- ties will natura lly lead to this prior distribution as can be seen from the Ka rhunen– Lo eve expansion [ 2 ]. W e, howev er, ma ke the stronger assumption that the remainder R J ( · ) itself is a zero- mean Gaussia n but with an unknown prior v ar iance co mp one nt τ 2 . Our r eason for this a ssumption is that, o nce the optimal r esolution level J is chosen using a p ow erful mec hanism such as a Bayes factor , as we suggest later, the wa v elet co eﬃcients at hig her r esolutions are not exp ected to hav e substantial inﬂu- ence o n the wa v elet smo other. Hence R J ( · ) w hich is made up of these higher-order wa v elet co e ﬃcient s will also hav e negligible inﬂuence. 3. Prior information and me m b ership functions W e have ex plained in the prev io us s ection that we would like to make use of im- precise pr io r information such as “ g is close to g 0 ” by using a membership function Nonp ar ametric r e gr e ssion 93 which tra ns lates this in to a measure of distance b etw een the co rresp onding wa v elet co eﬃcients. L et us examine the implications o f as s uming that the av ailable prior information is qua nt iﬁed in terms of a membership function (3.1) h A ( g ) = ξ ( ρ ( g , g 0 )) , where ρ is a mea sure o f distance. Due to the wa v elet deco mpo sition ass umed on g as well as g 0 (see Section 2), a na tural choice for ρ is the L 2 distance given by (3.2) ρ 2 ( g , g 0 ) = || g − g 0 || 2 = X ( θ j,k − θ 0 j,k ) 2 , where θ j,k and θ 0 j,k are, resp ectively , the wa velet co eﬃcients of g and g 0 . Using Parsev al’s iden tit y , note that ρ 2 ( g , g 0 ) = k g − g 0 k 2 = X | k |≤ K 0 ( α k − α 0 ,k ) 2 + ∞ X j =0 X | k |≤ K j ( β j,k − β 0 ,j,k ) 2 . Since we canno t estimate β j,k for j = J + 1 , . . . , and b eca use β j,k = O (2 − j s ) (see [ 1 ]) we then have ρ 2 ( g , g 0 ) = X | k |≤ K 0 ( α k − α 0 ,k ) 2 + J X j =0 X | k |≤ K j ( β j,k − β 0 ,j,k ) 2 (3.3) + ∞ X j = J +1 X | k |≤ K j ( β j,k − β 0 ,j,k ) 2 = X | k |≤ K 0 ( α k − α 0 ,k ) 2 + J X j =0 X | k |≤ K j ( β j,k − β 0 ,j,k ) 2 + ∞ X j = J +1 X | k |≤ K j O(2 − 2 j s ) = X | k |≤ K 0 ( α k − α 0 ,k ) 2 + J X j =0 X | k |≤ K j ( β j,k − β 0 ,j,k ) 2 + O(2 − 2( J +1) s ) = ρ 2 J ( g , g 0 ) + O (2 − 2( J +1) s ) . (3.4) F or this rea son, to quantify the imprecis e prior infor mation, we will use a member- ship function that will dep end o nly o n ρ 2 J ( g , g 0 ). Some po ssibilities for h A are the following: (i) The Gaussia n mem ber ship function given by (3.5) h A ( g ) = exp( − ρ 2 J ( g , g 0 )) = exp( − α || θ − θ 0 || 2 ) . This membership function can b e explained as follows. Suppo se we have a v ailable some past data o f the form y ∗ i = g ( x ∗ i ) + ε i , i = 1 , . . . , n ∗ , with ε i denoting i.i.d. normal e r rors , and supp ose g is estimated from this data b y ˆ g . Then the informa tio n in this data may be q uantiﬁed using a member ship function of the type h A ( g ) = exp( − α || g − ˆ g || 2 ) = exp( − c X ( θ j − ˆ θ j ) 2 ) . 94 J.-F. Angers and M. Delamp ady g 0 may then b e identiﬁed with ˆ g . If we ha ve multiple past data sets, we may then hav e av aila ble h A 1 ( g ) = exp( − α 1 || g − ˆ g 1 || 2 ), h A 2 ( g ) = exp( − α 2 || g − ˆ g 2 || 2 ), and so on, which may be combined into h A ( g ) = h A 1 ∩ A 2 ( g ) = h A 1 ( g ) h A 2 ( g ) = exp( −  α 1 || g − ˆ g 1 || 2 + α 2 || g − ˆ g 2 || 2  ) . As an exa mple one could consider ﬁtting regres sion lines to tw o (or more) s ets of past data with p ossibly diﬀerent er ror v ariances a nd use the ﬁtted reg ression lines along with the es tima ted v ariance s for constructing the membership functions. This justiﬁes to some extent our previous sug g estion that member ship functions quantif y prior infor mation in the s e ns e of the likeliho o d. The constants α 1 and α 2 provide additional scop e for as signing diﬀerent weigh ts to the tw o sources of info r mation, which is ano ther app ealing feature of this approa ch. (ii) The multiv a riate t members hip function (3.6) h A ( g ) =  1 + ρ 2 J ( g , g 0 )  − ( p + q ) / 2 =  1 + ( θ − θ 0 ) ′ V − 1 ( θ − θ 0 ) /q  − ( p + q ) / 2 , where q > 2 is the de g rees o f freedo m a nd p denotes the dimensio n of θ . This is a contin uo us s cale mixture o f Gaussian member ship functions with the same g 0 for each of the membership functions. Since this v anishes more slowly than ( 3.5 ), one could exp ect b etter robustness with this. (iii) The uniform function h A ( g ) =  1 , if ρ J ( g , g 0 ) ≤ δ ; 0 , otherwise. This is an ex treme case where g is restricted to a neig hborho o d of g 0 . In order to pro ceed with Bayesian inference o n g , we need to co nv ert the mem- ber ship function in to a prio r density . This is done as in [ 4 ] with the aid of a reference prior density π 0 . Th us we obtain the pr ior density π ( g ) ∝ h A ( g ) π 0 ( g ) , or, upo n utilizing the wav elet deco mpo sition for g , we have a n equiv alen t prior density π ( θ , σ 2 ) ∝ h A ( θ ) π 0 ( θ, σ 2 ) . 4. Posterior calculations As in [ 3 ], let Q n = ( Q n ) il , 1 ≤ i, l ≤ n , wher e ( Q n ) il = Q ( x i , x l ), which was int ro duced in ( 2 .4 ). No te that ( Q n ) il = X j ≥ J +1 X | k |≤ K j 2 − 2 j s ψ j k ( x i ) ψ j k ( x l ) . Let X = (Φ ′ , S ′ ) with the i th row of Φ ′ being { φ k ( x i ) } ′ | k |≤ K 0 and the i th r ow of S ′ being { ψ j k ( x i ) } ′ | k |≤ K j , 0 ≤ j ≤ J . Then we have the mo del (4.1) y | θ , σ 2 , τ 2 ∼ N ( X θ, σ 2 I n + τ 2 Q n ) . Unless θ has a normal prior distribution or a hierarchical prior with a conditionally normal prior distributio n, analytical simpliﬁcations in the computation of pos terior Nonp ar ametric r e gr e ssion 95 quantities ar e no t exp ected. F o r such case s , we have the joint p osterio r de ns it y o f the wa v elet co eﬃcien ts θ a nd the err o r v ariance s σ 2 and τ 2 given by the expression π ( θ , σ 2 , τ 2 | y ) ∝ f ( y | θ, σ 2 , τ 2 ) h A ( θ ) π 0 ( θ, σ 2 , τ 2 ) , where f is the likeliho o d. F ro m ( 4.1 ), f can b e expre s sed as f ( y | θ , σ 2 , τ 2 ) ∝ | σ 2 I n + τ 2 Q n | − 1 / 2 × exp( − 1 2 { ( y − X θ ) ′ ( σ 2 I n + τ 2 Q n ) − 1 ( y − X θ ) } ) . Pro ceeding further, s upp o s e π 0 of the form (4.2) π 0 ( θ, σ 2 , τ 2 ) = π 1 ( σ 2 , τ 2 ) , which is cons tant in θ , is chosen. MCMC based appr oaches to p osterior computations are now readily av ailable. F or exa mple, Gibbs s ampling is s tr aightforw ard. Note that the conditional pos ter ior densities ar e g iven b y (4.3) π ( θ | y , σ 2 , τ 2 ) ∝ exp( − 1 2 { ( y − X θ ) ′ ( σ 2 I n + τ 2 Q n ) − 1 ( y − X θ ) } ) h A ( θ ) , π ( σ 2 | y , θ , τ 2 ) ∝ | σ 2 I n + τ 2 Q n | − 1 / 2 exp( − 1 2 { ( y − X θ ) ′ ( σ 2 I n + τ 2 Q n ) − 1 ( y − X θ ) } ) π 1 ( σ 2 , τ 2 ) , (4.4) π ( τ 2 | y , θ , σ 2 ) ∝ | σ 2 I n + τ 2 Q n | − 1 / 2 exp( − 1 2 { ( y − X θ ) ′ ( σ 2 I n + τ 2 Q n ) − 1 ( y − X θ ) } ) π 1 ( σ 2 , τ 2 ) . (4.5) How ev er, ma jor simpliﬁcatio ns are p oss ible with the Gaussian h A as in (i). In this case, the po sterior a nalysis can pro ceed along the lines of [ 3 ]. Spec iﬁca lly , as s uming that h A ( θ ) is prop o r tional to the density o f N ( θ 0 , τ 2 Γ) with Γ =  I 2 K 0 +1 0 0 ∆ M β  , where M β = P J j =0 (2 K j + 1) and with τ 2 ∆ b eing the v ar iance-cov ar ia nce matrix of β (whic h is a ls o diagonal, having the diagonal entries speciﬁed b y V ar( β j k ) = τ 2 / 2 2 j s ), we o btain y | θ , σ 2 , τ 2 ∼ N ( X θ , σ 2 I n + τ 2 Q n ) , (4.6) θ | τ 2 ∼ N ( θ 0 , τ 2 Γ) . Therefore, it follows that y | σ 2 , τ 2 ∼ N ( X θ 0 , σ 2 I n + τ 2 ( X Γ X ′ + Q n )) , (4.7) θ | y , σ 2 , τ 2 ∼ N ( θ 0 + A ( y − X θ 0 ) , B ) , (4.8) where A = τ 2 Γ X ′  σ 2 I n + τ 2 ( X Γ X ′ + Q n )  − 1 , B = τ 2 Γ − τ 4 Γ X ′  σ 2 I n + τ 2 ( X Γ X ′ + Q n )  − 1 X Γ . 96 J.-F. Angers and M. Delamp ady Now pro ceeding as in [ 3 ], w e employ sp ectra l decomp osition to o bta in X Γ X ′ + Q n = H DH ′ , where D = dia g( d 1 , d 2 , . . . , d n ) is the matrix of eig env alues a nd H is the orthogo nal ma trix of eigenv ectors. Thus, σ 2 I n + τ 2 ( X Γ X ′ + Q n ) = H  σ 2 I n + τ 2 D  H ′ = σ 2 H ( I n + uD ) H ′ , where u = τ 2 /σ 2 . Then, the ﬁr st stage (co nditional) marg inal density of y given σ 2 and u can b e written as m ( y | σ 2 , u ) = 1 (2 π σ 2 ) n/ 2 1 det( I n + uD ) 1 / 2 × exp  − 1 2 σ 2 ( y − X θ 0 ) ′ H ( I n + uD ) − 1 H ′ ( y − X θ 0 )  = 1 (2 π σ 2 ) n/ 2 1 Q n i =1 (1 + ud i ) 1 / 2 exp ( − 1 2 σ 2 n X i =1 s 2 i 1 + ud i ) , (4.9) where s = ( s 1 , . . . , s n ) ′ = H ′ ( y − X θ 0 ). W e cho ose the prior on σ 2 and u = τ 2 /σ 2 qualitatively similar to that used in [ 3 ]. Sp eciﬁca lly , we take π 1 ( σ 2 , u ) to be pr o p or- tional to the pr o duct of an inv erse gamma densit y { k c − 1 / Γ( c − 1) } exp( − k / σ 2 )( σ 2 ) − c for σ 2 and the density of a F ( b, a ) distribution for u (for suitable choice of k , c , a and b ). Conditions apply on a and b as indicated in [ 3 ]. Once π 1 ( σ 2 , u ) is chosen a s a b ove, w e obtain the p osterio r mean and cov ariance matrix of θ a s in the following r e sult. Theorem 4.1. (4.10) E ( θ | y ) = θ 0 + Γ X ′ H E  ( I n + uD ) − 1 | y  s, wher e the exp e ctation is taken with r esp e ct t o (4.11) π 22 ( u | y ) ∝ u b/ 2 ( a + bu ) ( a + b ) / 2 n Y i =1 (1 + ud i ) ! − 1 / 2 2 k + n X i =1 s 2 i 1 + ud i ! − ( n +2 c ) / 2 . and V ar( θ | y ) = 1 n + 2 c E " 2 k + n X i =1 s 2 i 1 + ud i | y # Γ − 1 n + 2 c Γ X ′ H E " 2 k + n X i =1 s 2 i 1 + ud i ! ( I n + uD ) − 1 | y # H ′ X Γ + E [ M ( u ) M ( u ) ′ | y ] , (4.12) wher e M ( u ) = Γ X ′ H ( I n + uD ) − 1 s . The pro of of Theorem 4.1 readily follows up on using standa r d hier a rchical Bay esian mo del techniques (see [ 8 ], Section 9.1). The s econd part of Equatio n ( 4.10 ) can b e viewed as a co r rection term added to the prior guess after o bserving the data y . Coming to the m ultiv ariate t form of h A as in (ii), we note that the multiv ariate t densit y is a contin uo us sca le mixture of multiv ariate norma l densities as, for Nonp ar ametric r e gr e ssion 97 example, shown in [ 11 ], i.e., θ has the multiv ariate t distribution with lo ca tion θ 0 and scale matr ix V with density o f the form ( 3.6 ) if a nd only if θ | δ 2 ∼ N ( θ 0 , q δ 2 V ) , ( δ 2 ) − 1 ∼ χ 2 q . This implies that with a π 0 as in ( 4.2 ), we obtain a hiera rchical normal prior str uc - ture for θ with a n additiona l hyper-par ameter δ 2 . Conse q uent ly , we can pr o ceed with the p oster ior co mputations exactly as with the Gaussian h A , except that we will hav e a tw o-dimensional integration a fter we simplify our calculations as ab ov e. How ev er, MCMC techniques can easily handle this case. Computations with the h A given in (iii) ar e more diﬃcult. Analytical simpliﬁca - tions as shown for the cases (i) a nd (ii) ar e not av ailable here. Ther efore, we utilize the MCMC c o mputational schem e o utlined in (4.3)–(4.5) ab ov e. Alternatively , the Metrop olis–Has tings (M-H) algorithm may b e employ ed. 5. M o del c hec king and Ba y es factors An impo rtant and useful mo del chec king problem in the prese nt setup is checking the tw o mo dels M 0 : g = g 0 versus M 1 : g 6 = g 0 . Under M 1 , ( g = g ( θ ) , τ 2 , σ 2 ) is given the prior h A ( θ ) π 0 ( θ, τ 2 , σ 2 ) I ( g 6 = g 0 ), wherea s under M 0 , π 0 ( σ 2 ) induced by π 0 ( θ, τ 2 , σ 2 ) is the only part needed. In order to conduct the mo del chec king, we compute the Bay es facto r , B 01 , of M 0 relative to M 1 : (5.1) B 01 ( y ) = m ( y | M 0 ) m ( y | M 1 ) , where m ( y | M i ) is the predictive (marginal) densit y of y under mo del M i , i = 0 , 1. W e have m ( y | M 0 ) = Z f ( y | g 0 , σ 2 ) π 0 ( σ 2 ) dσ 2 and m ( y | M 1 ) = Z f ( y | θ , τ 2 , σ 2 ) h A ( θ ) π 0 ( θ, τ 2 , σ 2 ) dθ dτ 2 dσ 2 . Since improp er priors can lead to diﬃculties in mo del chec king problems, here we m ust employ prope r pr iors. (Note that fo r es timation purp oses, the noninformative, improp er prio r ( σ 2 ) − c corres p o nding to k = 0 would have work ed in the previo us section.) W e will dev elop the metho do logy here for the Gaussian membership func- tion only; the other cases are similar but co mputationally more intensiv e. Recall from the pr e v ious section that in the cas e of a Gaussian membership function h A , the p osterior a nalysis is similar to that discussed in [ 3 ], and for the same rea son, the co mputatio n of the Bay es factor is also similar to what was dis- cussed there. As in the prev io us section π 0 ( θ, τ 2 , σ 2 ) will b e constant in θ , while σ 2 is in verse gamma and is indep endent of v = σ 2 /τ 2 which is given the F a,b prior distribution. (Equiv alently , u = 1 /v = τ 2 /σ 2 is given the F b,a prior as b e fore.) Spec iﬁc a lly , π 0 ( σ 2 ) = { k c − 1 / Γ( c − 1 ) } exp( − k /σ 2 )( σ 2 ) − c , where c and k (small) are suitably chosen. Therefore, m ( y | M 0 ) = Z f ( y | g 0 , σ 2 ) π 0 ( σ 2 ) dσ 2 = (2 π ) − n/ 2 k c − 1 Γ( c − 1) Γ( n/ 2 + c − 1) ( k + 1 2 n X i =1 ( y i − g 0 ( x i )) 2 ) − ( n/ 2+ c − 1) . 98 J.-F. Angers and M. Delamp ady F urther, using (4.7), it follows that (5.2) m ( y | M 1 , σ 2 , u ) = (2 π σ 2 ) − n/ 2 n Y i =1 (1 + ud i ) − 1 / 2 exp ( − 1 2 σ 2 n X i =1 s 2 i 1 + ud i ) , where s = ( s 1 , . . . , s n ) ′ = H ′ ( y − X θ 0 ) as b efore. Therefor e, m ( y | M 1 ) = Z m ( y | M 1 , σ 2 , u ) π 0 ( σ 2 , u ) dσ 2 du = (2 π ) − n/ 2 k c − 1 Γ( c − 1) Z n Y i =1 (1 + ud i ) − 1 / 2 π 0 ( u ) ( Z exp ( − 1 σ 2 k + 1 2 n X i =1 s 2 i 1 + ud i !) ( σ 2 ) − ( n/ 2+ c ) dσ 2 ) du = (2 π ) − n/ 2 k c − 1 Γ( c − 1) Γ( n/ 2 + c − 1) (5.3) Z ( k + 1 2 n X i =1 s 2 i 1 + ud i ) − ( n/ 2+ c − 1) n Y i =1 (1 + ud i ) − 1 / 2 π 0 ( u ) du, where π 0 ( u ) denotes the F b,a density o f u . Note that this inv olv es o nly a straightfor- ward single-dimensio nal integration. The resulting Bayes factor is illustr ated la ter in our examples . 5.1. Prior r obustness of Bayes factors Note that the mo st informa tive part of the prior density that we have used is contained in the membership function h A . Since a mem be r ship function h A ( θ ) is to be tre a ted only as a likelihoo d for θ , any constant multiple ch A ( θ ) also contributes the same prio r informatio n ab out θ . Therefor e, a study of the robustnes s of the Bay es fa c tor that we obtained ab ov e with r e s pe ct to a cla ss of priors compa tible with h A is o f interest. Her e we cons ider a sensitivity study using the density r atio class deﬁned as follows. Since the prio r π that we use has the for m π ( θ , τ 2 , σ 2 ) ∝ h A ( θ ) π 0 ( θ, τ 2 , σ 2 ), we consider the cla ss of prior s C A =  π : c 1 h A ( θ ) π 0 ( θ, τ 2 , σ 2 ) ≤ απ ( θ , τ 2 , σ 2 ) ≤ c 2 h A ( θ ) π 0 ( θ, τ 2 , σ 2 ) , α > 0  , for sp eciﬁed 0 < c 1 < c 2 . W e would like to inv e s tigate how the Bay es facto r ( 5.1 ) behaves as the prior π v aries in C A . W e note that for any π ∈ C A , the Bay es factor B 01 has the form B 01 = R f ( y | g 0 , σ 2 ) π ( θ, τ 2 , σ 2 ) dθ dτ 2 dσ 2 R f ( y | θ , τ 2 , σ 2 ) π ( θ, τ 2 , σ 2 ) dθ dτ 2 dσ 2 . Even though the in tegration in the numerator ab ove need not inv olv e θ and τ 2 , we do so to apply the fo llowing re s ult (see [ 8 ], Theorem 3.9, or [ 4 ], Theorem 4.1). Consider the density-ratio class Γ DR = { π : L ( η ) ≤ απ ( η ) ≤ U ( η ) for so me α > 0 } , for sp eciﬁed no n-negative functions L and U . F urther, let q ≡ q + + q − be the usual decomp osition of q into its p o sitive and ne g ative parts, i.e., q + ( u ) = max { q ( u ) , 0 } and q − ( u ) = − max {− q ( u ) , 0 } . Then we hav e the following theorem (see [ 6 ]). Nonp ar ametric r e gr e ssion 99 Theorem 5. 1. F or functions q 1 and q 2 such that R | q i ( η ) | U ( η ) dη < ∞ , for i = 1 , 2 , and with q 2 p ositive a.s. with r esp e ct t o al l π ∈ Γ DR , inf π ∈ Γ DR R q 1 ( η ) π ( η ) dη R q 2 ( η ) π ( η ) dη is the unique solution λ of (5.4) Z ( q 1 ( η ) − λq 2 ( η )) − U ( η ) dη + Z ( q 1 ( η ) − λq 2 ( η )) + L ( η ) dη = 0 , sup π ∈ Γ DR R q 1 ( η ) π ( η ) dη R q 2 ( η ) π ( η ) dη is the unique solution λ of (5.5) Z ( q 1 ( η ) − λq 2 ( η )) + U ( η ) dη + Z ( q 1 ( η ) − λq 2 ( η )) − L ( η ) dη = 0 . W e shall discuss this result for the Gaussia n membership function only . Then, since the prior π tha t we us e has the for m π ( θ , τ 2 , σ 2 ) ∝ h A ( θ ) π 0 ( τ 2 , σ 2 ), and we don’t intend to v a ry π 0 ( τ 2 , σ 2 ) in our analysis , w e redeﬁne C A as C A = { π ( θ ) : c 1 h A ( θ ) ≤ απ ( θ ) ≤ c 2 h A ( θ ) , α > 0 } , for sp eciﬁed 0 < c 1 < c 2 . Now, we re-expr e s s B 01 as B 01 ( π ) = R R f ( y | g 0 , σ 2 ) π 0 ( σ 2 ) dσ 2  π ( θ ) dθ R R f ( y | θ , τ 2 , σ 2 ) π 0 ( τ 2 , σ 2 ) dτ 2 dσ 2  π ( θ ) dθ = R q 1 ( θ ) π ( θ ) dθ R q 2 ( θ ) π ( θ ) dθ , where q 1 ( θ ) ≡ Z f ( y | g 0 , σ 2 ) π 0 ( σ 2 ) dσ 2 = m ( y | M 0 ) , q 2 ( θ ) = Z f ( y | θ , τ 2 , σ 2 ) π 0 ( τ 2 , σ 2 ) dτ 2 dσ 2 . Then, Theorem 5.1 is readily applicable, and we obtain Theorem 5.2. inf π ∈C A B 01 ( π ) is the u nique solut ion λ of (5.6) c 2 Z ( q 1 ( θ ) − λq 2 ( θ )) − h A ( θ ) dθ + c 1 Z ( q 1 ( θ ) − λq 2 ( θ )) + h A ( θ ) dθ = 0 , sup π ∈C A B 01 ( π ) is the u nique solut ion λ of (5.7) c 2 Z ( q 1 ( θ ) − λq 2 ( θ )) + h A ( θ ) dθ + c 1 Z ( q 1 ( θ ) − λq 2 ( θ )) − h A ( θ ) dθ = 0 . 100 J.-F. Angers and M. Delamp ady 6. E xampl es, simulations and ill u s trations Illustrative examples, simulated a s well as those inv olving r eal-life data, are dis- cussed b elow. In all examples we have used tw o members hip functions: 1. Gaussian mem ber ship function: h A ( g ) prop ortio nal to the density of N ( θ 0 , τ 2 Γ), where θ 0 is obtained from the w av elet decomposition of g 0 . The hyper -parameter s a and b (se e (4.1 1) a nd (4.12 )) ar e b = 3 and a = 8( b + 2) / ( b − 2 ). Sensitivity analysis shows that the v alues of these hyper- parameters do not inﬂuence the results v ery muc h. T o show that the other h yper -parameter s c and k (see ( 4.11 ) and ( 4.12 )) have so me eﬀect, though not substantial, we hav e display ed the wa v elet smo others (see Figures 1 (a), 2 (a) a nd 3 (a)) fo r ( c, k ) = (2 , 1 . 5) , (1 . 5 , 0 . 5 ) and (1 . 05 , 0 . 05). 2. Uniform o n the ellipsoid (see (iii) in Section 3): h A ( g ) = I { ρ J ( g,g 0 ) ≤ δ } . W e have used three diﬀerent v alues (0.5, 1 and 5) for δ . F or the s im ulated ex amples, w e gener ated obser v ations from the mode l ( 1.1 ) with the reg ression function g ( x ) = cos(2 π x ) wher e x is drawn from a uniform density on the unit interv al. Then we consider ed three diﬀerent prior guesse s for g 0 : (i) g 0 ( x ) = cos(2 π x ) (see Figure 1 ), (ii) g 0 ( x ) = 4 | x − 0 . 5 | − 1 (see Figure 2 ), and (iii) g 0 ≡ 0 (see Figure 3 ). (a) Gaussian (b) Ellipsoid Fig 1 . Wavelet smo other for g ( x ) = cos(2 π x ) and g 0 ( x ) = cos (2 πx ) . Nonp ar ametric r e gr e ssion 101 (a) Gaussian (b) Ellipsoid Fig 2 . Wavelet smo other for g ( x ) = cos(2 π x ) and g 0 ( x ) = 4 | x − 0 . 5 | − 1 . Note that in (i) the chosen g 0 is the b est p ossible pr ior guess. F urther, since the normal pr ior is very informa tive, a s exp ected the smo other (see Figure 1 (a)) do es a n excellent job of extracting the true regress ion function. The b ehavior of the smo o ther obtained from the ellipsoid mem bers hip function (see Figure 1 (b)) is similar to that seen in Figure 1 (a), even though the prio r is diﬀerent. The smo o ther pr esented in Figure 2 b ehaves very similar to what was see n in Figure 1 . The prior g uess, g 0 , is slightly diﬀere nt from the true g . The b ehavior seen in Figure 3 emphasizes our comment following Figur e 2 . In fact, the smo o ther here lo oks better than the one in Fig ure 2 . This is p erha ps bec ause the prior is les s infor mative (concentrated) than the normal prior used there, so that the smo other can follow the data more closely than the prior . W e next applied our wa v elet smo other to the “Humidity da ta” example from (see [ 3 ]; also [ 8 ], Ex ample 10.2). The v ariable of in terest y that we hav e chosen from the data s et is the w eekly av erage humidit y level. The obs erv a tions w ere made from June 1, 19 9 5 to Decem ber 13, 1998. W e hav e chosen time (da y of recor ding the observ ation) as the cov ariate x . Since w e have 185 observ ations here, the maxim um po ssible v alue for J is 6. In this da ta (see Figure 4 (a)) a se a sonality eﬀect is present, so we hav e chosen g 0 ( x ) = 22 . 5 co s (2 π ( x + 0 . 1) / 0 . 2) + 62 . 5 , where x = (day − June 1, 19 95) / (Decem ber 13, 1998 – June 1, 1995). This choice of g 0 is ra ther arbitr ary but it seems to follow the seasona l v ar iations. F or the analys is which led to Figur e 4 (a) the Gaussian 102 J.-F. Angers and M. Delamp ady (a) Gaussian (b) Ellipsoid Fig 3 . Wavelet smo other for g ( x ) = cos(2 π x ) and g 0 ≡ 0 . mem ber ship function was used while in Figure 4 (b), it was the ellipsoid one. In this latter ﬁgur e, w e hav e also a dded the low er and upper env elopes obtained fro m the prior (lab eled Min/Max). F rom these t w o ﬁgures , it can b e se en that the prop ose d estimator ﬁts the data well and the ﬁnal result does no t dep end muc h on the mem ber ship function used. 6.1. Mo del che cking The mo del chec king approach bas e d on Bayes factors developed in the previ- ous s ection has b een tes ted on simulated examples. F or this, we gener ated ob- serv a tions form Equatio n ( 1.1 ) with the true function g ( x i ) = cos(2 π x i ) where x i , i = 1 , 2 , . . . , 2 0 were sa mpled from U (0 , 1). F or the err or ter m, ǫ i , we used σ 2 = 0 . 1. Then, for illustration purp os es, we considered three diﬀerent g 0 functions in ( M 0 : g = g 0 ): (i) g 0 ( x ) = cos(2 π x ), (ii) g 0 ( x ) = 4 | x − 0 . 5 | − 1 and (iii) g 0 ≡ 0 . Note that, the g 0 function in (i) cor resp onds to the true function. The g 0 functions in both (i) and (ii) are similar while the la st one is very diﬀerent from the true function g . Therefore, it is fair to a s sume that the Bay es factor (se e equation ( 5.1 )) Nonp ar ametric r e gr e ssion 103 (a) Gaussian (b) Ellipsoid Fig 4 . Wavelet smo other for the “Humidity data”. T a ble 1 Bayes factor for M 0 : g = g 0 vs M 1 : g 6 = g 0 g 0 B 01 (y) Evidence cos(2 π x ) 933.4275 v ery strongly fav ors M 0 4 | x − 0 . 5 | − 1 57.4735 strongly f av ors M 0 0 7 . 2845 × 10 − 6 v ery strongly fav ors M 1 for the ﬁrst tw o case s s hould not provide evidence aga inst the mo del M 0 : g = g 0 while we can exp ect str o ng ev idence in the ca se of the third function. These Bay es factors are g iven in T able 1 . F rom this table, it can b e seen that the mo del corres p o nding to the c o rrect function ( g 0 ( x ) = cos(2 π x )) obtains the largest B ay es factor follow ed by that for g 0 ( x ) = 4 | x − 0 . 5 | − 1. Moreov er, if we test M 0 : g 0 ( x ) ≡ 0 against M 1 : g 0 ( x ) 6 = 0, the Bayes facto r fav ors M 1 with strong evidence . 7. Co ncl usions In this pap er we suggest a simple approa ch to nonparametr ic regr ession by prop os- ing an alternative to dealing with complicated a nalyses on function spaces . The prop osed technique uses fuzzy sets to qua n tify the av a ilable prior information on a function space by starting with a “prio r guess” baseline regr ession function g 0 . Fir s t, wa v elet deco mp os ition is used to represent b oth the unknown regre s sion function g as well as the prior gues s g 0 . Then the prior uncertaint y of g rela tive to its dista nce 104 J.-F. Angers and M. Delamp ady from g 0 is sp eciﬁed in the form of a membership function which trans lates this distance into a mea sure of dis tance b etw een the corr esp onding w av elet c o eﬃcients. F urthermo re, a Bay esian test is prop osed to chec k whether the baseline function g 0 is compatible with the data or not. Ac kno wle dgment s. W e thank the editors and t w o referees for providing critical comments which hav e broug h t signiﬁcant improvem ents to our presentation. References [1] Abramovich, F. and Sa p a tinas, T. (1999 ). Bay esian appro ach to wav elet decomp osition and s hrink ag e. In Bayesian Infer enc e in Wavelet-Base d Mo d- els. L e ctur e N otes in S tatist. 141 (P . M ¨ uller and B. Vidako vic, eds.) 33 – 50. Springer, New Y o r k. MR16998 72 [2] Adler, R. J. (1981). The Ge ometry of R ando m Fields . Wiley , New Y ork. MR06118 57 [3] Angers, J.-F. and Delamp ady, M. (2001). B ayesian nonparametric reg res- sion using wa v elets. Sankhy¯ a Ser. B 63 28 7–308 . MR18970 44 [4] Angers, J.-F. and Delam p ady, M. (2004). F uzzy sets in hierarchical Bay es and robust Bay es inference . J. Statist. R ese ar ch 38 1–1 1. MR2 1252 5 9 [5] Ber ger, J. O. (1985) . S tatistic al De cision The ory and Bayesian Analysis . Springer, New Y o r k. MR08046 11 [6] DeR ober tis, L. and Har tigan, J. A. (1 981). Bay esian inference using in- terv als o f measures. Ann. Statist. 9 2 35–2 44. MR06066 09 [7] French, S. (198 6). De cisi on The ory . Ellis Hor woo d, Chichester, Engla nd. MR08493 27 [8] Ghosh, J. K., Delamp ady, M. and Samant a, T. (20 06). Intr o duction to Bayesian Analysis: The ory and Metho ds . Springer , New Y or k. MR22474 3 9 [9] Ghosh, J. K. and Ramamoor thi, R. V. (20 03). Bayesian Nonp ar ametrics . Springer, New Y o r k. MR19922 45 [10] Lenk, P. J. (198 8 ). The logis tic normal distribution for Bayesian, nonpara- metric, predictive densities. J. Amer. Statist. Asso c. 8 3 509 –516. MR09713 80 [11] Muirhead, R. J. (198 2 ). Asp e cts of Multivariate Statistic al The ory . Wiley , New Y or k. MR06529 32 [12] Singpur w alla, N ., Boo ker, J. M. and Bement, T. R. (2002). Probability theory . In F u zzy L o gic and Pr ob abi lity A pplic ations (T. J. Ross et al., eds.) 55– 71. SIAM, Philadelphia . MR19617 67 [13] Singpur w alla, N. and Booker, J. M. (2004). Membership functions and probability measur es o f fuzzy s ets (with discus s ion). J . Amer. St atist. Asso c. 99 867– 8 89. MR20909 19

Fuzzy sets in nonparametric Bayes regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment