Issues in Stacked Generalization

Journal of Articial In telligence Researc h 10 (1999) 271-289 Submitted 11/98; published 5/99 Issues in Stac k ed Generalization Kai Ming Ting kmting@deakin.edu.a u Scho ol of Computing and Mathematics De akin University, A ustr alia. Ian H. Witten ihw@cs.w aika to.a c.nz Dep artment of Computer Scienc e University of Waikato, New Ze aland. Abstract Stac k ed generalization is a general metho d of using a high-lev el mo del to com bine lo w er- lev el mo dels to ac hiev e greater predictiv e accuracy . In this pap er w e address t w o crucial issues whic h ha v e b een considered to b e a `blac k art' in classication tasks ev er since the in tro duction of stac k ed generalization in 1992 b y W olp ert: the t yp e of generalizer that is suitable to deriv e the higher-lev el mo del, and the kind of attributes that should b e used as its input. W e nd that b est results are obtained when the higher-lev el mo del com bines the condence (and not just the predictions) of the lo w er-lev el ones. W e demonstrate the eectiv eness of stac k ed generalization for com bining three dieren t t yp es of learning algorithms for classication tasks. W e also compare the p erformance of stac k ed generalization with ma jorit y v ote and published results of arcing and bagging. 1. In tro duction Stac k ed generalization is a w a y of com bining m ultiple mo dels that ha v e b een learned for a classication task (W olp ert, 1992), whic h has also b een used for regression (Breiman, 1996a) and ev en unsup ervised learning (Sm yth & W olp ert, 1997). T ypically , dieren t learning algorithms learn dieren t mo dels for the task at hand, and in the most common form of stac king the rst step is to collect the output of eac h mo del in to a new set of data. F or eac h instance in the original training set, this data set represen ts ev ery mo del's prediction of that instance's class, along with its true classication. During this step, care is tak en to ensure that the mo dels are formed from a batc h of training data that do es not include the instance in question, in just the same w a y as ordinary cross-v alidation. The new data are treated as the data for another learning problem, and in the second step a learning algorithm is emplo y ed to solv e this problem. In W olp ert's terminology , the original data and the mo dels constructed for them in the rst step are referred to as level-0 data and level-0 mo dels , resp ectiv ely , while the set of cross-v alidated data and the second-stage learning algorithm are referred to as level-1 data and the level-1 gener alizer . In this pap er, w e sho w ho w to mak e stac k ed generalization w ork for classication tasks b y addressing t w o crucial issues whic h W olp ert (1992) originally describ ed as `blac k art' and ha v e not b een resolv ed since. The t w o issues are (i) the t yp e of attributes that should b e used to form lev el-1 data, and (ii) the t yp e of lev el-1 generalizer in order to get impro v ed accuracy using the stac k ed generalization metho d. Breiman (1996a) demonstrated the success of stac k ed generalization in the setting of ordinary regression. The lev el-0 mo dels are regression trees of dieren t sizes or linear c  1999 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed. Ting & Witten regressions using dieren t n um b er of v ariables. But instead of selecting the single mo del that w orks b est as judged b y (for example) cross-v alidation, Breiman used the dieren t lev el- 0 regressors' output v alues for eac h mem b er of the training set to form lev el-1 data. Then he used least-squares linear regression, under the constrain t that all regression co ecien ts b e non-negativ e, as the lev el-1 generalizer. The non-negativit y constrain t turned out to b e crucial to guaran tee that the predictiv e accuracy w ould b e b etter than that ac hiev ed b y selecting the single b est predictor. Here w e sho w ho w stac k ed generalization can b e made to w ork reliably in classication tasks. W e do this b y using the output class probabilities generated b y lev el-0 mo dels to form lev el-1 data. Then for the lev el-1 generalizer w e use a v ersion of least squares linear regression adapted for classication tasks. W e nd the use of class probabilities to b e crucial for the successful application of stac k ed generalization in classication tasks. Ho w ev er, the non-negativit y constrain ts found necessary b y Breiman in regression are found to b e irrelev an t to impro v ed predictiv e accuracy in our classication situation. In Section 2, w e formally in tro duce the tec hnique of stac k ed generalization and describ e p ertinen t details of eac h learning algorithm used in our exp erimen ts. Section 3 describ es the results of stac king three dieren t t yp es of learning algorithms. Section 4 compares stac k ed generalization with arcing and bagging, t w o recen t metho ds that emplo y sampling tec hniques to mo dify the data distribution in order to pro duce m ultiple mo dels from a single learning algorithm. The follo wing section describ es related w ork, and the pap er ends with a summary of our conclusions. 2. Stac k ed Generalization Giv en a data set L = f ( y n ; x n ) ; n = 1 ; : : : ; N g , where y n is the class v alue and x n is a v ector represen ting the attribute v alues of the n th instance, randomly split the data in to J almost equal parts L 1 ; : : : ; L J . Dene L j and L (  j ) = L  L j to b e the test and training sets for the j th fold of a J -fold cross-v alidation. Giv en K learning algorithms, whic h w e call level-0 gener alizers , in v ok e the k th algorithm on the data in the training set L (  j ) to induce a mo del M (  j ) k , for k = 1 ; : : : ; K . These are called level-0 mo dels . F or eac h instance x n in L j , the test set for the j th cross-v alidation fold, let z k n denote the prediction of the mo del M (  j ) k on x n . A t the end of the en tire cross-v alidation pro cess, the data set assem bled from the outputs of the K mo dels is L C V = f ( y n ; z 1 n ; : : : ; z K n ) ; n = 1 ; : : : ; N g : These are the level-1 data . Use some learning algorithm that w e call the level-1 gener alizer to deriv e from these data a mo del ~ M for y as a function of ( z 1 ; : : : ; z K ). This is the level-1 mo del . Figure 1 illustrates the cross-v alidation pro cess. T o complete the training pro cess, the nal lev el-0 mo dels M k , k = 1 ; : : : ; K , are deriv ed using all the data in L . No w let us consider the classication pro cess, whic h uses the mo dels M k , k = 1 ; : : : ; K , in conjunction with ~ M . Giv en a new instance, mo dels M k pro duce a v ector ( z 1 ; : : : ; z K ). This v ector is input to the lev el-1 mo del ~ M , whose output is the nal classication result for that instance. This completes the stac k ed generalization metho d as prop osed b y W olp ert (1992), and also used b y Breiman (1996a) and LeBlanc & Tibshirani (1993). 272 Issues in St a cked Generaliza tion M M M 1 k K M ~ L L CV (-j) (-j) (-j) (-j) Level-1 Level-0 Figure 1: This gure illustrates the j -fold cross-v alidation pro cess in lev el-0; and the lev el-1 data set L C V at the end of this pro cess is used to pro duce lev el-1 mo del ~ M . As w ell as the situation describ ed ab o v e, whic h results in the lev el-1 mo del ~ M , the presen t pap er also considers a further situation where the output from the lev el-0 mo dels is a set of class probabilities rather than a single class prediction. If mo del M (  j ) k is used to classify an instance x in L j , let P k i ( x ) denote the probabilit y of the i th output class, and the v ector P k n = ( P k 1 ( x n ) ; : : : ; P k i ( x n ) ; : : : ; P k I ( x n )) giv es the mo del's class probabilities for the n th instance, assuming that there are I classes. As the lev el-1 data, assem ble together the class probabilit y v ector from all K mo dels, along with the actual class: L 0 C V = f ( y n ; P 1 n ; : : : ; P k n ; : : : ; P K n ) ; n = 1 ; : : : ; N g : Denote the lev el-1 mo del deriv ed from this as ~ M 0 to con trast it with ~ M . The follo wing t w o subsections describ e the algorithms used as lev el-0 and lev el-1 gener- alizers in the exp erimen ts rep orted in Section 3. 2.1 Lev el-0 Generalizers Three learning algorithms are used as the lev el-0 generalizers: C4.5, a decision tree learning algorithm (Quinlan, 1993); NB, a re-implemen tation of a Naiv e Ba y esian classier (Cestnik, 1990); and IB1, a v arian t of a lazy learning algorithm (Aha, Kibler & Alb ert, 1991) whic h emplo ys the p -nearest-neigh b or metho d using a mo died v alue-dierence metric for nominal and binary attributes (Cost & Salzb erg, 1993). F or eac h of these learning algorithms w e no w sho w the form ula that w e use for the estimated output class probabilities P i ( x ) for an instance x (where, in all cases, P i P i ( x ) = 1). C4.5: Consider the leaf of the decision tree at whic h the instance x falls. Let m i b e the n um b er of (training) instances with class i at this leaf, and supp ose the ma jorit y class 273 Ting & Witten at the leaf is ^ I . Let E = P i 6 = ^ I m i . Then, using a Laplace estimator, P ^ I ( x ) = 1  E + 1 P i m i + 2 ; P i ( x ) = (1  P ^ I ( x ))  m i E ; for i 6 = ^ I : Note that only pruned trees and default settings of C4.5 are used in our exp erimen ts. NB: Let P ( i j x ) b e the p osterior probabilit y of class i , giv en instance x . Then P i ( x ) = P ( i j x ) P i P ( i j x ) : Note that NB uses a Laplacian estimate for estimating the conditional probabilities for eac h nominal attribute to compute P ( i j x ). F or eac h con tin uous-v alued attribute, a normal distribution is assumed in whic h case the conditional probabilities can b e con v enien tly represen ted en tirely in terms of the mean and v ariance of the observ ed v alues for eac h class. IB1: Supp ose p nearest neigh b ors are used; denote them b y f ( y s ; x s ) ; s = 1 ; : : : ; p g for instance x . (W e use p = 3 in the exp erimen ts.) Then P i ( x ) = P p s =1 f ( y s ) =d ( x; x s ) P p s =1 1 =d ( x; x s ) ; where f ( y s ) = 1 if i = y s and 0 otherwise, and d is the Euclidean distance function. In all three learning algorithms, the predicted class of the lev el-0 mo del, giv en an instance x , is that ^ I for whic h P ^ I ( x ) > P i ( x ) for all i 6 = ^ I : 2.2 Lev el-1 Generalizers W e compare the eect of four dieren t learning algorithms as the lev el-1 generalizer: C4.5, IB1(using p = 21 nearest neigh b ors), 1 NB, and a m ulti-resp onse linear regression algorithm, MLR. Only the last needs further explanation. MLR is an adaptation of a least-squares linear regression algorithm that Breiman (1996a) used in regression settings. An y classication problem with real-v alued attributes can b e transformed in to a m ulti-resp onse regression problem. If the original classication problem has I classes, it is con v erted in to I separate regression problems, where the problem for class ` has instances with resp onses equal to one when they ha v e class ` and zero otherwise. The input to MLR is lev el-1 data, and w e need to consider the situation for the mo del ~ M 0 , where the attributes are probabilities, separately from that for the mo del ~ M , where 1. A large p v alue is used follo wing W olp ert's (1992) advice that \ : : : it is reasonable that `relativ ely global, smo oth : : : ' lev el-1 generalizers should p erform w ell." 274 Issues in St a cked Generaliza tion they are classes. In the former case, where the attributes are already real-v alued, the linear regression for class ` is simply LR ` ( x ) = K X k  k ` P k ` ( x ) : In the latter case, the classes are unordered nominal attributes. W e map them in to binary v alues in the ob vious w a y , setting P k ` ( x ) to 1 if the class of instance x is ` and zero otherwise; and then use the ab o v e linear regression. Cho ose the linear regression co ecien ts f  k ` g to minimize X j X ( y n ;x n ) 2L j ( y n  X k  k ` P (  j ) k ` ( x n )) 2 : The co ecien ts f  k ` g are constrained to b e non-negativ e, follo wing Breiman's (1996a) dis- co v ery that this is necessary for the successful application of stac k ed generalization to regres- sion problems. The non-negativ e-co ecien t least-squares algorithm describ ed b y La wson & Hanson (1995) is emplo y ed here to deriv e the linear regression for eac h class. W e sho w later that, in fact, the non-negativ e constrain t is unnecessary in classication tasks. With this in place, w e can no w describ e the w orking of MLR. T o classify a new instance x , compute LR ` ( x ) for all I classes and assign the instance to that class ` whic h has the greatest v alue: 2 LR ` ( x ) > LR ` 0 ( x ) for all ` 0 6 = `: In the next section w e in v estigate the stac king of C4.5, NB and IB1. 3. Stac king C4.5, NB and IB1 3.1 When Do es Stac k ed Generalization W ork? The exp erimen ts in this section sho w that  for successful stac k ed generalization it is necessary to use output class prob- abilities rather than class predictions|that is, ~ M 0 rather than ~ M ;  only the MLR algorithm is suitable for the lev el-1 generalizer, among the four algorithms used. W e use t w o articial datasets and eigh t real-w orld datasets from the UCI Rep ository of mac hine learning databases (Blak e, Keogh & Merz, 1998). Details of these are giv en in T able 1. F or the articial datasets|Led24 and W a v eform|eac h training dataset L of size 200 and 300, resp ectiv ely , is generated using a dieren t seed. The algorithms used for the exp erimen ts are then tested on a separate dataset of 5000 instances. Results are expressed as the a v erage error rate of ten rep etitions of this en tire pro cedure. F or the real-w orld datasets, W -fold cross-v alidation is p erformed. In eac h fold of this cross-v alidation, the training dataset is used as L , and the mo dels deriv ed are ev aluated 2. The pattern recognition comm unit y calls this t yp e of classier a line ar machine (Duda & Hart, 1973). 275 Ting & Witten Datasets # Samples # Classes # A ttr & T yp e Led24 200/5000 10 10N W a v eform 300/5000 3 40C Horse 368 2 3B+12N+7C Credit 690 2 4B+5N+6C V o w el 990 11 10C Euth yroid 3163 2 18B+7C Splice 3177 3 60N Abalone 4177 3 1N+7C Nettalk(s) 5438 5 7N Co ding 20000 2 15N N-nominal; B-binary; C-Con tin uous. T able 1: Details of the datasets used in the exp erimen t. on the test dataset. The result is expressed as the a v erage error rate of the W -fold cross- v alidation. Note that this cross-v alidation is used for ev aluation of the en tire pro cedure, whereas the J -fold cross-v alidation men tioned in Section 2 is the in ternal op eration of stac k ed generalization. Ho w ev er, b oth W and J are set to 10 in the exp erimen ts. In this section, w e presen t results of mo del com bination using lev el-1 mo dels ~ M and ~ M 0 , as w ell as a mo del selection metho d, emplo ying the same J -fold cross-v alidation pro- cedure. Note that the only dierence b et w een mo del com bination and mo del selection here is whether the lev el-1 learning is emplo y ed or not. T able 2 sho ws the a v erage error rates, obtained using W -fold cross-v alidation, of C4.5, NB and IB1, and BestCV, whic h is the b est of the three, selected using J -fold cross- v alidation. As exp ected, BestCV is almost alw a ys the classier with the lo w est error rate. 3 T able 3 sho ws the result of stac k ed generalization using the lev el-1 mo del ~ M , for whic h the lev el-1 data comprise the classications generated b y the lev el-0 mo dels, and ~ M 0 , for whic h the lev el-1 data comprise the probabilities generated b y the lev el-0 mo dels. Results are sho wn for all four lev el-1 generalizers in eac h case, along with BestCV. The lo w est error rate for eac h dataset is giv en in b old. T able 4 summarizes the results in T able 3 in terms of a comparison of eac h lev el-1 mo del with BestCV totaled o v er all datasets. Clearly , the b est lev el-1 mo del is ~ M 0 deriv ed using MLR. It p erforms b etter than BestCV in nine datasets and equally w ell in the ten th. The b est p erforming ~ M is deriv ed from NB, whic h p erforms b etter than BestCV in sev en datasets but signican tly w orse in t w o (W a v eform and V o w el). W e regard a dierence of more than t w o standard errors as signican t (95% condence). The standard error gures are omitted in this table to increase readabilit y . The datasets are sho wn in the order of increasing size. MLR p erforms signican tly b etter than BestCV in the four largest datasets. This indicates that stac k ed generalization is more lik ely to giv e signican t impro v emen ts in predictiv e accuracy if the v olume of data is large|a direct consequence of more accurate estimation using cross-v alidation. 3. Note that BestCV do es not alw a ys select the same classier in all W folds. That is wh y its error rate is not alw a ys equal to the lo w est error rate among the three classiers. 276 Issues in St a cked Generaliza tion Datasets Lev el-0 Generalizers C4.5 NB IB1 BestCV Led24 35.4 35.4 32.2 32.8  0.6 W a v eform 31.8 17.1 26.2 17.1  0.3 Horse 15.8 17.9 15.8 17.1  1.6 Credit 17.4 17.3 28.1 17.4  1.2 V o w el 22.7 51.0 2.6 2.6  0.2 Euth yroid 1.9 9.8 8.6 1.9  0.3 Splice 5.5 4.5 4.7 4.5  0.4 Abalone 41.4 42.1 40.5 40.1  0.6 Nettalk(s) 17.0 15.9 12.7 12.7  0.4 Co ding 27.6 28.8 25.0 25.0  0.3 T able 2: Av erage error rates of C4.5, NB and IB1, and BestCV|the b est among them selected using J -fold cross-v alidation. The standard errors are sho wn in the last column. Datasets Lev el-1 mo del, ~ M Lev el-1 mo del, ~ M 0 BestCV C4.5 NB IB1 MLR C4.5 NB IB1 MLR Led24 32.8 34.0 32.4 35.0 33.3 41.7 35.7 32.1 31.3 W a v eform 17.1 17.7 19.2 18.7 17.2 20.6 17.6 17.8 16.8 Horse 17.1 16.9 14.9 17.6 16.3 18.0 18.5 17.7 15.2 Credit 17.4 18.4 16.1 16.9 17.4 15.4 15.9 14.3 16.2 V o w el 2.6 2.6 3.8 3.6 2.6 2.7 7.2 3.3 2.5 Euth yroid 1.9 1.9 1.9 1.9 1.9 2.2 4.3 2.0 1.9 Splice 4.5 3.9 3.9 3.8 3.8 4.0 3.9 3.8 3.8 Abalone 40.1 38.5 38.5 38.2 38.1 43.3 37.1 39.2 38.3 Nettalk(s) 12.7 12.4 11.9 12.4 12.6 14.0 14.6 12.0 11.5 Co ding 25.0 23.2 23.1 23.2 23.2 22.3 21.2 21.2 20.7 T able 3: Av erage error rates for stac king C4.5, NB and IB1. Lev el-1 mo del, ~ M Lev el-1 mo del, ~ M 0 C4.5 NB IB1 MLR C4.5 NB IB1 MLR #Win vs. #Loss 3-5 2-7 4-5 2-5 7-3 6-4 4-6 0-9 T able 4: Summary of T able 3|Comparison of BestCV with ~ M and ~ M 0 . 277 Ting & Witten When one of the lev el-0 mo dels p erforms signican tly m uc h b etter than the rest, lik e in the Euth yroid and V o w el datasets, MLR p erforms either as go o d as BestCV b y selecting the b est p erforming lev el-0 mo del, or b etter than BestCV. MLR has an adv an tage o v er the other three lev el-1 generalizers in that its mo del can easily b e in terpreted. Examples of the com bination w eigh ts it deriv es (for the probabilit y- based mo del ~ M 0 ) app ear in T able 5 for the Horse, Credit, Splice, Abalone, W a v eform, Led24 and V o w el datasets. The w eigh ts indicate the relativ e imp ortance of the lev el-0 generalizers for eac h prediction class. F or example, in the Splice dataset (in T able 5(b)), NB is the dominan t generalizer for predicting class 2, NB and IB1 are b oth go o d at predicting class 3, and all three generalizers mak e a w orth while con tribution to the prediction of class 1. In con trast, in the Abalone dataset all three generalizers con tribute substan tially to the prediction of all three classes. Note that the w eigh ts for eac h class do not sum to one b ecause no suc h constrain t is imp osed on MLR. 3.2 Are Non-negativit y Constrain ts Necessary? Both Breiman (1996a) and LeBlanc & Tibshirani (1993) use the stac k ed generalization metho d in a regression setting and rep ort that it is necessary to constrain the regression co ecien ts to b e non-negativ e in order to guaran tee that stac k ed regression impro v es pre- dictiv e accuracy . Here w e in v estigate this nding in the domain of classication tasks. T o assess the eect of the non-negativit y constrain t on p erformance, three v ersions of MLR are emplo y ed to deriv e the lev el-1 mo del ~ M 0 : i. eac h linear regression in MLR is calculated with an in tercept constan t (that is, I + 1 w eigh ts for the I classes) but without an y constrain ts; ii. eac h linear regression is deriv ed with neither an in tercept constan t ( I w eigh ts for I classes) nor constrain ts; iii. eac h linear regression is deriv ed without an in tercept constan t, but with non- negativit y constrain ts ( I non-negativ e w eigh ts for I classes). The third v ersion is the one used for the results presen ted earlier. T able 6 sho ws the results of all three v ersions. They al l have almost indistinguishable err or r ates . W e conclude that in classication tasks, non-negativit y constrain ts are not necessary to guaran tee that stac k ed generalization impro v es predictiv e accuracy . Ho w ev er, there is another reason wh y it is a go o d idea to emplo y non-negativit y con- strain ts. T able 7 sho ws an example of the w eigh ts deriv ed b y these three v ersions of MLR on the Led24 dataset. The third v ersion, sho wn in column (iii), supp orts a more p erspicuous in terpretation of eac h lev el-0 generalizer's con tribution to the class predictions than do the other t w o. In this dataset, IB1 is the dominan t generalizer in predicting classes 4, 5 and 8, and b oth NB and IB1 mak e a w orth while con tribution in predicting class 2, as evidenced b y their high w eigh ts. Ho w ev er, the negativ e w eigh ts used in predicting these classes render the in terpretation of the other t w o v ersions m uc h less clear. 278 Issues in St a cked Generaliza tion Horse Credit Class C4.5 NB IB1 C4.5 NB IB1 1 0.36 0.20 0.42 0.63 0.30 0.04 2 0.39 0.19 0.41 0.65 0.28 0.07 C4.5 for  1 ; NB for  2 ; IB1 for  3 . T able 5: (a) W eigh ts generated b y MLR (mo del ~ M 0 ) for the Horse and Credit datasets. Splice Abalone W a v eform Class C4.5 NB IB1 C4.5 NB IB1 C4.5 NB IB1 1 0.23 0.43 0.36 0.25 0.25 0.39 0.16 0.59 0.34 2 0.15 0.72 0.12 0.27 0.20 0.25 0.14 0.72 0.07 3 0.08 0.52 0.40 0.30 0.18 0.39 0.04 0.65 0.23 T able 5: (b) W eigh ts generated b y MLR (mo del ~ M 0 ) for the Splice, Abalone and W a v eform datasets. Led24 V o w el Class C4.5 NB IB1 C4.5 NB IB1 1 0.46 0.65 0.00 0.04 0.00 0.96 2 0.00 0.37 0.43 0.03 0.00 0.97 3 0.47 0.00 0.54 0.01 0.00 1.00 4 0.00 0.13 0.65 0.05 0.25 0.86 5 0.00 0.19 0.64 0.01 0.08 0.97 6 0.35 0.14 0.35 0.15 0.00 0.92 7 0.15 0.43 0.36 0.03 0.01 1.02 8 0.00 0.00 0.68 0.04 0.01 0.96 9 0.00 0.38 0.29 0.03 0.00 1.02 10 0.00 0.50 0.24 0.08 0.01 0.93 11 { { { 0.00 0.04 0.96 T able 5: (c) W eigh ts generated b y MLR (mo del ~ M 0 ) for the Led24 and V o w el datasets. 279 Ting & Witten Datasets MLR with No Constrain ts No In tercept Non-Negativit y Led24 31.4 31.4 31.3 W a v eform 16.8 16.8 16.8 Horse 15.8 15.8 15.2 Credit 16.2 16.2 16.2 V o w el 2.4 2.4 2.5 Euth yroid 1.9 1.9 1.9 Splice 3.7 3.8 3.8 Abalone 38.3 38.3 38.3 Nettalk(s) 11.6 11.5 11.5 Co ding 20.7 20.7 20.7 T able 6: Av erage error rates of three v ersions of MLR. (i) (ii) (iii) Class  0  1  2  3  1  2  3  1  2  3 1 0.00 0.45 0.65 0.00 0.46 0.65 0.00 0.46 0.65 0.00 2 0.02 {0.42 0.47 0.56 {0.40 0.49 0.56 0.00 0.37 0.43 3 0.00 0.46 {0.01 0.54 0.47 {0.01 0.54 0.47 0.00 0.54 4 0.04 {0.33 0.15 0.84 {0.29 0.21 0.81 0.00 0.13 0.65 5 0.03 {0.37 0.26 0.84 {0.32 0.26 0.84 0.00 0.19 0.64 6 0.01 0.35 0.12 0.35 0.36 0.14 0.35 0.35 0.14 0.35 7 0.01 0.15 0.43 0.36 0.15 0.43 0.36 0.15 0.43 0.36 8 0.02 {0.05 {0.25 0.72 {0.03 {0.19 0.72 0.00 0.00 0.68 9 0.04 {0.08 0.32 0.32 {0.05 0.40 0.30 0.00 0.38 0.29 10 0.04 {0.06 0.43 0.25 {0.01 0.50 0.24 0.00 0.50 0.24 T able 7: W eigh ts generated b y three v ersions of MLR: (i) no constrain ts, (ii) no in tercept, and (iii) non-negativit y constrain ts, for the LED24 dataset. 280 Issues in St a cked Generaliza tion Dataset #SE BestCV Ma jorit y MLR Horse 0.5 17.1 15.0 15.2 Splice 2.5 4.5 4.0 3.8 Abalone 3.3 40.1 39.0 38.3 Led24 8.7 32.8 31.8 31.3 Credit 8.9 17.4 16.1 16.2 Nettalk(s) 10.8 12.7 12.2 11.5 Co ding 12.7 25.0 23.1 20.7 W a v eform 18.7 17.1 19.5 16.8 Euth yroid 26.3 1.9 8.1 1.9 V o w el 242.0 2.6 13.0 2.5 T able 8: Av erage error rates of BestCV, Ma jorit y V ote and MLR (mo del ~ M 0 ), along with the n um b er of standard error (#SE) b et w een BestCV and the w orst p erforming lev el-0 generalizers. 3.3 Ho w Do es Stac k ed Generalization Compare T o Ma jorit y V ote? Let us no w compare the error rate of ~ M 0 , deriv ed from MLR, to that of ma jorit y v ote, a simple decision com bination metho d whic h requires neither cross-v alidation nor lev el- 1 learning. T able 8 sho ws the a v erage error rates of BestCV, ma jorit y v ote and MLR. In order to see whether the relativ e p erformances of lev el-0 generalizers ha v e an y eect on these metho ds, the n um b er of standard errors (#SE) b et w een the error rates of the w orst p erforming lev el-0 generalizer and BestCV is giv en, and the datasets are re-ordered according to this measure. Since BestCV almost alw a ys selects the b est p erforming lev el-0 generalizer, small v alues of #SE indicate that the lev el-0 generalizers p erform comparably to one another, and vice v ersa. MLR compares fa v orably to ma jorit y v ote, with eigh t wins v ersus t w o losses. Out of the eigh t wins, six ha v e signican t dierences (the t w o exceptions are for the Splice and Led24 datasets); whereas b oth losses (for the Horse and Credit datasets) ha v e insignican t dierences. Th us the extra computation for cross-v alidation and lev el-1 learning seems to ha v e paid o. It is in teresting to note that the p erformance of ma jorit y v ote is related to the size of #SE. Ma jorit y v ote compares fa v orably to BestCV in the rst sev en datasets, where the v alues of #SE are small. In the last three, where #SE is large, ma jorit y v ote p erforms w orse. This indicates that if the lev el-0 generalizers p erform comparably , it is not w orth using cross-v alidation to determine the b est one, b ecause the result of ma jorit y v ote|whic h is far c heap er|is not signican tly dieren t. Although small v alues of #SE are a necessary condition for ma jorit y v ote to riv al BestCV, they are not a sucien t condition|see Matan (1996) for an example. The same applies when ma jorit y v ote is compared with MLR. MLR p erforms signican tly b etter in the v e datasets that ha v e large #SE v alues, but in only one of the other cases. 281 Ting & Witten ~ M v ersus ~ M 0 C4.5 NB IB1 MLR #Win vs. #Loss 8-2 5-4 3-6 1-7 T able 9: ~ M v ersus ~ M 0 for eac h generalizer|summarized results from T able 3. It is w orth men tioning a metho d that a v erages P i ( x ) for eac h i o v er all lev el-0 mo dels, yielding  P i ( x ), and then predicts class ^ I for whic h  P ^ I ( x ) >  P i ( x ) for all i 6 = ^ I : According to Breiman (1996b), this metho d pro duces an error rate almost iden tical to that of ma jorit y v ote. 3.4 Wh y Do es Stac k ed Generalization W ork Best With ~ M 0 Generated F rom MLR? W e ha v e sho wn that stac k ed generalization w orks b est when output class probabilities (rather than class predictions) are used with the MLR algorithm (rather than C4.5, IB1, NB). In retrosp ect, this is not surprising, and can b e explained in tuitiv ely as follo ws. The lev el-1 mo del should pro vide a simple w a y of com bining all the evidence a v ailable. This evidence includes not just the predictions, but the condence of eac h lev el-0 mo del in its predictions. A linear com bination is the simplest w a y of p o oling the lev el-0 mo dels' condence, and MLR pro vides just that. The alternativ e metho ds of NB, C4.5, and IB1 eac h ha v e shortcomings. A Ba y esian ap- proac h could form the basis for a suitable alternativ e w a y of p o oling the lev el-0 mo dels' con- dence, but the indep endence assumption cen tral to Naiv e Ba y es hamp ers its p erformance in some datasets b ecause the evidence pro vided b y the individual lev el-0 mo dels is certainly not indep enden t. C4.5 builds trees that can mo del in teraction amongst attributes|particularly when the tree is large|but this is not desirable for com bining condences. Nearest neigh- b or metho ds do not really giv e a w a y of com bining condences; also, the similarit y metric emplo y ed could misleadingly assume that t w o dieren t sets of condence lev els are similar. T able 9 summarizes the results in T able 3 b y comparing ~ M with ~ M 0 for eac h lev el-1 generalizer, across all datasets. C4.5 is clearly b etter o with a lab el-based represen tation, b ecause discretizing con tin uous-v alued attributes creates in tra-attribute in teraction in ad- dition to in teractions b et w een dieren t attributes. The evidence from T able 9 is that NB is indieren t to the use of lab els or condences: the normal distribution assumption that it em b o dies in the latter case could b e another reason wh y it is unsuitable for com bining condence measures. Both MLR and IB1 handle con tin uous-v alued attributes b etter than lab el-based ones, since this is the domain in whic h they are designed to w ork. Summar y W e summarize our ndings in this section as follo ws.  None of the four learning algorithms used to obtain mo del ~ M p erform satisfactorily . 282 Issues in St a cked Generaliza tion  MLR is the b est of the four learning algorithms to use as the lev el-1 generalizer for obtaining the mo del ~ M 0 .  When obtained using MLR, ~ M 0 has lo w er predictiv e error rate than the b est mo del selected b y J -fold cross-v alidation, for almost all datasets used in the exp erimen ts.  Another adv an tage of MLR o v er the other three lev el-1 generalizers is its in terpretabilit y . The w eigh ts  k ` indicate the dieren t con tributions that eac h lev el-0 mo del k mak es to the prediction classes ` .  Mo del ~ M 0 can b e deriv ed b y MLR with or without non-negativit y constrain ts. Suc h constrain ts mak e little dierence to the mo del's predictiv e accuracy .  The use of non-negativit y constrain ts in MLR has the adv an tage of in terpretabilit y . Non- negativ e w eigh ts  k ` supp ort easier in terpretation of the exten t to whic h eac h mo del con tributes to eac h prediction class.  When deriv ed using MLR, mo del ~ M 0 compares fa v orably with ma jorit y v ote.  MLR pro vides a metho d of com bining the condence generated b y the lev el-0 mo dels in to a nal decision. F or v arious reasons, NB, C4.5, and IB1 are not suitable for this task. 4. Comparison With Arcing And Bagging This section compares the results of stac king C4.5, NB and IB1 with the results of arcing (called b o osting b y its originator, Sc hapire, 1990) and bagging that are rep orted b y Breiman (1996b; 1996c). Both arcing and bagging emplo y sampling tec hniques to mo dify the data distribution in order to pro duce m ultiple mo dels from a single learning algorithm. T o com bine the decisions of the individual mo dels, arcing uses a w eigh ted ma jorit y v ote and bagging uses an un w eigh ted ma jorit y v ote. Breiman rep orts that b oth arcing and bagging can substan tially impro v e the predictiv e accuracy of a single mo del deriv ed using a base learning algorithm. 4.1 Exp erimen tal Results First w e describ e the dierences b et w een the exp erimen tal pro cedures. Our results for stac king are a v eraged o v er ten-fold cross-v alidation for all datasets except W a v eform, whic h is a v eraged o v er ten rep eated trials. Standard errors are also sho wn. Results for arcing and bagging are those obtained b y Breiman (1996b; 1996c), whic h are a v eraged o v er 100 trials. In Breiman's exp erimen ts, eac h trial uses a random 9:1 split to form the training and test sets for all datasets except W a v eform. Also note that the W a v eform dataset w e used has 19 irrelev an t attributes, but Breiman used a v ersion without irrelev an t attributes (whic h w ould b e exp ected to degrade the p erformance of lev el-0 generalizers in our exp erimen ts). In b oth cases 300 training instances w ere used for this dataset, but w e used 5000 test instances whereas Breiman used 1800. Arcing and bagging are done with 50 decision tree mo dels deriv ed from CAR T (Breiman et al. , 1984) in eac h trial. 283 Ting & Witten Dataset #Samples stac king arcing bagging W a v eform 300 16.8  0.2 17.8 19.3 Glass 214 28.4  2.9 22.0 23.2 Ionosphere 351 9.7  1.5 6.4 7.9 So yb ean 683 4.3  1.1 5.8 6.8 Breast Cancer 699 2.7  0.8 3.2 3.7 Diab etes 768 24.2  1.2 26.6 23.9 T able 10: Comparing stac king with arcing and bagging classiers. The results on six datasets are giv en in T able 10, and indicate that the three metho ds are v ery comp etitiv e. 4 Stac king p erforms b etter than b oth arcing and bagging in three datasets (W a v eform, So yb ean and Breast Cancer), and is b etter than arcing but w orse than bagging in the Diab etes dataset. Note that stac king p erforms v ery p o orly on Glass and Ionosphere, t w o small real-w orld datasets. This is not surprising, b ecause cross-v alidation inevitably pro duces p o or estimates for small datasets. 4.2 Discussion Lik e bagging, stac king is ideal for parallel computation. The construction of eac h lev el-0 mo del pro ceeds indep enden tly , no comm unication with the other mo deling pro cesses b eing necessary . Arcing and bagging require a considerable n um b er of mem b er mo dels b ecause they rely on v arying the data distribution to get a div erse set of mo dels from a single learning algorithm. Using a lev el-1 generalizer, stac king can w ork with only t w o or three lev el-0 mo dels. Supp ose the computation time required for a learning algorithm is C , and arcing or bagging needs h mo dels. The learning time required is T a = hC . Supp ose stac king requires g mo dels and eac h mo del emplo ys J -fold cross-v alidation. Assuming that time C is needed to deriv e eac h of the g lev el-0 mo dels and the lev el-1 mo del, the learning time for stac king is T s = ( g ( J + 1) + 1) C . F or the results giv en in T able 10, h = 50, J = 10, and g = 3; th us T a = 50 C and T s = 34 C . Ho w ev er, in practice the learning time required for the lev el-0 and lev el-1 generalizers ma y b e dieren t. Users of stac king ha v e a free c hoice of lev el-0 mo dels. They ma y either b e deriv ed from a single learning algorithm, or from a v ariet y of dieren t algorithms. The example in Section 3 uses dieren t t yp es of learning algorithms, while b ag-stacking |stac king bagged mo dels (Ting & Witten, 1997)|uses data v ariation to obtain a div erse set of mo dels from a single learning algorithm. In the former case, p erformance ma y v ary substan tially b et w een the lev el-0 mo dels|for example NB p erforms v ery p o orly in the V o w el and Euth yroid datasets compared to the other t w o mo dels (see T able 2). Stac king cop es w ell with this situation. The p erformance v ariation among the mem b er mo dels in bagging is rather small b ecause they are deriv ed from the same learning algorithm using b o otstrap samples. Section 3.3 4. The heart dataset used b y Breiman (1996b; 1996c) is omitted b ecause it w as v ery m uc h mo died from the original one. 284 Issues in St a cked Generaliza tion sho ws that a small p erformance v ariation among mem b er mo dels is a necessary condition for ma jorit y v ote (as emplo y ed b y bagging) to w ork w ell. It is w orth noting that arcing and bagging can b e incorp orated in to the framew ork of stac k ed generalization b y using arced or bagged mo dels as lev el-0 mo dels. Ting & Witten (1997) sho w one p ossible w a y of incorp orating bagged mo dels with lev el-1 learning, em- plo ying MLR instead of v oting. In this implemen tation, L is used as a test set for eac h of the bagged mo dels to deriv e lev el-1 data rather than the cross-v alidated data. This is viable b ecause eac h b o otstrap sample lea v es out ab out 37% of the examples. Ting & Witten (1997) sho w that bag-stac king almost alw a ys has higher predictiv e accuracy than bagging mo dels deriv ed from either C4.5 or NB. Note that the only dierence here is whether an adaptiv e lev el-1 mo del or a simple ma jorit y v ote is emplo y ed According to Breiman (1996b; 1996c), arcing and bagging can only impro v e the predic- tiv e accuracy of learning algorithms that are `unstable.' 5 An unstable learning algorithm is one for whic h small p erturbations in the training set can pro duce large c hanges in the deriv ed mo del. Decision trees and neural net w orks are unstable; NB and IB1 are stable. Stac king w orks with b oth. While MLR is the most successful candidate for lev el-1 learning that w e ha v e found, other algorithms migh t w ork equally w ell. One candidate is neural net w orks. Ho w ev er, w e ha v e exp erimen ted with bac k-propagation neural net w orks for this purp ose and found that they ha v e a m uc h slo w er learning rate than MLR. F or example, MLR only to ok 2.9 seconds as compare to 4790 seconds for the neural net w ork in the nettalk dataset; while b oth ha v e the same error rate. Other p ossible candidates are the m ultinomial logit mo del (Jordan & Jacobs, 1994), whic h is a sp ecial case of generalized linear mo dels (McCullagh & Nelder, 1983), and the supra Ba y esian pro cedure (Jacobs, 1995) whic h treats the lev el-0 mo dels' condence as data that ma y b e com bined with prior distribution of lev el-0 mo dels via Ba y es' rule. 5. Related W ork Our analysis of stac k ed generalization w as motiv ated b y that of Breiman (1996a), discussed earlier, and LeBlanc & Tibshirani (1993). LeBlanc & Tibshirani (1993) examine the stac king of a linear discriminan t and a nearest neigh b or classier and sho w that, for one articial dataset, a metho d similar to MLR p erforms b etter with non-negativit y constrain ts than without. Our results in Section 3.2 sho w that these constrain ts are irrelev an t to MLR's predictiv e accuracy in the classication situation. LeBlanc & Tibshirani (1993) and Ting & Witten (1997) use a v ersion of MLR that emplo ys all class probabilities from eac h lev el-0 mo del to induce eac h linear regression. In this case, the linear regression for class ` is LR ` ( x ) = K X k I X i  k i` P k i ( x ) : This implemen tation requires the tting of K I parameters, as compared to K parameters for the v ersion used in this pap er (see the corresp onding form ula in Section 2.2). Both 5. Sc hapire, R.E., Y. F reund, P . Bartlett, & W.S. Lee (1997) pro vide an alternativ e explanation for the eectiv eness of arcing and bagging. 285 Ting & Witten v ersions giv e comparable results in terms of predictiv e accuracy , but the v ersion used in this pap er runs considerably faster b ecause it needs to t few er parameters. The limitations of MLR are w ell-kno wn (Duda & Hart, 1973). F or a I -class problem, it divides the description space in to I con v ex decision regions. Ev ery region m ust b e singly connected, and the decision b oundaries are linear h yp erplanes. This means that MLR is most suitable for problems with unimo dal probabilit y densities. Despite these limitations, MLR still p erforms b etter as a lev el-1 generalizer than IB1, its nearest comp etitor in deriving ~ M 0 . These limitations ma y hold the k ey to a fuller understanding of the b eha vior of stac k ed generalization. Jacobs (1995) reviews linear com bination metho ds lik e that used in MLR. Previous w ork on stac k ed generalization, esp ecially as applied to classication tasks, has b een limited in sev eral w a ys. Some only applies to a particular dataset (e.g., Zhang, Mesiro v & W altz, 1992). Others rep ort results that are less than con vincing (Merz, 1995). Still others ha v e a dieren t fo cus and ev aluate the results on just a few datasets (LeBlanc & Tibshirani, 1993; Chan & Stolfo, 1995; Kim & Bartlett, 1995; F an et al. , 1996). One migh t consider a degenerate form of stac k ed generalization that do es not use cross- v alidation to pro duce data for lev el-1 learning. Then, lev el-1 learning can b e done `on the y' during the training pro cess (Jacobs et al. , 1991). In another approac h, lev el-1 learning tak es place in batc h mo de, after all lev el-0 mo dels are deriv ed (Ho et al. , 1994). Sev eral researc hers ha v e w ork ed on a still more degenerate form of stac k ed generalization without an y cross-v alidation or learning at lev el 1. Examples are neural net w ork ensem bles (Hansen & Salamon, 1990; P errone & Co op er, 1993; Krogh & V edelsb y , 1995), m ultiple decision tree com bination (Kw ok & Carter, 1990; Bun tine, 1991; Oliv er & Hand, 1995), and m ultiple rule com bination (Kononenk o & Ko v a  ci  c, 1992). The metho ds used at lev el 1 are ma jorit y v oting, w eigh ted a v eraging and Ba y esian com bination. Other p ossible metho ds are distribution summation and lik eliho o d com bination. There are v arious forms of re-ordering class rank, and Ali & P azzani (1996) study some of these metho ds for a rule learner. Ting (1996) uses the condence of eac h prediction to com bine a nearest neigh b or classier and a Naiv e Ba y esian classier. 6. Conclusions W e ha v e addressed t w o crucial issues for the successful implemen tation of stac k ed general- ization in classication tasks. First, class probabilities should b e used instead of the single predicted class as input attributes for higher-lev el learning. The class probabilities serv e as the condence measure for the prediction made. Second, the m ulti-resp onse least squares linear regression tec hnique should b e emplo y ed as the high-lev el generalizer. This tec hnique pro vides a metho d of com bining lev el-0 mo dels' condence. The other three learning algo- rithms ha v e either algorithmic limitations or are not suitable for aggregating condences. When com bining three dieren t t yp es of learning algorithms, this implemen tation of stac k ed generalization w as found to ac hiev e b etter predictiv e accuracy than b oth mo del selection based on cross-v alidation and ma jorit y v ote; it w as also found to b e comp eti- tiv e with arcing and bagging. Unlik e stac k ed regression, non-negativit y constrain ts in the least-squares regression are not necessary to guaran tee impro v ed predictiv e accuracy in classication tasks. Ho w ev er, these constrain ts are still preferred b ecause they increase the in terpretabilit y of the lev el-1 mo del. 286 Issues in St a cked Generaliza tion The implication of our successful implemen tation of stac k ed generalization is that earlier mo del com bination metho ds emplo ying (w eigh ted) ma jorit y v ote, a v eraging, or other com- putations that do not mak e use of lev el-1 learning, can no w apply this learning to impro v e their predictiv e accuracy . Ac kno wledgmen t The authors are grateful to the New Zealand Marsden F und for nancial supp ort for this researc h. This w ork w as conducted when the rst author w as in Departmen t of Computer Science, Univ ersit y of W aik ato. The authors are grateful to J. Ross Quinlan for pro viding C4.5 and Da vid W. Aha for pro viding IB1. The anon ymous review ers and the editor ha v e pro vided man y helpful commen ts. References Aha, D.W., D. Kibler & M.K. Alb ert (1991). Instance-Based Learning Algorithms. Ma- chine L e arning, 6 , pp. 37-66. Ali, K.M. & M.J. P azzani (1996). Error Reduction through Learning Multiple Descrip- tions. Machine L e arning , V ol. 24, No. 3, pp. 173-206. Blak e, C., E. Keogh & C.J. Merz (1998). UCI R ep ository of machine le arning datab ases [h ttp:// www.ics.uci.edu/ mlearn/MLRep ository .h tml]. Irvine, CA: Univ ersit y of Cal- ifornia, Departmen t of Information and Computer Science. Breiman, L. (1996a). Stac k ed Regressions. Machine L e arning , V ol. 24, pp. 49-64. Breiman, L. (1996b). Bagging Predictors. Machine L e arning , V ol. 24, No. 2, pp. 123-140. Breiman, L. (1996c). Bias, V ariance, and Arcing Classiers. T e chnic al R ep ort 460 . De- partmen t of Statistics, Univ ersit y of California, Berk eley , CA. Breiman, L., J.H. F riedman, R.A. Olshen & C.J. Stone (1984). Classic ation A nd R e gr es- sion T r e es . Belmon t, CA: W adsw orth. Cestnik, B. (1990). Estimating Probabilities: A Crucial T ask in Mac hine Learning. In Pr o c e e dings of the Eur op e an Confer enc e on A rticial Intel ligenc e , pp. 147-149. Chan, P .K. & S.J. Stolfo (1995). A Comparativ e Ev aluation of V oting and Meta-learning on P artitioned Data. In Pr o c e e dings of the Twelfth International Confer enc e on Ma- chine L e arning , pp. 90-98, Morgan Kaufmann. Cost, S & S. Salzb erg (1993). A W eigh ted Nearest Neigh b or Algorithm for Learning with Sym b olic F eatures. Machine L e arning, 10 , pp. 57-78. F an, D.W., P .K. Chan, S.J. Stolfo (1996). A Comparativ e Ev aluation of Com biner and Stac k ed Generalization. In Pr o c e e dings of AAAI-96 workshop on Inte gr ating Multiple L e arne d Mo dels , pp. 40-46. Hansen, L.K. & P . Salamon (1990). Neural Net w ork Ensem bles. IEEE T r ansactions of Pattern A nalysis and Machine Intel ligenc e, 12 , pp. 993-1001. 287 Ting & Witten Ho, T.K., J.J. Hull & S.N. Srihari (1994). Decision Com bination in Multiple Classier Systems. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , V ol. 16, No. 1, pp. 66-75. Jacobs, R.A. (1995). Metho ds of Com bining Exp erts' Probabilit y Assessmen ts. Neur al Computation 7 , pp. 867-888, MIT Press. Jacobs, R.A., M.I. Jordan, S.J. No wlan & G.E. Hin ton (1991). Adaptiv e Mixtures of Lo cal Exp erts. Neur al Computation 3 , pp. 79-87. Jacobs, R.A. & M.I. Jordan (1994). Hierac hical Mixtures of Exp erts and the EM Algo- rithms. Neur al Computation 6 , pp. 181-214. Kim, K. & E.B. Bartlett (1995). Error Estimation b y Series Asso ciation for Neural Net w ork Systems. Neur al Computation 7 , pp. 799-808, MIT Press. Kononenk o, I. & M. Ko v a  ci  c (1992). Learning as Optimization: Sto c hastic Generation of Multiple Kno wledge. In Pr o c e e dings of the Ninth International Confer enc e on Machine L e arning , pp. 257-262, Morgan Kaufmann. Krogh, A. & J. V edelsb y (1995). Neural Net w ork Ensem bles, Cross V alidation, and Activ e Learning. A dvanc es in Neur al Information Pr o c essing Systems 7 , G. T esauro, D.S. T ouretsky & T.K. Leen (Editors), pp. 231-238, MIT Press. Kw ok, S. & C. Carter (1990). Multiple Decision T rees. Unc ertainty in A rticial Intel- ligenc e 4 , R. Shac h ter, T. Levitt, L. Kanal and J. Lemmer (Editors), pp. 327-335, North-Holland. La wson C.L. & R.J. Hanson (1995). Solving L e ast Squar es Pr oblems . SIAM Publications. LeBlanc, M. & R. Tibshirani (1993). Com bining Estimates in Regression and Classica- tion. T ec hnical Rep ort 9318. Departmen t of Statistics, Univ ersit y of T oron to. Matan, O. (1996). On V oting Ensem bles of Classiers (extended abstract). In Pr o c e e dings of AAAI-96 workshop on Inte gr ating Multiple L e arne d Mo dels , pp. 84-88. McCullagh, P . & J.A. Nelder (1983). Gener alize d Line ar Mo dels . London: Chapman and Hall. Merz, C.J. (1995). Dynamic Learning Bias Selection. In Pr o c e e dings of the Fifth In- ternational Workshop on A rticial Intel ligenc e and Statistics , Ft. Lauderdale, FL: Unpublished, pp. 386-395. Oliv er, J.J. & D.J. Hand (1995). On Pruning and Av eraging Decision T rees. In Pr o c e e dings of the Twelfth International Confer enc e on Machine L e arning , pp. 430-437, Morgan Kaufmann. P errone, M.P . & L.N. Co op er (1993). When Net w orks Disagree: Ensem ble Metho ds for Hybrid Neural Net w orks. A rticial Neur al Networks for Sp e e ch and Vision , R.J. Mammone (Editor). Chapman-Hall. Quinlan, J.R. (1993). C4.5: Pr o gr am for machine le arning . Morgan Kaufmann. 288 Issues in St a cked Generaliza tion Sc hapire, R.E. (1990). The Strength of W eak Learnabilit y . Machine L e arning, 5 , pp. 197-227, Klu w er Academic Publishers. Sc hapire, R.E., Y. F reund, P . Bartlett, & W.S. Lee (1997). Bo osting the margin: A new explanation for the eectiv eness of v oting metho ds. In Pr o c e e dings of the F ourte enth International Confer enc e on Machine L e arning , pages 322-330, Morgan Kaufmann. Sm yth, P . & D. W olp ert (1997). Stac k ed Densit y Estimation. A dvanc es in Neur al Infor- mation Pr o c essing Systems . Ting, K.M. (1996). The Characterisation of Predictiv e Accuracy and Decision Com bina- tion. In Pr o c e e dings of the Thirte enth International Confer enc e on Machine L e arning , pp. 498-506, Morgan Kaufmann. Ting, K.M. & I.H. Witten (1997). Stac king Bagged and Dagged Mo dels. In Pr o c e e dings of the F ourte enth International Confer enc e on Machine L e arning , pp. 367-375, Morgan Kaufmann. W eiss S. M. & C. A. Kulik o wski (1991). Computer Systems That L e arns . Morgan Kauf- mann. W olp ert, D.H. (1992). Stac k ed Generalization. Neur al Networks , V ol. 5, pp. 241-259, P ergamon Press. Zhang, X., J.P . Mesiro v & D.L. W altz (1992). Hybrid System for Protein Secondary Structure Prediction. Journal of Mole cular Biolo gy , 225, pp. 1049-1063. 289

Issues in Stacked Generalization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment