Analyzing Error Sources in Global Feature Effect Estimation

Analyzing Error Sources in Global F eature Eﬀect Estimation Timo Heiß 1 , 2 [0009 − 0002 − 0392 − 4308] , Co co Bögel 1 , 2 [0009 − 0002 − 0683 − 4957] , Bernd Bisc hl 1 , 2 [0000 − 0001 − 6002 − 6980] , and Giusepp e Casalicc hio 1 , 2 [0000 − 0001 − 5324 − 5966] 1 LMU Munic h, Munic h, Germany {timo.heiss,giuseppe.casalicchio}@stat.uni-muenchen.de 2 Munic h Center for Mac hine Learning (MCML) Abstract. Global feature eﬀects such as partial dep endence (PD) and accum ulated local eﬀects (ALE) plots are widely used to in terpret black- b o x models. How ev er, they are only estimates of true underlying eﬀects, and their reliabilit y dep ends on m ultiple sources of error. Despite the p opularit y of global feature eﬀects, these error sources are largely unex- plored. In particular, the practically relev an t question of whether to use training or holdout data to estimate feature eﬀects remains unanswered. W e address this gap b y providing a systematic, estimator-lev el analy- sis th a t disen tangles sources of bias and v ariance for PD and ALE. T o this end, we derive a mean-squared-error decomp osition that separates mo del bias, estimation bias, mo del v ariance, and estimation v ariance, and analyze their dependence on mo del characteristics, data selection, and sample size. W e v alidate our theoretical ﬁndings through an extensive sim ulation study across m ultiple data-generating pro cesses, learners, es- timation strategies (training data, v alidation data, and cross-v alidation), and sample sizes. Our results rev eal that, while using holdout data is the- oretically the cleanest, p oten tial biases arising from the training data are empirically negligible and dominated by the impact of the usually higher sample size. The estimation v ariance dep ends on both the presence of in teractions and the sample size, with ALE b eing particularly sensitiv e to the latter. Cross-v alidation-based estimation is a promising approach that reduces the mo del v ariance comp onen t, particularly for o v erﬁtting mo dels. Our analysis provides a principled explanation of the sources of error in feature eﬀect estimates and oﬀers concrete guidance on choosing estimation strategies when interpreting machine learning models. Keyw ords: In terpretable Machine Learning · F eature Eﬀects · Partial Dep endence Plot · A ccum ulated Lo cal Eﬀects 1 In tro duction Man y machine learning mo dels are blac k b o xes whose in ternal structure is not, or only partly , compatible with human reasoning, complicating explanations of b oth individual predictions and ov erall model b eha vior. This lack of transparency is especially problematic in high-stakes domains such as healthcare, law, and 2 T. Heiß et al. ﬁnance, where decisions m ust be transparen t. T o address this c hallenge, the ﬁeld of eXplainable AI (XAI) has prop osed a wide range of metho ds to explain mac hine learning mo dels [16]. How ev er, these metho ds must b e used correctly to a v oid misleading conclusions, as there are many pitfalls to b e a w are of [18]. Global feature eﬀect metho ds such as partial dep endence (PD) [7] and accu- m ulated local eﬀects (ALE) [1] visualize how one or more features aﬀect predic- tions. In practice, they are estimated from ﬁnite data, and their reliabilit y de- p ends on v arious error sources. Despite their widespread adoption, the error com- p onen ts of feature eﬀect estimates remain largely unexplored. Prior w orks fo cus on extrapolation under feature dep endence [1,9], aggregation bias [9,10,12,13], or quan tifying uncertaint y [1,5,19]. A formal bias–v ariance decomposition w.r.t. a true underlying eﬀect has only b een derived for the theoretical PD [17], lea ving estimator-lev el errors in tro duced b y ﬁnite data largely unaddressed. A related practical question is whether to estimate explanations using train- ing or holdout data. This has been studied for metho ds lik e permutation feature imp ortance (PFI) [18], mean decrease in impurity (MDI), and SHAP [14], but remains op en for PD and ALE. While most works compute feature eﬀects on training data [1,7,11,16], other works use holdout data [17]. Practitioners still debate whether to estimate PD and ALE on training or holdout data (see §2), trading larger training sample sizes against p oten tial ov erﬁtting bias. These op en questions p oin t to a lack of an estimator-level understanding of error in global feature eﬀect estimation, whic h our w ork addresses. Our main con tributions are: – W e provide the ﬁrst estimator-level analysis of PD and ALE, deriving a full mean squared error (MSE) decomposition that separates mo del bias, estimation bias, mo del v ariance, and estimation v ariance. – W e theoretically analyze these comp onen ts, showing how sample size and in teractions aﬀect estimation bias and v ariance diﬀerently for PD and ALE, and formally relating remaining bias and v ariance to mo del bias and v ariance. – W e empirically v alidate our ﬁndings in an extensive simulation study across m ultiple data-generating pro cesses, learners, sample sizes, and estimation strategies (training, v alidation, and cross-v alidation (CV)), using dedicated estimators for the error comp onen ts. W e ﬁnd negligible bias diﬀerences b e- t w een training and holdout data, a strong sample-size eﬀect – especially for ALE – and that CV is often preferable due to v ariance reduction. 2 Related W ork Man y issues in feature eﬀects hav e b een analyzed. The PD plot [7] is known to suﬀer from extrap olation under dep enden t features b y ev aluating the mo del on implausible feature combinations [18]. ALE plots av oid exactly that issue [1]. An- other issue is aggr e gation bias : global eﬀects can obscure interaction-induced het- erogeneit y . Individual conditional expectation (ICE) curv es [10] and RHALE [9] visualize this heterogeneity , while regional metho ds like REPID [12] and GAD- GET [13] report eﬀects in regions with reduced heterogeneit y . Several w orks quan tify uncertaint y via v ariance of feature eﬀects with mo del-sp eciﬁc PD con- ﬁdence bands existing for probabilistic mo dels [19], as well as mo del-agnostic Analyzing Error Sources in Global F eature Eﬀect Estimation 3 approac hes that consider PD or ALE across m ultiple model ﬁts [5,17,1]. Only a few works explicitly study bias–v ariance trade-oﬀs in feature eﬀect estimators. RHALE [9] optimizes ALE binning to balance bias and v ariance but do es not relate this trade-oﬀ to a ground-truth eﬀect. In [8], bias and v ariance of the pro- p osed DALE estimator w.r.t. ALE are analyzed. In [3], v ariance reduction and consistency of their prop osed ALE estimator A2D2E are shown, and the error of PD, ALE, and A2D2E estimators against a deriv ativ e-based ground-truth are compared in simulations. Moreo v er, [17] formalize PD as a statistical estimator of a target estimand, derive a formal MSE decomposition for the theoretical PD, and prop ose v ariance estimators for conﬁdence interv als. In contrast, w e pro vide the ﬁrst full estimator-lev el MSE decomposition of empirical PD and ALE w.r.t. their corresp onding ground-truth feature eﬀects, and analyze the error and its comp onen ts b oth theoretically and empirically . F or many in terpretabilit y metho ds, computing explanations on training vs. holdout data can aﬀect conclusions: it matters for loss-based metho ds like p er- m utation feature imp ortance (PFI) [18] and for mean decrease in impurity (MDI) and SHAP , whic h can be biased on training data [14]. F or feature eﬀects, this issue is largely unstudied. PD [7], ALE [1], and common references and softw are often use training data without justiﬁcation [11,16], while others use holdout data [17]. Practitioners likewise disagree: some prefer training data for more re- liable estimates due to larger sample size 3 , others prefer holdout data as “the mo del may ov erﬁt” 4 , and some use both to diagnose distribution diﬀerences. 5 T o date, no systematic study addresses this. W e do so by comparing diﬀerent esti- mation strategies (training/holdout/CV) through bias-v ariance decomp osition. 3 Notation & Bac kground Assume a data-generating pro cess that is c haracterized by a joint distribution P X Y o v er features X = ( X 1 , . . . , X p ) ⊤ and target Y , with true underlying func- tion f ( x ) = E [ Y | X = x ] . A random dataset D = { ( X ( i ) , Y ( i ) ) } n i =1 consists of n i.i.d. samples from P X Y . Concrete realizations D = { ( x ( i ) , y ( i ) ) } n i =1 are the train- ing, test, and v alidation sets D train , D test , and D v al . A learning algorithm induces a ﬁtted mo del ˆ f on D train , viewed either as a ﬁxed function or as a random v ari- able with distribution P F due to training-sample and algorithmic randomness. Mo del p erformance is measured by the risk R L ( ˆ f ) = E X Y [ L ( Y , ˆ f ( X ))] with a p oin t-wise loss L , and can b e estimated on an n -sized dataset D as empirical risk R emp ( ˆ f ; D ) = 1 n P n i =1 L  y ( i ) , ˆ f ( x ( i ) )  . F or D train , this measures in-sample, for D test or D v al out-of-sample error. Exp ectations and v ariances w.r.t. ˆ f ∼ P F are denoted by subscript F , D ∼ P X Y b y D , and ( X, Y ) ∼ P X Y b y X Y (anal- ogously for marginal and conditionals). Note that D and ˆ f are indep enden t (denoted D ⊥ ˆ f ) only if D is not used for training, i.e., only for holdout data. 3 h ttps://forums . fast . ai/t/partial- dep endence- plot/98465 (last accessed: 01/28/2026) 4 h ttps://github . com/sosunek o/PDPbox/issues/68 (01/28/2026) 5 h ttps://www . mathw orks . com/help/stats/use- partial- dependence- plots- to- in terpret- regression- mo dels- trained- in- regression- learner- app . h tml (01/28/2026) 4 T. Heiß et al. W e consider feature eﬀects for a subset of features X S with S ⊆ { 1 , . . . , p } . Throughout this w ork, we restrict attention to a single feature of in terest ( | S | = 1 ) and denote the complemen t feature set by X ¯ S . Deﬁnition 1 (PD [7]). L et h : X → Y b e a pr e diction function and let X S denote a fe atur e of inter est. The PD of h w.r.t. X S is deﬁne d as P D h,S ( x S ) := E X ¯ S [ h ( x S , X ¯ S )] = Z X ¯ S h ( x S , x ¯ S ) d P X ¯ S ( x ¯ S ) . (1) Given an n -size d dataset D , it c an b e estimate d via Monte Carlo inte gr ation as d P D h,S ( x S ) := 1 n P n i =1 h ( x S , x ( i ) ¯ S ) = 1 n P n i =1 h ( i ) S ( x S ) . Here, h ( i ) S ( x S ) = h ( x S , x ( i ) ¯ S ) are the ICE curves [10]. P D ˆ f ,S denotes the theo- retical PD of the mo del ˆ f , and d P D ˆ f ,S is its estimator. P D f ,S is the ground-truth PD. In practice, PD is visualized using a grid of G feature v alues { x ( g ) S } G g =1 . Quan tile-based grids rather than equidistan t ones are recommended in [18]. Deﬁnition 2 (ALE [1]). L et h : X → Y b e a pr e diction function and let X S denote a fe atur e of inter est. The unc enter e d ALE of h w.r.t. X S is deﬁne d as ] ALE h,S ( x S ) = lim K →∞ k K S ( x S ) X k =1 E X ¯ S | X S ∈ I K S ( k )  ∆ K h,S ( k , X ¯ S )  , (2) wher e fe atur e X S is p artitione d into K intervals { I K S ( k ) = ( z K k − 1 ,S , z K k,S ] } K k =1 , k K S ( x S ) denotes the index of the interval into which a value x S fal ls, and the maximum interval width c onver ges to zer o as K → ∞ . The ﬁnite diﬀer enc es in the k -th interval ar e given by ∆ K h,S ( k , x ¯ S ) := h ( z K k,S , x ¯ S ) − h ( z K k − 1 ,S , x ¯ S ) . The unc enter e d ALE for a ﬁnite dataset D and ﬁnite K c an b e estimate d as fol lows, wher e n K S ( k ) denotes the numb er of observations in the k -th interval: [ ] ALE h,S ( x S ) = P k K S ( x S ) k =1 1 n K S ( k ) P i : x ( i ) S ∈ I K S ( k ) h ∆ K h,S ( k , x ( i ) ¯ S ) i . Cen tered versions ALE h,S and [ ALE h,S are obtained b y subtracting a con- stan t such that they hav e zero mean w.r.t. the marginal distribution of X S or the empirical distribution of { x ( i ) S } n i =1 , respectively . F or the grid { z K k,S } K k =0 , em- pirical quantiles of { x ( i ) S } n i =1 are recommended [1]. F or readability , we omit the sup erscript K when K is ﬁxed, and suppress th e subscript S for PD and ALE. In [17], it is shown that the MSE of the theoretical PD of ˆ f w.r.t. to the theoretical ground-truth PD can be decomp osed into squared bias and v ariance: E F [( P D f ( x S ) − P D ˆ f ( x S )) 2 ] = ( P D f ( x S ) − E F [ P D ˆ f ( x S )]) 2 + V ar F [ P D ˆ f ( x S )] . (3) The bias term relates to systematic mo del bias, and the v ariance term captures v ariability across model ﬁts. F or the empirical estimator d P D ˆ f , Molnar et al. [17] further argue that Mon te Carlo in tegration in troduces an additional source of v ariance. Moreov er, when estimated on holdout data, d P D ˆ f is unbiased w.r.t. P D ˆ f , and un biasedness of the mo del implies un biasedness of the PD. 6 6 Pro ofs can b e found in App endices C & D of the arXiv version of [17]. Analyzing Error Sources in Global F eature Eﬀect Estimation 5 While an analogous decomp osition for ALE is not av ailable, consistency results for the ALE estimator exist in [1]. T o this end, they deﬁne a popu- lation version of the binned uncen tered ALE for a prediction function h as ] ALE K h ( x S ) := P k S ( x S ) k =1 E X ¯ S | X S ∈ I S ( k ) [ ∆ h ( k , X ¯ S )] . F or the same K ﬁxed bins, [ ] ALE h on an i.i.d. D ∼ P X Y con v erges to ] ALE K h p oin t wise as n → ∞ almost surely , under mild in tegrability conditions. Moreov er, ] ALE K h con v erges p oin t- wise to ] ALE h as the bin resolution K → ∞ . Th us, [ ] ALE h is jointly consistent for ] ALE h if n gro ws suﬃcien tly fast relativ e to K . 7 4 Theoretical Considerations & Estimators F or our theoretical analysis, we adopt sev eral assumptions listed in §A.1. 4.1 F ull Error Decomp osition of the PD Estimator Previous work [17] considers only the error decomp osition of P D ˆ f . Since P D ˆ f cannot b e determined when P X ¯ S is unknown and is estimated via Monte Carlo in tegration, we instead study the estimator’s error. While [17] notes that this adds a v ariance term, we formally derive the MSE decomposition at ﬁxed x S , in tegrating o v er both ˆ f ∼ P F and the dataset D ∼ P X Y used for PD estimation: E F E D | ˆ f  ( P D f ( x S ) − d P D ˆ f ( x S )) 2  = ( P D f ( x S ) − E F E D | ˆ f [ d P D ˆ f ( x S )]) 2 + V ar F E D | ˆ f [ d P D ˆ f ( x S )] + E F V ar D | ˆ f [ d P D ˆ f ( x S )] . (4) The pro of is given in §A.2. The decomp osition has three distinct terms, whic h w e will analyze in more detail b elo w: the squared bias and t w o v ariances. Bias. The bias of the PD estimator (cf. ﬁrst term in Eq. (4)) decomp oses as: E F E D | ˆ f [ d P D ˆ f ( x S )] − P D f ( x S ) = E F [ E D | ˆ f [ d P D ˆ f ( x S )] − P D ˆ f ( x S )] + ( E F [ P D ˆ f ( x S )] − P D f ( x S )) (5) b y adding and subtracting E F [ P D ˆ f ,S ( x S )] . The ﬁrst part v anishes b y the unbi- asedness of the PD estimator w.r.t. the theoretical model PD [17], if only the data used for Mon te Carlo integration in the PD estimation is indep enden t of the mo del ˆ f . This is true for holdout data, but not necessarily for training data. F or the second part, exchanging expectations (F ubini) yields (pro of in §A.3): E F  P D ˆ f ( x S )  − P D f ( x S ) = E X ¯ S h E F  ˆ f ( x S , X ¯ S )  − f ( x S , X ¯ S ) i . (6) Consequen tly , the term reduces to the mo del’s bias av eraged o v er P X ¯ S . Th us, for estimation on holdout data, the PD estimator’s bias reduces to the av erage mo del bias, whereas estimation on training data ma y in troduce additional bias. 7 These consistency results can be found in Theorem 3 of the arXiv version of [1]. 6 T. Heiß et al. Mo del v ariance. The second term in Eq. (4) reﬂects the v ariance of the PD estimator w.r.t. the mo del distribution P F . When the PD is estimated on holdout data, it reduces to V ar F [ E D | ˆ f [ d P D ˆ f ( x S )]] = V ar F [ P D ˆ f ( x S )] b y the un biasedness of the PD estimator w.r.t. the theoretical mo del PD. This is exactly the v ariance of the theoretical PD in Eq. (3). By exchanging expectations (F ubini/T onelli) and applying Jensen’s inequalit y , we obtain an upper b ound (pro of in §A.4): V ar F [ P D ˆ f ( x S )] ≤ E X ¯ S V ar F [ ˆ f ( x S , X ¯ S )] . (7) Th us, the theoretical PD v ariance at x S is controlled by the av erage (point wise) mo del v ariance w.r.t. the marginal distribution of X ¯ S . Estimation v ariance. The third term in Eq. (4) captures the v ariance w.r.t. the samples used to estimate the PD via Mon te Carlo integration, which equals: E F V ar D | ˆ f [ d P D ˆ f ( x S )] = 1 n E F V ar D | ˆ f [ ˆ f ( x S , X ¯ S )] (8) for any X ¯ S ∼ D | ˆ f . The proof is given in §A.5. Consequen tly , the point wise estimation v ariance of the PD dep ends on the sample size n and decreases at the rate O (1 /n ) . It also dep ends on the exp ected v ariance of the ICE curves at x S . Centering the PD via d P D ˆ f ( x S ) − E D | ˆ f [ d P D ˆ f ( X S )] yields the v ariance of the centered ICE curves V ar D | ˆ f [ ˆ f ( x S , X ¯ S ) − E D | ˆ f [ ˆ f ( X S , X ¯ S )]] instead (pro of analogous to §A.5). Since centered ICE curv es fulﬁll local decomp osabilit y [13], their v ariance at x S is solely due to in teractions inv olving feature X S . Thus, the estimation v ariance of the centered PD is zero when X S has no interactions in ˆ f . An o verview of all four derived error components is pro vided in Fig. 1. 4.2 F ull Error Decomp osition of the ALE Estimator W e now pro vide an error analysis for ALE, whic h is missing in the curren t literature. The MSE of [ ALE ˆ f w.r.t. ALE f can b e decomp osed into bias and v ariance analogous to PD in Eq. (4) (pro of in §A.2, analogous to the one for PD). The only prop ert y required for the argument is that all expectations exist and that f , and hence ALE f , is non-random w.r.t. to P F . F or the theoretical analysis Fig. 1: Conceptual ov erview of the four error components. Analyzing Error Sources in Global F eature Eﬀect Estimation 7 of the error components, we fo cus on uncen tered ALE, as centering is linear p ost- pro cessing that only aﬀects the oﬀset. All sources of bias and v ariance originate in the uncen tered ALE and propagate deterministically under centering. Bias. Similar to PD, w e can decomp ose the bias of ALE further in to: E F E D | ˆ f [ [ ] ALE ˆ f ( x S )] − ] ALE f ( x S ) = E F  E D | ˆ f [ [ ] ALE ˆ f ( x S )] − ] ALE ˆ f ( x S )  +  E F [ ] ALE ˆ f ( x S )] − ] ALE f ( x S )  , (9) b y adding and subtracting E F [ ] ALE ˆ f ( x S )] . The ﬁrst part can b e viewed as the a v erage “estimation bias” and can b e further decomposed into: E F [ E D | ˆ f [ [ ] ALE ˆ f ( x S )] − ] ALE ˆ f ( x S )] = E F [ E D | ˆ f [ [ ] ALE ˆ f ( x S )] − ] ALE K ˆ f ( x S )] + E F [ ] ALE K ˆ f ( x S ) − ] ALE ˆ f ( x S )] . (10) The ﬁrst term is due to estimation on ﬁnite samples. F or ALE estimation on holdout data, under mild regularity (Assumptions (iii) and (v)) and n S ( k ) > 0 for all relev an t bins, the ALE estimator is un biased w.r.t. the binned population ALE and the term becomes zero (pro of in §A.6). T o the second term, we refer as “discretization bias”: the term inside E F go es to 0 as K → ∞ (p oin t wise in ˆ f ) b y the deﬁnition of ALE (Eq. (2)). Con v ergence of the en tire exp ectation term can b e shown b y applying the dominated con v ergence theorem exactly as in §A.7. F or the second part in Eq. (9), we show in §A.7 that it arises from an (inﬁnite ) sum of the lo cal av erage biases in the ﬁnite diﬀerences of ˆ f w.r.t. those of f . Th us, it v anishes when the mo del’s conditional ﬁnite diﬀerences are unbiased w.r.t. those of f . In particular, this is satisﬁed if ˆ f is unbiased w.r.t. f for all x . Mo del v ariance. When ALE is estimated on holdout data, it holds that V ar F [ E D | ˆ f [ [ ] ALE ˆ f ( x S )]] = V ar F [ ] ALE K ˆ f ( x S )] b y the ALE estimator’s unbiased- ness (see ab o v e). F or this expression, w e obtain an upp er b ound for the v ariance of ALE w.r.t. the mo del distribution (proof in §A.8): V ar F  ] ALE K ˆ f ( x S )  ≤ k S ( x S ) k S ( x S ) X k =1 E X ¯ S | X S ∈ I S ( k ) V ar F h ∆ ˆ f ( k , X ¯ S ) i . (11) Th us, the theoretical ALE v ariance at x S is con trolled b y the local av erage v ariability of ﬁnite diﬀerences across models along the interv als up to x S . Estimation v ariance. By conditioning on the samples’ bin assignments, B = ( B (1) , . . . , B ( n ) ) with B ( i ) = k S ( X ( i ) S ) , and using the law of total v ariance, the estimation v ariance of ALE, E F [V ar D | ˆ f [ [ ] ALE ˆ f ( x S )]] , decomp oses in to E F  V ar B | ˆ f  E D | ˆ f ,B  [ ] ALE ˆ f ( x S )  + E F  E B | ˆ f  V ar D | ˆ f ,B  [ ] ALE ˆ f ( x S )  . 8 T. Heiß et al. T able 1: Estimators of the feature eﬀect error comp onen ts at x S . Component Estimator MSE [ MSE ( x S ) = 1 M P M m =1  P D f ( x S ) − d P D ˆ f ( m ) ( x S )  2 Bias d Bias ( x S ) = P D f ( x S ) − 1 M P M m =1 d P D ˆ f ( m ) ( x S ) V ariance d V ar ( x S ) = 1 M − 1 P M m =1  d P D ˆ f ( m ) ( x S ) − 1 M P M m ′ =1 d P D ˆ f ( m ′ ) ( x S )  2 Estimation v ariance d V ar Est ( x S ) = 1 M ( R − 1) P M m =1 P R r =1  d P D [ r ] ˆ f ( m ) ( x S ) − 1 R P R r ′ =1 d P D [ r ′ ] ˆ f ( m ) ( x S )  2 The ﬁrst term captures v ariance from random bin assignments along X S , i.e., v ariability in the exp ected ALE estimate due to random sample allo cation to bins. F or the second term, assuming n S ( k ) > 0 for all relev an t bins k ≤ k S ( x S ) : E F E B | ˆ f V ar D | ˆ f ,B  [ ] ALE ˆ f ( x S )  = X k ≤ k S ( x S ) E F E B | ˆ f  1 n S ( k ) σ 2 k ( ˆ f )  (12) with σ 2 k ( ˆ f ) := V ar X ¯ S | X S ∈ I S ( k ) , ˆ f [ ∆ ˆ f ( k , X ¯ S )] . The pro of is giv en in §A.9. Th us, it scales with the expected inv erse n um ber of observ ations p er bin. Assuming deterministic equal-frequency binning for simplicit y , this factor b ecomes K n . It also depends on the local v ariance in the estimated ﬁnite diﬀerences. Finite diﬀerences also fulﬁll local decomposability [13]. Th us, the v ariance σ 2 k ( ˆ f ) is only due to in teraction eﬀects in v olving feature X S and is zero when X S has no in teraction eﬀects in ˆ f . 4.3 Estimators of the Error Comp onen ts T o inv estigate the error comp onen ts empirically in §5-6, we prop ose estimators for eac h. E F and V ar F can be estimated by av eraging o v er m ultiple mo dels ˆ f ( m ) of the same learner ﬁtted to M diﬀerent training sets, eac h indep enden tly sampled from P X Y . T ranslating the standard estimators for MSE, bias, and v ariance [20] to our setting yields the estimators in T ab. 1. The v ariance estimator is the same as in [17], capturing the v ariance in b oth model ﬁts and Mon te Carlo in tegration. T o estimate the estimation v ariance separately , consider R Monte Carlo iterations D r to estimate the feature eﬀect d P D [ r ] , again for multiple mo del reﬁts. While T ab. 1 is based on PD, estimators for ALE follow analogously . These estimators are unbiased according to standard statistical results. The Mon te Carlo standard errors of these estimates can again be estimated as a function of the num ber of rep etitions [20]. Imp ortan tly , most estimators in T ab. 1 are impractical as they require knowledge of the ground-truth. W e will rather use them to understand the b eha vior of feature eﬀect estimation. 5 Exp erimen tal Set-Up W e no w empirically v alidate our ﬁndings from §4 with an extensive sim ulation study . While the theoretical analysis requires holdout data at multiple p oin ts Analyzing Error Sources in Global F eature Eﬀect Estimation 9 T able 2: Data settings with underlying functions and feature structure. Setting F unction Correlation Simple-Normal- Correlated f 1 ( x ) = x 1 + x 2 2 2 + x 1 x 2 X 1 , . . . , X 4 ∼ N (0 , 1) , ρ 12 = 0 . 9 , ρ ij = 0 ∀ ( i, j )  = (1 , 2) , i  = j F riedman1 [6] f 2 ( x ) = 10 sin( πx 1 x 2 ) +20( x 3 − 1 2 ) 2 + 10 x 4 + 5 x 5 X 1 , . . . , X 7 i.i.d. ∼ U(0 , 1) F eynman I.29.16 [15] f 3 ( x , θ ) = q x 2 1 + x 2 2 + 2 x 1 x 2 cos( θ 1 − θ 2 ) X 1 , X 2 ∼ LogU(0 . 1 , 10) , θ 1 , θ 2 ∼ U(0 , 2 π ) , D 1 , D 2 ∼ U(0 , 1) , all indep. (e.g., to av oid estimation bias), we aim to gain deeper insigh ts into what empir- ically happ ens when this is violated. F or this, w e compare feature eﬀect errors across diﬀeren t estimation strategies: on training data, v alidation data, and via CV. W e break this ov erarc hing goal down into three sp eciﬁc researc h questions: R Q1: Ho w do es ov erﬁtting empirically aﬀect MSE, bias, and v ariance of PD and ALE when estimated on training vs. v alidation data vs. CV? R Q2: Ho w do the v ariance sources (mo del and estimation v ariance) empirically b eha v e for PD and ALE on training vs. v alidation data vs. in CV? R Q3: Ho w do es sample size aﬀect estimation error for PD and ALE empirically? Data settings. W e consider three settings of v arying complexity . Ground- truth functions and feature structures are giv en in T ab. 2. All settings include tw o indep enden t dummy features. The ﬁrst setting has correlations and interactions, the second diﬀerent non-linearities, and the third real-world relev ance. 8 F eatures are generated i.i.d., and the target as y = f ( x ) + ε with i.i.d. ε ∼ N (0 , σ 2 ε ) and a signal-to-noise ratio of 5 . 9 W e consider tw o sample sizes n = 1250 and n = 10000 . Mo dels. W e consider a generalized additive mo del (GAM) with spline bases for main and pairwise in teraction eﬀects, and XGBoost as learners. Eac h of them is once conﬁgured with “optimally tuned” (OT) h yperparameters and once with h yp erparameters chosen to o v erﬁt (OF). These h yperparameters are pre-selected p er setting and sample size, i.e., carefully hand-pic k ed for OF (e.g., small p enalt y / large learning rate) and tuned on separate data samples for OT. F or details on the h yp erparameters and model p erformances, see §B.1. F eature eﬀect estimation. W e estimate the feature eﬀects d P D ˆ f and [ ALE ˆ f p er feature and mo del b y (a) training a mo del on all n samples and estimating the feature eﬀect on the same n samples, (b) splitting the n samples in to 80% train and 20% v alidation set, ﬁtting a model on the training set and estimating the eﬀect on the v alidation set, and (c) a CV-based estimation strategy on all n samples. 10 W e compute feature eﬀects all at the same 100 grid points, deﬁned 8 This dataset is based on the ph ysics-grounded F eynman equation I.29.16 for wa v e in terference, addressing the limited realism of standard test functions. 9 The scale parameter is set such that ˆ σ Y /σ ε = 5 . 10 W e use 5-fold CV: in eac h iteration, a model is ﬁtted on four folds, eﬀects are estimated on the held-out fold, and the ﬁve resulting eﬀects are av eraged point wise. 10 T. Heiß et al. b y the theoretical quan tiles of P X S for comparabilit y . A dditionally , w e center the curv es after estimation, and omit the ﬁrst and last grid p oin t to av oid boundary eﬀects, particularly for ALE. T o enable ground-truth comparisons, w e addition- ally construct ground-truth eﬀect estimators in the same manner, but on 10 , 000 fresh samples and directly on f . This eliminates the discretization bias from our error, as the ground-truth and estimate use the same interv als. 11 Exp erimen ts. T o address RQ1, we compute the MSE, bias, and v ariance of all estimated mo del feature eﬀects w.r.t. the estimated ground-truth eﬀects via the estimators in T ab. 1 with d P D f for P D f (analog. for ALE). W e rep eat eac h setting-size-mo del com bination M = 30 times to estimate the error terms. Eac h repetition inv olves dra wing new data, ﬁtting the mo dels, and estimating the eﬀects. W e rep ort MSE, bias, and v ariance av eraged o v er the grid points. T o address RQ2, w e estimate the estimation v ariance according to T ab. 1. In each iteration m ∈ { 1 , . . . , M } , we ﬁx the trained mo dels, draw R = 30 new data sets, and estimate the eﬀects with them for the ﬁxed mo dels. By subtracting this from the total v ariance (cf. RQ1), we estimate the mo del v ariance. F or this analysis, w e fo cus on X GBoost (OF & OT) and a sample size of n = 1250 . T o address R Q3, we compute the MSE betw een analytical ground-truth ef- fects (a v ailable for the ﬁrst t w o settings) and estimated ground-truth eﬀects across 50 diﬀeren t sample sizes ranging from 10 1 to 10 6 (on log scale) with 50 rep etitions p er size. This isolates the estimation error, as no mo del is in v olv ed. Repro ducibilit y . Exp erimen ts are implemented in Python and use ﬁxed random seeds. W e release all exp erimen tal co de and raw results on GitHub ( link ). 6 Empirical Results W e report results for a single setting and a representativ e feature subset. F urther results are pro vided in §B.2 and consisten t with the ﬁndings rep orted here. 6.1 Bias-V ariance-Analysis PD decomp osition. Our empirical results on SimpleNormalCorr elate d in T ab. 3 rev eal that the MSE of the PD estimator is lo w est mostly for CV- and training- set-based estimation. Bias is mostly similar across the estimation strategies, and w e observe no systematic trends w.r.t. estimation strategy or sample siz e. Th us, a p oten tial bias in troduced by estimation on training data that could not be ruled out in our theoretical analysis (§4.1) app ears empirically negligible. V ariance is generally lo w est for CV-based estimation. W e h yp othesize that this is due to t w o eﬀects: (1) mo del v ariance ma y decrease as CV av erages out ﬁtting v ariabil- it y across m ultiple models , and (2) estimation v ariance ma y decrease compared to estimation on a single v alidation set due to increased eﬀectiv e sample size. V ariance is generally slightly higher for v alidation than for training-set-based 11 Except for discretization bias, this estimate is unbiased w.r.t. the true theoretical eﬀect (cf. §4), and adds negligible estimation v ariance at this sample size (cf. §6.3). Analyzing Error Sources in Global F eature Eﬀect Estimation 11 T able 3: Results for PD on SimpleNormalCorr elate d av eraged o v er 100 grid p oin ts. Bold n um b ers are minimum per metric-feature-mo del-size-com bination. F eature x 1 x 2 x 3 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.0695 0.0296 0.0710 0.0699 0.0276 0.0716 0.0090 0.0161 0.0091 v al 0.0928 0.0206 0.0955 0.0849 0.0345 0.0865 0.0114 0.0190 0.0114 CV 0.0609 0.0264 0.0622 0.0562 0.0292 0.0573 0.0086 0.0157 0.0086 GAM_OT train 0.0051 0.0323 0.0042 0.0039 0.0247 0.0034 0.0005 0.0022 0.0005 v al 0.0089 0.0336 0.0081 0.0093 0.0305 0.0086 0.0005 0.0040 0.0005 CV 0.0053 0.0356 0.0042 0.0041 0.0287 0.0034 0.0005 0.0022 0.0005 XGB_OF train 0.1780 0.3147 0.0817 0.4217 0.4505 0.2263 0.0010 0.0043 0.0010 v al 0.1880 0.3242 0.0857 0.3888 0.3941 0.2416 0.0015 0.0098 0.0014 CV 0.1458 0.3114 0.0505 0.3043 0.4236 0.1291 0.0007 0.0042 0.0007 XGB_OT train 0.2807 0.5194 0.0113 0.1690 0.3941 0.0141 0.0014 0.0052 0.0014 v al 0.3008 0.5375 0.0123 0.1666 0.3894 0.0155 0.0019 0.0073 0.0019 CV 0.2950 0.5351 0.0090 0.1691 0.3987 0.0104 0.0013 0.0051 0.0013 n = 10000 GAM_OF train 0.1878 0.0674 0.1895 0.1995 0.0890 0.1982 0.0008 0.0050 0.0008 v al 0.2248 0.0601 0.2289 0.2371 0.0678 0.2406 0.0011 0.0055 0.0011 CV 0.1777 0.0778 0.1775 0.1875 0.0996 0.1837 0.0008 0.0050 0.0008 GAM_OT train 0.0011 0.0206 0.0007 0.0012 0.0204 0.0008 0.0000 0.0009 0.0000 v al 0.0020 0.0228 0.0016 0.0018 0.0222 0.0014 0.0001 0.0012 0.0001 CV 0.0012 0.0232 0.0007 0.0013 0.0230 0.0008 0.0000 0.0010 0.0000 XGB_OF train 0.0915 0.2638 0.0227 0.2625 0.4404 0.0710 0.0003 0.0029 0.0003 v al 0.1029 0.2675 0.0324 0.2700 0.4465 0.0730 0.0004 0.0037 0.0004 CV 0.0828 0.2641 0.0135 0.2226 0.4277 0.0411 0.0002 0.0022 0.0002 XGB_OT train 0.1752 0.4142 0.0037 0.1210 0.3418 0.0043 0.0002 0.0022 0.0002 v al 0.1802 0.4197 0.0042 0.1226 0.3439 0.0045 0.0003 0.0022 0.0003 CV 0.1806 0.4215 0.0030 0.1237 0.3471 0.0033 0.0002 0.0021 0.0002 estimation, lik ely due to the smaller sample size. The higher v ariance of ov erﬁt- ting models reﬂects in higher PD v ariance. These empirical results agree with our theoretical ﬁndings. An additional ﬁnding is that, for w ell-generalizing mo d- els, diﬀerences in MSE across estimation strategies are mostly negligible, while for o v erﬁtting mo dels, CV-based estimation yields a substan tial reduction. ALE decomposition. Similar results for ALE in T ab. 4 show that the MSE is again low est for training-set- or CV-based estimation. In con trast to PD, bias is often considerably low er for training data (largest set) when n is small, and generally decreases with increasing sample size, likely as the probability of n S ( k ) > 0 ∀ k grows (required for un biasedness, cf. §4.2). Again, the v ariance is mostly lo w est for CV-based estimation, but no w substantially higher on the smaller v alidation set, supporting our theoretical result that ALE (estimation) v ariance is more sensitive to sample size than PD. Generally , this conﬁrms our theoretical ﬁndings and again sho ws that CV-based estimation is promising. 6.2 V ariance Decomposition Analysis F or v ariance decomp osition in to mo del and estimation v ariance, w e consider the results on F rie dman1 in T ab. 5, including a non-linear feature with interactions ( X 1 ), a linear without ( X 4 ), and a dummy feature ( X 7 ). The estimation v ariance is constan tly highest when feature eﬀects are estimated on the smaller v alida- tion set. This is more pronounced for ALE, and w e generally observ e higher 12 T. Heiß et al. T able 4: Results for ALE on SimpleNormalCorr elate d a v eraged ov er 100 grid p oin ts. Bold n um b ers are minimum per metric-feature-mo del-size-com bination. F eature x 1 x 2 x 3 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.0123 0.0205 0.0123 0.0108 0.0236 0.0106 0.0091 0.0161 0.0092 v al 0.0277 0.0771 0.0225 0.0304 0.0711 0.0262 0.0129 0.0176 0.0131 CV 0.0178 0.0857 0.0108 0.0190 0.0858 0.0120 0.0077 0.0153 0.0077 GAM_OT train 0.0022 0.0100 0.0021 0.0019 0.0061 0.0019 0.0005 0.0022 0.0005 v al 0.0148 0.0808 0.0086 0.0198 0.0918 0.0117 0.0005 0.0036 0.0005 CV 0.0103 0.0899 0.0023 0.0109 0.0890 0.0031 0.0004 0.0020 0.0004 XGB_OF train 0.1159 0.1343 0.1012 0.1120 0.1196 0.1011 0.0724 0.0279 0.0741 v al 0.5238 0.1153 0.5281 0.6017 0.1415 0.6017 0.0848 0.0354 0.0864 CV 0.1531 0.0784 0.1520 0.1025 0.0803 0.0994 0.0118 0.0126 0.0121 XGB_OT train 0.0162 0.0875 0.0088 0.0174 0.0768 0.0119 0.0065 0.0077 0.0066 v al 0.0471 0.1470 0.0263 0.0386 0.1094 0.0275 0.0033 0.0116 0.0033 CV 0.0315 0.1549 0.0077 0.0234 0.1158 0.0103 0.0020 0.0053 0.0020 n = 10000 GAM_OF train 0.0011 0.0071 0.0011 0.0012 0.0071 0.0011 0.0008 0.0050 0.0008 v al 0.0017 0.0075 0.0017 0.0017 0.0085 0.0017 0.0011 0.0057 0.0011 CV 0.0011 0.0068 0.0011 0.0011 0.0072 0.0011 0.0008 0.0050 0.0008 GAM_OT train 0.0002 0.0030 0.0002 0.0003 0.0045 0.0002 0.0000 0.0009 0.0000 v al 0.0006 0.0037 0.0006 0.0005 0.0056 0.0005 0.0001 0.0012 0.0001 CV 0.0002 0.0029 0.0002 0.0003 0.0046 0.0002 0.0000 0.0010 0.0000 XGB_OF train 0.0164 0.0346 0.0157 0.0197 0.0520 0.0176 0.0103 0.0189 0.0103 v al 0.0652 0.0416 0.0657 0.0780 0.0597 0.0770 0.0155 0.0166 0.0157 CV 0.0130 0.0180 0.0131 0.0147 0.0198 0.0148 0.0023 0.0098 0.0023 XGB_OT train 0.0046 0.0538 0.0017 0.0039 0.0401 0.0023 0.0005 0.0037 0.0005 v al 0.0054 0.0544 0.0025 0.0047 0.0412 0.0031 0.0006 0.0037 0.0007 CV 0.0044 0.0555 0.0014 0.0034 0.0417 0.0017 0.0003 0.0027 0.0003 estimation v ariance than for PD. As hypothesized, CV (1) reduces model v ari- ance compared to the other estimation strategies, which is most pronounced for o v erﬁtting mo dels (generally higher mo del v ariance). As exp ected, it also (2) reduces estimation v ariance compared to a single v alidation set. These empiri- cal ﬁndings agree with our theoretical results. Although estimation v ariance is sometimes sligh tly higher for features with in teractions ( X 1 ), X GBoost may also learn in teractions that are not in f , whic h is wh y this is not as clear as expected. 6.3 Eﬀect of the Sample Size Our results for the eﬀect of sample size are sho wn in Fig. 2 for F rie dman1 . F or PD on holdout data, w e kno w that there is no estimation bias (cf. §4.1) and Fig. 2a con tains only estimation v ariance. W e observ e the expected p olynomial decrease at roughly 1 /n for features with interactions ( X 1 ) and a negligible error across sample sizes for features without in teractions ( X 4 ), as expected for cen tered PD. F or ALE, we know from §4.2 that there are also estimation biases in addition to v ariance due to discretization and when n S ( k ) = 0 . With interactions ( X 1 ), the observed estimation error in Fig. 2b is closer to K /n for small sample sizes but gets closer to 1 /n as the sample size increases. F or features without interactions, w e observ e a sharp drop in the estimation error at n = K , reducing it to the exp ected negligible level. This ma y b e due to reduced estimation bias at this sample size, as eac h in terv al could, in principle, no w con tain at least one sample. Analyzing Error Sources in Global F eature Eﬀect Estimation 13 T able 5: Decomp osition results of total v ariance V ar T ot in to mo del v ariance V ar Mod and estimation v ariance V ar Est on F rie dman1 av eraged o v er 100 grid p oin ts. Bold n um b ers indicate estimation strategy with m inimal v ariance. F eature x 1 x 4 x 7 Metric V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est PD XGB_OF train 0.0862 0.0833 0.0029 0.1269 0.1240 0.0029 0.0027 0.0025 0.0002 v al 0.1120 0.0974 0.0146 0.1811 0.1661 0.0150 0.0059 0.0046 0.0012 CV 0.0460 0.0432 0.0028 0.0814 0.0785 0.0029 0.0014 0.0011 0.0002 XGB_OT train 0.0217 0.0209 0.0008 0.0232 0.0231 0.0001 0.0071 0.0070 0.0000 v al 0.0280 0.0241 0.0038 0.0283 0.0278 0.0006 0.0083 0.0082 0.0001 CV 0.0190 0.0182 0.0008 0.0205 0.0204 0.0001 0.0062 0.0062 0.0000 ALE XGB_OF train 1.0248 0.4286 0.5962 1.1178 0.4220 0.6958 0.1852 0.1556 0.0296 v al 4.6600 1.5201 3.1399 4.9819 1.1399 3.8420 0.3004 0.1614 0.1390 CV 0.6573 0.0846 0.5727 0.8568 0.0858 0.7710 0.0262 - * 0.0287 XGB_OT train 0.0337 0.0229 0.0108 0.0340 0.0265 0.0075 0.0125 0.0113 0.0012 v al 0.0937 0.0200 0.0737 0.1113 0.0463 0.0651 0.0154 0.0075 0.0080 CV 0.0299 0.0147 0.0151 0.0273 0.0143 0.0129 0.0067 0.0051 0.0016 * Due to instabilities in v ariance estimation for ALE, likely caused by unstable bin assignments, the estimated V ar Est sometimes exceeds the estimated V ar T ot . W e omit V ar Mod in these cases. 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 mean estimation er r or F e a t u r e x 1 1 / n K / n 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 2 7 1 0 2 3 1 0 1 9 1 0 1 5 1 0 1 1 1 0 7 1 0 3 1 0 1 mean estimation er r or F e a t u r e x 4 1 / n K / n (a) Mean estimation error for PD 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 mean estimation er r or F e a t u r e x 1 1 / n K / n 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 2 8 1 0 2 4 1 0 2 0 1 0 1 6 1 0 1 2 1 0 8 1 0 4 1 0 0 mean estimation er r or F e a t u r e x 4 1 / n K / n (b) Mean estimation error for ALE Fig. 2: Mean estimation errors on F rie dman1 for X 1 and X 4 . F or eac h sample size n , the v ariances are av eraged o v er all grid p oin ts. Both axes are log-scale. 7 Conclusion & F uture W ork Summary . In this w ork, we presen ted an estimator-level analysis of global fea- ture eﬀect estimation for PD and ALE. W e deriv ed a full MSE decomp osition that separates model bias, estimation bias, mo del v ariance, and estimation v ari- ance, and we analyzed these comp onen ts theoretically . Our results sho w that the mo del bias comp onen t follows directly from systematic biases in the mo del ˆ f for PD or in the ﬁnite diﬀerences of ˆ f for ALE. F or PD, the estimation bias is zero on holdout data. F or ALE, it consists of a bias when n S ( k ) = 0 , and a discretization bias due to binning. F or both PD and ALE, we derived upp er b ounds on the model v ariance in terms of p oin t wise v ariance of ˆ f (PD) or its ﬁnite diﬀerences (ALE). Estimation v ariance is go v erned by (1) the sample size, scaling as 1 /n for PD and as the exp ected in v erse bin coun ts for ALE, explain- ing ALE’s sensitivity in small-sample regime s, and (2) the v ariance of the ICE curv es / ﬁnite diﬀerences. F or centered PD and generally for ALE, the latter dep ends only on in teractions with X S . 14 T. Heiß et al. W e v alidated our theoretical ﬁndings through simulations b y comparing fea- ture eﬀect estimation on training and v alidation data and a CV-based strategy . Our empirical results show ed that potential bias from estimating feature eﬀects on the training data is negligible across our settings, including when mo dels ov er- ﬁt (RQ1). Diﬀerences are instead dominated b y sample-size eﬀects: v alidation- set-based estimation yields higher v ariance (for ALE also higher bias), while training-set- and CV-based estimation attain the low est MSE. V ariance decom- p osition (R Q2) sho ws that the estimation v ariance is consisten tly highest on the smaller v alidation set and more pronounced for ALE, in line with our theoretical results. CV reduces mo del v ariance b y av eraging across ﬁts (particularly b ene- ﬁcial for o v erﬁtting mo dels) and estimation v ariance through a higher eﬀective sample size compared to a single v alidation set. The observ ed estimation error (R Q3) conﬁrms our theoretical results on sample size and interaction eﬀects. Practical implications. While using holdout data is theoretically cleaner, our results indicate that training-set-based feature eﬀect estimation is empir- ically safe and often preferable b ecause it has a larger sample size. CV-based estimation emerges as a robus t alternative, particularly for ov erﬁtting mo dels. Although suc h models are typically iden tiﬁed via appropriate p erformance esti- mation and ruled out, this may not alwa ys be p erfectly p ossible in all applica- tions, in whic h cases CV-based feature eﬀect estimation can b e a safer option. Limitations & future w ork. Our empirical results are themselv es esti- mates with errors from ﬁnite rep etitions, whic h can lead to artifacts, suc h as estimation v ariance estimates exceeding total v ariance (cf. T ab. 5). A dditionally , the empirical analysis considers only low-dimensional settings and tw o mo del classes. F urther, our theoretical analysis leav es ro om for future research, includ- ing: (1) a tigh ter bias analysis for cases in which D ⊥ ˆ f do es not hold to formalize what happ ens for training-set-based estimation, (2) a formal theory for CV-based estimation results, (3) extensions of our analysis to distribution shifts b et w een training and estimation data, whic h is another practically relev ant issue. References 1. Apley , D.W., Zh u, J.: Visualizing the eﬀects of predictor v ariables in black b o x sup ervised learning mo dels. Journal of the Ro y al Statistical So ciet y Series B: Sta- tistical Metho dology 82 (4), 1059–1086 (2020) 2. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for h yp er-parameter optimization. In: Pro ceedings of the 25th International Conference on Neural Infor- mation Pro cessing Systems. p. 2546–2554. NIPS’11, Curran Asso ciates Inc., Red Ho ok, NY, USA (2011) 3. Chang, C.Y., Chang, M.C.: A ccelerated aggregated D-optimal designs for estimat- ing main eﬀects in blac k-box mo dels. arXiv:2510.08465 [stat.ML] (2025) 4. Chen, T., Guestrin, C.: X GBoost: A scalable tree b oosting system. In: Pro ceedings of the 22nd A CM SIGKDD International Conference on Kno wledge Disco v ery and Data Mining. p. 785–794. KDD ’16, A CM, New Y ork, NY, USA (2016) 5. Co ok, T.R., Mo dig, Z.D., Palmer, N.M.: Explaining mac hine learning by bo ot- strapping partial marginal eﬀects and Shapley v alues. FEDS W orking Paper No. 2024-75 (2024) Analyzing Error Sources in Global F eature Eﬀect Estimation 15 6. F riedman, J.H.: Multiv ariate adaptive regression splines. The Annals of Statistics 19 (1), 1–67 (1991), publisher: Institute of Mathematical Statistics 7. F riedman, J.H.: Greedy function approximation: A gradient bo osting machine. The Annals of Statistics 29 (5), 1189–1232 (2001) 8. Gk olemis, V., Dalamagas, T., Diou, C.: DALE: Diﬀeren tial accum ulated local ef- fects for eﬃcient and accurate global explanations. In: Proceedings of The 14th Asian Conference on Machine Learning. pp. 375–390. PMLR (2023) 9. Gk olemis, V., Dalamagas, T., Ntoutsi, E., Diou, C.: RHALE: Robust and heterogeneit y-aw are accum ulated lo cal eﬀects. In: ECAI 2023, pp. 859–866. IOS Press (2023) 10. Goldstein, A., Kap elner, A., Bleich, J., Pitkin, E.: Peeking inside the black b o x: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics 24 (1), 44–65 (2015) 11. Green well, B.: p dp: An R pack age for constructing partial dep endence plots. The R Journal 9/1 (2017) 12. Herbinger, J., Bischl, B., Casalicchio, G.: REPID: Regional eﬀect plots with im- plicit in teraction detection. In: Pro ceedings of The 25th International Conference on Artiﬁcial In telligence and Statistics. pp. 10209–10233. PMLR (2022) 13. Herbinger, J., W right, M.N., Nagler, T., Bisc hl, B., Casalicc hio, G.: Decomp osing global feature eﬀects based on feature interactio ns. Journal of Machine Learning Researc h 25 (381), 1–65 (2024) 14. Lo ec her, M.: Debiasing MDI feature importance and SHAP v alues in tree en- sem bles. In: Holzinger, A., Kieseb erg, P ., Tjoa, A.M., W eippl, E. (eds.) Machine Learning and Kno wledge Extraction. pp. 114–129. Springer International Publish- ing, Cham (2022) 15. Matsubara, Y., Chiba, N., Igarashi, R., Ushiku, Y.: Rethinking symbolic regression datasets and b enc hmarks for scientiﬁc discov ery . Journal of Data-cen tric Machine Learning Research 1 (2024) 16. Molnar, C.: In terpretable machine learning: a guide for making blac k box models explainable. Christoph Molnar, Munich, Germany , 2nd edn. (2022) 17. Molnar, C., F reiesleb en, T., König, G., Herbinger, J., Reisinger, T., Casalicchio, G., W righ t, M.N., Bisc hl, B.: Relating the partial dep endence plot and permutation feature importance to the data generating pro cess. In: Longo, L. (ed.) Explainable Artiﬁcial Intelligence. pp. 456–479. Springer Nature Switzerland, Cham (2023) 18. Molnar, C., K önig, G., Herbinger, J., F reiesleb en, T., Dandl, S., Sc holb ec k, C.A., Casalicc hio, G., Grosse-W entrup, M., Bisc hl, B.: General pitfalls of mo del-agnostic in terpretation metho ds for mac hine learning mo dels. In: Holzinger, A., Goeb el, R., F ong, R., Moon, T., Müller, K.R., Samek, W. (eds.) xxAI - Beyond Explainable AI: International W orkshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers, pp. 39–68. Springer In ternational Publishing, Cham (2022) 19. Mo osbauer, J., Herbinger, J., Casalicc hio, G., Lindauer, M., Bischl, B.: Explaining h yp erparameter optimization via partial dep endence plots. In: Pro ceedings of the 35th In ternational Conference on Neural Information Processing Systems. NIPS ’21, Curran Associates Inc., Red Hook, NY, USA (2021) 20. Morris, T.P ., White, I.R., Cro wther, M.J.: Using simulation studies to ev aluate statistical metho ds. Statistics in Medicine 38 (11), 2074–2102 (2019) 21. Probst, P ., Boulesteix, A.L., Bischl, B.: T unability: Importance of hyperparameters of machine learning algorithms. Journal of Mac hine Learning Research 20 (53), 1– 32 (2019) 22. Serv én, D., Brummitt, C.: p ygam: Generalized additiv e mo dels in p ython (2018) 16 T. Heiß et al. App endix A Theoretical Evidence A.1 Assumptions (i) Indep endenc e of fe atur e eﬀe ct deﬁnition and mo del. F eature eﬀects (b oth PD and ALE) are deﬁned using expectations indep enden t of the randomness in the training of ˆ f ∼ P F , meaning X ¯ S ⊥ ˆ f for the theoretical PD / ALE. (ii) Inte gr ability of ˆ f and f . F or all features X S and p oin ts x S , the random v ariable ˆ f ( x S , X ¯ S ) is integrable and square-integrable under the joint la w P ( F,X ¯ S ) , i.e., E ( F,X ¯ S ) [ | ˆ f ( x S , X ¯ S ) | ] < ∞ and E ( F,X ¯ S ) [ ˆ f ( x S , X ¯ S ) 2 ] < ∞ . In particular, this implies that for P F -almost all realizations of ˆ f w e ha v e E X ¯ S | ˆ f [ ˆ f ( x S , X ¯ S ) 2 ] < ∞ by T onelli’s theorem for conditional probabilities / sto c hastic k ernels (since ˆ f ( x S , X ¯ S ) 2 ≥ 0 ). Likewise E X ¯ S | ˆ f [ | ˆ f ( x S , X ¯ S ) | ] < ∞ . Analogously , f ( x S , X ¯ S ) is integrable and square-integrable under P X ¯ S , i.e., E X ¯ S [ | f ( x S , X ¯ S ) | ] < ∞ and E X ¯ S [ f ( x S , X ¯ S ) 2 ] < ∞ . These conditions hold in particular for the sp ecial case that X ¯ S and ˆ f are independent, e.g., under Assumption (i).They ensure that the relev ant exp ectations and v ariances are w ell deﬁned (i.p., PD, ALE, and their estimators) and justify in terc hanging the order of in tegration (F ubini/T onelli) where needed. (iii) Finite c onditional pr ob abilities. F or all bins I S ( k ) considered, P ( X S ∈ I S ( k )) > 0 , and the conditional la w P X ¯ S | X S ∈ I S ( k ) is w ell-deﬁned. (iv) Uniform b ounde d total variation in x S . F or all features X S , there exists a nonnegativ e random v ariable V ( ˆ f ) such that, with probability 1 ov er ˆ f ∼ P F , ess sup x ¯ S TV  t 7→ ˆ f ( t, x ¯ S )  ≤ V ( ˆ f ) , E F [ V ( ˆ f )] < ∞ . (v) Inte gr ability of ﬁnite diﬀer enc es. F or all features X S , all K and all k con- sidered, the ﬁnite diﬀerences of ˆ f are in tegrable and square-in tegrable w.r.t. the join t law given the k -th bin, i.e., under P ( F,X ) | X S ∈ I K S ( k ) . In other w ords, E ( F,X ) | X S ∈ I K S ( k ) [ ∆ K ˆ f ,S ( k , X ¯ S ) 2 ] < ∞ , and analogously to (ii) this implies E X ¯ S | X S ∈ I K S ( k ) , ˆ f [ ∆ K ˆ f ,S ( k , X ¯ S ) 2 ] < ∞ for P F -almost all ˆ f . F or in tegrabilit y , w e lik ewise ha v e E ( F,X ) | X S ∈ I K S ( k ) [ | ∆ K ˆ f ,S ( k , X ¯ S ) | ] < ∞ . Note that this follo ws from (iv) plus the square-integrabilit y condition in (ii). As in (ii), the same in tegrabilit y conditions hold for ∆ K f ,S ( k , X ¯ S ) (with f in place of ˆ f ). (vi) Existenc e and squar e-inte gr ability of ALE tar gets. F or all features X S and h ∈ { ˆ f , f } , the theoretical uncentered ALE ] ALE h ( x S ) (Eq. (2)) exists for all x S . Moreov er, the centered theoretical ALE ALE h ( x S ) exists and has ﬁnite second momen t w.r.t. X S , i.e., E X S  ALE h ( X S ) 2  < ∞ . Analyzing Error Sources in Global F eature Eﬀect Estimation 17 A.2 Bias-V ariance-Decomp osition of the Estimators (Eq. (4)) Pr o of. Fix a feature index S and an ev aluation p oin t x S . Let D denote a random dataset used to estimate the feature eﬀect. F or b etter readability , w e omit the p oin t x S . Under Assumption (ii) for either dep enden t or independent D and ˆ f , E F E D | ˆ f  ( P D f − d P D ˆ f ) 2  = E F E D | ˆ f [ P D 2 f − 2 P D f d P D ˆ f + d P D 2 ˆ f ] = P D 2 f − 2 P D f E F E D | ˆ f [ d P D ˆ f ] + E F E D | ˆ f [ d P D 2 ˆ f ] = P D 2 f − 2 P D f E F E D | ˆ f [ d P D ˆ f ] + E F V ar D | ˆ f [ d P D ˆ f ] + E F [ E D | ˆ f [ d P D ˆ f ] 2 ] = P D 2 f − 2 P D f E F E D | ˆ f [ d P D ˆ f ] + E F V ar D | ˆ f [ d P D ˆ f ] + V ar F E D | ˆ f [ d P D ˆ f ] + E F [ E D | ˆ f [ d P D ˆ f ]] 2 = ( P D f − E F E D | ˆ f [ d P D ˆ f ]) 2 + V ar F E D | ˆ f [ d P D ˆ f ] + E F V ar D | ˆ f [ d P D ˆ f ] . The pro of for ALE follo ws by replacing P D f with ALE f and d P D ˆ f with [ ALE ˆ f in the equations ab o v e. F or this replacement argument, it suﬃces that all in v olv ed quan tities are w ell-deﬁned (Assumption (vi)) and hav e ﬁnite second moments (Assumption (v)), and that f , and hence ALE f , is non-random w.r.t. P F . ⊓ ⊔ A.3 Mo del Bias of the PD (Eq. (6)) Pr o of. Fix x S . Consider the product probabilit y space ( ˆ f , X ¯ S ) ∼ P F ⊗ P X ¯ S (cf. Assumption(i)). By Assumption (ii), ˆ f ( x S , X ¯ S ) is integrable under P F ⊗ P X ¯ S , hence P D ˆ f ( x S ) = E X ¯ S [ ˆ f ( x S , X ¯ S )] exists for P F -almost all realizations of ˆ f and is in tegrable w.r.t. P F . No w E F  P D ˆ f ( x S )  − P D f ( x S ) is equal to E F h E X ¯ S  ˆ f ( x S , X ¯ S )  i − E X ¯ S  f ( x S , X ¯ S )  = E X ¯ S h E F  ˆ f ( x S , X ¯ S )  − f ( x S , X ¯ S ) i . This follows b y an application of F ubini’s theorem (exchange of in tegrals) since E F E X ¯ S [ | ˆ f ( x S , X ¯ S ) | ] < ∞ b y Assumption (ii). ⊓ ⊔ A.4 Mo del V ariance of the PD (Eq. (7)) Pr o of. As in §A.3, ﬁx x S and supp ose Assumptions (i) and (ii) hold. Then, the v ariance of the theoretical mo del PD exists and is: V ar F  P D ˆ f ( x S )  = E F h  E X ¯ S [ ˆ f ( x S , X ¯ S )] − E F E X ¯ S [ ˆ f ( x S , X ¯ S )]  2 i . Applying F ubini’s theorem as E F E X ¯ S [ | ˆ f ( x S , X ¯ S ) | ] < ∞ b y Assumption (ii), w e hav e E F E X ¯ S [ ˆ f ( x S , X ¯ S )] = E X ¯ S E F [ ˆ f ( x S , X ¯ S )] . By linearity of expectations, V ar F  P D ˆ f ( x S )  = E F h  E X ¯ S  ˆ f ( x S , X ¯ S ) − E F [ ˆ f ( x S , X ¯ S )]  2 i . Deﬁning U := ˆ f ( x S , X ¯ S ) − E F [ ˆ f ( x S , X ¯ S )] and applying Jensen’s inequality to ϕ ( t ) = t 2 yields:  E X ¯ S [ U ]  2 ≤ E X ¯ S [ U 2 ] . Therefore, V ar F  P D ˆ f ( x S )  ≤ E F E X ¯ S [ U 2 ] . Since U 2 ≥ 0 , T onelli’s theorem completes the pro of: V ar F  P D ˆ f ( x S )  ≤ E F E X ¯ S [ U 2 ] = E X ¯ S E F [ U 2 ] = E X ¯ S V ar F  ˆ f ( x S , X ¯ S )  . ⊓ ⊔ 18 T. Heiß et al. A.5 Estimation V ariance of the PD (Eq. (8)) Pr o of. Fix x S and a mo del ˆ f . Supp ose Assumption (ii) holds, and consider i.i.d. dra ws D = { X ( i ) ¯ S } n i =1 from P X ¯ S | ˆ f , conditional on ˆ f . Deﬁne ˆ Y ( i ) S := ˆ f ( x S , X ( i ) ¯ S ) , then b y construction the ˆ Y ( i ) S are i.i.d. conditional on ˆ f , and b y Assumption (ii), w e ha v e E X ( i ) ¯ S | ˆ f  ( ˆ Y ( i ) S ) 2  < ∞ for P F -almost all ˆ f . Thereb y V ar D | ˆ f  d P D ˆ f ( x S )  = V ar D | ˆ f " 1 n n X i =1 ˆ Y ( i ) S # = 1 n 2 V ar D | ˆ f " n X i =1 ˆ Y ( i ) S # = 1 n 2 n X i =1 V ar D | ˆ f [ ˆ Y ( i ) S ] ! = 1 n 2  n V ar D | ˆ f [ ˆ Y (1) S ]  = 1 n V ar D | ˆ f  ˆ f ( x S , X ¯ S )  , with X ¯ S in the last step being drawn conditional on ˆ f , so from D | ˆ f . T aking exp ectation ov er the training randomness of ˆ f gives the proof. ⊓ ⊔ A.6 ALE Estimator Bias w.r.t. Binned Population ALE (§4.2) Pr o of. Fix x S , b ins { I S ( k ) } K k =1 , and a realization of ˆ f . Supp ose Assumptions (iii) and (v) hold. Moreov er, assume D = { X ( i ) } n i =1 (used for estimation) are i.i.d. dra ws from P X , indep endent of ˆ f , and assume n S ( k ) > 0 for all bins k ≤ k S ( x S ) . Let B = ( B (1) , . . . , B ( n ) ) b e the bin ass ignmen ts of the samples, where B ( i ) = k S ( X ( i ) S ) . Then, conditional on the ful l assignmen t v ector B , i.e., on the pro duct ev en t T n i =1 { X ( i ) S ∈ I S ( B ( i ) ) } , the samples remain indep enden t; in particular, for eac h k the sub collection { X ( i ) ¯ S : B ( i ) = k } is i.i.d. with X ( i ) ¯ S ∼ P X ¯ S | X S ∈ I S ( k ) . Therefore, using the la w of total exp ectation and D ⊥ ˆ f , w e obtain E D | ˆ f  [ ] ALE ˆ f ( x S )  = E B E D | B  k S ( x S ) X k =1 1 n S ( k ) X i : X ( i ) S ∈ I S ( k ) ∆ ˆ f ( k , X ( i ) ¯ S )  = E B  k S ( x S ) X k =1 1 n S ( k ) X i : X ( i ) S ∈ I S ( k ) E D | B h ∆ ˆ f ( k , X ( i ) ¯ S ) i  = E B  k S ( x S ) X k =1 E X ¯ S | X S ∈ I S ( k ) h ∆ ˆ f ( k , X ¯ S ) i  = E B  ] ALE K ˆ f ( x S )  = ] ALE K ˆ f ( x S ) . Th us, [ ] ALE ˆ f is un biased for ] ALE K ˆ f : E D | ˆ f  [ ] ALE ˆ f ( x S )  − ] ALE K ˆ f ( x S ) = 0 . ⊓ ⊔ A.7 Mo del Bias of the ALE (§4.2) Pr o of. Fix x S and let ˆ f ∼ P F . Supp ose Assumptions (i), (iii), (iv), and (v) hold, so ( X S , X ¯ S ) ∼ P X is an indep enden t p opulation dra w, i.e., ( X S , X ¯ S ) ⊥ ˆ f , and Analyzing Error Sources in Global F eature Eﬀect Estimation 19 ] ALE ˆ f ( x S ) and ] ALE f ( x S ) exist by Assumption (vi). F or K ∈ N , deﬁne g K ( ˆ f ) := P k K S ( x S ) k =1 E X ¯ S | X S ∈ I K S ( k )  ∆ K ˆ f ,S ( k , X ¯ S )  , so that ] ALE ˆ f ( x S ) = lim K →∞ g K ( ˆ f ) by deﬁnition. Then, E F h ] ALE ˆ f ( x S ) i = Z F lim K →∞ g K ( ˆ f ) d P F ( ˆ f ) = lim K →∞ Z F g K ( ˆ f ) d P F ( ˆ f ) = lim K →∞ k K S ( x S ) X k =1 Z F E X ¯ S | X S ∈ I K S ( k ) h ∆ K ˆ f ,S ( k , X ¯ S ) i d P F ( ˆ f ) = lim K →∞ k K S ( x S ) X k =1 E X ¯ S | X S ∈ I K S ( k ) h E F h ∆ K ˆ f ,S ( k , X ¯ S ) ii . The second equality follo ws from the Dominated Con v ergence Theorem. Indeed, g K ( ˆ f ) → ] ALE ˆ f ( x S ) p oin t wise in ˆ f by deﬁnition, and for ev ery K ,   g K ( ˆ f )   ≤ k K S ( x S ) X k =1 E X ¯ S | X S ∈ I K S ( k ) h   ∆ K ˆ f ,S ( k , X ¯ S )   i ≤ k K S ( x S ) X k =1 ess sup x ¯ S   ∆ K ˆ f ,S ( k , x ¯ S )   ≤ ess sup x ¯ S k K S ( x S ) X k =1   ˆ f ( z K k,S , x ¯ S ) − ˆ f ( z K k − 1 ,S , x ¯ S )   ≤ ess sup x ¯ S TV  t 7→ ˆ f ( t, x ¯ S )  ≤ V ( ˆ f ) , where V ( ˆ f ) is the in tegrable b ound from Assumption (iv). The last equalit y ab o v e follo ws from F ubini’s theorem, using Assumption (v) (integrabilit y) for P X ¯ S | X S ∈ I K S ( k ) , ˆ f = P X ¯ S | X S ∈ I K S ( k ) since ( X S , X ¯ S ) ⊥ ˆ f . By linearit y of expectation, E F h ] ALE ˆ f ( x S ) i − ] ALE f ( x S ) = lim K →∞ k K S ( x S ) X k =1 E X ¯ S | X S ∈ I K S ( k ) h E F h ∆ K ˆ f ,S ( k , X ¯ S ) i − ∆ K f ,S ( k , X ¯ S ) i . Th us, this bias term v anishes if the ﬁnite diﬀerences of ˆ f are unbiased w.r.t. those of f . One suﬃcien t condition is that ˆ f is a p oin t wise unbiased estimator of f . ⊓ ⊔ A.8 Mo del V ariance of the ALE (Eq. (11)) Pr o of. Fix x S , let ˆ f ∼ P F , and supp ose Assumptions (i), (iii), and (v) hold, so that as in §A.7, for each bin index k , all exp ectations w.r.t. P X ¯ S |I k are in- dep enden t of ˆ f . F or readabilit y , w e abbreviate the even t { X S ∈ I S ( k ) } by I k . Then: 20 T. Heiß et al. V ar F  ] ALE K ˆ f ( x S )  = V ar F   k S ( x S ) X k =1 E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S )    = E F  k S ( x S ) X k =1 E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S )  − E F  k S ( x S ) X k =1 E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S )   2  = E F  k S ( x S ) X k =1 E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S ) − E F  ∆ ˆ f ( k , X ¯ S )   2  . The last line follows from the linearity of exp ectations and exc hanging E F and E X ¯ S |I k b y an application of F ubini’s theorem. Indeed, using Assumption (v) for ˆ f ⊥ ( X S , X ¯ S ) ensures that the resulting joint integrand is integrable. Deﬁne U k := E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S ) − E F  ∆ ˆ f ( k , X ¯ S )  , and let U =  U 1 , . . . , U k S ( x S )  ⊤ . By the Cauch y-Sc h w arz inequality ,  P k S ( x S ) k =1 U k  2 = ⟨ 1 , U ⟩ 2 ≤ ∥ 1 ∥ 2 ∥ U ∥ 2 = k S ( x S ) P k S ( x S ) k =1 U 2 k . Moreov er, E F  k S ( x S ) P k S ( x S ) k =1 U 2 k  = k S ( x S ) P k S ( x S ) k =1 E F  U 2 k  b y linearity of expectation. By Assumption (v), applying Jensen’s inequalit y to φ ( t ) = t 2 , we obtain U 2 k ≤ E X ¯ S |I k  ∆ ˆ f ( k , X ¯ S ) − E F [ ∆ ˆ f ( k , X ¯ S )]  2  . Putting ev erything together and applying T onelli’s theorem (the integrand U 2 k is non- negativ e) giv es the pro of: V ar F  ] ALE K ˆ f ( x S )  ≤ k S ( x S ) k S ( x S ) X k =1 E F E X ¯ S |I k   ∆ ˆ f ( k , X ¯ S ) − E F  ∆ ˆ f ( k , X ¯ S )   2  = k S ( x S ) k S ( x S ) X k =1 E X ¯ S |I k E F   ∆ ˆ f ( k , X ¯ S ) − E F  ∆ ˆ f ( k , X ¯ S )   2  = k S ( x S ) k S ( x S ) X k =1 E X ¯ S |I k V ar F h ∆ ˆ f ( k , X ¯ S ) i . □ A.9 Estimation V ariance of the ALE (Eq. (12)) Pr o of. Fix x S and bins I S ( k ) , supp ose Assumptions (iii), (v) and (vi) hold, i.p. with D = { X ( i ) } n i =1 b eing i.i.d. draws from P X | ˆ f , conditional on ˆ f , and assume n S ( k ) > 0 ∀ k . F or an y ﬁxed k , the v ariance of the ﬁnite diﬀerences σ 2 k ( ˆ f ) := V ar X ¯ S | X S ∈ I S ( k ) , ˆ f [ ∆ ˆ f ( k , X ¯ S )] is ﬁnite by Assumption (v): σ 2 k ( ˆ f ) < ∞ . No w, for each k , conditional on a mo del ˆ f and on the full bin assignmen t v ector B of all samples, the v ariables { ∆ ˆ f ( k , X ( i ) ¯ S ) : B ( i ) = k } (all those falling into I S ( k ) ) are i.i.d. with v ariance σ 2 k ( ˆ f ) . Since also n S ( k ) and the even ts { X ( i ) S ∈ I S ( k ) } are ﬁxed conditional on B and ˆ f (similar to §A.6), this gives Analyzing Error Sources in Global F eature Eﬀect Estimation 21 us V ar D | ˆ f ,B [ b µ k ( ˆ f )] = σ 2 k ( ˆ f ) /n S ( k ) , where we deﬁne the unaccumulated v alue of the k -th bin as b µ k ( ˆ f ) := 1 n S ( k ) P i : X ( i ) S ∈ I S ( k ) ∆ ˆ f  k , X ( i ) ¯ S  . Moreov er, condi- tional on B and ˆ f , diﬀerent bins use disjoin t subsets of the i.i.d. sample, hence Co v D | ˆ f ,B [ b µ k ( ˆ f ) , b µ ℓ ( ˆ f )] = 0 for k  = ℓ . Altogether w e get: E F E B | ˆ f V ar D | ˆ f ,B  [ ] ALE ˆ f ( x S )  = E F E B | ˆ f V ar D | ˆ f ,B  X k ≤ k S ( x S ) b µ k ( ˆ f )  = X k ≤ k S ( x S ) E F E B | ˆ f V ar D | ˆ f ,B [ b µ k ( ˆ f )] = X k ≤ k S ( x S ) E F E B | ˆ f  1 n S ( k ) σ 2 k ( ˆ f )  . □ B Sim ulation Details & Results B.1 Mo dels, Hyp erparameters, and Model P erformances The XGBoost implemen tation from xgb o ost [4], and the GAM implementation from pyGAM [22] were used. F or the GAM, we considered the n um b er of basis functions ( ∈ [5 , 50] ) and the p enalization con trol parameter ( ∈ [0 . 001 , 1000] , log-uniform) as tuning parameters. F or X GBoost, w e used the parameter spaces from [21] (conﬁned to trees as base learners). Hyp erparameter conﬁgurations for the ov erﬁtting mo dels (OF) were carefully hand-pic k ed to achiev e strong p erformance on the training data while p erforming relativ ely p oorly on holdout data. The optimal hyperparameters (OT) were se- lected by tuning the mo dels on a separate data sample of size n ( 1250 or 10000 ) for 200 trials using a T ree-structured P arzen Estimator (TPE) [2] with MSE on separate holdout data ( 10000 samples for reliable p erformance estimates) as minimization ob jective. F or b oth OF and OT mo dels, the ﬁnal selected hyper- parameter conﬁgurations can b e found on Gith ub ( link ). W e ev aluate mo del performance across all repetitions on both the training and test data, with 10000 test samples to obtain reliable performance estimates. As intended, o v erﬁtting mo dels sho w b etter training p erformance than the opti- mally tuned mo dels, but underp erform on test data and exhibit higher v ariance in their generalization performance. A linear regression mo del is used as a base- line and is outp erformed by all OT mo dels. These p erformance metrics can b e found on GitHub ( link ). B.2 F urther Results: Bias-V ariance-Analysis 22 T. Heiß et al. T able 6: Results for PD on F rie dman1 av eraged o v er 100 grid p oin ts. Bold n um- b ers indicate the minim um p er metric-feature-mo del-size-com bination. F eature x 1 x 3 x 5 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.1184 0.0720 0.1173 0.1285 0.0639 0.1289 0.1210 0.0631 0.1212 v al 0.1968 0.0929 0.1949 0.2084 0.0907 0.2073 0.2302 0.0955 0.2289 CV 0.1062 0.0697 0.1049 0.1094 0.0650 0.1090 0.1057 0.0652 0.1051 GAM_OT train 0.0166 0.0190 0.0168 0.0146 0.0238 0.0145 0.0145 0.0166 0.0147 v al 0.0264 0.0266 0.0266 0.0187 0.0236 0.0188 0.0199 0.0221 0.0201 CV 0.0165 0.0182 0.0167 0.0144 0.0236 0.0143 0.0143 0.0166 0.0146 XGB_OF train 0.1361 0.2297 0.0862 0.4853 0.6748 0.0310 0.1142 0.2767 0.0390 v al 0.1915 0.2885 0.1120 0.6402 0.7744 0.0419 0.1718 0.3097 0.0786 CV 0.1196 0.2742 0.0460 0.5429 0.7258 0.0167 0.1055 0.2853 0.0249 XGB_OT train 0.0327 0.1081 0.0217 0.0393 0.1511 0.0170 0.0208 0.0854 0.0140 v al 0.0416 0.1206 0.0280 0.0509 0.1752 0.0209 0.0245 0.0925 0.0165 CV 0.0334 0.1227 0.0190 0.0439 0.1718 0.0149 0.0209 0.0941 0.0125 n = 10000 GAM_OF train 0.0094 0.0178 0.0094 0.0086 0.0173 0.0086 0.0096 0.0192 0.0096 v al 0.0137 0.0212 0.0137 0.0117 0.0211 0.0117 0.0129 0.0235 0.0128 CV 0.0095 0.0178 0.0095 0.0087 0.0173 0.0086 0.0097 0.0189 0.0096 GAM_OT train 0.0005 0.0082 0.0005 0.0004 0.0024 0.0004 0.0004 0.0035 0.0004 v al 0.0013 0.0076 0.0013 0.0005 0.0021 0.0005 0.0005 0.0049 0.0005 CV 0.0005 0.0082 0.0005 0.0004 0.0024 0.0004 0.0004 0.0036 0.0004 XGB_OF train 0.0232 0.1194 0.0092 0.0675 0.2523 0.0040 0.0302 0.1594 0.0049 v al 0.0326 0.1370 0.0143 0.0924 0.2949 0.0057 0.0386 0.1776 0.0073 CV 0.0235 0.1314 0.0065 0.0860 0.2886 0.0028 0.0344 0.1753 0.0038 XGB_OT train 0.0060 0.0266 0.0055 0.0060 0.0374 0.0047 0.0044 0.0274 0.0038 v al 0.0078 0.0318 0.0070 0.0073 0.0428 0.0057 0.0055 0.0327 0.0046 CV 0.0052 0.0293 0.0045 0.0056 0.0427 0.0039 0.0041 0.0310 0.0033 T able 7: Results for ALE on F rie dman1 av eraged ov er 100 grid points. Bold n um b ers indicate the minim um per metric-feature-mo del-size-com bination. F eature x 1 x 3 x 5 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.1280 0.0809 0.1258 0.1290 0.0629 0.1295 0.1246 0.0642 0.1248 v al 0.4060 0.1738 0.3892 0.4199 0.2139 0.3875 0.4212 0.1265 0.4196 CV 0.1745 0.1855 0.1451 0.1497 0.1479 0.1324 0.1212 0.0951 0.1162 GAM_OT train 0.0206 0.0336 0.0201 0.0146 0.0235 0.0145 0.0144 0.0165 0.0146 v al 0.0795 0.1668 0.0535 0.0403 0.1219 0.0263 0.0316 0.1086 0.0205 CV 0.0400 0.1449 0.0196 0.0273 0.1159 0.0144 0.0257 0.1147 0.0130 XGB_OF train 3.6571 1.6329 1.0248 1.2235 0.8780 0.4683 2.6986 1.4763 0.5371 v al 4.8262 0.5670 4.6600 1.5224 0.7511 0.9914 1.9319 0.4747 1.7654 CV 0.8781 0.4926 0.6573 0.7944 0.7932 0.1710 0.5237 0.4235 0.3562 XGB_OT train 0.0591 0.1629 0.0337 0.0344 0.0961 0.0260 0.0482 0.1701 0.0199 v al 0.1411 0.2250 0.0937 0.1314 0.2818 0.0538 0.0770 0.1795 0.0463 CV 0.0878 0.2428 0.0299 0.0920 0.2690 0.0203 0.0563 0.1989 0.0173 n = 10000 GAM_OF train 0.0107 0.0272 0.0103 0.0087 0.0171 0.0087 0.0096 0.0191 0.0096 v al 0.0177 0.0312 0.0173 0.0127 0.0214 0.0127 0.0135 0.0237 0.0133 CV 0.0107 0.0267 0.0103 0.0088 0.0171 0.0088 0.0097 0.0189 0.0097 GAM_OT train 0.0016 0.0239 0.0011 0.0004 0.0023 0.0004 0.0004 0.0036 0.0004 v al 0.0036 0.0233 0.0031 0.0005 0.0023 0.0005 0.0005 0.0048 0.0005 CV 0.0016 0.0239 0.0011 0.0004 0.0024 0.0004 0.0004 0.0036 0.0004 XGB_OF train 0.6796 0.8044 0.0336 0.4915 0.6802 0.0299 0.8473 0.9057 0.0279 v al 0.2098 0.1518 0.1932 0.1711 0.3057 0.0803 0.1157 0.1788 0.0866 CV 0.0485 0.1308 0.0324 0.1091 0.3024 0.0183 0.0499 0.1821 0.0173 XGB_OT train 0.0070 0.0252 0.0066 0.0055 0.0209 0.0052 0.0045 0.0164 0.0044 v al 0.0166 0.0392 0.0155 0.0112 0.0483 0.0091 0.0075 0.0291 0.0069 CV 0.0064 0.0276 0.0058 0.0062 0.0436 0.0044 0.0045 0.0307 0.0037 Analyzing Error Sources in Global F eature Eﬀect Estimation 23 T able 8: Results for PD on F eynman I.29.16 a v eraged o v er 100 grid p oin ts. Bold n um b ers indicate the minim um per metric-feature-mo del-size-com bination. F eature x 1 θ 1 d 1 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.2916 0.1357 0.2826 0.0660 0.0482 0.0659 0.0693 0.0497 0.0692 v al 0.4493 0.1492 0.4417 0.0988 0.0617 0.0982 0.1011 0.0528 0.1017 CV 0.1577 0.0749 0.1574 0.0611 0.0458 0.0610 0.0650 0.0485 0.0648 GAM_OT train 0.0318 0.0358 0.0316 0.0199 0.0281 0.0197 0.0179 0.0194 0.0182 v al 0.0430 0.0370 0.0431 0.0255 0.0310 0.0254 0.0215 0.0224 0.0217 CV 0.0299 0.0344 0.0297 0.0186 0.0273 0.0185 0.0170 0.0193 0.0172 XGB_OF train 0.0681 0.0623 0.0664 0.0032 0.0209 0.0029 0.0030 0.0105 0.0030 v al 0.0878 0.0715 0.0855 0.0049 0.0245 0.0044 0.0039 0.0178 0.0037 CV 0.0429 0.0408 0.0426 0.0021 0.0217 0.0016 0.0014 0.0044 0.0014 XGB_OT train 0.0316 0.0746 0.0270 0.0095 0.0229 0.0093 0.0074 0.0138 0.0074 v al 0.0404 0.0813 0.0349 0.0117 0.0227 0.0116 0.0083 0.0162 0.0083 CV 0.0297 0.0865 0.0230 0.0079 0.0221 0.0077 0.0063 0.0129 0.0064 n = 10000 GAM_OF train 0.0146 0.0221 0.0146 0.0072 0.0214 0.0070 0.0071 0.0160 0.0071 v al 0.0192 0.0253 0.0192 0.0090 0.0217 0.0088 0.0090 0.0184 0.0089 CV 0.0142 0.0219 0.0142 0.0070 0.0212 0.0068 0.0069 0.0156 0.0068 GAM_OT train 0.0061 0.0135 0.0061 0.0036 0.0179 0.0034 0.0033 0.0123 0.0033 v al 0.0075 0.0166 0.0075 0.0040 0.0156 0.0039 0.0039 0.0134 0.0039 CV 0.0058 0.0134 0.0058 0.0034 0.0178 0.0032 0.0032 0.0121 0.0032 XGB_OF train 0.0103 0.0198 0.0103 0.0014 0.0293 0.0006 0.0007 0.0080 0.0006 v al 0.0124 0.0207 0.0124 0.0017 0.0293 0.0008 0.0007 0.0078 0.0007 CV 0.0066 0.0156 0.0065 0.0012 0.0290 0.0004 0.0004 0.0069 0.0004 XGB_OT train 0.0057 0.0181 0.0056 0.0028 0.0203 0.0024 0.0023 0.0098 0.0023 v al 0.0072 0.0204 0.0070 0.0033 0.0202 0.0030 0.0026 0.0106 0.0025 CV 0.0045 0.0185 0.0043 0.0024 0.0204 0.0020 0.0018 0.0093 0.0018 T able 9: Results for ALE on F eynman I.29.16 a v eraged ov er 100 grid p oin ts. Bold n um b ers indicate the minim um per metric-feature-mo del-size-com bination. F eature x 1 θ 1 d 1 Metric MSE Bias V ar MSE Bias V ar MSE Bias V ar n = 1250 GAM_OF train 0.3122 0.1226 0.3075 0.0701 0.0591 0.0689 0.0738 0.0506 0.0737 v al 0.7270 0.2335 0.6956 0.2011 0.0703 0.2029 0.1729 0.0797 0.1723 CV 0.2393 0.2227 0.1963 0.0712 0.0503 0.0711 0.0786 0.0556 0.0781 GAM_OT train 0.0337 0.0405 0.0332 0.0214 0.0400 0.0205 0.0179 0.0195 0.0181 v al 0.0827 0.1494 0.0625 0.0322 0.0437 0.0314 0.0220 0.0230 0.0222 CV 0.0604 0.1697 0.0327 0.0185 0.0409 0.0174 0.0146 0.0183 0.0148 XGB_OF train 0.5272 0.3080 0.4473 0.1359 0.0539 0.1376 0.1118 0.0677 0.1109 v al 1.8846 0.2021 1.9073 0.1266 0.0616 0.1270 0.0653 0.0428 0.0656 CV 0.3804 0.2820 0.3113 0.0381 0.0520 0.0366 0.0338 0.0502 0.0323 XGB_OT train 0.0362 0.0421 0.0356 0.0186 0.0379 0.0177 0.0129 0.0176 0.0130 v al 0.1037 0.1760 0.0752 0.0318 0.0277 0.0321 0.0189 0.0240 0.0189 CV 0.0790 0.2141 0.0344 0.0120 0.0397 0.0108 0.0076 0.0146 0.0076 n = 10000 GAM_OF train 0.0149 0.0257 0.0147 0.0079 0.0304 0.0072 0.0072 0.0162 0.0072 v al 0.0219 0.0294 0.0218 0.0114 0.0308 0.0108 0.0096 0.0190 0.0096 CV 0.0148 0.0255 0.0147 0.0077 0.0303 0.0070 0.0070 0.0156 0.0070 GAM_OT train 0.0063 0.0191 0.0062 0.0041 0.0279 0.0035 0.0033 0.0123 0.0033 v al 0.0081 0.0230 0.0078 0.0052 0.0268 0.0046 0.0040 0.0134 0.0039 CV 0.0061 0.0191 0.0059 0.0040 0.0278 0.0033 0.0032 0.0120 0.0032 XGB_OF train 0.0470 0.0386 0.0471 0.0199 0.0541 0.0175 0.0105 0.0331 0.0098 v al 0.2569 0.0855 0.2582 0.0197 0.0568 0.0171 0.0129 0.0196 0.0129 CV 0.0358 0.0354 0.0357 0.0059 0.0459 0.0039 0.0037 0.0178 0.0035 XGB_OT train 0.0069 0.0154 0.0069 0.0045 0.0241 0.0041 0.0029 0.0116 0.0029 v al 0.0154 0.0212 0.0154 0.0087 0.0293 0.0081 0.0043 0.0132 0.0043 CV 0.0058 0.0256 0.0054 0.0043 0.0293 0.0035 0.0022 0.0098 0.0022 24 T. Heiß et al. T able 10: Decomp osition results of total v ariance V ar T ot in to mo del V ar Mod and estimation v ariance V ar Est on SimpleNormalCorr elate d av eraged ov er 100 grid p oin ts. Bold n um b ers indicate estimation strategy with min imal v ariance. F eature x 1 x 2 x 3 Metric V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est PD XGB_OF train 0.0817 0.0813 0.0004 0.2263 0.2259 0.0004 0.0010 0.0009 0.0001 v al 0.0857 0.0837 0.0020 0.2416 0.2395 0.0021 0.0014 0.0011 0.0003 CV 0.0505 0.0501 0.0004 0.1291 0.1287 0.0004 0.0007 0.0006 0.0001 XGB_OT train 0.0113 0.0112 0.0001 0.0141 0.0141 0.0001 0.0014 0.0014 0.0000 v al 0.0123 0.0119 0.0004 0.0155 0.0151 0.0004 0.0019 0.0019 0.0000 CV 0.0090 0.0089 0.0001 0.0104 0.0103 0.0001 0.0013 0.0013 0.0000 ALE XGB_OF train 0.1012 0.0161 0.0851 0.1011 0.0249 0.0762 0.0741 0.0636 0.0105 v al 0.5281 0.0359 0.4923 0.6017 0.1743 0.4275 0.0864 0.0424 0.0439 CV 0.1520 0.0620 0.0900 0.0994 0.0110 0.0884 0.0121 0.0003 0.0118 XGB_OT train 0.0088 0.0077 0.0011 0.0119 0.0107 0.0011 0.0066 0.0062 0.0005 v al 0.0263 0.0124 0.0140 0.0275 0.0089 0.0186 0.0033 0.0003 0.0029 CV 0.0077 0.0049 0.0029 0.0103 0.0065 0.0038 0.0020 0.0014 0.0007 T able 11: Decomposition results of total v ariance V ar T ot in to mo del V ar Mod and estimation v ariance V ar Est on F eynman I.29.16 av eraged ov er 100 grid p oin ts. Bold n um b ers indicate estimation strategy with m inimal v ariance. F eature x 1 θ 1 d 1 Metric V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est V ar T ot V ar Mod V ar Est PD XGB_OF train 0.0664 0.0652 0.0013 0.0029 0.0027 0.0002 0.0030 0.0029 0.0002 v al 0.0855 0.0794 0.0061 0.0044 0.0035 0.0009 0.0037 0.0029 0.0008 CV 0.0426 0.0414 0.0012 0.0016 0.0015 0.0002 0.0014 0.0013 0.0001 XGB_OT train 0.0270 0.0266 0.0004 0.0093 0.0091 0.0002 0.0074 0.0074 0.0000 v al 0.0349 0.0334 0.0015 0.0116 0.0108 0.0007 0.0083 0.0081 0.0001 CV 0.0230 0.0226 0.0004 0.0077 0.0076 0.0002 0.0064 0.0064 0.0000 ALE XGB_OF train 0.4473 0.0498 0.3975 0.1376 0.1135 0.0240 0.1109 0.0864 0.0245 v al 1.9073 - * 2.3566 0.1270 - * 0.1507 0.0656 - * 0.1164 CV 0.3113 - * 0.4167 0.0366 0.0099 0.0267 0.0323 0.0071 0.0252 XGB_OT train 0.0356 0.0302 0.0054 0.0177 0.0141 0.0037 0.0130 0.0112 0.0017 v al 0.0752 0.0186 0.0565 0.0321 0.0096 0.0226 0.0189 0.0076 0.0113 CV 0.0344 0.0227 0.0116 0.0108 0.0065 0.0043 0.0076 0.0053 0.0024 * Due to instabilities in v ariance estimation for ALE, likely caused by unstable bin assignments, the estimated V ar Est sometimes exceeds the estimated V ar T ot . W e omit V ar Mod in these cases. 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 mean estimation er r or F e a t u r e x 1 1 / n K / n 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 2 9 1 0 2 5 1 0 2 1 1 0 1 7 1 0 1 3 1 0 9 1 0 5 1 0 1 mean estimation er r or F e a t u r e x 3 1 / n K / n (a) Mean estimation error PD 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 mean estimation er r or F e a t u r e x 1 1 / n K / n 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 n 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 mean estimation er r or F e a t u r e x 3 1 / n K / n (b) Mean estimation error ALE Fig. 3: Mean estimation errors on SimpleNormalCorr elate d for X 1 and X 3 . F or eac h sample size n , v ariances are a v eraged o v er all grid p oin ts. Axes are log-scale.

Analyzing Error Sources in Global Feature Effect Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment