Robust Learning via Cause-Effect Models

We consider the problem of function estimation in the case where the data distribution may shift between training and test time, and additional information about it may be available at test time. This relates to popular scenarios such as covariate sh…

Authors: Bernhard Sch"olkopf, Dominik Janzing, Jonas Peters

Robust Learning via Cause-Effect Mo dels Bernhard Sc h¨ olk opf, Dominik Janzing, Jonas P eters & Kun Zhang Max Planc k Institute for In telligen t Systems Sp emannstr. 38 T ¨ ubingen, German y { first.last@tuebingen.mpg.de } Abstract W e consider the problem of function estimation in the case where the data distribution ma y shift b et w een training and test time, and additional information about it ma y be av ailable at test time. This relates to p opular scenarios such as co v ariate shift, concept drift, transfer learning and semi- sup ervised learning. This working pap er discusses how these tasks could b e tackled depend- ing on the kind of changes of the distributions. It argues that kno wledge of an underlying causal direction can facilitate several of these tasks. 1 In tro duction By and large, statistical machine learning exploits statistical asso ciations or de- p endences b etw een v ariables to make predictions ab out certain v ariables. This is a v ery p o w erful concept esp ecially in situations where we hav e sizable train- ing sets, but no detailed mo del of the underlying data generating pro cess. This pro cess is usually mo delled as an unkno wn probability distribution, and ma- c hine learning excels whenever this distribution do es not c hange. Most of the theoretical analysis assumes that the data are i.i.d. (indep enden t and iden tically distributed) or at least exc hangeable. On the other hand, practical problems often do not hav e these fav orable prop erties, forcing us to leav e the comfort zone of i.i.d. data. Sometimes dis- tributions shift o ver time, sometimes w e might w an t to combine data recorded under differen t conditions or from different but related regularities. Researchers ha v e developed a num ber of modifications of statistical learning methods to handle v arious scenarios of changing distributions, for an o v erview, see [1]. The present pap er attempts to study these problems from the p oin t of view of causal learning. As some other recent w ork in the field [2, 3], it will build on the assumption that in causal structures, the distribution of the cause and the mec hanism relating cause and effect tend to b e indep enden t 1 . F or instance, in the problem of predicting splicing patterns from genomic sequences, the basic splicing mechanism (driven b y the ribosome) may b e assumed stable b et w en differen t sp ecies [4], even though the genomic sequences and their statistical prop erties might differ in several resp ects. This is imp ortan t information con- straining causal mo dels, and it can also b e useful for robust predictive mo dels 1 In the mentioned references, independence is meant in the sense of algorithmic indep en- dence, but other notions of indep endence can also make sense. 1 as we try to show in the present pap er. In tuitiv ely , if w e learn a causal mo del of splicing, we could hop e to b e more robust with resp ect to changes of the input statistics, and we may b e able to com bine data collected from differen t sp ecies to get a more accurate statistical mo del of the splicing mechanism. Causal graphical mo dels as pioneered by [5, 6] are usually though t of as join t probability distributions o v er a set of v ariables X 1 , . . . , X n , along with directed graphs (for simplicity , we assume acyclicit y) with vertices X i and arrows indicating direct causal influences. The c ausal Markov assumption [5] states that each vertex X i is indep endent of its non-descendants in the graph, given its parents. Here, indep endence is usually meant in a statistical sense, although alternativ e views hav e b een developed, e.g., using algorithmic indep endence [3]. Crucially , the causal Marko v assumption links the seman tics of causalit y to something that has empirically measurable consequences (e.g., conditional statistical indep endence). Given a sufficien t set of observ ations from a joint distribution, it allo ws us to test conditional indep endence statements and thus infer (sub ject to a genericity assumption referred to as “faithfulness”) which causal mo dels are consistent with an observed distribution. Ho w ev er, this will t ypically not lead us to a unique causal mo del, and in the case of graphs with only t w o v ariables, there are no conditional independence statements to test and w e cannot do an ything. There is an alternative view of causal mo dels, which do es not start from a join t distribution. Instead, it assumes a set of jointly indep enden t noise v ari- ables, one at eac h v ertex, and eac h v ertex computes a deterministic function of its noise v ariables and its paren ts. This view, referred to as a functional causal mo del (or nonlinear structural equation mo del), en tails a joint distri- bution whic h along with the graph satisfies the causal Mark o v assumption [5]. Vice versa, each causal graphical mo del can b e expressed as a functional causal mo del [3, e.g.]. 2 The functional p oin t of view is rather useful in that it allows us to come up with assumptions on causal mo dels that would b e harder to conceive in a pure probabilistic view. It has recently b een sho wn [7] that an assumption of nonlinear functions with additive noise renders the tw o v ariable case (and th us the multiv ariate case [8]) identifiable, i.e., we can distinguish b etw een the causal structures X → Y and X ← Y , given that one and only one of these t w o alternatives is true (whic h implicitly excludes a common cause of X and Y ). Hence, we can tac kle the case where conditional indep endence tests do not pro vide any information. This op ens up the p ossibilit y to iden tify the causal direction for input-output learning problems. The present pap er assa ys whether this can he helpful for machine learning, and it argues that in man y situations, 2 As an aside, note that the functional p oin t of view is more specific than that graphical model view [5]. T o see this consider X → Y and the following tw o functional mo dels that lead to the same joint distribution: (1) Y = X xor N with P ( N = 0) = 2 / 3 , P ( N = 1) = 1 / 3 and (2) Y = f ( X, N ) = f N ( X ) with f 0 ≡ 0 , f 1 ≡ 1 , f 2 = id and P ( N = 0) = P ( N = 1) = P ( N = 2) = 1 / 3. Supp ose one observes the sample (0 , 0). (1) and (2) give different answers to the counterfactual question “What would hav e happ ened if X had b een one?”. The causal graph and the joint distribution do es not provide sufficient information to give any answ er. 2 a causal mo del can b e more robust under distribution shifts than a pure statis- tical mo del. Perhaps somewhat suprisingly , learning problems need not alwa ys predict effect from cause, and the direction of the prediction has consequences for which tasks are easy and which tasks are hard. In the remainder of the pap er, w e restrict ourselv es to the simplest p ossible case, where we hav e tw o v ariables only and there are no unobserved confounders. C E N C N E φ id Figure 1: A simple functional causal mo del, where C is the cause v ariable, ϕ is a deterministic mechanism, and E is the effect v ariable. N C is a noise v ariable influencing C (without restricting generality , we can equate this with C ), and N E influences E via E = ϕ ( C , N E ). W e assume that N C and N E are indep enden t, in which case we may restrict our atten tion to the present graph (causal sufficiency). Notation. W e consider the causal structure sho wn in Fig. 1, with tw o observ- ables, mo deled by random v ariables. When using the notation C and E , the v ariable C stands for the cause and E for the effect. W e denote their domains C and E and their distributions by P ( C ) and P ( E ) (ov erloading the notation P ). When using the notation X and Y , the v ariable X will alwa ys be the input and Y the output, from a machine learning p oint of view (but input and output can either b e cause of effect — more b elo w). F or simplicity , we assume that their distributions hav e a joint densit y with resp ect to some pro duct measure. W e write the v alues of this density as P ( c, e ) and the v alues of the marginal densities as P ( c ) and P ( e ), again keeping in mind that these three P are different functions — w e can alwa ys tell from the argumen t whic h function is mean t. W e iden tify a training set of size l with a uniform mixture of Dirac measures, denoted as P ( C , E ) and use an analogous notation for an additional data set of size m (e.g., a set of test inputs). E.g., P 0 ( C ) could b e a set of test inputs 3 sampled from a distribution P 0 that need not b e iden tical with P . The follow- ing assumptions are used throughout the pap er. The subsections b elo w only men tion additional assumptions that are task sp ecific. Causal sufficiency . W e further assume that there are tw o indep enden t noise v ariables N C and N E , mo deled as random v ariables with domains N C and N E and distributions P ( N C ) and P ( N E ). In some places, we will use conditional densities, alw a ys implicitly assuming that they exist. The function ϕ and the noise term P ( N E ) jointly determine the conditional P ( E | C ) via E = ϕ ( C, N E ) . W e think of P ( E | C ) as the me chanism transforming cause C into effect E . Indep ence of mec hanism and input. W e finally assume that the mech- anism is “indep enden t” of the distribution of the cause (i.e., indep enden t of C = N C in Fig. 1), in the sense that P ( E | C ) contains no information ab out P ( C ) and vice versa; in particular, if P ( E | C ) changes at some p oin t in time, there is no reason to b eliev e that P ( C ) c hanges at the same time. 3 This assump- tion has b een used by [2, 3]. It encapsulates our b elief that ϕ is a mechanism of nature that do es not care what we feed into it. The assumption introduces an imp ortan t asymmetry b et w een cause and effect, since it will usually b e vio- lated in the bac kw ard direction, i.e., the distribution of the effect E will inherit prop erties from ϕ [3, 9]. Ric hness of functional causal mo dels It turns out that the tw o-v ariable functional causal mo del is so rich that it cannot b e identified. The causal Marko v condition is trivially satisfied b oth by the forward mo del and the bac kw ard mo del, and thus b oth graphs allo w a functional mo del. T o understand the ric hness of the class intuitiv ely , consider the simple case where the noise N E can take only a finite num ber of v alues, say { 1 , . . . , v } . This noise could affect ϕ for instance as follo ws: there is a set of functions { ϕ n : n = 1 , . . . , v } , and the noise randomly switc hes one of them on at any p oin t, i.e., ϕ ( c, n ) = ϕ n ( c ) . The functions ϕ n could implement arbitrarily differen t mec hanisms, and it w ould th us b e very hard to identify ϕ from empirical data sampled from such a complex mo del. 4 As an aside, recall that for acyclic causal graphs with more than tw o v ari- ables, the graph structure will typically imply conditional indep endence prop- erties via the causal Marko v condition. Ho w ev er, the ab o v e construction with 3 A stronger condition, which we do not need in the presen t context, would b e to require that P ( N E ), ϕ and P ( C ) b e jointly “independent.” 4 A similar construction, with the range of the noise having the cardinality of the function class, can b e used [3] to argue that every causal graphical mo del can b e expressed as a functional causal mo del. 4 noises randomly switching b et w een mechanisms is still v alid, and it is thus sur- prising that conditional indep endence alone does allo w us to do some causal in- ference of practical significance, as implemented by the w ell known PC and FCI algorithms [6, 5]. It should b e clear that additional assumptions that preven t the noise switching construction should significantly facilitate the task of iden- tifying causal graphs from data. Intuitiv ely , such ass umptions need to control the complexity with which the noise N E giv en a training set plus tw o unpaired sets from the t w o original marginals. Additiv e noise models. One suc h assumption is referred to as ANM , standing for nonline ar non-Gaussian acyclic mo del [7]. This mo del assumes ϕ ( C , N E ) = φ ( C ) + N E for some function φ : E = φ ( C ) + N E , (1) and it has b een shown that φ and N E can b e iden tified in the generic case, pro vided that N E is assumed to hav e zero mean. This means that apart from some exceptions, such as the case where φ is linear and N E is Gaussian, a given join t distribution of tw o real-v alued random v ariables X and Y can b e fit by an ANM mo del in at most one direction (which we then consider the causal one). A similar statement has b een shown for discrete data [10] and for the post- nonlinear ANM mo del [11] E = ψ ( φ ( C ) + N E ) , where ψ is an inv ertible function. In practice, an ANM mo del can b e fit by regressing the effect on the cause while enforcing that the residual noise v ariable is indep enden t of the cause [12]. If this is imp ossible, the mo del is incorrect (e.g., cause and effect are interc hanged, the noise is not additiv e, or there are confounders). ANM pla ys an imp ortan t role in this pap er; first, b ecause all the metho ds b elo w will presupp ose that we kno w what is cause and what is effect, and second, b ecause we will generalize ANM to handle the case where we hav e sev eral mo dels of the form (1) that share the same φ . 2 Predicting Effect from Cause Let us consider the case where we are trying to estimate a function f : X → Y or a conditional distribution P ( Y | X ) in the causal direction, i.e., that X is the cause and Y the effec t. Intuitiv ely , this situation of c ausal pr e diction should b e the ’easy’ case since there exists a functional mec hanism ϕ which f should try to mimic. W e are in terested in the question how robust (or inv ariant) the estimation is with resp ect to changes in the noise v ariables of the underlying functional causal mo del. 5 X Y N X N Y φ id p r edi c t i o n Figure 2: Predicting effect Y from cause X . 2.1 Additional information ab out the input 2.1.1 Robustness w.r.t. input changes (distribution shift) Giv en: training p oints sampled from P ( X, Y ) and an additional set of inputs sampled from P 0 ( X ), with P ( X ) 6 = P 0 ( X ). Goal: estimate P 0 ( Y | X ). Assumption: none. Solution: b y indep endence of mechanism and input, there is no reason to assume that the observ ed c hange in P ( X ) (i.e., in P ( N X )) en tails a c hange in P ( Y | X ), and we thus conclude P 0 ( Y | X ) = P ( Y | X ). This scenario is referred to as c ovariate shift [1]. 2.1.2 Semi-sup ervised learning Giv en: training p oints sampled from P ( X, Y ) and an additional set of inputs sampled from P ( X ). Goal: estimate P ( Y | X ). Note: b y independence of the mechanism, P ( X ) con tains no information ab out P ( Y | X ). A more accurate estimate of P ( Y | X ), as ma y be possible by the addition of the test inputs P ( X ), do es thus not influence an estimate of P ( Y | X ), and semi-sup ervised learning (SSL) is p oin tless for the scenario in Figure 2. 6 2.2 Additional information ab out the output 2.2.1 Robustness w.r.t. output changes Giv en: training p oints sampled from P ( X , Y ) and an additional set of outputs sampled from P 0 ( Y ), with P 0 ( Y ) 6 = P ( Y ). Goal: estimate P 0 ( Y | X ). Assumption: v arious options, e.g., an additive Gaussian noise mo del where P ( φ ( X )) is indecomp osable and P 0 ( φ ( X )) is also indecomp osable, if it is differ- en t from P ( φ ( X )). Solution: first we need to decide whether P ( X ) or P ( Y | X ) has changed. This can b e done using the metho d Localizing Distribution Change (Sub- section 4.2) under appropriate assumptions (see ab o v e). If P ( X ) has changed, pro ceed as in Subsubsection 2.1.1. If P ( Y | X ) has changed, w e can, estimate P 0 ( Y | X ) via Estimating Causal Conditional (Subsection 4.3). Here, addi- tiv e noise is a sufficien t assumption. 2.2.2 Semi-sup ervised learning Giv en: training p oints sampled from P ( X , Y ) and an additional set of outputs sampled from P ( Y ). Goal: estimate P ( Y | X ). Assumption: P ( X , Y ) has an additive noise mo del from X to Y and P ( Y ) has a unique decomp osition as conv olution of tw o distributions, sa y P ( Y ) = Q ∗ R . This is, for instance satisfied if the noise is Gaussian and P ( φ ( C )) is indecomp osable. Solution: The additional outputs help b ecause the decomp osition tells us that either P ( N Y ) = Q or P ( N Y ) = R . The additiv e noise mo del learned from the x, y -pairs will probably tell us which of the alternatives is true. Knowing P ( N E ), the conditional P ( Y | X ) reduces to learning φ from the x, y -pairs, which is certainly a w eak er problem than learning P ( Y | X ) would b e in general. 2.3 Additional information ab out input and output 2.3.1 T ransfer learning (only nosie c hanges) Giv en: training p oin ts sampled from P ( X , Y ) and an additional set of p oin ts sampled from P 0 ( X, Y ), with P 0 ( X, Y ) 6 = P ( X , Y ). Goal: estimate P 0 ( Y | X ). 7 Assumption: Additiv e noise where φ is inv arian t, but the noises can change. Solution: run conditional ANM to output a single function, only enforcing indep endence of residuals separately for the tw o data sets (Section 4.4). There is also a semi-supervised learning v ariant of this scenario: Giv en giv en a training set plus tw o unpaired sets from the tw o original marginals, then the extra sets help to b etter estimate P ( X, Y ) b ecause w e hav e argued in Subsub- section 2.2.2 that additional y -v alues sampled from P ( Y ) already help. 2.3.2 Concept drift (only meachnism changes) Giv en: training p oin ts sampled from P ( X , Y ) and an additional set of p oin ts sampled from P 0 ( X, Y ), with P 0 ( X, Y ) 6 = P ( X , Y ). Goal: estimate P 0 ( Y | X ). Assumption: N X , N Y in v ariant, but φ has c hanged. Solution: Apply ANM to p oin ts sampled from P 0 ( X, Y ) to obtain φ . Then P 0 ( Y | X ) is given by P 0 ( Y | X ) = P N Y ( Y − φ ( X )) . 3 Predicting Cause from Effect W e now turn to the opp osite direction, where we consider the effect as observed and we try to predict the v alue of the cause v ariable that led to it. This situation of antic ausal pr e diction may seem unnatural, but it is actually ubiquitous in mac hine learning. Consider, for instance, the task of predicting the class lab el of a handwritten digit from its image. The underlying causal structure is as follo ws: a p erson intends to write the digit 7, say , and this inten tion causes a motor pattern pro ducing an image of the digit 7 — in that sense, it is justified to consider the class lab el Y the cause of the image X . 3.1 Additional information ab out the input 3.1.1 Robustness w.r.t. input changes (distribution shift) Giv en: training p oints sampled from P ( X, Y ) and an additional set of inputs sampled from P 0 ( X ), with P 0 ( X ) 6 = P ( X ). 5 Goal: estimate P 0 ( Y | X ). 5 A related scenario is that we do not hav e additional data from P 0 ( X ), but we w ant to still use our knowledge of the causal direction to learn a model that is somewhat robust w.r.t. changes of P ( X ) due to changes in either P ( Y ) or P ( X | Y ). 8 X Y N X N Y φ id pr ed i ct i on Figure 3: Predicting cause Y from effect X . Assumption: additiv e Gaussian noise with inv ertible function φ and inde- comp osable P ( φ ( Y )) is sufficien t. Other assumptions are also p ossible, but in v ertibilit y of the causal conditional P ( X | Y ) is necessary in any case. Solution: W e apply Localizing Distribution Change (Subsection 4.2) to decide if P ( Y ) or P ( X | Y ) has changed. In the first case, we can estimate P 0 ( Y ) via Inverting Conditionals (Subsection 4.1) if we assume that P ( X | Y ) is an injectiv e conditional. 6 F rom this we get P 0 ( X, Y ), and then P 0 ( Y | X ) = P 0 ( X, Y ) R P 0 ( X, Y ) d Y . If, of the other hand, P ( X | Y ) has changed, we can estimate P 0 ( X | Y ) via Estimating Causal Conditionals (Subsection 4.3). 3.1.2 Semi-sup ervised learning Giv en: training p oints sampled from P ( X, Y ) and an additional set of inputs sampled from P ( X ). Goal: estimate P ( Y | X ). Assumption: unclear. 6 This term will b e introduced in Subsection 4.1. Injectivity means that the input distri- bution can uniquely b e computed from the output distribution. W e will give examples of injective conditionals later. 9 Note: b y dep endence of the mechanism, P ( X ) con tains information ab out P ( Y | X ). The additional inputs thus may allow a more accurate estimate of P ( X ). 7 Kno wn metho ds for semi-sup ervised learning can indeed be view ed in this w a y . F or instance, the cluster assumption says that p oin ts that lie in the same cluster of P ( X ) should ha v e the same Y ; and the low density separation as- sumption sa ys that the decision b oundary of a classifier (i.e., the p oint where P ( Y | X ) crosses 0 . 5) should lie in a region where P ( Y ) is small. The semi- sup ervised smoothness assumption says that the estimated function (whic h we ma y think of as the exp ectation of P ( Y | X ) should b e smo oth in regions where P ( X ) is large (for an ov erview of the common assumptions, see [13]). Some algorithms assume a mo del for the causal mechanism, P ( X | Y ), which is usually a Gaussian distribution or mixture of Gaussians, and learn it on b oth lab eled and unlab eled data [14]. Note that all these assumptions translate prop erties of P ( X ) into prop erties of P ( Y | X ). Using a more accurate estimate of P ( X ), w e could also try to pro ceed as in Subsubsection 3.1.1. 8 3.2 Additional information ab out the output 3.2.1 Robustness w.r.t. output changes Giv en: training p oints sampled from P ( X , Y ) and an additional set of outputs sampled from P 0 ( Y ), with P 0 ( Y ) 6 = P ( Y ). Goal: estimate P 0 ( Y | X ). Assumption: none. Solution: indep endence of mechanism implies P 0 ( X | Y ) = P ( X | Y ), hence P 0 ( X, Y ) = P ( X | Y ) P 0 ( Y ). F rom this, we compute P 0 ( Y | X ) = P 0 ( X | Y ) P 0 ( Y ) R P 0 ( X, Y ) d Y . There may also b e ro om for a semi-sup ervised learning v arian t: supp ose we ha v e additional output observ ations rather than additional inputs as in standard SSL — in whic h situations do es this help? 7 Note that a weak form of SSL could roughly work as followst: after learning a generative model for P ( X , Y ) from the first part of the sample, we can use the additional samples from P ( X ) to double chec k whether our model generates the right distribution for P ( X ). 8 How ev er, in this case w e do not hav e the tw o alternativ es of whether P ( Y ) or P ( X | Y ) has changed. The question now should be: given a b etter estimate of P ( X ), do es that change our estimate of P ( Y ), or of P ( X | Y )? 10 3.3 Additional information ab out input and output 3.3.1 Robustness w.r.t. c hanges of input and output noise (transfer learning) Giv en: training p oin ts sampled from P ( X , Y ) and an additional set of p oin ts sampled from P 0 ( X, Y ), with P 0 ( X, Y ) 6 = P ( X , Y ). Goal: estimate P 0 ( Y | X ). Assumption: additiv e noise where φ is in v ariant, but the noises can c hange. Solution: analogous to Subsection 2.3.1, but use the mo del backw ards in the end. 3.3.2 Concept drift (changes of the mec hanism) Giv en: training p oin ts sampled from P ( X , Y ) and an additional set of p oin ts sampled from P 0 ( X, Y ), with P 0 ( X, Y ) 6 = P ( X , Y ). Goal: estimate P 0 ( Y | X ). Assumption: N X , N Y in v ariant, but φ has c hanged to φ 0 . Solution: W e can learn φ 0 from P 0 ( X, Y ) and then estimate the entire distri- bution P 0 ( X, Y ) using the estimations of our distributions P ( N X ) and P ( N Y ) obtained from observing those x, y pairs that were taken from P ( X , Y ). 4 Mo dules 4.1 In v erting Conditionals W e can think of a conditional P ( Y | X ) as a mec hanism that transforms P ( X ) in to P ( Y ). In some cases, w e do not lo ose any information by this mechanism: Definition 1 (injective conditionals) a c onditional distribution P ( Y | X ) is c al le d inje ctive if ther e ar e no two distributions P ( X ) 6 = P 0 ( X ) such that Z P ( y | x ) P ( x ) dx = Z P ( y | x ) P 0 ( x ) dx . Example 1 (full rank sto c hastic matrix) L et X , Y have finite r ange. Then P ( Y | X ) is given by a sto chastic matrix M and is inje ctive if and only if M has ful l r ank. Note that this is only p ossible if | X | ≤ | Y | . 11 Example 2 (Post-nonlinear mo del) L et X , Y b e r e al-value d and Y = ψ ( φ ( X ) + N Y ) with N Y ⊥ ⊥ X , b e a p ost-nonline ar mo del wher e φ and ψ ar e inje ctive. Then the distribution of Y uniquely determines the distribution of φ ( X ) + N Y b e c ause ψ is invertible. This in turn, uniquely determines the distribution of φ ( X ) pr ovide d that the c onvolution with P ( N Y ) is invertible. Sinc e ψ is invertible, this determines the distribution of X uniquely. Note that additive noise mo dels with injective φ are a sp ecial case of a p ost- non-linear mo del by setting ψ : = id . 4.2 Lo calizing distribution change Giv en data p oin ts sampled from P ( C , E ) and additional p oin ts from P 0 ( E ) 6 = P ( E ), we wish to decide whether P ( C ) or P ( E | C ) has c hanged. Assume E = φ ( C ) + N E , with the same φ for b oth distributions P ( E , C ) and P 0 ( E , C ), but the distribu- tion of the noise N E or the distribution of C changes. Let P ( φ ( C )) denote the distribution of φ ( C ). 9 Then the distributions of the effect are giv en by P ( E ) = P ( φ ( C )) ∗ P ( N E ) P 0 ( E ) = P 0 ( φ ( C )) ∗ P 0 ( N E ) , where either P 0 ( φ ( C )) = P ( φ ( C )) or P 0 ( N E ) = P ( N E ). T o decide which of these cases is true, we first estimate φ from the first data set, and then apply a decon v olution with P ( φ ( C )) (denoted with P ( φ ( C )) ∗ − 1 · ) or P ( N E ) to P 0 ( E ) and c hec k whether (1) P ( φ ( C )) ∗ − 1 P 0 ( E ) or (2) P ( φ ( C )) ∗ − 1 P 0 ( E ) is a prob- abilit y distribution. Below w e will dicuss one p ossible set of assumptions that ensure that exactly one of the alternatives should is true. In case (1), P ( E | C ) has c hanged. In case (2), P ( C ) has changed. T o show that there are (not to o artificial) asumptions that render the prob- lem solv able, assume that P ( φ ( C )) and P 0 ( φ ( C )) are indecomp osable and P ( N E ) and P 0 ( N E ) are Gaussian with zero mean. Then the distribution P ( E ) = P ( φ ( C )) ∗ P ( N E ) uniquely determines P ( φ ( C )) by deconv olving P ( E ) with the Gaussian of maximal p ossible width that still yields a probability distribution. W e are aw are that there exist situations where b oth cases are p ossible. F or instance, consider the example in which P ( f ( C )) follows a uniform distribution, P ( N E ) ∼ N (0 , 1), while when generating P 0 ( E ), P 0 ( f ( C )) = P ( f ( C )) and P 0 ( N E ) ∼ N (0 , 2). That is, when generating the new data, only P ( E | C ) was c hanged. How ev er, applying the deconv olution with P ( N E ) to P 0 ( E ) results in P 0 ( E ) ∗ − 1 P ( N E ) = P ( f ( C )) ∗  P 0 ( N E ) ∗ − 1 P ( N E )  = P ( f ( C )) ∗ N (0 , 2 − 1) = 9 Explicitly , it is derived from the distribution of C by P ( φ ( C ) ∈ A ) = P ( C ∈ φ − 1 ( A )). 12 P ( f ( C )) ∗ N (0 , 1), which still corresponds a v alid distribution. Consequently , w e ha v e to conclude that b oth cases are p ossible. Despite the examples where the prop osed method fails, the proposed method also w orks in – hop efully – man y situations. F or instance, now let us switc h the roles of P ( E ) and P 0 ( E ) in the example ab o v e, or in other words, sup- p ose P ( N E ) ∼ N (0 , 2) and P 0 ( N E ) ∼ N (0 , 1). In this example decon v olving P 0 ( E ) with P ( N E ) gives P 0 ( E ) ∗ − 1 P ( N E ) = P ( f ( C )) ∗ P 0 ( N E ) ∗ − 1 P ( N E ) = P ( f ( C )) ∗ − 1 N (0 , 1), which is not a v alid distribution. That is, in this example w e can make the decision that P ( E | C ) has changed. W e are w orking on the conditions to guaran tee that only one of the tw o ca ses is p ossible. 4.3 Estimating causal conditionals Giv en P 0 ( E ), estimate P 0 ( E | C ) under the assumption that P ( C ) remained con- stan t. Assume that P ( E , C ) and P 0 ( E , C ) hav e b een generated by the additive noise mo del E = φ ( C ) + N E , with the same P ( C ) and f , while the distribution of N E has changed. W e ha v e P ( E ) = P ( φ ( C )) ∗ P ( N E ) , P 0 ( E ) = P ( φ ( C )) ∗ P 0 ( N E ) . Hence, P 0 ( N E ) can b e obtained by the deconv olution P 0 ( N E ) = P ( φ ( C )) ∗ − 1 P 0 ( E ) . This w a y , we can compute the new conditional P 0 ( E | C ). 4.4 Conditional ANM Giv en t w o data sets generated by E = φ ( C ) + N E (2) and E 0 = φ ( C 0 ) + N 0 E , (3) resp ectiv ely . W e apply the algorithm of [12] to obtain the shared function φ , enforcing separate indep endence C ⊥ ⊥ N E and C 0 ⊥ ⊥ N 0 E . This can b e interpreted as a ANM mo del enforcing conditional indep endence in E | i = φ ( C | i ) + N E | i, (4) where i ∈ { 1 , 2 } is an index, and C | i ⊥ ⊥ N E | i . Ac kno wledgemen t W e thank Joris Mo oij, Bob Williamson, Vladimir V ap- nik, Jak ob Zsc heisc hler and Eleni Sgouritsa for helpful discussions. 13 References [1] M. Sugiyama and M. Kaw anabe. Machine L e arning in Non-Stationary Envir onment . MIT Press, Cam bridge, MA, 2012. [2] J. Lemeire and E. Dirkx. Causal mo dels as minimal descriptions of multi- v ariate systems. http://parallel.vub.ac.be/ ∼ jan/ , 2007. [3] D. Janzing and B. Sch¨ olk opf. Causal inference using the algorithmic Marko v condition. IEEE T r ansactions on Information The ory , 56(10):5168–5194, 2010. [4] G. Sch w eik ert, C. Widmer, B. Sch¨ olk opf, and G. R¨ atsc h. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In D. Koller, D. Sch uurmans, Y. Bengio, and L. Bottou, editors, A dvanc es in Neur al Information Pr o c essing Systems , volume 21, pages 1433–1440, 2009. [5] J. Pearl. Causality . Cambridge Universit y Press, 2000. [6] P . Spirtes, C. Glymour, and R. Sc heines. Causation, pr e diction, and se ar ch . Springer-V erlag. (2nd edition MIT Press 2000), 1993. [7] P . O. Hoy er, D. Janzing, J. M. Mooij, J. P eters, and B. Sc h¨ olk opf. Nonlinear causal discov ery with additiv e noise mo dels. In D. Koller, D. Sch uurmans, Y. Bengio, and L. Bottou, editors, A dvanc es in Neur al Information Pr o- c essing Systems , v olume 21, pages 689–696, 2009. [8] J. P eters, J. M. Mo oij, D. Janzing, and B. Sch¨ olk opf. Iden tifiability of causal graphs using functional mo dels. In Pr o c e e dings of the 27th Confer enc e on UAI , pages 589–598, 2011. [9] P . Daniu ˇ sis, D. Janzing, J. Mooij, J. Zsc heisc hler, B. Steudel, K. Zhang, and B. Sch¨ olk opf. Inferring deterministic causal relations. In 26th Confer enc e on Unc ertainty in Artificial Intel ligenc e , Corv allis, OR, USA, 07 2010. AUAI Press. [10] J. Peters, D. Janzing, and B. Sc h¨ olk opf. Causal inference on discrete data using additive noise mo dels. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 33:2436–2450, 2011. [11] K. Zhang and A. Hyv¨ arinen. On the iden tifiabilit y of the p ost-nonlinear causal mo del. In Pr o c e e dings of the 25th Confer enc e on Unc ertainty in A rtificial Intel ligenc e , Montreal, Canada, 2009. [12] J. Mo oij, D. Janzing, J. P eters, and B. Sc h¨ olk opf. Regression b y dep en- dence minimization and its application to causal inference in additive noise mo dels. In A. Danyluk, L. Bottou, and M. Littman, editors, Pr o c e e dings of the 26th International Confer enc e on Machine L e arning , New Y ork, NY, USA, 06 2009. A CM Press. 14 [13] O. Chap elle, B. Sch¨ olk opf, and A. Zien, editors. Semi-Sup ervise d L e arning . MIT Press, Cam bridge, MA, USA, 09 2006. [14] X. Zhu and A. Goldberg. Introduction to semi-sup ervised learning. In Syn- thesis L e ctur es on Artificial Intel ligenc e and Machine L e arning , volume 3, pages 1–130. Morgan & Cla yp ool Publishers, 2009. 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment