Directed information and Pearls causal calculus
Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal processing, and control. When such a model is defined on a directed acyclic graph (DAG), one can assign a partial ordering to the events occurring in the co…
Authors: Maxim Raginsky
Dir ected Inf ormation and P earl’ s Causal Calculus Maxim Raginsky Abstract — Probabilistic graphical models are a fundamental tool in statistics, machine learning, signal pr ocessing, and control. When such a model is defined on a directed acyclic graph (D A G), one can assign a partial ordering to the events occurring in the corresponding stochastic system. Based on the work of Judea Pearl and others, these DA G-based “causal factorizations” of joint probability measures have been used for characterization and inference of functional dependencies (causal links). This mostly expository paper focuses on several connections between Pearl’ s formalism (and in particular his notion of “intervention”) and information-theoretic notions of causality and feedback (such as causal conditioning, directed stochastic kernels, and directed information). As an application, we show ho w conditional dir ected inf ormation can be used to develop an information-theor etic version of Pearl’ s “back- door” criterion for identifiability of causal effects fr om passive observations. This suggests that the back-door criterion can be thought of as a causal analog of statistical sufficiency . I . I N T RO D U C T I O N The problems of causality in engineered and natural systems ha ve recently attracted the attention of information theorists and signal processing researchers [1]–[6]. The well- worn but nonetheless true maxim stating that “correlation does not imply causation” means that causal relationships cannot be captured by standard information-theoretic quan- tities lik e mutual information, conditional entropy , or di- ver gence, because all of these are measures of statistical dependence (i.e., correlation). The first information-theoretic studies of causality were concerned with feedback commu- nication systems and led to the development of the notion of dir ected information by Massey [7], with subsequent extensions and generalizations by Kramer, T atikonda, and Mitter [8]–[10]. Connections between directed information and sequential prediction, source coding, and hypothesis testing have also been extensi v ely in vestig ated [11]–[14]. Howe ver , causality has also been the subject of vigorous study in the statistics, artificial intelligence, and machine learning communities [15]–[18]. The key idea advanced in these works, particularly by Pearl, is that causality is syn- onymous with functional (rather than statistical) dependence. In other words, causal relationships correspond to stable deterministic mechanisms, by which one set of v ariables (the causes), together with some possibly unobserved ex- ogenous disturbances, may affect another set of variables (the ef fects). Thus, inferring causal relationships requires active experimentation that intervenes into some of these mechanisms. In very schematic terms (this discussion will be This work was supported by NSF grant CCF-1017564 and by AFOSR grant F A9550-10-1-0390. The author is with the Department of Electrical and Computer Engineer- ing, Duke University , Durham, NC. E-mail: m.raginsky@duke.edu. made precise in the sequel), an ideal setting for identifying or estimating the “causal ef fect” of one observ able (say , X ) on another (say , Y ) would permit the experimenter to disconnect X from all mechanisms that influence it, force X to take on some value(s) of interest, and then to estimate the probability distribution of Y as a result of this intervention, while controlling for all possible spurious influences and factors. This is quite different from estimating the statistical effect of X on Y , i.e., the conditional distribution P Y | X , by means of passiv e observations, e.g., from a large number of independent samples from the joint distribution of X , Y . The purpose of this mostly expository paper is to intro- duce the information theory , control, and signal processing communities to sev eral ke y concepts of the probabilistic theory of causality and, along the way , to elucidate se veral connections between Pearl’ s treatment of interventions on the one hand, and information-theoretic concepts pertaining to causality (such as directed information [7], causal condi- tioning [8], or directed stochastic kernels [9], [10]) on the other . In particular , the representation of causal relationships by Markov factorizations of joint probability distributions w .r .t. directed acyclic graphs (D A Gs) [15]–[18], such that the natural partial ordering of the vertices of the D A G corresponds to causal ordering of the events in the system under consideration, should be very congenial to systems theorists, who naturally think in terms of block diagrams, interconnections, and sequential recursi ve models. Let us giv e a brief o vervie w of the remainder of the paper . W e first moti v ate the functional vie w of causality in Section II by means of a simple e xample of a point- to-point communication system. Next, in Section III, we dev elop the general framew ork for studying causality in Markovian dynamical systems . In particular , we moti v ate Pearl’ s definition of intervention as “surgery” on a sequential recursiv e representation of such a system, whereby the relations defining the intervened-upon variables are deleted, and all instances of these variables in the remaining relations are assigned to some fixed v alue. This operation has a natural diagrammatic representation on the D A G inducing the Markov factorization of the joint probability distribution of the system observables according to the sequential model. W e also show that the probability distributions induced by this operation (i.e., what Pearl calls the causal effects ) are in one-to-one correspondence with the directed stochastic ker- nels of T atikonda and Mitter [9], [10]. This correspondence is then used in Section IV to show how directed information (and certain generalizations, such as conditional directed information) can be used to quantify the strength of causal effects by comparing them with ordinary (observational) encoder channel decoder Fig. 1. A generic communication system without feedback. conditional distributions. Section V develops an information- theoretic interpretation of Pearl’ s “back-door” criterion [18, Sec. 3.3.1] (a sufficient condition for identifiability of causal effects from observational data) in terms of conditional directed information, sho wing in effect that the back-door criterion can be viewed as a natural causal analog of statis- tical sufficienc y . I I . R E V E A L I N G C AU S A L I T Y T H R O U G H F U N C T I O NA L D E P E N D E N C E T o illustrate the difference between statistical dependence and causal dependence, consider the standard diagram of a point-to-point communication system without feedback, as shown in Figure 1. A message W is mapped into a channel input symbol X = e ( W ) , X is transmitted over a channel with transition kernel P Y | X , and the resulting channel output symbol Y is processed at the recei ver into a decoded message ˜ W = d ( Y ) , where e and d are some deterministic encoding and decoding functions. It is intuitively clear that the message W “causes” the decoded message ˜ W and not the other way around, but we cannot tell this from the joint distribution of W , X , Y , and ˜ W . Indeed, we hav e P W X Y ˜ W ( w , x, y , ˜ w ) = P W ( w ) 1 { e ( w )= x } P Y | X ( y | x ) 1 { d ( y )= ˜ w } , so that the joint distribution of W and ˜ W , gi ven by P W ˜ W ( w , ˜ w ) = P W ( w ) X x,y 1 { e ( w )= x } P Y | X ( y | x ) 1 { d ( y )= ˜ w } = P W ( w ) X y P Y | X ( y | e ( w )) 1 { d ( y )= ˜ w } ≡ P W ( w ) P ˜ W | W ( ˜ w | w ) , can also be factored as P W ˜ W ( w , ˜ w ) = P ˜ W ( ˜ w ) P W | ˜ W ( w | ˜ w ) , which merely shows that W and ˜ W are statistically depen- dent on one another . Indeed, to quote Massey [7], “statistical dependence, unlike causality , has no inherent directivity . ” If the encoder , the channel, and the decoder are nondegenerate, so that I ( W ; ˜ W ) > 0 , then the dependence between the message W and the decoded message ˜ W is completely symmetric: W depends on ˜ W , and ˜ W depends on W . In order to elicit the causal influence of the transmitted message on the decoded message, as well as the lack of causal influence in the opposite direction, we need to break this symmetry . T o that end, let us represent the stochastic transformation X → Y effected by the channel P Y | X as a deterministic mapping Y = f ( X, U ) , where U is random channel noise , assumed to be independent of W and X . (Indeed, any stochastic kernel P Y | X can be represented encoder channel decoder Fig. 2. An equiv alent diagram of the system in Figure 1. in this form for a suitable choice of f and P U .) This representation is shown in Figure 2. Now we can represent our communication system in the following sequential form : W ∼ P W U ∼ P U X = e ( W ) (1) Y = f ( X , U ) ˜ W = d ( Y ) What happens if we make a hard assignment W ← w of a specific v alue w to the transmitted message? Looking at the sequential model in (1), we see that this action will influence the “downstream” variables U, X, Y , ˜ W as follo ws: U ∼ P U X = e ( w ) Y = f ( e ( w ) , U ) ˜ W = d ( f ( e ( w ) , U )) . The corresponding joint distribution of U , X , Y and ˜ W resulting from the action W ← w , which we will denote by P U X Y ˜ W | W ← w , has the form P U X Y ˜ W | W ← w ( u, x, y , ˜ w ) = P U ( u ) 1 { e ( w )= x } 1 { f ( e ( w ) ,u )= y } 1 { d ( f ( e ( w ) ,u ))= ˜ w } . Marginalizing out the channel noise U , the channel input X , and the channel output Y , we get P ˜ W | W ← w ( ˜ w ) = X u,y P U ( u ) 1 { f ( e ( w ) ,u )= y } 1 { d ( f ( e ( w ) ,u ))= ˜ w } . This distribution is, in fact, equal to the ordinary conditional distribution P ˜ W | W = w , giv en by P ˜ W | W = w ( ˜ w ) = X y P Y | X ( y | e ( w )) 1 { d ( y )= ˜ w } = X u,y P U ( u ) 1 { f ( e ( w ) ,u )= y } 1 { d ( f ( e ( w ) ,u ))= ˜ w } . Again, assuming that the mappings e , f , d are nondegenerate, there exist at least two values w , w 0 for the transmitted mes- sage, for which P ˜ W | W = w 6 = P ˜ W | W = w 0 and, consequently , P ˜ W | W ← w 6 = P ˜ W | W ← w 0 . In other words, the downstream effect of the hard assignment W ← w is different from that of W ← w 0 . Now let us consider what happens if we make a hard assignment ˜ W ← ˜ w of the decoded message. One way to do this would be to replace the original decoding map d with system ... ... Fig. 3. A generic stochastic dynamical system with multiple feedback loops and exogenous disturbances. the constant map d ˜ w ( y ) = ˜ w for all y . The effect of this hard assignment on the remaining variables can be represented as W ∼ P W U ∼ P U X = e ( W ) Y = f ( e ( W ) , U ) This clearly shows that the joint distribution of the “up- stream” random variables W, U, X , Y is unaf fected by the action ˜ W ← ˜ w ; in fact, exactly the same conclusion would hold if we replaced the original decoding map d with any other decoding map d 0 . In other words, P W U X Y | ˜ W ← ˜ w = P W U X Y , P W | ˜ W ← ˜ w = P W , which shows the absence of causal influence of ˜ W on W . I I I . C AU S A L I T Y I N S E Q U E N T I A L DY N A M I C AL S Y S T E M S The simple example of the preceding section illustrates the general treatment of causality advocated by Pearl. T o motiv ate it, let us consider a stochastic dynamical system with multiple feedback loops and exogenous influences (or disturbances) sho wn in Figure 3. The e xogenous disturbances are modeled by n random variables U 1 , . . . , U n with a fixed joint distribution P U n = P U 1 ...U n , while the system observ- ables are represented by n variables X 1 , . . . , X n , related to U n and to one another by n coupled equations X i = f i ( X n , U n ) , i ∈ [ n ] (2) W e assume that the system specification is sound in the sense that the equations (2) ha v e a unique solution X n = x n for any realization U n = u n of the exogenous v ariables. This representation of stochastic dynamical systems as multiple feedback loops was used by W itsenhausen [19]–[21] in his seminal work on distrib uted control systems. This description allo ws for arbitrary dependencies between the observables X 1 , . . . , X n , including cycles of the form X j = f j ( X i , U j ) , X k = f k ( X j , U k ) , X i = f i ( X k , U i ) . In order to study causality , we will limit ourselves to sequential dynamical systems, in which the observables X 1 , . . . , X n are ordered in such a way that, for each i ∈ [ n ] , there exists a set Π i ⊆ [ i − 1] , such that the function f i depends essentially only on X Π i , ( X j : j ∈ Π i ) and on U i : X i = f i ( X Π i , U i ) , i ∈ [ n ] (3) Moreov er , if for each i the exogenous variable U i is indepen- dent of ( X i − 1 , U i − 1 ) , then the sequential model (3) specifies the joint distribution P X n via the Markov factorization P X n ( x n ) = n Y i =1 P X i | X Π i ( x i | x Π i ) , (4) where, for each i ∈ [ n ] , P X i | X Π i ( x i | x Π i ) = P U i f i ( x Π i , U i ) = x i , (5) and X [ i − 1] \ Π i → X Π i → X i is a Markov chain. W e will refer to any stochastic dynamical system specified by (3) with independent disturbances U 1 , . . . , U n as a Markovian dynamical system . Apparently , one of the earliest attempts to study causality by means of simple Markovian models of this sort was made in the 1920’ s by the geneticist Se wall Wright [22]. The Markov factorization (4) can also be represented in graphical form by means of a directed graph with n vertices, where vertex i is associated with X i , and there is a directed edge from vertex j to verte x i if and only if j ∈ Π i . Because Π i ⊆ [ i − 1] , we end up with a D A G. Since we will use this graphical representation rather heavily in the sequel, let us pause to define some concepts associated with DA Gs. Giv en i ∈ [ n ] , we let ∆ i ⊂ [ n ] denote the set of all descendants of i , i.e., the set of all j ∈ [ n ] \{ i } , such that there is a directed path from i to j . Similarly , we let A i denote the set of all ancestors of i , i.e., all j ∈ [ n ] \{ i } connected to i by directed paths. W e also let ∆ + i , ∆ i ∪ { i } , so that N i , [ n ] \ ∆ + i is the set of all nondescendants of i . Note that [ j ∈ N i A j ⊂ N i . (6) Indeed, if for some j ∈ N i there exists some k ∈ A j ∩ ∆ + i , then there is a directed path from i to j going through k , which is impossible by the definition of N i . A. Interventions in Markovian dynamical systems Consider a Markovian dynamical system specified accord- ing to (3). Just as we did in the simple example of Section II, we can study the causal effect of one set of variables X S , S ⊂ [ n ] , on another set X T with S ∩ T = ∅ by e xamining the impact of hard assignments of the form X S ← x S on X T . The main idea is to start with the recursive representation (3), delete all equations defining the variables X i , i ∈ S , and replace all other instances of these variables with the assigned values. For example, the ef fect of what Pearl calls an atomic intervention X i ← x i can be represented as the following modification of (3): X j = ( f j ( X Π j , U j ) X i = x i , if j ∈ ∆ i f j ( X Π j , U j ) , if j ∈ N i (7) Now , for any set T ⊆ [ n ] \{ i } , let P X T | X i ← x i denote the probability distribution of X T induced by the modi- fied model (7). Other notation used by Pearl and coau- thors includes P X T | ˆ X i = ˆ x i (where hats are added to the intervened-upon v ariables and the v alues assigned to them) and P X T | do ( X i = x i ) ; we will use some of these interchange- ably . The main claim is that these interventional distributions describe the causal effect of X i upon X T . Let us see some illustrations in support of this claim. First of all, we would intuitiv ely expect that the interven- tion X i ← x i would only af fect the descendants of i . This is indeed true: Lemma 1. F or any T ⊆ N i and any intervention X i ← x i , P X T | X i ← x i = P X T , wher e the distribution P X T on the right-hand side is induced by the original model (3) . Pr oof. Because of (6), no X k with k ∈ ∆ + i appears in any of the equations defining X N i in (7). Hence, the joint distribution of X N i in the modified model (7) is the same as in the original model (3). Since Π i ⊆ N i , we have Corollary 1. F or any i ∈ [ n ] and any intervention X i ← x i , P X Π i | X i ← x i = P X Π i . The extension to multiple interventions of the form X S ← x S is immediate: defining the sets ∆ S , [ i ∈ S ∆ i , ∆ + S , ∆ S ∪ S, N S , [ n ] \ ∆ + S we can represent the effect of the intervention X S ← x S on X S c = ( X j : j 6∈ S ) by X j = ( f j ( X Π j , U j ) X S = x S , if j ∈ ∆ S f j ( X Π j , U j ) , if j ∈ N S (8) and, for any T ⊂ [ n ] \ S , the interventional distribution P X T | X S ← x S is giv en by the joint distribution of X T induced by (8). Going through the same reasoning as before, we obtain the following generalization of Lemma 1: Lemma 2. F or any S ⊆ [ n ] , any T ⊆ N S , and any intervention X S → x S , P X T | X S ← x S = P X T . On the other hand, let us pick some i ∈ [ n ] and consider the causal effect of the intervention X Π i ← x Π i upon X i : Lemma 3. F or any S ⊂ [ n ] and any intervention X Π S ← x Π S , we have P X S | X Π S ← x Π S = P X S | X Π S = x Π S . Mor eover , for any T ⊆ ( S ∪ Π S ) c and any intervention X T ← x T , we have P X S | X Π S ← x Π S ,X T ← x T = P X S | X Π S ← x Π S = P X S | X Π S = x Π S . Pr oof. Observe that, as a result of the intervention X Π S ← x Π S , we have X j = f j ( x Π j , U j ) , ∀ j ∈ S which means that, for any x S and any additional intervention X T ← x T , where T is disjoint form S ∪ Π S , we have P X S | X Π S ← x Π S ,X T ← x T ( x S ) = P U S f j ( x Π j , U j ) = x j , ∀ j ∈ S = P X S | X Π S ← x Π S ( x S ) = P X S | X Π S = x Π S ( x S ) . In other words, the joint distribution of X S induced by (8) is unaffected by X T ← x T . In terms of the Mark ov factorization (4), we can express the interventional distrib utions P X T | X S ← x S for any pair of disjoint sets S, T ⊂ [ n ] as follows. First, we write down the “global” interventional distrib ution of X S c giv en the action X S ← x S , P X S c | X S ← x S ( x S c ) = Y i ∈ S c P X i | X Π i ( x i | x Π i ) , (9) and then marginalize out all v ariables outside of T : P X T | X S ← x S ( x T ) = X x S c ∩ T c P X S c | X S ← x S ( x S c ) (10) Note that, in general, this is different from the ordinary conditional distrib ution P X T | X S = x S , which has the follo wing standard interpretation in Bayesian terms: Suppose we can only observe X S , but not X S c . If we let system ev olve freely according to (3) and then observe that X S = x S , then P X T | X S = x S represents our posterior beliefs about X T based on the observed evidence X S = x S . B. Interventions in graphical models Graphical model representations of Mark ovian dynami- cal systems of fer a con venient visual way of computing interventional distrib utions. Essentially , if we wish to write down the interventional distribution P X S c | X S ← x S , we dra w the corresponding D AG, remove all edges incident upon the vertices in S , and write down the joint distribution of X S c induced by the resulting D AG, while setting X S to the assigned values x S . Let us see this on a couple of examples. Consider the following graphical model: X 1 / / ! ! B B B B B B B B X 4 ! ! B B B B B B B B X 3 / / ! ! B B B B B B B B X 6 X 2 = = | | | | | | | | X 5 = = | | | | | | | | It specifies the joint distribution of X 6 = ( X 1 , . . . , X 6 ) via P X 6 ( x 6 ) = P X 1 ( x 1 ) P X 2 ( x 2 ) × P X 3 | X 2 ( x 3 | x 2 ) P X 4 | X 1 ( x 4 | x 1 ) × P X 5 | X 3 ( x 5 | x 3 ) P X 6 | X 5 3 ( x 6 | x 5 3 ) . The ef fect of the intervention X 3 ← x 3 can be represented graphically as follows: X 1 / / X 4 ! ! B B B B B B B B X 3 / / ! ! B B B B B B B B X 6 X 2 x 3 = O O X 5 = = | | | | | | | | In other words, the intervened-upon variable X 3 , which is enclosed in a box, is disconnected from its direct causes in Π 3 = { 1 , 2 } , and an additional arrow is added to indicate the hard assignment X 3 ← x 3 . The resulting interv entional distribution P X 1 ,X 6 2 | X 3 ← x 3 can be read off directly from the diagram: P X 1 ,X 6 2 | X 3 ← x 3 ( x 1 , x 6 2 ) = P X 1 ( x 1 ) P X 2 ( x 2 ) × P X 4 | X 1 ( x 4 | x 1 ) P X 5 | X 3 ( x 5 | x 3 ) × P X 6 | X 5 3 ( x 6 | x 5 3 ) . As another example, consider the following diagram, which depicts communication o ver a discrete memoryless channel P Y | X using a sequence of possibly randomized feedback encoders P X i | X i − 1 ,Y i − 1 , i ∈ [ n ] : X 1 / / X 2 / / X 3 / / . . . / / X n Y 1 = = | | | | | | | | Y 2 = = | | | | | | | | Y 3 < < x x x x x x x x x . . . < < x x x x x x x x x Y n The ef fect of the intervention Y 1 ← y 1 , . . . , Y n ← y n is represented graphically as X 1 / / X 2 / / X 3 / / . . . / / X n Y 1 = = | | | | | | | | Y 2 = = | | | | | | | | Y 3 < < x x x x x x x x x . . . < < x x x x x x x x x Y n y 1 = O O y 2 = O O y 3 = O O . . . y n = O O and the corresponding interventional distribution is P X n | Y n ← y n ( x n ) = n Y i =1 P X i | X i − 1 ,Y i − 1 ( x i | x i − 1 , y i − 1 ) . C. Interventional distributions as directed stochastic kernels As it turns out, Pearl’ s construction of interventional distri- butions has been dev eloped independently by T atikonda and Mitter [9], [10] under the name of directed stochastic kernels in their work on the capacity of channels with feedback. T atikonda and Mitter consider an n -tuple of causally ordered random variables X 1 , . . . , X n with joint distribution P X n ( x n ) = n Y i =1 P X i | X i − 1 ( x i | x i − 1 ) (of course, we are free to factor P X n along any other ordering of the variables, but the subsequent definitions depend on a fixed ordering). Then for any S ⊂ [ n ] they define the directed stochastic kernel ~ P X S c | X S = x S by ~ P X S c | X S = x S ( x S c ) , Y i ∈ S c P X i | X i − 1 ( x i | x i − 1 ) . (11) It is easy to see that this definition is equivalent to Pearl’ s. Indeed, if we consider the D A G with n vertices that has Π i = [ i − 1] for each i ∈ [ n ] , then ~ P X S c | X S = x S defined in (11) is equal to P X S c | X S ← x S defined in (9). Con versely , if the v ariables X 1 , . . . , X n are ordered in such a way that for each i ∈ [ n ] there exists some Π i ⊆ [ i − 1] such that X [ i − 1] \ Π i → X Π i → X i is a Markov chain, then P X S c | X S ← x S ( x S c ) = Y i ∈ S c P X i | X Π i ( x i | x Π i ) = Y i ∈ S c P X i | X i − 1 ( x i | x i − 1 ) = ~ P X S c | X S ← x S ( x S c ) , where the first step uses (9), and the second uses (11) and the above Marko v chain condition. D. Interventions as channels The interventional distribution P X T | X S ← x S can be viewed as a mapping from the set of all tuples x S = ( x i : i ∈ S ) into the set of all probability distributions for X T . An y such mapping defines a channel [23] with input variable X S and output v ariable X T . If S = Π T , then Lemma 3 shows that this channel coincides with the specification of the conditional distrib ution of X T giv en X Π T in the intervention-free system. This equality of the originally pre- scribed stochastic kernels and the directed stochastic kernels holds whenev er X S (resp., X T ) is the complete input (resp., output) variable of an encoder , decoder , or controller . By contrast, whenever P X T | X S ← x S 6 = P X T | X S = x S for some x S , we can conclude that there are some additional causal or statistical relationships between X S and X T . I V . D I R E C T E D I N F O R M A T I O N A S A M E A S U R E O F C AU S A L I T Y Now that we hav e motiv ated the notion of a causal effect, we can proceed to define v arious information-theoretic quantities that capture causality as opposed to dependence. Assuming, as before, a Markovian dynamical system of the form (3), let us consider the interventional distribu- tion P X T | ˆ X S ( ·| ˆ x S ) for disjoint sets S, T ⊂ [ n ] . As we hav e pointed out already , this distribution is, in general, different from the conditional distribution P X T | X S ( ·| x S ) . In particular , if P X T | ˆ X S ( ·| ˆ x S ) = P X T ( · ) for any intervention X S ← x S , then the v ariables in S have no causal influence on those in T . On the opposite end of the spectrum, if P X T | ˆ X S ( ·| ˆ x S ) = P X T | X S ( ·| x S ) , then the causal effect of X T coincides with ordinary conditioning. This observation suggests that, for each realization x S of X S , we may measure the average “strength” of the causal ef fect of the intervention X S ← x S on X T by the diver gence D ( P X T | X S = x S k P X T | ˆ X S = ˆ x S ) = E " log P X T | ˆ X S ( X T | x S ) P X T | ˆ X S ( X T | ˆ x S ) # where the expectation is w .r .t. the conditional distribution P X T | X S = x S . If we now average this w .r .t. the marginal distribution of X S induced by (3), then we obtain D ( P X T | X S k P X T | ˆ X S | P X S ) = E " log P X T | X S ( X T | X S ) P X T | ˆ X S ( X T | ˆ X S ) # , (12) where D ( P B | A k Q B | A | P A ) denotes the conditional diver - gence [24]. If T = S c , then we ha ve D ( P X S c | X S k P X S c | ˆ X S | P X S ) = E " log P X S c | X S ( X S c | X S ) P X S c | ˆ X S ( X S c | ˆ X S ) # = E " log P X S c | X S ( X S c | X S ) ~ P X S c | X S ( X S c | X S ) # , where the second step uses the equiv alence between the in- terventional distribution P X S c | ˆ X S and the directed stochastic kernel ~ P X S c | X S . W e can now recognize the last expression as the dir ected information I ( X S c → X S ) from X S c to X S as defined by T atikonda and Mitter [10, p. 327]. This definition, in turn, generalizes the one proposed by Massey [7] in the context of communication ov er noisy channels with feedback. Thus, directed information arises naturally as an information-theoretic measure of causality: if I ( X S c → X S ) is small, then the interventional distributions of X S c based on X S are close to observational (i.e., conditional) distributions of X S c giv en X S , which means that the causal effects of X S on X S c can be reliably identified without the need for acti ve experimentation. On the other hand, if I ( X S c ; X S ) is equal to the ordinary mutual information I ( X S c ; X S ) , then the variables in S hav e no causal effect on the remaining variables in S c , and any statistical dependence between X S and X S c must be along the (not necessarily directed) paths in the D AG that have some edges pointing tow ard S . The definitions of Massey and T atikonda–Mitter apply only to the causal effect of X S on the entire complementary set X S c . W e can, howe ver , consider an arbitrary set T ⊆ S c and use (12) as our definition of the directed information from X T to X S : I ( X T → X S ) , D ( P X T | X S k P X T | ˆ X S | P X S ) . (13) Note that for I ( X T → X S ) to be well-defined, we need to specify an appropriate Mark ovian dynamical system, where the interventional distribution P X T | ˆ X S is computed according to (10). An expression for the directed information I ( X S c → X S ) can be obtained from the underlying graphical model. Indeed, note that we can write I ( X S c → X S ) = E " log P X S ,X S c ( X S , X S c ) P X S c | ˆ X S ( X S c | ˆ X S ) P X S ( X S ) # . Now , the probability distribution in the numerator is equal to P X n and can be assembled from the original Markov factorization, while the one in the denominator is the product of the interventional distribution P X S c | ˆ X S (which can be read off from the transformed D A G obtained using the procedure illustrated in Section III-B) and the marginal distribution P X S according to the original model. The di- rected edges that are common to the original D A G and the transformed DA G correspond to the factors in the numerator and the denominator that can be cancelled. The remaining expression can then be represented as a sum of conditional mutual informations by exploiting appropriate conditional independence relations encoded in the original D A G. 1 A. Combining interventions and passive observations: con- ditional directed information W e have already pointed out the different status of activ e interventions of the form X S ← x S and conditioning on passiv e observations X S = x S . Many problems pertaining to causality in volve a combination of the two: gi ven three disjoint sets S, S 0 , T ⊂ [ n ] , we may want to consider a mixed quantity P X T | X S ← x S ,X S 0 = x S 0 . In order for such an object to be well-defined, the conditioning on X S 0 must be done w .r .t. the interventional distribution of P X S 0 ∪ T | X S ← x S : P X T | X S ← x S ,X S 0 = x S 0 ( x T ) , P X S 0 ∪ T | X S ← x S ( x S 0 ∪ T ) P X S 0 | X S ← x S ( x S 0 ) . In fact, this is the only sensible definition, because perform- ing the conditioning first may destroy the Marko v structures that are needed to construct the interv entional distribution. W ith the above definition, we may define the conditional dir ected information I ( X T → X S | X S 0 ) , D ( P X T | X S ,X S 0 k P X T | ˆ X S ,X S 0 | P X S ,X S 0 ) (14a) ≡ E " log P X T | X S ,X S 0 ( X T | X S , X S 0 ) P X T | ˆ X S ,X S 0 ( X T | ˆ X S , X S 0 ) # . (14b) B. Some pr operties of dir ected information Let us illustrate the role of the directed information (13) and the conditional directed information (14) in quantifying the causal flow of information in Markovian dynamical systems. W e start with the follo wing: Lemma 4. F or any S ⊂ [ n ] and any T ⊆ N S , I ( X T → X S ) = I ( X T ; X S ) . 1 W e would like to thank Y ury Polyanskiy for clarifications regarding this procedure. Mor eover , for any T ⊆ ( S ∪ Π S ) c , I ( X S → X Π S ∪ T ) = I ( X S → X Π S ) = 0 . Pr oof. This is just a restatement of Lemmas 2 and 3 in the language of directed information. W e can also show that there are two contributions to the directed flow of information from X T to X S : (1) the ordinary mutual information between the variables in S and any nondescendants of S that happen to lie in T , and (2) the conditional directed information from the descendants of S in T to S , gi ven the nondescendants of S in T : Proposition 1 (chain rule) . F or any two disjoint sets S, T ⊂ [ n ] , we have I ( X T → X S ) = I ( X T ∩ N S ; X S ) + I ( X T ∩ ∆ S → X S | X T ∩ N S ) . (15) Pr oof. For brevity , let us denote T 1 = T ∩ N S and T 2 = T ∩ ∆ + S (which is equal to T ∩ ∆ S since S ∩ T = ∅ ). Then P X T | X S ← x S ( x T ) = P X T 1 | X S ← x S ( x T 1 ) P X T 2 | X S ← x S ,X T 1 = x T 1 ( x T 2 ) = P X T 1 ( x T 1 ) P X T 2 | X S ← x S ,X T 1 = x T 1 ( x T 2 ) , where the second step uses Lemma 2. Similarly , P X T | X S = x S ( x T ) = P X T 1 | X S = x S ( x T 1 ) P X T 2 | X S = x S ,X T 1 = x T 1 ( x T 2 ) . Therefore, I ( X T → X S ) = E " log P X T 1 | X S ( X T 1 | X S ) P X T 1 ( X T 1 ) # + E " log P X T 2 | X S ,X T 1 ( X T 2 | X S , X T 1 ) P X T 2 | ˆ X S ,X T 1 ( X T 2 | ˆ X S , X T 1 ) # = I ( X T 1 ; X S ) + I ( X T 2 → X S | X T 1 ) , which giv es us (15). Corollary 2. F or any set S ⊂ [ n ] , I ( X S c → X S ) = I ( X N S ; X S ) + I ( X ∆ S → X S | X N S ) . Pr oof. Immediate from the proposition with T = S c . C. Examples: thr ee canonical causal structur es Many fundamental questions pertaining to causality (including the possibility of discovering causal influences from observ ational data) can be reduced to the study of three canonical causal structures in volving three random variables X, Y , Z : the chain X → Y → Z ; the fork X ← Y → Z ; and the collider X → Y ← Z [16], [18]. W e have the following examples of directed information relations for these structures: Chain. Since X is a nondescendant of Y , we have I ( X → Y ) = I ( X ; Y ) ; since X is the direct cause of Y we ha ve I ( Y → X ) = 0 . Similarly , we have I ( Y → Z ) = I ( Y ; Z ) and I ( Z → Y ) = 0 . Moreo ver , since X is a nondescendant of Z , we have I ( X → Z ) = I ( X ; Z ) . On the other hand, I ( Z → X ) = 0 . Fork. Y is the direct cause of X , so I ( X → Y ) = 0 , and it is a nondescendant of X , so I ( Y → X ) = I ( X ; Y ) . The same goes for Y and Z : I ( Z → Y ) = 0 and I ( Y → Z ) = I ( Y ; Z ) . Finally , we hav e I ( X → Z ) = I ( Z → X ) = I ( X ; Z ) , since there is no directed path from X to Z or from Z to X . Collider . The direction of the links between X and Y and between Z and Y is the rev erse of that in the fork, so we hav e I ( X → Y ) = I ( X ; Y ) , I ( Y → X ) = 0 , I ( Y → Z ) = 0 , and I ( Z → Y ) = I ( Y ; Z ) . Finally , since X is a nondescendant of Z , we have I ( X → Z ) = I ( X ; Z ) = 0 ; similarly , I ( Z → X ) = I ( X ; Z ) = 0 , where we have also used the fact that X and Z are independent. V . A P P L I C A T I O N T O I D E N T I FI C AT I O N O F C AU S A L E FF E C T S One activ e area of interest in the studies of causality concerns identification of causal effects based on passive observations only . In the context of Marko vian dynamical system models, this problem arises whenever only a subset of the v ariables X n is available for observation, the goal is to determine the causal effect of one group of v ariables in this subset upon another , and it is not possible or feasible to activ ely intervene into the system. Then the relev ant question becomes: giv en a set V ⊂ [ n ] that indexes the variables av ailable for observation, is it possible to express the causal effect P X T | ˆ X S for some disjoint sets S, T ⊂ V in terms of ordinary (noninterventional) probabilities? More precisely , let us assume that we kno w the structure of the underlying DA G (i.e., the sets Π i , i ∈ [ n ] ), but not the functions f i or the distributions P U i of the exogenous disturbances. What other variables besides those in S and T do we need to observe in order to estimate the causal effect P X T | ˆ X S ? The idea is that the ordinary conditional probabili- ties relating the variables in V can be estimated from passi ve observations, and so P X T | ˆ X S can be estimated using a plug- in rule in terms of these conditional probabilities. One obvious answer is that it is sufficient to observe S , T , and all direct causes of the variables in S , i.e., those in Π S . T o see this, let us write do wn the interventional distribution P X T | X S ← x S and condition on X Π S : P X T | X S ← x S ( x T ) = X x Π S P X T | X S ← x S ,X Π S = x Π S ( x T ) P X Π S | X S ← x S ( x Π S ) = X x Π S P X T | X S ← x S ,X Π S = x Π S ( x T ) P X Π S ( x Π S ) , where the second step uses the fact that Π S ⊆ N S and Lemma 2. Now , it can be shown that P X T | X S ← x S ,X Π S = x Π S = P X T | X S = x S ,X Π S = x Π S [18, Thm. 3.2.2], which is equiv alent to I ( X T → X S | X Π S ) = 0 . This giv es P X T | X S ← x S ( x T ) = X x Π S P X T | X S = x S ,X Π S = x Π S ( x T ) P X Π S ( x Π S ) . (16) Thus, if we observe the variables in T , S , and Π S , then we can use (16) to dev elop an estimate of the causal effect P X T | ˆ X S in terms of the conditional distrib ution P X T | X S ,X Π S and the marginal distribution P X Π S . Both of these quantities can, in turn, be estimated from passive observations. The intuitiv e meaning of (16) is that we can estimate the causal effect of X S on X T without any need for active experi- mentation if we can control for the direct causes of X S , i.e., X Π S . Whenev er this is not possible, we would still like to know what other v ariables it suffices to observe in order for the causal effect P X T | ˆ X S to be identifiable. One sufficient condition due to Pearl, who termed it the “back- door criterion” [18, Sec. 3.3.1], says that certain subsets of the nondescendants of S can be used instead: Theorem 1 (the back-door criterion: directed information form) . Let S, T ⊂ [ n ] be such that T is disjoint fr om S ∪ Π S . Then for any set Z ⊆ N S the relation P X T | X S ← x S ( x T ) = X x Z P X T | X S ∪ Z = x S ∪ Z ( x T ) P X Z ( x Z ) (17) holds if and only if I ( X T → X S | X Z ) = 0 . Pr oof. Let us condition on X Z : P X T | X S ← x S ( x T ) = X x Z P X T | X S ← x S ,X Z = x Z ( x T ) P X Z | X S ← x S ( x Z ) = X x Z P X T | X S ← x S ,X Z = x Z ( x T ) P X Z ( x Z ) , where the second step uses the fact that Z ⊆ N S and Lemma 2. The proof is finished using the fact that P X T | X S ← x S ,X Z = x Z = P X T | X S ∪ Z = x S ∪ Z for all x S , x Z if and only if I ( X T → X S | X Z ) = 0 . The original back-door criterion [18, Section 3.3.1] is stated in graphical terms using the notion of d-separation (a graph- based criterion for identifying conditional independence re- lations), so it can be checked without knowing { f i } n i =1 or { P U i } n i =1 . Conceptually , its equiv alent information-theoretic form given by the above theorem is similar to statistical suf- ficiency: if Z ⊆ N S , then X Z may only depend functionally on X T (but not on X S or on any of the descendants of X S ), and if I ( X S ; X T | X Z ) = 0 , then X Z is sufficient for X S in the ordinary Bayesian sense. A C K N O W L E D G M E N T The author would like to thank T odd Coleman, Prakash Ishwar , T ara Javidi, Donatello Materassi, and Y ury Polyan- skiy for stimulating discussions. R E F E R E N C E S [1] A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Motif discovery in tissue-specific regulatory sequences using directed information, ” EURASIP J. Bioinf. Sys. Biol. , 2007, art. no. 13853. [2] P . Mathai, N. C. Martins, and B. Shapiro, “On the detection of gene network interconnections using directed mutual information, ” in Proc. Inform Th. Appl. W orkshop , La Jolla, CA, January/February 2007, pp. 274–283. [3] A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Using directed information to build biologically relev ant influence networks, ” J. Bioinf. Comput. Biol. , vol. 6, no. 3, pp. 493–519, 2008. [4] P .-O. Amblard and O. J. J. Michel, “On directed information and Granger causality graphs, ” J. Comput. Neurosci. , vol. 30, no. 1, pp. 7–16, 2011. [5] C. J. Quinn, T . P . Coleman, N. Kiya v ash, and N. G. Hatsopoulos, “Estimating the directed information to infer causal relationships in ensemble neural spike train recordings, ” J. Comput. Neurosci. , vol. 30, no. 1, pp. 17–44, 2011. [6] C. J. Quinn, T . P . Coleman, and N. Kiyavash, “Causal dependence tree approximations of joint distributions for multiple random processes, ” IEEE Tr ans. Inform. Theory , 2011, submitted. [Online]. A vailable: http://arxiv .org/abs/1101.5108 [7] J. Massey , “Causality , feedback, and directed information, ” in Proc. Int. Symp. Inf. Theory Appl. , 1990, pp. 303–305. [8] G. Kramer, “Directed information for channels with feedback, ” Ph.D. dissertation, Swiss Federal Institute of T echnology , Zurich, Switzer- land, 1998. [9] S. T atikonda, “Control under communication constraints, ” Ph.D. dis- sertation, MIT , Cambridge, MA, August 2000. [10] S. T atikonda and S.Mitter, “The capacity of channels with feedback, ” IEEE T rans. Inform. Theory , vol. 53, no. 1, pp. 323–349, 2009. [11] R. V enkataramanan and S. S. Pradhan, “Source coding with feed- forward: rate-distortion theorems and error exponents for a general source, ” IEEE T rans. Inform. Theory , vol. 53, no. 6, pp. 2154–2179, June 2007. [12] H. Permuter , P . Cuff, B. V an Roy, and T . W eissman, “Capacity of the trapdoor channel with feedback, ” IEEE T rans. Inform. Theory , vol. 54, no. 7, pp. 3150–3165, July 2008. [13] S. Gorantla and T . Coleman, “Information-theoretic viewpoints on optimal causal coding-decoding problems, ” IEEE Tr ans. Inform. Theory , 2011, submitted. [Online]. A vailable: http://arxi v .org/abs/ 1102.0250 [14] H. H. Permuter, Y .-H. Kim, and T . W eissman, “Interpretations of directed information in portfolio theory , data compression, and hy- pothesis testing, ” IEEE T rans. Inform. Theory , vol. 57, no. 6, pp. 3248–3259, June 2011. [15] J. Pearl, Probabilistic Reasoning in Intelligent Systems . San Fran- cisco, CA: Morgan Kaufmann, 1988. [16] P . Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Sear ch , 2nd ed. MIT Press, 2000. [17] J. Pearl, “Causal inference in statistics: an overvie w , ” Statist. Surv . , vol. 3, pp. 96–146, 2009. [18] ——, Causality: Models, Reasoning, and Inference , 2nd ed. Cam- bridge Univ . Press, 2009. [19] H. S. Witsenhausen, “On information structures, feedback and causal- ity , ” SIAM J. Contr ol , vol. 9, no. 2, pp. 149–160, 1971. [20] ——, “Separation of estimation and control for discrete time systems, ” Pr oc. IEEE , vol. 59, no. 11, pp. 1557–1566, November 1971. [21] ——, “ A standard form for sequential stochastic control, ” Math. Sys. Theory , vol. 7, no. 1, pp. 5–11, 1973. [22] S. Wright, “Correlation and causation, ” J. Agricultural Res. , vol. 20, pp. 557–585, 1921. [23] R. L. Dobrushin, “ A general formulation of the basic Shannon theorem in information theory , ” Uspekhi Math. Nauk , vol. 14, no. 6, pp. 3–103, 1959. [24] I. Csisz ´ ar and J. K ¨ orner , Information Theory: Coding Theor ems for Discr ete Memoryless Sour ces . Budapest: Akad ´ emiai Kiad ´ o, 1981.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment