MetaCI: Meta-Learning for Causal Inference in a Heterogeneous Population
Performing inference on data obtained through observational studies is becoming extremely relevant due to the widespread availability of data in fields such as healthcare, education, retail, etc. Furthermore, this data is accrued from multiple homoge…
Authors: Ankit Sharma, Garima Gupta, Ranjitha Prasad
MetaCI: Meta-Learning f or Causal Inference in a Heterogeneous P o pulation Ankit Sharma TCS Research, Delhi ankit .shar ma16@tcs.com Garima Gupta TCS Research, Delhi gupta .gari ma1@t cs.com Ranjitha Prasad TCS Research, Delhi ranji tha.p rasad @tcs.com Arnab Chatterjee TCS Research, Delhi arnab .chat terjee4@tcs.com Lovek esh V ig TCS Research, Delhi lovek esh.v ig@tc s.com Gautam Shroff TCS Research, Delhi gauta m.shr off@t cs.com Abstract Performin g inference on data obtaine d thro u gh observational studies is becom- ing extremely relev ant due to the w id espread availability of d ata in fields such as healthcare, education , r e tail, etc. Furth ermore, this d ata is accrued from m ultiple homog eneous subg r oups o f a h eterogen eous po pulation, and hence, generalizin g the inference mech anism over such data is essential. W e p ropose th e MetaCI framework with the goal o f an swering co unterfactual question s in the c ontext of causal inference (CI), where the factual observations are obtained from se veral homog eneous subgr oups. While the CI network is design e d to generalize from factual to counter factual distrib ution in ord er to tackle covariate shift, MetaCI em- ploys the meta- le a r ning parad igm to tackle the shift in d ata distributions between training an d test phase due to the pre sence of heterogen eity in th e po pulation, and due to drifts in the target distribution, also k nown as concep t shift. W e b enchma r k the performanc e of the MetaC I algor ithm using the mean absolute percentage error over the av erage treatm e nt effect as the metric, and demonstrate that meta initialization has significant gains compare d to random ly initialized networks, and other methods. 1 Intr oduction Learning cau sal r e lationships is the heart an d sou l of several do mains such as healthcar e , advertis- ing, edu cation, econom ics, etc. For instance, pe r sonalized and targeted treatmen t considerin g an individual’ s health indicator s is cru cial in health care [1, 2], targeted advertising campaign is essen- tial to ac h ieve high er profit margin in channel attribution [3 – 5]. Causal inferen c e (CI) aims to inf e r unbiased causality effect o f th e tre atment fr om ob servational data by factoring th e imp act o f th e confou nding variables o f patients/users. In the con text of o b servational studies, confo unding vari- ables affect the treatmen t and ou tcome, an d hen ce, disentan gling the effect of th e se variables is the key to achieve treatmen t effectiv eness. In this work, we tackle the fact th a t the study- popula tio n is heteroge n eous, an d henc e , developing CI-based systems that gener alize fo r new u nseen subg roups in data is essential in order to provide be tter targeted interventions. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), V an couv er , Canada. ω Causal Inference Netw ork Reptile Ψ r − 1 Ψ r ˜ Ψ 0 ω Figure 1: T oy example: For the joint distribution of the confo u nding variables (Left), task ω c o nsists of samples in X belonging to the region in the joint pdf (Centre ) . This region is co ntaminated by smaller non -overlapping r egions o f the join t p d f, in o rder to b ring in com monality among tasks. Further, these tasks are th e inp u t to the Me taCI framework, to obtain th e meta in itialization given as ˜ Ψ 0 , as depicted above (Righ t). Classical appro aches in CI estimate th e average treatmen t effects from ob servational data by ac- counting for th e selection b ias using prop ensity scores, hence creating un biased e stimators of th e av eraged treatment effect (A TE) [6]. More recently , deep neu ral network b a sed CI appr oaches have been prop osed with different mec hanisms to handle the b ias. These include a latent variable mod e l- ing u sing V AEs [7] , a GAN- b ased techniq ue [8], a DNN-b ased Deep IV [9]. In [10, 11], the authors propo se to view th e causal infere n ce pro blem as a covariate shift pro blem, an d prop ose algor ithms that balance between the factual an d the counterfactua l p opulatio n . Often, ob ser vational d ata is scarce, and the study- popu lation is hetero geneou s. Sub grou p analysis is p roposed in literature for coping with he terogen e ity in the popu lation [12, 1 3], especially in the context o f estab lishing effect of the tr eatment for eac h sub group s [2]. Our goa l is to desig n a d eep neural network b ased cau sal in ference m odel that is capable o f ad apting/ge n eralizing to n ew sub- group s in the input data that may not have been encoun tered du ring training . T o achieve this goal, we use th e n ovel ‘learning to learn ’ p aradigm, also k nown as the meta -learning fra mew ork. Unlike conv entional deep neu ral networks that requir e large amounts o f data for tr aining, m eta-learnin g o r few-shot learning learn s to learn fr o m p revious tasks , by discovering the structure amon g tasks to enable fast learning of new tasks [ 14]. In this work, we employ th e algorith m ic fr amework for CI propo sed in [10, 11], since it is a flexible fr amework in th e context of meta- learning. Contributions : W e apply the meta optim ization based techn iq ue known a s Reptile on a well-k nown causal inference model [10]. A crucial design challen ge is to define tasks , as in meta-learn ing context, appro priately for a g iv en pro blem. Specifically , we define tasks based on f eatures o f the subgrou ps in such a way tha t tasks con tain some comm onality w .r .t to subgro ups. In scen a rios that have multiple substructures in the deep n e ural network model, we p ropose the ‘m ulti-Reptile’, which employs different learning rates for the parameters of the substructures. As in [10], we assume that th ere is no hid den co nfoun ding. W e d emonstrate the results on two datasets – (a) synthetic dataset in the advertisement d omain [3 ], and (b) semi-syn th etic dataset based on the IHDP dataset [15]. W e emp loy mean absolu te percentag e error (MAPE) defined o n A TE as the metric, and dem o nstrate that o ur MetaC I fram ework counters the effect of he terogene ity in the input po pulation and handle s th e chang e in target d istributions d uring inference time, while the CI network co unters the issue of covariate shift. 2 Pr eliminaries In this section , we describe Reptile , an optimization based meta-lear ning p aradigm , followed by description of the CI framework propo sed in [10]. 2 2.1 Meta-optimizat ion preliminaries: Rept ile Reptile is an op timization based appro ach to meta- learning, where the mod el par ameters are adapted for fast lear ning with a few examples. In [1 6], the auth ors state th e optimization p roblem in this context for an initial set o f p a rameters Ψ , a rando mly sampled task ω with correspon ding lo ss g iven by L ω , as follows ˆ Ψ = arg min Ψ E ω L ω ( U L ω (Ψ)) , (1) where U L ω ( · ) is an upda te ope rator, and L represents the stochastic gradien t descent (SGD) epoch s. As a n algorith m, Reptile inv olves repeatedly sam pling task ω , fo llowed by lear ning th e parame ter s using an upd ate operator (e. g., SGD) on th e d ata pe rtaining to ω , and updatin g these param e ters by learning on different tasks. T h e training phase of this framework pr ovides a m eta-initialization for the param eters Ψ of the network , such tha t, for a new unseen task, network can b e fine- tuned using this meta-initialization and a small amou nt o f d a ta f rom a new task. W e emp loy the parallel version of reptile, where the solution for the optimization problem in (1) is giv en b y Ψ ← Ψ + ǫ ( ˜ Ψ − Ψ) , (2) where ǫ is an adaptive learning rate, and ˜ Ψ is obtain ed af ter apply in g the update operato r on the ω -th task data. I n this work, we consider the tasks pertaining to the causal in ference where the goal is to learn a model for co unterfactual inference. Hen ce, U L ω (Ψ) is a stochastic gra d ient descen t operato r which optimizes a co st function p e r taining to cou nterfactual infere n ce as g iv en in [1 0]. W e use the meta o ptimization fram ew ork to tackle both, th e p r ior shift that occu rs due to a drif t in the featur e distribution across tasks, and th e con cept sh if t that occur s d ue to a d r ift in p r obability distribution of the target variables [17]. In the sequel, we p rovide the basic setting of a causal inferen ce p r oblem, and describe the CI network wh ich we use as the update operator, U L ω (Ψ) . 2.2 Causal Inference pr eliminaries In this subsection, we describ e the pr oblem o f counterfactu a l inf e rence in the meta- optimization framework. The CI network that we e mploy was pro posed in [ 10, 11]. Let T represent the set o f tr eatments, X ω be th e set o f contexts, and Y ω be the set of p ossible outcomes w .r .t. the ω -th task. W e assume that the treatm e nt is binary , that is T ∈ { 0 , 1 } , where we assign treatmen t t = 1 as treated an d t = 0 as contr ol . Note that, f or a given context x ω ∈ X ω , we observe one of the potential ou tcomes y ω ∈ Y ω , according to the treatment provided, i.e., if t ω = 0 , we observe y ω = Y ω 0 , an d if t ω = 1 , we observe y ω = Y ω 1 , an d a c c ording ly we are interested in optimizing th e ITE fo r the co ntext in task ω , x ω is given b y I T E ( x ω ) = Y ω 1 ( x ω ) − Y ω 0 ( x ω ) . Furthermo re, we are also interested in the the average treatm e n t effect (A TE) averaged over all tasks and contexts, d efined as AT E = E ω ∼ p ( ω ) E x ω ∼ p ( x ω ) I T E ( x ω ) . In [10], th e authors perform c o unterfactual in ference b y ge n eralizing f rom the factu a l to coun terfac- tual distribution. T o this end, they learn a representatio n Φ ω and th e fu nction h ω , such that one ter m optimizes th e pr ediction error w .r .t. the observed o utcomes over the factual represen tation, the other term e nsures that the distributions of treatm ent p opulatio n s are similar or balanced for a given task ω , , thu s tackling the issue of covariate shift [11]. Accor dingly , the objective to minimize is L ( α ω , γ ) = 1 N ω N ω X i =1 w ω i L h ω (Φ ω ( x ω i ) , t ω i ) , y F ,ω i + α ω disc ˆ P F Φ ω , ˆ P CF Φ ω + γ R ( h ω ) , (3) where α ω , γ > 0 are hyper-parameters that control the streng th of the imbalance p enalties, w i compen sate for th e difference in treatment g r oup size, R ( h ω ) is a mo del c o mplexity term, ˆ P F ( · ) represents the factual distribution, and ˆ P CF ( · ) represents the coun terfactual distribution, respectively , and disc ( · , · ) is th e discrepancy measur e as defined in [11]. 3 MetaCI Model In this section, we present the process of task creation, and describe the prop o sed Meta CI mod el. 3 ˜ W Φ ˜ W h t Φ ω Multi-Reptile W r − 1 Φ W r Φ W r − 1 h W r h X ω L ( α ω , γ ) L Epo chs per meta-iteration t x ω CI Blo ck T ask ω Figure 2: Block diagr a m descr ibing the MetaC I framework f or a given task ω . 3.1 T a sk creation It is well known th a t a g ood m eta-learning model should be trained for a div erse set of lear ning tasks and op timized based on the p robab ility distribution of different tasks, in cluding po tentially unseen tasks. Definin g task similar ity is the key overarchin g challeng e in meta lear ning. In the p resence of heterogeneity in the pop ulation, we employ our kno wledge r egarding th e features specific to subgrou ps, wh ich are also the confou nding variables in order to define tasks. W e create tasks by combinin g a m ajority of samples fro m one sub group , an d a f ew samples fr om othe r subgrou ps in fixed propo rtions. Mathematically , using the joint distribution of the confo unding variables, we ensure that we cho ose a subg roup that lies in a given r egion of the joint distribution, an d mix it with samples fr o m smaller disjoint region s o f the sam e joint p robab ility distribution, as dep icted for a toy example in Fig. 1. 4 Pr oposed Model In this section, we prop ose a novel Met aCI alg orithm, where w e combine a variant of the Reptile framework alon g with the cau sal inferenc e fram ew ork [10]. As dep icted in the n eural network model in Fig. 2, we see th at sampling of task and the u pdate of weights u sing Multi-Reptile meta- learning algorithm occu rs o utside the CI blo ck. The CI block co nstitutes the up date operator in th e co n text of meta- learning framework, an d L SGD epo chs are used per meta - iteration. W e term the m eta- learning v ariant as Mu lti- Rep tile, since it employs mu ltip le adapti ve lea rning rates for different subset of parameter s of the up date o perator U L ω ( W ) . Specifically , in the case of the CI network, we employ d ifferent lear ning r ate f or the representatio n and the hypotheses layers. The Meta CI algorithm is formally stated in Algorithm. 1. Algorithm 1 MetaCI algorithm 1: procedure M E TA - C I ( a rgu ments) 2: For all tasks, sample a test task ω ∈ Ω te , and Ω tr constitute the pool of train tasks. 3: for R iteration s do 4: Sample task ω ∈ Ω tr . 5: Compute the weights ˜ W Φ and ˜ W h using U L ω ( W Φ ) and U L ω ( W h ) , r espectively . 6: Meta upd a te weigh ts of the repr e sentation layer : W r +1 Φ = W r Φ + ǫ Φ ( ˜ W Φ − W r Φ ) 7: Meta upd a te weigh ts of the hyp otheses layer: W r +1 h = W r h + ǫ h ( ˜ W h − W r h ) 8: end for 9: return W Φ and W h . 10: end procedure 5 Experiments In this sectio n, we describ e th e datasets, the mechanism used f or creatin g task s for each dataset as described in Sec. 3. 1, f ollowed by the metric s we em ploy for ev aluation, an d finally the experimen tal results. 4 5.1 Datasets W e demonstrate the perfor mance of the pro posed algorithms on a syn thetically genera ted advertise- ment dataset [3] and the semi-synth etic IHDP dataset [15], for ev aluation * . 5.1.1 Synthetic advertisement dataset W e use a synthetic d ata g enerating pr ocess (DGP) to gener a te da ta relevant to the advertisement domain, as described in [3]. W e set the sam ple size N = 2 000 and numb er o f features p = 10 . W e gene rate features q 1 , . . . , q P ∼ N (0 , 1 ) , and the basis function s f 1 ( x ) , . . . , f 10 ( x ) as described in [3]. W e restrict th e treatment T as be in g bina ry , and generate the treatment as T | X = 1 if ∼ N ( P 5 j =1 f j ( q j ) , 1) > 0 , an d 0 oth erwise. Further, we generate the response as Y | T , X ∼ N ( P 5 j =1 f j +5 ( q j ) + η T T , θ ) . W e set θ = 1 to gen erate data for demo nstrating the effect o f covariate shift, an d set θ as 1 , 10 and 20 to g enerate d ata for d emonstratin g the effect o f co ncept shift. Note that the features q 1 , . . . , q 5 have confo unding effects on b oth the treatment and the ou tc o me, and the rest of the features contribute to the no ise in th e mo del. 5.1.2 Semi-synthetic IHDP dataset The I nfant Health Development Prog ram (IH DP) [18] dataset consists of measuremen ts of moth er and children for study ing the effect of specialist ho me visits on futur e cogn itive test scores. The dataset compr ises of 4302 infants having 25 featu res. Ou t of these, 8 ar e selected b ased on A CIC challenge (20 17) to o b tain con text inf ormation X . Sp ecifically , th ese feature s f orm th e basis of the meta-learnin g tasks obtain ed using the DGP [15]. 5.2 T a sk creation f or Reptile Here we describe the pro cess o f task creatio n to demonstrate the perform ance of the Met aCI frame- work in the pr esence of covariate and con cept shift, for th e datasets provided in the previous section. 5.2.1 Covariate shift T a sks in synthetic dataset : In ord er to appr opriately provide tasks to the Met aCI framework in presence of cov ariate shift, we g enerate 2 000 user s distinguished based on the set of features, fo r number of tasks defined b y cardinality of Ω . W e con sid e r these | Ω | disjoin t chu nks, a nd mix it with samples from other chun ks in the ratio 3 : 2 , i.e., each task con sists o f 6 0 % of samples fro m a given chunk , and 4 0% of samples in e q ual pr oportio n from k other chunk s. For ev ery subgro up, T | X and Y | T , X is gen erated u sing a gen erating p rocess sp ecified in [3]. In th e single featu re c a se, the data is split on the basis o f the first featur e which is one of th e co n foun ding variables. In the case of multiple confou nding featur es, the data is split on the basis of the first two features which are confou nding . W e cr eate tasks based on the joint distribution of the con f ound ing featu r es as outlined in Sec. 3.1. T a sks in IHDP dataset : Here we create tasks for the Meta CI f r amework for the I HDP dataset, with an en d goal o f demostra tin g the per forman ce of the pr oposed algor ithm in pr esence of covariate shift. W e define tasks by dividing the entire popu lation of infants, g iv en as a finite numb er of contexts in the ACIC challenge dataset, 2017, in to | Ω | equal sized chun k s. W e cre ate these chunks based on the joint d istribution of multiple confou nding features. Specifically , we co nsider mo ther’ s age, ch ild’ s bilirubin lev el and moth er’ s place of birth. Each chun k is mixed with samples from other ch unks in the ratio 3 : 2 , i.e., each task dataset, X ω , consists of 6 0% of samples from a g iv en ch unk, and 40% of sam ples in equal pro portion from k o ther chu nks. For each of the ta sk s, T and Y ω is g enerated synthetically using hetroskedastic, ad d itiv e erro r DGP g iv en in [15]. In both the above cases, the n umber of chunk s used for mix ing ( k ) is an experim ental variable and lies in range [1 , | Ω | − 1] . * The simulated datasets will be av ailable upon request fr om authors post publication of the paper . 5 5.2.2 Concept and covariate s hift T a sks in synthetic and IHDP dataset scenario: In or d er to demonstrate th e perfo rmance of MetaC I in the presen ce of concept sh if t, we use two different genera tio n processes wh ich d iffer in gener a tio n of the response variable Y . According ly , we describe two types of task creation as follows: 1. Case 1 - con cept sh if t using 2 DGPs: Based on the co n foun ding features of the datasets, we co nsider 4 ch u nks pe r DGP , and 3 chun ks per DGP , in syn th etic and IHDP datasets, respectively . 2. Case 2 - concep t shift using 3 DGPs: W e consider 3 ch unks per DGP a nd 2 chu nks per D GP , in synthetic and IHDP datasets, respectively . In both the above cases, th e chun ks are mixed within an d a c r oss grou ps by r e taining 60% of the samples of one chunk, and r eplacing th e remainin g 40% with samples from other chun k s, to cre- ate tasks. Th e mixed chu nks co ntribute to g enerating the responses as dictated by the numb er of DGPs. Ac ross DGPs, the parameter s o f the distribution which is used to sample Y | T , X is varied to demonstra te concep t shift. 5.3 Metrics In this subsection we d escribe the perfo r mance metrics used for ev aluating pro p osed causal meta model. W e use a verag e treatmen t effect ( AT E ω ,r ) fo r r -th test iteration a n d tes t task ω as the perfor mance metric, which is defined as AT E ω ,r = P N ω,t 1 i =1 ( y i, 1 − ˆ y i, 0 ) 2 N ω ,t 1 + P N ω,t 0 i =1 ( ˆ y i, 1 − y i, 0 ) 2 N ω ,t 0 , (4) where y i, 1 ( y i, 0 ) is the factual r esponse to treatm ent t i = 1 ( t i = 0 ) and ˆ y i, 0 ( ˆ y i, 1 ) is its corre- sponding co u nterfactual resp onse, N ω ,t 1 ( N ω ,t 0 ) are the n umber of samples in the task ω that ar e offered trea tm ent 1 ( 0 ). In o rder to elimin a te any bias in the test set, we repo rt the averaged AT E correspo n ding to the iteration that has th e least averaged validation o bjective across test set o f the meta-test tasks. In the following section, we report th e mean ab solute perce n tage erro r defined on the groun d truth A TE AT E G , and the AT E obtained a s above as follows: M AP E = AT E G − AT E AT E G , (5) i.e., lower values of MAPE in dicate that the o btained A TE values are clo ser to th e grou nd truth A T E . 5.4 Experimental details and results In th is sectio n, we report th e experimental details and the r e sults o btained. W e split | Ω | tasks into | Ω | − 1 train task s an d a test task as shown in Fig. 3. Every train task is divided in the ratio 1 : 1 correspo n ding to training a n d validation and test task is divided in the ratio 2 : 1 : 1 corre spondin g to training, validation and test sets. Th e Me taCI fram ework is trained fo r 1 0 00 iteratio ns by samplin g a train task in each iteration. For each iteration ( r ), weights ( W r ) of causal meta model are comp uted after L = 64 epochs of m ini-batch Stochastic Gradient Descent ( SGD) over th e batches o f train set of tra in task. These weights W i (where r = i d uring training of Meta CI ) are then used to up date the initial weights W 0 ,i present at the start of each iteration using reptile update Eq. (2). W e pick the best train task hyp er-parameters (learn ing rate, dropo ut, ǫ ) correspon d to the least value of validation loss fu nction averaged acro ss all iterations. W e ev aluate th e per forman ce on the test set of test task (ref er Fig . 3) by tun ing the m eta causal mo d els’ weig hts ( W 0 ,j , wher e j is every 100 th iteration) for 6 4 epochs on th e test task’ s train set. Best h yper-parameters fo r test task is obtain ed in the same manne r as discussed for training phase. W e r epeat each experime nt by conside ring each of | Ω | tasks as meta test task s, and rep o rt th e aver - aged MAPE across test sets of each test task as in Fig. 3. W e consider two baselin e s f or Met aCI . Th e first baseline is meta lear n ing based rep tile a lgorithm that uses th e N N4 causal infer ence network. This b a selin e was presente d in [10]. N N4 do es not incorpo rate a representatio n layer Φ , as compared to th e CI neu ral n e twork in [10], and hence it is a 6 T rain Pool T raining V alidation ω 1 ω | Ω | Meta-train: Ω tr Ω te Meta-test T raining V alidation T est ˜ W 0 Figure 3: Block diagr a m descr ibing the trainin g p r ocedur e of Meta CI . good baseline . The auth ors d emonstrate the superio rity of their p ropo sed network as comp ared to NN4 . In the tables that fo llow in the next section, we employ two variants of th is baseline, nam ely MetaN N4 , wh ich u ses m eta-initialization, and Rando mNN4 , wh ich u ses ra n dom in itializatio n, both along with NN 4 . By ad o pting NN 4 along with m eta learning, we verify that the gains ob tained by using CI network a s co mpared to N N4 is carried over when we use m e ta learning. In addition , we provide an other baseline which co nsists of th e CI n etwork which is trained for large number o f epochs over data fro m ea c h task, but initialized using ran dom in itialization. This baseline helps us to gauge th e perfo rmance of the CI n etwork wh en th e data is no t provid ed in a meta lear ning fashion . W e den ote this b aseline as CI Ω in the tables that follow in the next section. 5.5 Results W e d emonstrate the pe r forman ce of M etaCI for varying number of tasks ( | Ω | ), varying k , and ǫ using different settings f o r task creation, in the c ontext o f synthetic an d semi-syn thetic d ataset discussed in p revious section. W e presen t the results pertain ing to data that sees a covariate shif t, an d th e combined effect of both , con cept an d covariate shift. Co nvergence is demo nstrated in Fig . 4. 5.5.1 Covariate shift: V arying number of subgr oups ( | Ω | ): W e study the performanc e by measurin g th e MAPE for varying numb e r of tasks to study the effect of meta-initialization . In the context of syn thetic d ataset, we have the flexibility of gene r ating as many samp les as we requ ir e per task. Hen ce, in T able 1 and 2 we set the numb er of samples p er task to be same. Ho we ver , the nu m ber of users are fixed in the case of th e IHDP dataset, and he n ce, th e n u mber o f samp les per task goes down as the nu m ber of tasks in crease. Furthermo re, we set k = | Ω | − 1 , i.e., as the n umber of tasks increase, the n umber of mixing ch unks also in crease, hence de creasing the common ality b etween tasks. Hen ce, we expe c t to observe a trade-o ff b etween data per task N ω and k . Fro m T able 1 an d T able 2, we see that this is indeed true, since we get the best MAPE for | Ω | = 7 f or single f eature used for task cr eation in synthetic dataset case and | Ω | = 4 , | Ω | = 6 in ca se o f m u ltiple f eatures u sed for task creatio n in IHDP and synthetic d ataset respectively . Fu rthermo re, we see that th e proposed te c hnique p erform s better compar ed to the baselines describ ed in the previous section. T able 1: MAPE: V ary ing | Ω | ( k = | Ω | − 1 ), using single feature for task creation . | Ω | Synthetic dataset N ω CI Ω MetaC I MetaNN4 RandomCI RandomN N4 4 2000 0.7305 0.351 3 1.940 0 1.556 3 1.999 1 7 2000 0.7088 0.247 3 2.099 3 1.316 0 1.934 8 9 2000 0.9487 0.428 4 1.999 5 1.283 2 1.462 2 11 2000 0.8036 0.347 5 0.792 9 1.285 5 0.983 1 T able 2: MAPE: V aryin g | Ω | ( N ω ) using multiple features for task creation. | Ω | IHDP N ω MetaC I RandomCI 4 1144 0.51 64 1.68 96 6 764 0.511 2 1.8492 8 498 1.742 2 2.2367 | Ω | Synthetic dataset N ω MetaC I RandomCI 4 2000 0.3276 1 .1786 7 2000 0.5528 0 . 6976 9 2000 0.5762 0 . 6714 7 Figure 4: Comparison of validation objective (o n test) across varying numbe r of training epoch s. (Left) IHDP dataset, (Right) Synthetic dataset. T able 3: MAPE: V aryin g k using si ngle feature ( | Ω | = 7 ) and m ultiple featu res ( | Ω | = 6 and | Ω | = 7 ) for task creation . k Synthetic (sing le featu re) MetaCI Random CI 3 0 .4030 1 . 6603 4 0 .2490 1 . 4428 6 0 .2473 1 .3160 k IHDP (multiple featu res) MetaCI Random CI 2 0 .7537 1 . 1847 4 0 .4546 1 .7456 5 0 .5112 1 . 8492 k Synthetic (m ultiple f eatures) MetaCI Random CI 3 0 .8658 0 . 9750 4 0 .8596 0 . 9263 6 0 .5528 0 .6976 T able 4: Perfor mance of th e Me taCI framework fo r three scenarios, wh ere speeds of re lative weig h t adaptation of represen tation and hypo theses layer are varied. Scenario: | Ω | = 4 , k = 3 ǫ h > ǫ φ ǫ h = ǫ φ ǫ h < ǫ φ Multiple co -variate IHDP dataset 0.499 4 0.516 4 0.7 169 Single co-variate synthetic dataset 0.3230 0. 3513 0.584 0 Multiple co -variate synthetic dataset 0. 3621 0. 3276 0.596 6 V arying number of chunks used for mix ing ( k ): W e vary th e nu mber of mixing chunk s k , for a fixed number of tasks Ω , to study th e effect of mixing on the perfo rmance as measured b y MAPE. For | Ω | = 7 a n d | Ω | = 6 , we see that varying k leads to an imp roved value of A TE compar e d to the groun d truth A TE in T ab le 3. V arying meta learning rate ǫ : W e demon strate the r e la tive pe r forman ce of multi-reptile, wh ere we vary th e relativ e weights ( ǫ ) a ssign ed to the param eters of the repr esentation layer ( W Φ ) vis-á- vis the weights assign ed to th e parameter s o f the h y potheses layer ( W h ). Across several scenarios and datasets, as shown in T ab le 4, we observe th at adopting a slower learn ing r ate for th e representa tio n layer as comp ared to th e h y potheses lay e r leads to AT E very close to the g roun d truth A TE. Intu - iti vely , th e rep resentation layer minimizes the discrep ancy b etween distributions, wh ich m ay vary slowly across tasks. 5.5.2 Concept and covariate s hift In th is section, we pre sen t results for datasets in wh ich we synthetically simu late con cept and co- variate shift at the same time. While cov ariate sh ift is in herent to the CI setting and ar ises due to confou nding variables, co ncept shift ar ises due to the c hange in th e prob ability d istribution of the re- sponse variable co n ditioned on the inp ut an d treatmen t. In T able 5, we demon strate the perf orman ce of the Me taCI algo r ithm when there are 2 and 3 DGPs for generatin g th e respon se as d iscussed in Sec. 5 . 2.2. M ean ( µ d ) an d variance ( σ 2 d ) of AT E s per DGP fo r both the datasets shown in T able 5, where d = 1, 2... W e observed that M etaCI conv erges faster as compa red to R andom CI for both the datasets. 8 T able 5 : Perf ormanc e of M etaCI in case of covariate and concep t shift u sin g Syn thetic an d I HDP datasets. # DGPs ( µ 1 , σ 2 1 ) ( µ 2 , σ 2 2 ) ( µ 3 , σ 2 3 ) MetaCI Random CI Synthetic 2 (0.4 822, 0.000 3) (0.5356, 0.173 9) - 0 .4559 1.2523 3 (0.5 400, 0.000 4) (0.4733, 0.033 7) (0.7433, 0.499 4) 1.3153 2.0104 IHDP 2 (0.1 515, 0.000 2) (0.9294, 0.000 6) - 0 .9419 1.3600 3 (0.1 521, 0.000 7) (0.9000, 0.000 6) (1.0288, 0.008 0) 1.6135 1.8699 6 Conclusions In this work , we demo nstrate the efficacy of the meta learn ing based reptile fr a mew ork in a causal inference setting f o r a h eterogen eous popu lation. W e sh owed that th e meta le a rning ap proach is a modern appro ach that co uld replace the classical subg roup analysis, wher e these su b grou ps can be translated as task s in the meta lear n ing framework. W e provided a novel app roach to create tasks based on th e confoun ding featur e s, and sh owed that it is possible to obtain a good meta in itialisation which leads to sign ificant improvement in A TE on the u nseen data. W e also showed th a t th e Me taCI framework adap ts its par ameters in the p resence of both covariate an d co ncept shift in the d a taset, and ou tp erform s the baselines by large margins. W e allu de to specific details regard ing training meta learning based deep neural network mo dels, which by itself is a contribution to curr ent literatu re. Refer ences [1] Samir M Hanash, Christina S Baik, and Olli Kallio n iemi. Emerging molecu lar biomarkers—blo od-ba sed strategies to detect and monito r cancer . Na tu r e re views Clinical oncology , 8 (3):14 2, 2 011. [2] Heidi Seib old, Ach im Zeileis, an d T or sten Hothorn . Model-b ased recur si ve partition ing for subgrou p analyses. The internatio nal journa l of b iostatistics , 12(1):4 5 –63, 2016. [3] W ei Sun, Peng yuan W a n g, Dawei Y in, Jian Y ang, and Y i Cha ng. Causal inferen ce via sparse additive models with application to o nline advertising . In T wenty-Ninth AA AI Conference on Artificial Intelligence , 20 15. [4] Matt T a d dy , Matt Gardner, Liyun Chen, and David Dr aper . A n onpar ametric bay esian analysis of h eterogen ous tr e atment e ffects in dig ital experim e ntation. Journal of Business & Econo mic Statistics , 34(4):6 61–67 2, 2016 . [5] Léon Bottou, Jon as Peters, Joaquin Qu iñonero -Candela, Den is X Char les, D Max Chickerin g, Elon Portugaly , Dipankar Ray , Patrice Sim a r d, and Ed Snelson. Counterfactual r easoning and learn ing system s: The examp le of co mputatio nal advertising. The Journal of Machine Learning Resear ch , 14(1 ):3207 –326 0, 2013. [6] Paul R Rosen baum a nd Don ald B Rub in. The central ro le of the pro pensity scor e in observa- tional studies for causal effects. Biometrika , 70(1) :4 1–55 , 1983 . [7] Christos Lou izos, Uri Shalit, Joris M Mo oij, David Sontag, Richard Ze m el, and Max W ellin g. Causal effect inf erence with deep latent-variable mod els. In Ad vances in Neural Informatio n Pr o c essing Systems , pages 6446– 6 456, 2017. [8] Jinsung Y oo n , James Jor don, a n d Mihaela van der Schaar . Ganite: Estimation o f individualized treatment effects using gener ativ e adversarial n ets. 20 18. [9] Jason Hartfo rd, Greg Lewis, Ke v in Leyton-Brown, a nd Matt T addy . De e p iv: A flexible ap - proach for coun ter factual prediction. In P r oceeding s of th e 34th Interna tional Conference on Machine Learnin g-V olume 70 , pa ges 1414– 1 423. JMLR. org, 201 7. [10] Fredrik Joh ansson, Ur i Shalit, and David Son tag. Learn ing re p resentation s for cou nterfactual inference . In I nternation a l conference o n machine lea rning , pag es 3020–3 029, 2016. 9 [11] Uri Shalit, Fredrik D Joha nsson, and David Son tag. Estimating individual treatmen t effect: generalizatio n boun ds and alg o rithms. In Pr oceed ings o f the 34th In te rn ational Con fer ence on Machine Learnin g-V olume 70 , pa ges 3076– 3 085. JMLR. org, 201 7. [12] T yler J V ander W eele, Alex R Luedtke, Mark J van d er Laan , and Ronald C Kess ler . Selecting optimal subgroup s for treatment using many c ovariates. Epid e miology , 30(3 ):334 – 341, 20 1 9. [13] Stefan W ag er an d Su san Athey . Estimation and inference of heter o geneo u s treatm ent effects using random forests. Journal of the American S tatistical Associa tion , 11 3(523 ):1228 –1242, 2018. [14] Joaquin V ansch oren. Meta-lea rning: A survey . arXiv pr eprint a rXiv:181 0.035 48 , 20 18. [15] P Richard Hah n, V incen t Dorie, and Jared S Murr ay . Atlantic cau sal inf erence c o nferen ce (acic) data analysis challenge 2017 . arXiv pr eprint arXiv:190 5.095 15 , 20 19. [16] Alex Nichol, Joshua Ac hiam, and Jo hn Schu lman. On first-order meta- learning algorith ms. arXiv pr eprint a rXiv:180 3.029 99 , 20 18. [17] W outer M Kouw . An intr o duction to dom ain adap tation and tran sfer learn ing. arXiv pr eprint arXiv:181 2.118 06 , 20 18. [18] Ruth T . Gro ss. Infant he alth and development prog ram (ihdp) : Enhanc in g the outcomes of low birth weight, pr emature in fants in th e un ited states, 19 8 5-19 88. MI: Inter-university Con sor- tium for P olitica l a nd Social Resear ch , 199 3. 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment