Learning Cost-Effective Treatment Regimes using Markov Decision Processes

Learning Cost-Effectiv e T r eatment Regimes Using Markov Decision Pr ocesses Himabindu Lakkaraju Cynthia Rudin Stanford Univ ersity Duke Uni versity Abstract Decision makers, such as doctors and judges, make crucial decisions such as recommending treatments to patients, and granting bails to de- fendants on a daily basis. Such decisions typi- cally in v olv e weighting the potential beneﬁts of taking an action against the costs in v olved. In this work, we aim to automate this task of learn- ing cost-effective, interpr etable and actionable tr eatment r e gimes . W e formulate this as a prob- lem of learning a decision list – a sequence of if-then-else rules – which maps characteristics of subjects (eg., diagnostic test results of patients) to treatments. W e propose a novel objectiv e to construct a decision list which maximizes out- comes for the population, and minimizes ov erall costs. W e model the problem of learning such a list as a Markov Decision Process (MDP) and employ a variant of the Upper Conﬁdence Bound for T rees (UCT) strategy which lev erages cus- tomized checks for pruning the search space ef- fectiv ely . Experimental results on real world ob- servational data capturing judicial bail decisions and treatment recommendations for asthma pa- tients demonstrate the effecti veness of our ap- proach. 1 Introduction Medical and judicial decisions can be complex: they in- volv e careful assessment of the subject’ s condition, analyz- ing the costs associated with the possible actions, and the nature of the consequent outcomes. Further , there might be costs associated with the assessment of the subject’ s condition itself (e.g., physical pain endured during medi- cal tests, monetary costs etc.). For instance, a doctor ﬁrst If Spiro-T est = Pos and Prev-Asthma = Y es and Cough = High then C Else if Spiro-T est = Pos and Prev-Asthma = No then Q Else if Short-Breath = Y es and Gender = F and Age ≥ 40 and Prev-Asthma = Y es then C Else if Peak-Flow = Y es and Prev-RespIssue = No and Wheezing = Y es , then Q Else if Chest-Pain = Y es and Prev-RespIssue = Y es and Methacholine = Pos then C Else Q Figure 1: Regime for treatment recommendations for asthma patients output by our frame work; Q refers to milder forms of treatment used for quick-relief, and C corresponds to more intense treatments such as controller drugs ( C is higher cost than Q ); Attributes in blue are least expensi ve. diagnoses the patient’ s condition by studying the patient’ s medical history and ordering a set of relev ant tests that are crucial to the diagnosis. In doing so, she also factors in the physical, mental and monetary costs incurred due to each of these tests. Based on the test results, she carefully de- liberates v arious treatment options, analyzes the potential side-effects as well as the effecti veness of each of these op- tions. Analogously , a judge deciding if a defendant should be granted bail studies the criminal records of the defen- dant, and enquires for additional information (e.g., defen- dant’ s personal life or economic status) if needed. She then recommends a course of action that trades off the risk with granting bail to the defendant (the defendant may commit a ne w crime when out on bail) with the cost of denying bail (adverse effects on defendant, or defendant’ s family , cost of jail to the county). In practical situations, human decision makers often lever - age personal experience to make decisions, without consid- ering data, e ven if massi ve amounts of it e xist for the prob- lem at hand. There exist domains where machine learn- ing models could potentially help – b ut they w ould need to consider all three aspects discussed above: predictions of counterfactuals, costs of gathering information, and costs of treatments. Further , these models must be interpretable in order to create any reasonable chance of a human deci- sion maker actually using them. In this work, we address the problem of learning such cost-effecti ve, interpretable treatment regimes from observ ational data. Prior research addresses various aspects of the problem at hand in isolation. For instance, there exists a large body of literature on estimating treatment effects [ 8 , 24 , 7 ], recom- mending optimal treatments [ 1 , 34 , 9 ], and learning intel- ligible models for prediction [ 19 , 16 , 21 , 4 ]. Ho wev er, an effecti ve solution for the problem at hand should ideally in- corporate all of the aforementioned aspects. Furthermore, existing solutions for learning treatment regimes neither ac- count for the costs associated with gathering the required information, nor the treatment costs. The goal of this work is to propose a framework which jointly addresses all of the aforementioned aspects. W e address the problem at hand by formulating it as a task of learning a decision list that maps subject characteristics to treatments (such as the one sho wn in Figure 1 ) such that it: 1) maximizes the expectation of a pre-speciﬁed outcome when used to assign treatments to a population of interest 2) minimizes costs associated with assessing subjects’ con- ditions and 3) minimizes costs associated with the treat- ments themselves. W e choose decision lists to express the treatment regimes because they are highly intelligible, and therefore, readily employable by decision makers. W e pro- pose a nov el objectiv e function to learn a decision list op- timized with respect to the criterion highlighted abov e. W e prov e that the proposed objectiv e is NP-hard by reducing it to the weighted exact cover problem. W e then optimize this objectiv e by modeling it as a Markov Decision Pro- cess (MDP) and employing a v ariant of the Upper Con- ﬁdence Bound for T rees (UCT) strategy which leverages customized checks for pruning the search space effecti vely . W e empirically ev aluate the proposed framework on two real world datasets: 1) judicial bail decisions 2) treatment recommendations for asthma patients. Our results demon- strate that the regimes output by our framework result in improv ed outcomes compared to state-of-the-art baselines at much lower costs. Further, the treatment regimes output by our approach are less complex and require fewer diag- nostic checks to determine the optimal treatment. 2 Related W ork Below , we provide an overvie w of related research on learning treatment regimes, dynamic optimal treat- ment regimes, subgroup analysis, and interpretable models. T reatment Regimes. The problem of learning treat- ment regimes has been e xtensively studied in the con- text of medicine and health care. Along the lines of [ 36 ], literature on treatment regimes can be cate gorized as: r egr ession-based methods and policy-sear ch-based meth- ods . Re gression-based methods [ 28 , 32 , 29 , 33 , 40 , 28 , 26 ] model the conditional distribution of the outcomes giv en the treatment and characteristics of patients and choose the treatment resulting in the best possible outcome for each in- dividual. P olicy-searc h-based methods search for a policy (a function which assigns treatments to individuals) within a pre-speciﬁed class of policies. The policy is chosen to optimize the expected outcome across the population of interest. Examples of such estimators include marginal structural mean models [ 29 ], outcome weighted learn- ing [ 39 , 38 ], and robust marginal mean models [ 35 , 36 ]. V ery few of the aforementioned solutions [ 36 , 25 ] produce regimes which are intelligible. None of the aforementioned approaches explicitly account for treatment costs and costs associated with gathering information pertaining to patient characteristics. While most work on learning treatment regimes has been done in the context of medicine, the same ideas apply to policies in other ﬁelds. T o the best of our knowledge, this work is the ﬁrst attempt in extending work on treatment regimes to judicial bail decisions. Dynamic T reatment Regimes. Recent research in person- alized medicine has focused on dev eloping dynamic tr eat- ment r e gimes [ 15 , 37 , 34 , 9 ]. The goal is to learn treat- ment regimes that maximize outcomes for patients in a giv en population by recommending a sequence of appro- priate treatments over time, based on the state of the pa- tient. There has been little attention paid to interpretability in this literature (with the exception of [ 37 ]). None of the prior solutions for this problem consider treatment costs or costs associated with diagnosing a patient’ s condition. Subgroup Analysis. The goal of this line of research is to ﬁnd out whether there exist subgroups of individuals in which a giv en treatment exhibits heterogeneous effects, and if so, how the treatment effect varies across them. This problem has been well studied [ 31 , 10 , 20 , 3 , 11 ]. Howe ver , identifying subgroups with heterogeneous treatment ef fects does not readily provide us with re gimes. Interpr etable Models. A large body of machine learning literature focused on de veloping interpretable models for classiﬁcation [ 19 , 16 , 21 , 4 ] and clustering [ 12 , 18 , 17 ]. T o this end, various classes of models such as decision lists [ 19 ], decision sets [ 16 ], prototype (case) based mod- els [ 4 ], and generalized additi ve models [ 21 ] were pro- posed. These classes of models were not concei ved to model treatment effects. There has been recent work on lev eraging decision lists to describe estimated treatment regimes [ 25 , 14 , 36 ]. These solutions do not account for the treatment costs or costs in volv ed in gathering patient char- acteristics. They are also constructed using greedy meth- ods, which causes issues with the quality of the models. 3 Our Framework First, we formalize the notion of treatment regimes and dis- cuss ho w to represent them as decision lists. W e then pro- pose an objective function for constructing cost-effecti ve treatment regimes. 3.1 Input Data and Cost Functions Consider a dataset D = { ( x 1 , a 1 , y 1 ) , ( x 2 , a 2 , y 2 ) · · · ( x N , a N , y N ) } comprised of N independent and identi- cally distributed observations, each of which corresponds to a subject (individual), potentially from an observ ational study . Let x i = h x (1) i , x (2) i , · · · x ( p ) i i ∈ [ V 1 , V 2 , · · · V p ] de- note the characteristics of subject i . V f denotes the set of all possible values that can be assumed by a characteristic f ∈ F = { 1 , 2 , · · · p } . Each characteristic f ∈ F can ei- ther be a binary , categorical or real v alued variable. In the medical setting, example characteristics include patient’ s age, BMI, gender , red blood cell count, glucose lev el etc., Let a i ∈ A = { 1 , 2 , · · · m } and y i ∈ R denote the treat- ment assigned to subject i and the corresponding outcome respectiv ely . W e assume that y i is deﬁned such that higher values indicate better outcomes. For example, the outcome of a patient can be regarded as a wellness improvement score that indicates the effecti veness of the assigned treat- ment. It can be much more expensi ve to determine certain sub- ject characteristics compared to others. For instance, a patient’ s age can be easily retriev ed either from previous records or by asking the patient. On the other hand, de- termining her glucose level requires more comprehensiv e testing, and is therefore more expensi ve in terms of mon- etary costs, time and effort required both from the patient as well as the clinicians. W e assume access to a function d : F → R which returns the cost of determining any char- acteristic in F . The cost associated with a given character - istic f ∈ F is assumed to be the same for all the subjects in the population, though the framework can be extended to hav e patient-speciﬁc costs. Analogously , each treatment a ∈ A incurs a cost and we assume access to a function d 0 : A → R which returns the cost associated with treat- ment a ∈ A . W e now discuss the notion of a treatment regime formally , and then introduce the class of models that we employ to express such re gimes. 3.2 T reatment Regimes A treatment regime is a function which takes as input the characteristics of any given subject x and maps them to an appropriate treatment a ∈ A . As discussed, prior stud- ies [ 30 , 23 ] suggest that decision makers such as doctors and judges who make high stake decisions are more likely to trust, and, therefore employ models which are inter- pretable and transparent. W e thus employ decision lists to express treatment re gimes (see e xample in Figure 1 ). A de- cision list is an ordered list of rules embedded within an if- then-else structure. A treatment regime 1 expressed as a de- cision list π is a sequence of L + 1 rules [ r 1 , r 2 , · · · , r L +1 ] . The last one, r L +1 , is a default rule which applies to all those subjects who do not satisfy any of the previous L rules. Each rule r j (except the default rule) is a tuple of the form ( c j , a j ) where a j ∈ A , and c j represents a pattern which is a conjunction of one or more predi- cates. Each predicate takes the form ( f , o, v ) where f ∈ F , o ∈ { = , 6 = , ≤ , ≥ , <, > } , and v ∈ V f denotes some v alue v that can be assumed by the characteristic f . For instance, “ Age ≥ 40 ∧ Gender = Female” is an example of such a pat- tern. A subject i is said to satisfy rule j if his/her character- istics x i satisfy all the predicates in c j . Let us formally de- note this using an indicator function, satisfy ( x i , c j ) which returns a 1 if x i satisﬁes c j and 0 otherwise. The rules in π partition the dataset D into L + 1 groups: {R 1 , R 2 · · · R L , R default } . A group R j , where j ∈ { 1 , 2 , · · · L } , is comprised of those subjects that satisfy c j but do not satisfy any of c 1 , c 2 , · · · c j − 1 . This can be for- mally written as: R j = ( x ∈ [ V 1 · · · V p ] | satisfy ( x , c j } ∧ j − 1 ^ t =1 ¬ satisfy ( x , c t ) ) . (1) The treatment assigned to each subject by π is determined by the group that he/she belongs to. For instance, if subject i with characteristics x i belongs to group R j induced by π i.e., x i ∈ R j , then subject i will be assigned the corre- sponding treatment a j under the regime π . More formally , π ( x i ) = L X l =1 a l 1 ( x i ∈ R l ) + a default 1 ( x i ∈ R default ) (2) where 1 denotes an indicator function that returns 1 if the condition within the brackets ev aluates to true and 0 other - wise. Thus, π returns the treatment assigned to x i . Similarly , the cost incurred when we assign a treatment to the subject i ( tr eatment cost ) according to the regime π is giv en by: φ ( x i ) = d 0 ( π ( x i )) (3) where the function d 0 , deﬁned in Section 3.1., takes as input a treatment a ∈ A and returns its cost. W e can also deﬁne the cost incurred in assessing the con- dition of a subject i ( assessment cost ) as per the regime π . Note that a subject i belongs to the group R j if and only if the subject does not satisfy the conditions c 1 · · · c j − 1 , 1 W e use the terms decision list and treatment regimes inter- changeably from here on. but satisﬁes the condition c j (Refer to Eqn. 1 ). T o reach this conclusion, all the characteristics present in the cor- responding antecedents c 1 · · · c j must have been measured for subject i and e valuated against the appropriate predicate conditions. This implies that the assessment cost incurred for this subject i is the sum of the costs of all the character- istics that appear in c 1 · · · c j . If N l denotes the set of all the characteristics that appear in c 1 · · · c l , the assessment cost of the subject i as per the regime π can be written as: ψ ( x i ) = L X l =1 " 1 ( x i ∈ R l ) × X e ∈N l d ( e ) !# . (4) 3.3 Objective Function W e now formulate the objective function for learning a cost-effecti ve treatment regime. W e ﬁrst formalize the no- tions of e xpected outcome, assessment, and treatment costs of a treatment regime π with respect to the dataset D . Expected Outcome Recall that the treatment re gime π assigns a subject i with characteristics x i to a treatment π ( x i ) (Equation 2 ). The quality of the regime π is partly determined by the expected outcome when all the subjects in D are assigned treatments according to π . The higher the value of such an expected outcome, the better the quality of the regime π . There is, howe ver , one cav eat to computing the v alue of this expected outcome – we only observe the outcome y i resulting from assigning x i to a i in the data D , and not any of the counterfactuals. If the regime π , on the other hand, assigns a different treatment a 0 6 = a i to x i , we cannot ev aluate the policy on x i . The solutions proposed to compute expected outcomes in settings such as ours can be categorized as: adjustment by regression modeling, adjustment by in verse propensity score weighting, and doubly rob ust estimation. A de- tailed treatment of each of these approaches is presented in Lunceford et al. [ 22 ]. The success of regression based mod- eling and in verse weighting depends heavily on the postu- lated regression model and the postulated propensity score model respectiv ely . In either case, if the postulated models are not identical to the true models, we hav e biased esti- mates of the expected outcome. On the other hand, doubly robust estimation combines the above approaches in such a way that the estimated value of the expected outcome is unbiased as long as one of the postulated models is identi- cal to the true model. The doubly robust estimator for the expected outcome of the regime π , denoted by g 1 ( π ) , can be written as: g 1 ( π ) = 1 N N X i =1 X a ∈A o ( i, a ) where (5) o ( i, a ) =  1 ( a i = a ) ˆ ω ( x i , a ) ( y i − ˆ y ( x i , a )) + ˆ y ( x i , a )  1 ( π ( x i ) = a ) . ˆ ω ( x i , a ) denotes the probability that the subject i with char- acteristics x i is assigned to treatment a in the data D . ˆ ω represents the propensity score model. In practice, we ﬁt a multinomial logistic regression model on D to learn this function. Our framew ork does not impose any constraints on the functional form of ˆ ω . Similarly , ˆ y ( x i , a ) denotes the predicted outcome obtained as a result of assigning a subject characterized by x i to a treatment a . ˆ y corresponds to the outcome regression model and is learned in our ex- periments by ﬁtting a linear regression model on D prior to optimizing for the treatment regimes. ˆ y and ˆ ω could be modeled using any other method; this is an entirely sepa- rate step from the algorithm discussed here. Expected Assessment Cost Recall that there are assess- ment costs associated with each subject. These costs are gov erned by the characteristics that will be used in assess- ing the subject’ s condition and recommending a treatment. The assessment cost of a subject i treated using regime π is gi ven in Eqn. 4 . The e xpected assessment cost across the entire population can be computed as: g 2 ( π ) = 1 N N X i =1 ψ ( x i ) . (6) It is important to ensure that our learning process fa vors regimes with smaller values of expected assessment cost. Keeping this cost lo w also ensures that the full decision list is sparse, which assists with interpretability . Expected T reatment Cost There is a cost associated with assigning treatment to any given subject. The treat- ment cost for a subject i who is assigned treatment using regime π is given in Eqn. 3 . The expected treatment cost across the entire population can be computed as: g 3 ( π ) = 1 N N X i =1 φ ( x i ) . (7) The smaller the expected treatment cost of a regime, the more desirable it is in practice. W e present the complete objectiv e function below . Complete Objective W e assume access to the following inputs: 1) the observational data D ; 2) a set F P of fre- quently occurring patterns in D . Recall that each pattern corresponds to a conjunction of one or more predicates. An example pattern is “ Age ≥ 40 ∧ Gender = Female”. In prac- tice, such patterns can be obtained by running a frequent pattern mining algorithm such as Apriori [ 2 ] on the set D ; 3) a set of all possible treatments A . W e deﬁne the set of all possible (pattern, treatment) tuples as L = { ( c, a ) | c ∈ F P , a ∈ A} and C ( L ) as the set of all possible combinations of L . An element in L can be thought of as a rule in a decision list and an element in C ( L ) can be thought of a list of rules in a decision list (without the default rule). W e then search o ver all elements in the set C ( L ) × A to ﬁnd a regime which maximizes the expected outcome (Eqn. 5 ) while minimizing the expected assessment (Eqn. 6 ), and treatment costs (Eqn. 7 ) all of which are computed over D . Our objecti ve function can be formally written as: arg max π ∈ C ( L ) ×A λ 1 g 1 ( π ) − λ 2 g 2 ( π ) − λ 3 g 3 ( π ) (8) where g 1 , g 2 , g 3 are deﬁned in Eqns. 5 , 6 , 7 respectively , and λ 1 and λ 2 are non-negati ve weights that scale the rela- tiv e inﬂuence of the terms in the objective. Theorem 1 The objective function in Eqn. 9 is NP-hard. (Please see appendix for details.) Note that NP-hardness is a worst case categorization only; with an efﬁcient search procedure, it is practical to obtain a good approximation on most reasonably-sized datasets. 3.4 Optimizing the Objective W e optimize our objectiv e by modeling it as as a Markov Decision Process (MDP) and then employing Upper Con- ﬁdence Bound on T rees (UCT) algorithm to ﬁnd a treat- ment regime which maximizes Eqn. 9 . W e also propose and le verage customized checks for guiding the exploration of the UCT algorithm and pruning the search space effec- tiv ely . Markov Decision Process F ormulation Our goal is to ﬁnd a sequence of rules which maximize the objective function in Eqn. 9 . T o this end, we formulate a fully ob- servable MDP such that the optimal policy of the posited formulation provides a solution to our objecti ve function. A fully observable MDP is characterized by a tuple ( S , A , T , R ) where S denotes the set of all possible states, A denotes the set of all possible actions, T and R represent the transition and reward functions respecti vely . Below we deﬁne each of these in the context of our problem. Figure 2 shows a snapshot of the state space and transitions for a small dataset. State Space. Conceptually , each state in our state space captures the effect of some partial or fully constructed de- cision list. T o illustrate, let us consider a partial decision list with just one rule “if Age ≥ 40 ∧ Gender = Female, then T1”. This partial list induces that: (i) all those subjects that satisfy the condition of the rule are assigned treatment T1, and (ii) Age and gender characteristics will be required in determining treatments for all the subjects in the popula- tion. Ag e BMI T 0 0 0 0 0 0 0 0 0 St art St a t e Ag e BMI T 0 0 1 0 0 1 0 0 1 T erminal St a t e Ag e BMI T 0 1 2 0 1 2 0 1 0 Non - T erminal St a t e Ag e BMI T 22 H i gh 1 43 H i gh 2 38 Normal 1 Ob se r v a ti onal Da t a MD P St a t e Sp ac e and T r ans i ti ons Figure 2: Sample Observ ational Data and the correspond- ing Markov Decision Process Representation; T stands for T reatment. T o capture such information, we represent a state ˜ s ∈ S by a list of tuples [( τ 1 ( ˜ s ) , σ 1 ( ˜ s )) , · · · ( τ N ( ˜ s ) , σ N ( ˜ s ))] where each tuple corresponds to a subject in D . τ i ( ˜ s ) is a bi- nary vector of length p deﬁned such that τ ( j ) i ( ˜ s ) = 1 if the characteristic j will be required for determining subject i ’ s treatment, and 0 otherwise. Further , σ i ( ˜ s ) captures the treatment assigned to subject i . If no treatment has been assigned to i , then σ i ( ˜ s ) = 0 . Note that we ha ve a single start state ˜ s 0 which corresponds to an empty decision list. τ i ( ˜ s 0 ) is a vector of 0 s, and σ i ( ˜ s 0 ) = 0 for all i in D indicating that no treatments ha ve been assigned to any subject, and no characteristics were deemed as requirements for assigning treatments. Further- more, a state ˜ s is regarded as a terminal state if for all i , σ i ( ˜ s ) is non-zero indicating that treatments have been as- signed to all the subjects. Actions. Each action can take one of the following forms: 1) a rule r ∈ L , which is a tuple of the form (pattern, treatment). Eg., (Age ≥ 40 ∧ Gender = Female, T1). This speciﬁes that subjects who obey conditions in the pattern are prescribed the treatment. Such action leads to a non- terminal state. 2) a treatment a ∈ A , which corresponds to the default rule, thus this action leads to a terminal state. T ransition and Rewar d Functions. W e hav e a deterministic transition function which ensures that taking an action ˜ a = (˜ c, ˜ t ) from state ˜ s will alw ays lead to the same state ˜ s 0 . Let U denote the set of all those subjects i for which treatments hav e already been assigned to be in state ˜ s i.e., σ i ( ˜ s ) 6 = 0 and let U c denote the set of all those subjects who hav e not been assigned treatment in the state ˜ s . Let U 0 denote the set of all those subjects i which do not belong to the set U and which satisfy the condition ˜ c of action ˜ a . Let Q denote the set of all those characteristics in F which are present in the condition ˜ c of action ˜ a . If action ˜ a corresponds to a default rule, then Q = ∅ and U 0 = U c . W ith this notation in place, the ne w state ˜ s 0 can be characterized as follows: 1) τ ( j ) i ( ˜ s 0 ) = τ ( j ) i ( ˜ s ) and σ i ( ˜ s 0 ) = σ i ( ˜ s ) for all i ∈ U , j ∈ F ; 2) τ ( j ) i ( ˜ s 0 ) = 1 for all i ∈ U c , j ∈ Q ; 3) σ i ( ˜ s 0 ) = ˜ t for all i ∈ U 0 . Similarly , the immediate re ward obtained when we reach ˜ s 0 by taking ˜ a = (˜ c, ˜ t ) from the state ˜ s can be written as: λ 1 N X i ∈ U 0 o ( i, ˜ t ) − λ 2 N X i ∈ U c ,j ∈ Q d ( j ) − λ 3 N X i ∈ U 0 d 0 ( ˜ t ) where o is deﬁned in Eqn. 5 , d and d 0 are cost functions for characteristics and treatments respecti vely (see Section 3.1). UCT with Customized Pruning The basic idea behind the Upper Conﬁdence Bound on Trees (UCT) [ 13 ] algo- rithm is to iterativ ely construct a search tree for some pre- determined number of iterations. At the end of this proce- dure, the best performing policy or sequence of actions is returned as the output. Each node in the search tree corre- sponds to a state in the MDP state space and the links in the tree correspond to the actions. UCT employs the UCB- 1 metric [ 6 ] for navigating through the search space. W e employ a UCT -based algorithm for ﬁnding the optimal policy of our MDP formulation, though we leverage cus- tomized checks to further guide the exploration process and prune the search space. Recall that each non-terminal state in our state space corresponds to a partial decision list. W e exploit the fact that we can upper-bound the value of the ob- jectiv e for any gi ven partial decision list. The upper bound on the objective for any giv en non-terminal state ˜ s can be computed by approximating the reward as follows: 1) all the subjects who hav e not been assigned treatments will get the best possible treatments without incurring any treat- ment cost 2) no additional assessments are required by an y subject (and hence no additional assessment costs levied) in the population. The upper bound on the incremental re- ward is thus: upper bound ( U c ) = λ 1 1 N X i ∈ U c max t o ( i, t ) . During the ex ecution of UCT procedure, whenever there is a choice to be made about which action needs to be taken, we employ checks based on the upper bound of the ob- jectiv e value of the resulting state. Consider a scenario in which the UCT procedure is currently in state ˜ s and needs to choose an action. For each possible action ˜ a (that does not correspond to a default rule 2 ) from state ˜ s , we deter- mine the upper bound on the objective value of the result- ing state ˜ s 0 . If this v alue is less than either the highest value encountered previously for a complete rule list, or the ob- jectiv e value corresponding to the best default action from the state ˜ s , then we bloc k the action ˜ a from the state ˜ s . This state is prov ably suboptimal. 2 W e can compute exact v alues of objectiv e function if the ac- tion is a default rule because the corresponding decision list is fully constructed. 4 Experimental Evaluation Here, we discuss the detailed experimental ev aluation of our framework. First we analyze the outcomes obtained and costs incurred when recommending treatments using our approach. Then, we present an ablation study which explores the contributions of each of the terms in our ob- jectiv e, followed by an analysis on real data. Dataset Descriptions Our ﬁrst dataset consists of infor- mation pertaining to the bail decisions of about 86K defen- dants (see T able 1 ). It captures information about various defendant characteristics such as demographic attrib utes, past criminal history , personal and health related informa- tion for each of the 86K defendants. Further , the decisions made by judges in each of these cases (release without/with conditions) and the corresponding outcomes (e.g., if a de- fendant committed another crime when out on bail) are also av ailable. W e assigned costs to characteristics, and treatments based on discussions with subject matter experts. The character- istics that were harder to obtain were assigned higher costs compared to the ones that were readily av ailable. Similarly , the treatment that placed a higher burden on both the de- fendant (release on condition) was assigned a higher cost. When assigning scores to outcomes, undesirable scenarios (e.g., violent crime when released on bail) received lower scores. Our second dataset (Refer T able 1 ) captures details of about 60K asthma patients [ 16 ]. For each of these 60K patients, various attributes such as demographics, symptoms, past health history , test results have been recorded. Each patient in the dataset was prescribed either quick relief medications or long term controller drugs. Further , the outcomes in the form of time to the next asthma attack (after the treatment began) were recorded. The longer this interval, the better the outcome, and the higher the outcome score. W e assigned costs to characteristics, and treatments based on the incon venience (physical/mental/monetary) they caused to patients. Baselines W e compared our framew ork to the follo wing state-of-the-art treatment recommendation approaches: 1) Outcome W eighted Learning (O WL) [ 39 ] 2) Modiﬁed Co- variate Approach (MCA) [ 32 ] 3) Interpretable and Parsi- monious Treatment Regime Learning (IPTL) [ 36 ]. While none of these approaches explicitly account for treatment costs or costs required for gathering the subject character- istics, MCA and IPTL minimize the number of character- istics/cov ariates required for deciding the treatment of any giv en subject. OWL, on the other hand, utilizes all the char- acteristics av ailable in the data when assigning treatments. Bail Dataset Asthma Dataset # of Data Points 86152 60048 Characteristics & Costs age, gender , previous of fenses, prior arrests, age, gender , BMI, BP , short breath, temperature, current charge, SSN (cost = 1) cough, chest pain, wheezing, past allergies, asthma history , family history , has insurance (cost 1) marital status, kids, owns house, pays rent peak ﬂow test (cost = 2) addresses in past years (cost = 2) spirometry test (cost = 4) mental illness, drug tests (cost = 6) methacholine test (cost = 6) Treatments & Costs release on personal recognizance (cost = 20) quick relief (cost = 10) release on conditions/bond (cost = 40) controller drugs (cost = 15) Outcomes & Scores no risk (score = 100), failure to appear (score = 66) no asthma attack for ≥ 4 months (score = 100) non-violent crime (score = 33) no asthma attack for 2 months (score = 66) violent crime (score = 0) no asthma attack for 1 month (score = 33) asthma attack in less than 2 weeks (score = 0) T able 1: Summary of datasets. Experimental Setting The objectiv e function that we proposed in Eqn. 9 has three parameters λ 1 , λ 2 , and λ 3 . These parameters could either be speciﬁed by an end-user or learned using a validation set. W e set aside 5% of each of our datasets as a validation set to estimate these parameters. W e automatically searched the parameter space to ﬁnd a set of parameters that produced a decision list with the maxi- mum average outcome on the validation set (discussed in detail later) and satisﬁed some simple constraints such as: 1) a verage assessment cost ≤ 4 on both the datasets 2) a ver - age treatment cost ≤ 30 for the bail data; av erage treatment cost ≤ 12 for the asthma data. W e then used a coordinate ascent strategy to search the parameter space and update each parameter λ j while holding the other two parameters constant. The values of each of these parameters were cho- sen via a binary search on the interv al (0 , 1000) . W e ran the UCT procedure for our approach for 50K iterations. W e used both gaussian and linear kernels for O WL and em- ployed the tuning strate gy discussed in Zhao et. al. [ 39 ]. In case of IPTL, we set the parameter that limits the number of the rules in the treatment regime to 20 . W e evaluated the performance of our model and other baselines using 10 fold cross validation. 4.1 Quantitative Evaluation W e analyzed the performance of our approach CITR (Cost- effecti ve, Interpretable Treatment Regimes) on various as- pects such as outcomes obtained, costs incurred, and intel- ligibility . W e computed the following metrics: A vg. Outcome Recall that a treatment regime assigns a treatment to ev ery subject in the population. W e used the prediction model ˆ y (deﬁned in Section 3.3) to obtain an out- come score giv en the characteristics of the subject and the treatment assigned (we used ground truth outcome scores whenev er they were av ailable in the data). W e computed the a verage outcome score of all the subjects in the popula- tion. A vg. Assess Cost W e determined assessment costs in- curred by each subject based on what characteristics were used to determine their treatment. W e then averaged all such per-subject assessment costs to obtain the average as- sessment cost. A vg. # of Characs W e determined the number of charac- teristics that are used when assigning a treatment to each subject in the population and then computed the a verage of these numbers. A vg. T reat Cost W e computed the a verage of the treatment costs incurred by all the subjects in the population. List Len Our approach CITR and the baseline IPTL ex- press treatment regimes as decision lists. In order to com- pare the complexity of the resulting decision lists, we com- puted the number of rules in each of these lists. While higher values of average outcome are preferred, lower v alues on all of the other metrics are desirable. Results T able 2 (top panel) presents the values of the metrics computed for our approach as well as the base- lines. It can be seen that the treatment regimes produced by our approach results in better a verage outcomes with lower costs across both datasets. While IPTL and MCA do not explicitly reduce costs, they do minimize the num- ber of characteristics required for determining treatment of any giv en subject. Our approach produces regimes with the least cost for a given average number of characteris- tics required to determine treatment (A vg. # of Characs). It is also interesting that our approach produces more con- cise lists with fewer rules compared to the baselines. While the treatment costs of all the baselines are similar , there is some v ariation in the a verage assessment costs and the out- comes. IPTL turns out to be the best performing baseline in terms of the a verage outcome, a verage assessment costs, and average no. of characteristics. The last line of T able 2 shows the av erage outcomes and the average treatment costs computed empirically on the observ ational data. Both of our datasets are comprised of decisions made by human experts. It is interesting that the regimes learned by algo- rithmic approaches perform better than human experts on both of the datasets. Bail Dataset Asthma Dataset A vg. A vg. A vg. A vg. # of List A vg. A vg. A vg. A vg. # of List Outcome Assess Cost Treat Cost Characs. Len Outcome Assess Cost Treat Cost Characs. Len CITR 79.2 8.88 31.09 6.38 7 74.38 13.87 11.81 7.23 6 IPTL 77.6 14.53 35.23 8.57 9 71.88 18.58 11.83 7.87 8 MCA 73.4 19.03 35.48 12.03 - 70.32 19.53 12.01 10.23 - OWL (Gaussian) 72.9 28 35.18 13 - 71.02 25 12.38 16 - OWL (Linear) 71.3 28 34.23 13 - 71.02 25 12.38 16 - CITR - No Treat 80.5 8.93 34.48 7.57 7 77.39 14.02 12.87 7.38 7 CITR - No Assess 81.3 13.83 32.02 9.86 10 78.32 18.28 12.02 8.97 9 CITR - Outcome 81.7 13.98 34.49 10.38 10 79.37 18.28 12.88 9.21 9 Human 69.37 - 33.39 - - 68.32 - 12.28 - - T able 2: Results for Treatment Regimes. Our approach: CITR; Baselines: IPTL, MCA, OWL; Ablations of our approach: CITR - No T reat, CITR - No Assess, CITR - Outcome; Human refers to the setting where judges and doctors assigned treatments. 4.1.1 Ablation Study W e also analyzed the effect of various terms of our objec- tiv e function on the outcomes, and the costs incurred. T o this end, we experimented with three different ablations of our approach: 1) CITR - No T reat , which is obtained by e x- cluding the term corresponding to the expected treatment cost in our objectiv e ( g 3 ( π ) in Eqn. 9 ). 2) CITR - No As- sess , which is obtained by excluding the expected assess- ment cost term in our objectiv e ( g 2 ( π )) in Eqn. 9 ) 3) CITR - Outcome , which is obtained by excluding both assessment and treatment cost terms from our objectiv e. T able 2 (second panel) sho ws the values of the metrics dis- cussed earlier in this section for all the ablations of our model. Naturally , removing the treatment cost term in- creases the av erage treatment cost on both datasets. Nat- urally , removing the assessment cost part of the objective results in regimes with much higher assessment costs (8.88 vs . 13.83 on bail data; 13.87 vs . 18.28 on asthma data). The length of the list also increases for both the datasets when we exclude the assessment cost term. These results demon- strate that each term in our objectiv e function is crucial to producing a cost-effecti ve interpretable regime. 4.2 Qualitative Analysis The treatment regimes produced by our approach on asthma and bail datasets are shown in Figures 1 and 3 re- spectiv ely . It can be seen in Figure 3 that methacholine test which is more expensi ve appears at the end of the regime. This en- sures that only a small fraction of the population (8.23%) is b urdened by its cost. Furthermore, it turns out that though the spirometry test is slightly expensi ve compared to patient demographics and symptoms, it would be harder to determine the treatment for a patient without this test. This aligns with research on asthma treatment recommen- dations [ 27 , 5 ]. Furthermore, it is interesting to note that the regime not only accounts for test results on spirometry and peak ﬂo w b ut also assesses if the patient has a pre vious history of asthma or respiratory issues. If the test results are positiv e and the patient has no previous history of asthma or respiratory disorders, then the patient is recommended quick relief drugs. On the other hand, if the test results are positiv e and the patient suffered previous asthma or respi- ratory issues, then controller drugs are recommended. If Gender = F and Current-Charge = Minor Pre v-Offense = None then RP Else if Prev-Of fense = Y es and Prior-Arrest = Y es then RC Else if Current-Charge = Misdemeanor and Age ≤ 30 then RC Else if Age ≥ 50 and Prior-Arrest = No , then RP Else if Marital-Status = Single and Pays-Rent = No and Current-Charge = Misd. then RC Else if Addresses-Past-Yr ≥ 5 then RC Else RP Figure 3: Treatment regime for bail data; RP refers to milder form of treatment: release on personal recogni- zance, and RC is release on condition which is compara- tiv ely harsher . In case of the bail dataset, the constructed regime is able to achieve good outcomes without ev en using the most expensi ve characteristics such as mental illness tests and drug tests. Personal information characteristics, which are slightly more expensi ve than defendant demographics and prior criminal history , appear only tow ards the end of the list and these checks apply only to 21.23% of the popula- tion. It is interesting that the regime uses the defendant’ s criminal history as well as personal and demographic in- formation to make recommendations. For instance, fe- males with minor current charges (such as dri ving offenses) and no prior criminal records are typically released on bail without conditions such as bonds or checking in with the police. On the other hand, defendants who ha ve committed crimes earlier are only granted conditional bail. 5 Conclusions In this work, we proposed a framew ork for learning cost- effecti ve, interpretable treatment re gimes from observa- tional data. T o the best of our knowledge, this is the ﬁrst solution to the problem at hand that addresses all of the fol- lowing aspects: 1) maximizing the outcomes 2) minimiz- ing the treatment costs, and costs associated with gathering information required to determine the treatment 3) express- ing regimes using an interpretable model. W e modeled the problem of learning a treatment regime as a MDP and em- ployed a variant of UCT which prunes the search space us- ing customized checks. W e demonstrated the ef fectiv eness of our framework on real world data from judiciary and health care domains. References [1] Ev a-Maria Abulesz and Gerasimos L yberatos. Novel approach for determining optimal treatment regimen for cancer cemotherapy . International journal of sys- tems science , 19(8):1483–1497, 1988. [2] Rak esh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for mining association rules. [3] James O Berger , Xiaojing W ang, and Lei Shen. A bayesian approach to subgroup identiﬁcation. Journal of biopharmaceutical statistics , 24(1):110–129, 2014. [4] Jacob Bien and Robert Tibshirani. Classiﬁcation by set co ver: The prototype vector machine. arXiv pr eprint arXiv:0908.2284 , 2009. [5] Louis-Philippe Boulet, Marie- ` Eve Boulay , Guylaine Gauthier , Li via Battisti, V al ´ erie Chabot, Marie- France Beauchesne, Denis V illeneuve, and Patricia C ˆ ot ´ e. Beneﬁts of an asthma education program pro- vided at primary care sites on asthma outcomes. Res- piratory medicine , 109(8):991–1000, 2015. [6] Cameron B Browne, Edward Powle y , Daniel White- house, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen T avener , Diego Perez, Spyri- don Samothrakis, and Simon Colton. A surve y of monte carlo tree search methods. IEEE T ransac- tions on Computational Intelligence and AI in Games , 4(1):1–43, 2012. [7] Johannes AN Dorresteijn, Frank LJ V isseren, Paul M Ridker , Annemarie MJ W assink, Nina P Paynter , Ewout W Steyerber g, Y olanda v an der Graaf, and Nancy R Cook. Estimating treatment effects for in- dividual patients based on the results of randomised clinical trials. Bmj , 343:d5888, 2011. [8] Ralph B DAgostino. Estimating treatment effects us- ing observational data. Jama , 297(3):314–316, 2007. [9] Ailin Fan, W enbin Lu, Rui Song, et al. Sequential advantage selection for optimal treatment regime. The Annals of Applied Statistics , 10(1):32–53, 2016. [10] Jared C Foster , Jeremy MG T aylor , and Stephen J Ru- berg. Subgroup identiﬁcation from randomized clin- ical trial data. Statistics in medicine , 30(24):2867– 2880, 2011. [11] K osuke Imai, Marc Ratko vic, et al. Estimating treatment effect heterogeneity in randomized pro- gram e valuation. The Annals of Applied Statistics , 7(1):443–470, 2013. [12] Been Kim, Cynthia Rudin, and Julie A Shah. The bayesian case model: A generati ve approach for case- based reasoning and prototype classiﬁcation. In Ad- vances in Neural Information Processing Systems , pages 1952–1960, 2014. [13] Le vente Kocsis and Csaba Szepesv ´ ari. Bandit based monte-carlo planning. In Eur opean confer ence on machine learning , pages 282–293. Springer, 2006. [14] EB Laber and YQ Zhao. Tree-based methods for individualized treatment regimes. Biometrika , 102(3):501–514, 2015. [15] Eric B Laber, Daniel J Lizotte, Min Qian, W illiam E Pelham, and Susan A Murphy . Dynamic treatment regimes: T echnical challenges and applications. Elec- tr onic journal of statistics , 8(1):1225, 2014. [16] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskov ec. Interpretable decision sets: A joint frame- work for description and prediction. 2016. [17] Himabindu Lakkaraju and Jure Lesk ov ec. Confusions ov er time: An interpretable bayesian model to charac- terize trends in decision making. In Advances in Neu- ral Information Processing Systems (NIPS) , 2016. [18] Himabindu Lakkaraju, Jure Lesk ov ec, Jon Kleinber g, and Sendhil Mullainathan. A bayesian frame work for modeling human ev aluations. In SIAM SDM , 2015. [19] Benjamin Letham, Cynthia Rudin, T yler H Mc- Cormick, Da vid Madigan, et al. Interpretable clas- siﬁers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics , 9(3):1350–1371, 2015. [20] W ei-Y in Loh, Xu He, and Michael Man. A regres- sion tree approach to identifying subgroups with dif- ferential treatment ef fects. Statistics in medicine , 34(11):1818–1833, 2015. [21] Y in Lou, Rich Caruana, and Johannes Gehrke. In- telligible models for classiﬁcation and regression. In Pr oceedings of the 18th A CM SIGKDD international confer ence on Knowledge discovery and data mining , pages 150–158. A CM, 2012. [22] Jared K Lunceford and Marie Davidian. Stratiﬁca- tion and weighting via the propensity score in estima- tion of causal treatment effects: a comparati ve study . Statistics in medicine , 23(19):2937–2960, 2004. [23] Douglas B Marlowe, David S Festinger, Karen L Dugosh, Kathleen M Benasutti, Gloria Fox, and Ja- son R Croft. Adapti ve programming improves out- comes in drug court an experimental trial. Criminal justice and behavior , 39(4):514–532, 2012. [24] James J McGough and Stephen V Faraone. Estimat- ing the size of treatment effects: moving beyond p values. Psychiatry (1550-5952) , 6(10), 2009. [25] Erica EM Moodie, Bibhas Chakraborty , and Michael S Kramer . Q-learning for estimating optimal dynamic treatment rules from observational data. Canadian J ournal of Statistics , 40(4):629–645, 2012. [26] Erica EM Moodie, Nema Dean, and Y ue Ru Sun. Q-learning: Fle xible learning about useful utilities. Statistics in Biosciences , 6(2):223–243, 2014. [27] Jor ge Pereira, Priscilla Porto-Figueira, Carina Cav aco, Khushman T aunk, Srikanth Rapole, Rahul Dhakne, Hampapathalu Nagarajaram, and Jos ´ e S C ˆ amara. Breath analysis as a potential and non- in vasi ve frontier in disease diagnosis: an overvie w . Metabolites , 5(1):3–55, 2015. [28] Min Qian and Susan A Murphy . Performance guar- antees for individualized treatment rules. Annals of statistics , 39(2):1180, 2011. [29] James M Robins. Correcting for non-compliance in randomized trials using structural nested mean mod- els. Communications in Statistics-Theory and meth- ods , 23(8):2379–2412, 1994. [30] Richard N Shif fman. Representation of clinical prac- tice guidelines in con ventional and augmented deci- sion tables. Journal of the American Medical Infor- matics Association , 4(5):382–393, 1997. [31] Xiaogang Su, Chih-Ling Tsai, Hansheng W ang, David M Nickerson, and Bogong Li. Subgroup anal- ysis via recursive partitioning. Journal of Machine Learning Resear ch , 10(Feb):141–158, 2009. [32] Lu Tian, Ash A Alizadeh, Andre w J Gentles, and Robert T ibshirani. A simple method for estimating interactions between a treatment and a large number of cov ariates. Journal of the American Statistical As- sociation , 109(508):1517–1532, 2014. [33] Stijn V ansteelandt, Marshall Joffe, et al. Structural nested models and g-estimation: The partially re- alized promise. Statistical Science , 29(4):707–731, 2014. [34] Michael P W allace and Erica EM Moodie. Per- sonalizing medicine: a revie w of adaptiv e treatment strategies. Pharmacoepidemiology and drug safety , 23(6):580–585, 2014. [35] Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. A robust method for estimating opti- mal treatment regimes. Biometrics , 68(4):1010–1018, 2012. [36] Y ichi Zhang, Eric B. Laber, Anastasios Tsiatis, and Marie Davidian. Using decision lists to construct in- terpretable and parsimonious treatment regimes. Bio- metrics , 71(4):895–904, 2015. [37] Y ichi Zhang, Eric B Laber , Anastasios Tsiatis, and Marie Da vidian. Interpretable dynamic treatment regimes. arXiv pr eprint arXiv:1606.01472 , 2016. [38] Y ing-Qi Zhao, Donglin Zeng, Eric B Laber , Rui Song, Ming Y uan, and Michael Rene K osorok. Dou- bly robust learning for estimating individualized treat- ment with censored data. Biometrika , 102(1):151– 168, 2015. [39] Y ingqi Zhao, Donglin Zeng, A John Rush, and Michael R K osorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association , 107(499):1106–1118, 2012. [40] Y uf an Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical tri- als. Statistics in medicine , 28(26):3294–3315, 2009. 6 A ppendix 6.1 Proof for Theorem 1 Statement: The objecti ve deﬁned in Eqn. 8 is NP-hard. Proof: The rough idea behind this proof is to estab- lish the connection between the objectiv e in Eqn. (8) and weighted exact-co ver problem. Our objectiv e function is given by: arg max π ∈ C ( L ) ×A λ 1 g 1 ( π ) − λ 2 g 2 ( π ) − λ 3 g 3 ( π ) (9) The goal is to ﬁnd a sequence of ( c, a ) pairs where c ∈ {F P ∪ ∅} and a ∈ A which not only covers all the data points in the dataset but also maximizes the objectiv e giv en above. Note that c = ∅ denotes a default rule. F P represents a set of frequently occurring patterns each of which is a conjunction of one or more predicates. Examples: (1) Age ≥ 40 ∧ Gender = Female; (2) BMI = High; (3) Gender = M ∧ BP = High ∧ Age ≤ 25 Such patterns are provided as input to us. W e have deﬁned the set L as: L = F P × A . This implies that an element in the set L will be of the form: (Age ≥ 40 ∧ Gender = Female, T1) i.e., each element in L is a rule. Our goal is now to ﬁnd an ordered list of rules from L (let us ignore the default rule for a little while) which maximize the objectiv e in Eqn. 9 . Let us assume the set L comprises of the following candidate rules: (1) (Age ≥ 40 ∧ Gender = Female, T1) (2) (Age ≥ 40 ∧ Gender = Female, T2) (3) (BMI = High, T1) (4) (BMI = High, T2) Let us create a new set L 0 from L as follows: for each rule ( c, a ) in L , append the negations of conditions of all possible combinations of all the other rules in L . Also include in the ne w set L 0 , the set of all possible combinations of negations of conditions in all the rules in the set L . Follo wing our example above, the new set L 0 will look like this: (1) (Age ≥ 40 ∧ Gender = Female, T1) (2) (Age ≥ 40 ∧ Gender = Female, T2) (3) ( ¬ (Age ≥ 40 ∧ Gender = Female), T1) (4) ( ¬ (Age ≥ 40 ∧ Gender = Female), T2) (5) ( ¬ (BMI = High) ∧ Age ≥ 40 ∧ Gender = Female, T1) (6) ( ¬ (BMI = High) ∧ Age ≥ 40 ∧ Gender = Female, T2) (7) (BMI = High, T1) (8) (BMI = High, T2) (9) ( ¬ (BMI = High), T1) (10) ( ¬ (BMI = High), T2) (11) ( ¬ (Age ≥ 40 ∧ Gender = Female) ∧ BMI = High, T1) (12) ( ¬ (Age ≥ 40 ∧ Gender = Female) ∧ BMI = High, T2) (13) ( ¬ (Age ≥ 40 ∧ Gender = Female) ∧ ¬ (BMI = High), T1) (14) ( ¬ (Age ≥ 40 ∧ Gender = Female) ∧ ¬ (BMI = High), T2) Now , the problem of ﬁnding an ordered sequence of rules on L (plus a default rule a ∈ A ) can now be posed as the problem of ﬁnding an unordered set of rules on L 0 . T o illustrate, let us consider a decision list constructed using L in the abov e example: (1) (Age ≥ 40 ∧ Gender = Female, T1) (2) T2 This list can now be expressed as an unordered set using the elements in L 0 as follows: (Age ≥ 40 ∧ Gender = Female, T1) ( ¬ (Age ≥ 40 ∧ Gender = Female), T2) W e hav e thus reduced the problem of ﬁnding an ordered list of rules to that of unordered set of rules on L 0 . More speciﬁcally , the problem is now reduced to that of choos- ing a set of rules from the set L 0 such that 1) each data point/element in the data is covered exactly once 2) the objectiv e function in Eqn. 9 is maximized. This problem can be formally written as: min j ∈L 0 Ψ( j ) φ ( j ) s.t. X j : satisf y ( x i ,c j ) φ ( j ) = 1 ∀ i : ( x i , a i , y i ) ∈ D φ ( j ) ∈ { 0 , 1 } ∀ j : r j ∈ L 0 (10) where φ ( j ) is an indicator function which is 1 if the rule r j is chosen to be in the set cov er . Ψ( j ) is the cost associated with choosing the rule r j = ( c j , a j ) which is deﬁned as: Ψ( j ) = X i : satisf y ( x i ,c j ) − λ 1 N o ( i, a j )+ λ 2 N X e ∈ c j d ( e )+ λ 3 N d 0 ( a j ) Note that we basically split our complete objective func- tion across the rules that will be chosen to be part of the ﬁnal set cover . Further , we are dealing with a minimization problem here, so we ﬂip the signs of the terms in the objectiv e (which is a maximization function). Eqn. 10 is the weighted exact cover problem. Since this problem is NP-Hard, our objective function is also NP- Hard.

Learning Cost-Effective Treatment Regimes using Markov Decision Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment