On Combining Machine Learning with Decision Making

Mac h Learn man uscript No. (will b e inserted b y the editor) On Com bining Mac hine Learning with Decision Making Theja T ulabandh ula · Cyn thia Rudin Received: date / Accepted: date Abstract W e presen t a new application and cov ering n umber bound for the framew ork of “Mac hine Learning with Op erational Costs (MLOC),” which is an exploratory form of decision theory . The MLOC framework incorporates knowledge ab out how a predictiv e model will be used for a subsequent task, thus com bining mac hine learning with the decision that is made afterwards. In this w ork, we use the MLOC framew ork to study a problem that has implications for p ow er grid reliabilit y and main tenance, called the Machine L e arning and T raveling R ep air- man Pr oblem (ML&TRP). The goal of the ML&TRP is to determine a route for a “repair crew,” which repairs no des on a graph. The repair crew aims to minimize the cost of failures at the no des, but as in many real situations, the failure proba- bilities are not known and must b e estimated. The MLOC framework allows us to understand how this uncertaint y inﬂuences the repair route. W e also present new co vering num b er generalization b ounds for the MLOC framework. Keyw ords decision theory · generalization b ound · constrained linear function classes · cov ering n umbers · trav eling repairman · mixed-integer programming 1 Introduction In man y domains, it is essen tial to understand how uncertaint y in predictions inﬂuences decision-making. In that sense, one would lik e to explore the space of F unding for Theja T ulabandhula w as pro vided b y a F ulbrigh t F ellowship and Xerox F ellowship. Cynthia Rudin’s w ork on this pro ject was funded in part by Con Edison, by the MIT Energy Initiative Seed F und, and NSF grant I IS-1053407. Theja T ulabandhula Department of Electrical Engineering and Computer Science, Massach usetts Institute of T echnology , Cambridge, MA 02139, USA. E-mail: theja@mit.edu Cynthia Rudin MIT Sloan School of Management, Massach usetts Institute of T echnology , Cambridge, MA 02139, USA. E-mail: rudin@mit.edu 2 Theja T ulabandhula, Cynthia Rudin p ossible reasonable predictions and understand the range of reasonable p olicies and their costs. The new framework of Machine Learning with Op erational Costs (MLOC) (T ulabandh ula and Rudin, 2013) pro vides a mechanism to do this, and is a t yp e of exploratory decision theory . Where usual decision theories provide a single p olicy that minimizes expected costs, the MLOC framework is able to pro duce a range of reasonable policies that span the full set of reasonable costs. T o do this, the op erational cost b ecomes a regularization term within the machine learning mo del, and adjusting the regularization constan t allo ws us to explore solutions for all reasonable costs. This gives decision makers a w ay to understand the uncertain ty in their predictive mo del in terms of something they can grasp - uncertain ty in the cost to solv e the problem. The MLOC framework can also b e used in another wa y , namely to incorp orate prior kno wledge about the cost to produce a b etter predictiv e mo del. In that sense, kno wledge about the cost translates in to a more restricted hypothesis space, whic h p oten tially translates into b etter generalization. In particular, if the h yp othesis space is restricted, then upper bounds on the complexity of the hypothesis space are smaller, leading to b etter generalization b ounds. In this w ork, we provide an application of the MLOC framework to p ow er grid engineering and reliability . This problem, called the Machine L e arning and T r av- eling R ep airman Pr oblem (ML&TRP), has a machine learning comp onent and a decision-making comp onent. The machine learning comp onent is to predict future p o wer grid failures b efore they o ccur, where these failures occur at equipment that is distributed throughout the cit y . The decision-making comp onen t is to determine in what order the equipment should be insp ected. W e could use the MLOC frame- w ork in either of the t wo wa ys outlined ab ov e: either to understand the range of reasonable costs for the p ow er compan y , or to use prior knowledge that the costs are high or low in order to choose a more predictive and cost-eﬀective route. T o b e more precise, the ML&TRP pr e diction problem is to determine the failure probability for each no de on a graph, using features of each no de and past failure data. The de cision problem is to determine a route for a “repair crew” on the graph, where there is some trav el time b etw een each pair of no des. There are man y p ossible applications of the ML&TRP , including the scheduling of safety insp ections or repair work for the electrical grid, oil rigs, underground mining, mac hines in a factory , or airlines. In our experiments, w e use data from an ongoing pro ject with Con Edison, which is NYC’s p ow er utility company . W e also provide a generalization b ound for the MLOC framework based on co vering num b ers. These b ounds are diﬀerent than those of T ulabandh ula and Rudin (2013) which use concentration of Rademacher complexity and Dudley’s en tropy integral, and are not directly comparable. The b ounds here ha ve a m uch more geometric ﬂav or lo oking at the hypothesis space as a volumetric ob ject. Neither of the t wo bounds are tigh ter in all situations. W e ﬁnd the bounds here to b e more intuitiv e, as the geometry is more transparent. The ML&TRP relates to literature on b oth machine learning and optimiza- tion (time-dep enden t tra veling salesman problems). In mac hine learning, our w ork b ears a slight resemblance to work on graph-based regularization (Agarwal, 2006, Belkin et al., 2006, Zhou et al., 2004), but their goal is to obtain probabilit y esti- mates that are smo othed on a graph with suitably designed edge weigh ts. On the other hand, our goal is to obtain, in addition to probabilit y estimates, a low-cost route for trav ersing a very diﬀerent graph with edge weigh ts that are physical On Combining Machine Learning with Decision Making 3 distances. Our regularization is v astly diﬀerent from p opular ones ( ` 1 or ` 2 norm) b ecause our regularization comes from b eliefs on decision-making costs. W e use unlab eled data as do es semi-sup ervised learning (Chapelle et al., 2006) but diﬀer in the motiv ation as w ell as the w a y we use these additional data. F or example, w e do not extract distributional information from the unlab eled data. Our w ork con tributes to the literature on the TRP (T ra veling Repairman Problem) and re- lated problems by adding the new dimension of probabilistic estimation at the no des. W e create new adaptations of mo dern tec hniques (Fischetti et al., 1993, v an Eijl, 1995, Lechmann, 2009) within our work for solving the TRP part of the ML&TRP . There is a b o dy of literature regarding cost models for maintenance in the reliabilit y mo deling literature, though the emphasis in those w orks is usually to design a model that accurately represen ts the sto chastic pro cess for the failures. In that literature, for instance, a main tenance sc hedule would b e created from the predicted condition of the equipmen t (but not on the cost of p erforming the repairs in a certain order or routing a vehicle betw een the equipmen t). Barb era et al. (1996) develop a mo del that assumes that equipment hav e exp onential rates of failure and fail only once in an inspection in terv al, and they use this mo del to determine a main tenance sc hedule. Marseguerra et al. (2002) introduces a mo del for degradation leading to failure for a contin uous complex system, and use Mon te Carlo sim ulations to determine the optimal degradation lev el to p erform an in- sp ection. Their work uses a very diﬀerent cost mo del from ours; the cost is the long run a verage main tenance cost and cost of failures. A neural-netw ork based main tenance mo del was developed by Heng et al. (2009). A related work on rout- ing for emergency maintenance on the electrical grid is the heuristic algorithm of W eintraub et al. (1999) that dispatc hes vehicles to areas where there are currently breakdo wns and where there are likely to b e breakdo wns in the future. Ertekin et al. (2013) propose a model for failures of p ow er grid equipment and use this mo del to sim ulate the cost of v arious insp ection p olicies. One can view the MLOC framew ork to b e somewhat Bay esian, in the sense that prior knowledge is b eing used when not enough data are av ailable. In Section 2 we review the MLOC framew ork. In Section 3 w e will motiv ate and outline the new application of the MLOC framework to the ML&TRP , pro viding t wo w ays of modeling failure cost. In Section 4 w e provide mixed-in teger nonlinear (MINLP) form ulations and discuss algorithms an illustrative example. Section 5 giv es exp erimental results on data from the NYC p ow er grid, showing the b eneﬁt of the ML&TRP ov er traditional metho ds. Section 6 contains the theoretical gen- eralization result for the MLOC framew ork with pro ofs. Section 8 concludes the pap er. The conference pap er of (T ulabandh ula et al., 2011) contains a summary of w ork on the ML&TRP , and the pap er T ulabandhula and Rudin (2013) provides a more complete explanation of the MLOC framework, with other illustrations and connections to robust optimization. 2 Review of F ramework for Mac hine Learning with Op erational Costs In the MLOC framew ork we hav e the standard sup ervised training set of labeled instances, { ( x i , y i ) } m i =1 , where x i ∈ X , y i ∈ Y . F or simplicity , X ⊂ R d . T o hav e nonlinear functions, w e could simply ha ve the j th comp onen t of x replaced by a 4 Theja T ulabandhula, Cynthia Rudin nonlinear function h j ( x ). Also Y ⊂ R . W e wish to learn a function f ∗ : X → Y . This is ordinarily done by solving a minimization problem: f ∗ ∈ argmin f ∈F unc m X i =1 l ( f ( x i ) , y i ) + C 2 R ( f ) ! , (1) for some loss function l : Y × Y → R + , regularizer R : F unc → R , constant C 2 and function class F unc . F unc is the set of all linear functionals, where f ∈ F unc is of the form λ · x , λ ∈ R d . The sup erscript ‘ unc ’ refers to the word “unconstrained.” Consider an organization making a policy decision regarding a new collection of unlab eled instances { ˜ x i } M i =1 ∈ X M . The cost to enact a policy is not exactly kno wn, because the labels for the { ˜ x i } i are not known. Instead the mo del’s pre- dictions are used, which are the f ∗ ( ˜ x i )’s. The goal of the organization is then to create a policy π ∗ that minimizes operational cost OpCost( π, f ∗ , { ˜ x i } i ). The oper- ational cost OpCost( π , f ∗ , { ˜ x i } i ) is how m uch will b e sp ent if p olicy π is chosen in resp onse to the { f ∗ ( ˜ x i ) } i ’s. When there is uncertaint y in f ∗ , there is uncertaint y in the cost to enact the optimal p olicy π ∗ . This uncertaint y is what we w ould like to explore. A typical wa y that companies make decisions is using what w e call the sequen tial pro cess , which computes the p olicy according to t wo steps: Step 1: Create function f ∗ based on { ( x i , y i ) } i according to (1). That is: f ∗ ∈ argmin f ∈F unc m X i =1 l ( f ( x i ) , y i ) + C 2 R ( f ) ! . Step 2: Cho ose p olicy π ∗ to minimize the op erational cost, π ∗ ∈ argmin π ∈ Π OpCost( π , f ∗ , { ˜ x i } i ) . On the other hand, the MLOC framew ork is based around a sim ultaneous pro cess , which combines Steps 1 and 2 of the sequential pro cess. T o do this, the op erational cost b ecomes a regularization term, and its regularization parameter C 1 con trols the amount of optimism or p essimism for the op erational cost. Step 1: Cho ose a mo del f ∗ ob eying the follo wing: f ∗ ∈ argmin f ∈F unc " m X i =1 l ( f ( x i ) , y i ) + C 2 R ( f ) + C 1 min π ∈ Π OpCost ( π , f , { ˜ x i } i ) # . Step 2: Compute the p olicy: π ∗ ∈ argmin π ∈ Π OpCost  π , f ∗ , { ˜ x i } i  . The case C 1 = 0 for the sim ultaneous process is precisely the sequential pro cess; th us, the sequential pro cess is a sp ecial case of the simultaneous pro cess. Our abilit y to solve the MLOC sim ultaneous pro cess dep ends on the tractability of the optimization problem argmin π ∈ Π OpCost ( π , f ∗ , { ˜ x i } i ). How ever, if this problem is in tractable, then the sequential pro cess is also intractable, and the organization will not be able to c ho ose an optimized p olicy at all. The simultaneous process requires this subproblem to be solved several times, whereas the sequential pro cess only requires the subproblem to b e solved once. If the num b er of unlabeled instances On Combining Machine Learning with Decision Making 5 is small, then Step 1 can be solv ed without a problem, even if the training set is large. As C 1 v aries ov er its full range, it maps out the full range of costs for all reasonable solutions. If C 1 is set to a n umber that is too large (either p ositiv e or negativ e), the solution of the simultaneous pro cess will hav e empirical error that is too high to be reasonable. In that case, we kno w that b y v arying C 1 within a smaller range will lead to the full range of costs for reasonable predictive mo dels. As with an y regularization term, the new operational cost term can b e in ter- preted as a prior b elief ab out the mo del - in this case, a b elief that the op erating costs should b e lo wer or higher on the current set of unlabeled instances { ˜ x i } i . In that sense, MLOC regularization ma y hav e a closer connection to realit y than t ypical (e.g., ` 1 or ` 2 norm) regularizers. If one asks a manager at a compan y what prior b elief they hav e ab out the estimation mo del, it is not likely they would give a answ er in terms of coeﬃcients for a linear mo del. Even managers who are not mathematicians or computer scien tists might ha ve some b elief - they could p er- haps believe that they are exp ecting to spend a certain amount to enact the policy . It is p ossible that this t yp e of belief, which relies on direct exp erience, might b e more practical, and more accurate, than the more abstract prior information that w e are typically used to dealing with. In the ML&TRP , the training error term is deriv ed from data from the past, and the OpCost term is calculated on data from the presen t. The OpCost term is the only term that deals with routing. 3 The Machine Learning and T ra veling Repairman Problem The US Department of Energy’s Grid 2030 do cument states that “America’s elec- tric system, ‘the supreme engineering achiev ement of the 20th century ,’ is aging, ineﬃcien t, and congested, and incapable of meeting the future energy needs of the Information Economy without op erational changes and substan tial capital inv est- men t ov er the next several decades” (United States Department of Energy and Distribution, 2003). Since 2004, man y p o wer utility companies are implemen ting new inspection and repair programs for preemptiv e maintenance, whereas in the past, all repair w ork w as done reactively (Urbina, 2004). New Y ork Cit y has the oldest p o wer system in the world, and the largest underground electric system, with enough electrical cable to go three and a half times around the world. In New Y ork City , there are several separate new preemptive main tenance programs, including the targeted insp ection program for electrical service structures (man- holes), programs that p erform extensiv e repairs that w ere placed on a w aiting list after the manhole w as insp ected, and the vente d c over r eplac ement pr o gr am , where eac h manhole is replaced with a ven ted co ver that allows gases to escap e, mitigating the possibility and eﬀects of serious ev ents including ﬁres and explo- sions. Con Edison, the p ow er company in NYC, has the ability to use mac hine learning mo dels in Manhattan, Brooklyn and the Bronx for sc heduling of manhole insp ection and repair w ork (Rudin et al., 2010, 2012, 2011, 2014). This pro ject w as the motiv ation for the developmen t of the ML&TRP and w e use data from the NYC p o wer grid for our exp eriments. F eatures for the NYC mo del are deriv ed from physical characteristics of the manhole (e.g., n umber of electrical cables en- tering the manhole), and features deriv ed from its history of inv olvemen t in past ev ents. Repeat failures (serious and non-serious even ts) can o ccur on the same manhole. W e tak e the possibility of rep eat failures into account in the ML&TRP 6 Theja T ulabandhula, Cynthia Rudin (in Cost 1 giv en below). That said, failures are rare even ts, and it is not easy to accurately estimate the probability that a given manhole will fail within a given p erio d of time. Because of this uncertain ty , we can use the MLOC framew ork to assist in decision-making. The result π ∗ ∈ Π from the algorithm would be a route that could b e used for the repair crew to ﬁx a pre-sp eciﬁed set of manholes corresp onding to { ˜ x i } M i =1 , whic h are assumed to need a particular repair. 3.1 Learning In what follows, w e will use descriptions and terminology that matc h the pow er grid application. In the ML&TRP , data from the past will be used to train the mo del, denoted { ( x i , y i ) } m i =1 , whereas the ˜ x i are calculated from the presen t, whose lab els are from the future and thus not known. Let x j i indicate the j -th co ordinate of the feature vector for manhole i calculated at a time p erio d from the past. The x i v ector encodes the num b er and types of electrical cables, num b er and t yp es of previous ev ents, etc. The lab el for manhole i from the past is denoted y i , where y i ∈ {− 1 , 1 } indicating whether the manhole had a failure (ﬁre, explosion, smoking manhole) within a sp eciﬁc p erio d of time in the past. More details ab out the features and lab els can be found in Section 5. The other instances { ˜ x i } M i =1 (with M unrelated to m ), are unlab eled data that are eac h associated with a no de on a graph G . The no des of the graph G indexed by i = 1 , ..., M represen t manholes on whic h we wan t to design a route. Note that M can be substan tially smaller than m , e.g., M < 10 and m > 20 , 000; e.g., for a repair truck that carries supplies for at most M repairs. W e are also given physical distances d i,j ∈ R + b et ween each pair of no des i and j . A route on G is represen ted b y a p ermutation π of the no de indices 1 , . . . , M . Let Π b e the set of all p erm utations of { 1 , ..., M } . F ailure probabilities will b e estimated at each of the nodes and these estimates will b e based on a function of the form f λ ( x ) = λ · x . The class of possible functions F is c hosen to be: F := { f λ : λ ∈ R d , k λ k 2 ≤ B b } , where B b is a ﬁxed p ositive real n umber. W e choose the logistic loss: l ( f λ ( x ) , y ) := ln  1 + e − yf λ ( x )  so that the probabilit y of failure P ( y = 1 | x ), is estimated as in logistic regression by: P ( y = 1 | x ) or p ( x ) := 1 1 + e − f λ ( x ) . (2) Note that the routing problem is done in batc h: once the route is determined, the repair truck is sent out and changes to the route are no longer p ossible. 3.2 Two Options for the OpCost The op erational cost can be deﬁned to matc h the application. In the ﬁrst option (denoted as Cost 1), for eac h node there is a cost for (possibly rep eated) failures prior to a visit by the repair crew. In this case, temporary repairs are made to ﬁx eac h node b efore the repair crew comes to make p ermanent repairs. In the second option (denoted as Cost 2), for each no de, there is a cost for the ﬁrst failure prior to visiting it. In this case, p ermanent repairs are made when there is an even t, or when the repair crew arrives, whic hever is so oner. There is a natural On Combining Machine Learning with Decision Making 7 in terpretation of the failures as being generated by a con tinuous random pro cess at eac h of the nodes. When discretized in time, this is approximated by a Bernoulli pro cess with parameter p ( ˜ x i ). Both Cost 1 and Cost 2 are appropriate for p ow er grid applications. Cost 2 is also appropriate for deliv ery truck routing applications, where p erishable items can fail (once an item has sp oiled, it cannot sp oil again). F or conv enience, we assume that after the repair crew visits all the no des, it returns to the starting node (node 1) which is ﬁxed b eforehand. Scenarios where one is not interested in beginning from or returning to the starting no de w ould b e mo deled slightly diﬀeren tly (the computational complexity remains the same). Let a route b e represen ted b y π : { 1 , ..., M } 7→ { 1 , ..., M } , this means that π ( i ) is the i th no de to be visited. F or example, let M = 4 , π = [2 , 3 , 4 , 1]. This means, π (1) = 2 , no de 2 is the ﬁrst no de to b e visited, π (2) = 3, no de 3 is the second no de on the route, and so on. Since the ﬁnal node visited is the ﬁrst no de, we app end the following to the deﬁnition of π : π ( M + 1) = π (1). Let the distances b e scaled appropriately so that a unit of distance is trav ersed in a unit of time. Giv en a route, the latency of a node π ( i ) is the time (or equiv alently distance) from the start at whic h no de π ( i ) is visited. It is the sum of distances tra v ersed b efore p osition i on the route: L π ( π ( i )) := ( P M k =1 d π ( k ) π ( k +1) 1 [ k 0 is p (1 − p ) t − 1 . The probabilit y that the ﬁrst failure for node π ( i ) o ccurs before time L π ( π ( i )) is then the sum of the failure probabilities from t = 1 , ..., L π ( π ( i )) : P L π ( π ( i )) t =1 p (1 − p ) t − 1 = 1 − (1 − p ) L π ( π ( i )) . Th us, substituting the expression (2) for p , we hav e: P  ﬁrst failure o ccurs b efore time L π ( π ( i ))  = 1 − (1 − p ( ˜ x π ( i ) )) L π ( π ( i )) = 1 −  1 − 1 1 + e − f λ ( ˜ x π ( i ) )  L π ( π ( i )) = 1 −  1 + e f λ ( ˜ x π ( i ) )  − L π ( π ( i )) . The cost of visiting no de π ( i ) will b e prop ortional to this quantit y: Cost of no de π ( i ) ∝  1 −  1 + e f λ ( ˜ x π ( i ) )  − L π ( π ( i ))  . (7) Similarly to Cost 1, L π ( π ( i )) inﬂuences the cost at eac h no de. If we visit a no de early in the route, then the cost incurred is small because the node is less lik ely to fail b efore we reach it. Similarly , if we schedule a visit later on in the tour, the cost is higher b ecause the no de has a higher chance of failing prior to the repair crew’s visit. The total failure cost is thus: OpCost( π , f λ , { ˜ x i } M i =1 , { d i,j } M i,j =1 ) = M X i =1  1 −  1 + e f λ ( ˜ x π ( i ) )  − L π ( π ( i ))  . (8) On Combining Machine Learning with Decision Making 9 This cost is not directly related to a weigh ted TRP cost in its present form. That is, when the failure probabilities of the no des are all the same, the total cost is not linear in the latencies, as is the case for Cost 1. Building on this cost, we will deriv e a cost that is the same as a w eighted TRP in Section 4.2, of the form: Cost of no de π ( i ) ∝ L π ( π ( i )) log  1 + e f λ ( ˜ x π ( i ) )  , (9) as an alternative to (7). There is a sligh tly more general v ersion of this form ulation (as there w as for Cost 1), whic h is to tak e the cost for each node to b e a function of tw o quantities: the probabilit y of failure before the visit, and the probability of failure after the visit. Let us redeﬁne β to be a constan t of proportionality for the cost of visiting b efore the failure even t. F rom the geometric distribution, P (failure o ccurs after time L π ( π ( i ))) = (1 − p ( ˜ x π ( i ) )) L π ( π ( i )) , and the cost of visiting node π ( i ) becomes: Cost of no de π ( i ) ∝ P (failure b efore L π ( π ( i ))) + β × P (failure after L π ( π ( i ))) . If β = 1, then the sum ab ov e is 1 for all nodes regardless of node failures or latencies. More realistically , the cost of visiting the no de after the failure is more than the cost of visiting proactiv ely , β  1 leading to (7). W e could again hav e written the summation to hide the dep endence on π : OpCost( π , f λ , { ˜ x i } M i =1 , { d i,j } M i,j =1 ) = M X i =1  1 −  1 + e f λ ( ˜ x i )  − L π ( i )  . R emark 1 The costs deﬁned ab ov e are b y no means exhaustive. W e c hose to deﬁne op erational costs this w ay b ecause they mimic the w ell known minimum latency ob jectiv e in routing problems. F or instance, we could hav e used a P oisson failure mo del at each no de instead of binomial or geometric as in Costs 1 and 2. Let us assume that the Poisson rate parameter µ ( ˜ x π ( i ) ) is the output of the estimation problem (sa y prop ortional to p ( ˜ x π ( i ) )). Then P ( k failures o ccur in time L π ( π ( i ))) = ( µ ( ˜ x π ( i ) ) L π ( π ( i ))) k e − µ ( ˜ x π ( i ) ) L π ( π ( i )) k ! . F rom this w e can get the probabilit y that at least one failure occurs in time in terv al [0 , L π ( π ( i ))] at no de π ( i ). No w w e can deﬁne the op erational cost to be the sum of these probabilities whic h depend on the routing and pro ceed in the same w ay as Cost 2. That is, we can minimize this cost to get the optimal routing π ∗ . R emark 2 The op erational cost m ust dep end on graph prop erties like latency . W e w ould not lik e to minimize an ob jectiv e of the form P M i =1 1 p ( ˜ x π ( i ) ) (or any other function of just p ( ˜ x π ( i ) ), the output of the estimation problem) as this do es not lead to an op erational cost in the true sense. This op erational cost do es not mak e use of latency information or other graph prop erties related to routing unless p ( ˜ x π ( i ) ) implicitly dep ends on them (which is not the case here). No w that the ma jor steps for b oth formulations ha ve been deﬁned, w e will discuss metho ds for optimizing the ob jectives. 10 Theja T ulabandhula, Cynthia Rudin 4 Optimization W e start by formulating mixed-in teger linear programs (MILP’s) for the TRP subproblem. 4.1 Mixed-integer optimization for Cost 1 F or either the sequen tial or simultaneous processes, we need the solution of the subproblem: π ∗ ∈ argmin π ∈ Π OpCost( π , f ∗ λ , { ˜ x i } M i =1 , { d i,j } M i,j =1 ), or equiv alen tly , π ∗ ∈ argmin π ∈ Π M X i =2 p ( ˜ x π ( i ) ) M X k =1 d π ( k ) π ( k +1) 1 [ k 0 , P  ∃ f ∈ F 0 : | R emp ( f λ , { x i , y i } m 1 ) − R true ( f λ ) | >   ≤ 4 α ( d, a budget ( C budget ))  32 B b X b  + 1  d exp  − m 2 128 M 2 b ound  , wher e α ( d, a budget ( C budget )) is e qual to 1 2 + k a budget k − 1 2 +  32 X b B b +  32 X b Γ  1 + d 2  √ π Γ  d +1 2  2 F 1 1 2 , 1 − d 2 ; 3 2 ;  k a budget k − 1 2 +  32 X b B b +  32 X b  2 ! (24) or e quivalently, 1 − 1 2 I 1 −  k a budget k − 1 2 +  32 X b  2 /  B b +  32 X b  2  d + 1 2 , 1 2  (25) and wher e 2 F 1 ( a, b ; c ; d ) and I x ( a, b ) ar e the hyp er ge ometric function and the r e g- ularize d inc omple te b eta functions r esp e ctively. The term α ( d, a budget ( C budget )) comes directly from formulae for the v olume of spherical caps. As C budget decreases, the norm k a budget k 2 increases, and th us k a budget k − 1 2 decreases, (24) and (25) decrease, and the whole b ound decreases. This is the mechanism b y whic h decreasing C budget ma y improv e generalization abilit y . Theorem 1 is speciﬁc to the ML&TRP because F 0 w as deﬁned based on the ML&TRP and a budget w as deﬁned in (22) for Cost 1 and (23) for Cost 2. The tec hnique of Theorem 1 applies m uch more broadly than the ML&TRP . In fact, we can derive a general b ound that applies to any problem with a similar h yp othesis space constraint. Sp eciﬁcally , the hypothesis space should b e b ounded b y the intersection of a ball with a half-space. 26 Theja T ulabandhula, Cynthia Rudin Corollary 1 ( Bound for Gener al MLOC F r amework ) Consider any op er- ational c ost c onstr aint such that the hyp othesis sp ac e lies within F 2 deﬁne d by F 2 = { f λ ∈ F : a budget · λ ≤ 1 } for some a budget ∈ R d . Then, for any  > 0 , P  ∃ f ∈ F 2 : | R emp ( f λ , { x i , y i } m 1 ) − R true ( f λ ) | >   ≤ 4 α ( d, a budget )  32 B b X b  + 1  d exp  − m 2 128 M 2 b ound  , wher e α ( d, a budget ) e quals 1 2 + k a budget k − 1 2 +  32 X b B b +  32 X b Γ  1 + d 2  √ π Γ  d +1 2  2 F 1 1 2 , 1 − d 2 ; 3 2 ;  k a budget k − 1 2 +  32 X b B b +  32 X b  2 ! or e quivalently, 1 − 1 2 I 1 −  k a budget k − 1 2 +  32 X b  2 /  B b +  32 X b  2  d + 1 2 , 1 2  and wher e 2 F 1 ( a, b ; c ; d ) and I x ( a, b ) ar e the hyp er ge ometric function and the r e g- ularize d inc omple te b eta functions r esp e ctively. The α ( d, a budget ) is inﬂuenced by our b elief on the op erational cost. Thus, by b eing able to sp ecify something ab out the op erational cost, we are able to hav e a b etter guarantee on generalization. In the case where w e are not able to sp ecify an ything ab out the op erational cost, the quantit y α ( d, a budget ) is equal to 1 giving us the standard generalization result for norm constrained linear function classes. 6.3 Pro of The pro of outline is as follows. W e will construct t wo classes, F 1 and F 2 that are sligh tly larger than F 0 , but smaller than F when C budget is small enough. Then w e will use a v olumetric argumen t to b ound the co v ering num b er of F 2 , whic h uses the v olumes of spherical caps; the idea is to sho w that the v alue of C budget aﬀects the volume of the hypothesis space, and thus the cov ering n um b er. The co vering num b er b ound is then applied to a uniform bound of P ollard (1984) to obtain a generalization b ound. The fact that the co v ering num b er of F 2 can b e b elo w that of F indicates that using functions from F 2 ma y provide impro vemen ts in generalization ov er using the full set F . Let us lead up to the pro of of Theorem 1. Deﬁnition 1 Let A ⊆ X b e an arbitrary set and ( X , dist) a (pseudo) metric space. Let | · | denote set size. – F or any  > 0, an  -cov er for A is a ﬁnite set U ⊆ X (not necessarily ⊆ A ) s.t. ∀ x ∈ A, ∃ u ∈ U with dist( x, u ) ≤  . – A is totally bounded if A has a ﬁnite  -cov er for all  > 0. The c overing numb er of A is N ( , A, dist) := inf U ∈U | U | where U is the set of all  -cov ers for A . – A set R ⊆ X is  -separated if ∀ x, y ∈ R, dist( x, y ) >  . The p acking numb er M ( , A, dist) := sup R ∈R | R | , where R is the set of all  -separated subsets of A . On Combining Machine Learning with Decision Making 27 Consider Cost 1. Since, for an y collection of v alues p ( ˜ x i ) ≥ 0 , P i d i p ( ˜ x i ) ≤ P i L π ( i ) p ( ˜ x i ) ≤ C budget , the class of functions which ob ey the constraint P i d i p ( ˜ x i ) ≤ C budget is larger than the class ob eying P i L π ( i ) p ( ˜ x i ) ≤ C budget . That is, F 0 ⊆ F 1 where F 1 := ( f λ : f λ ∈ F , M X i =1 d i 1 1 + e − f λ ( ˜ x i ) ≤ C budget ) . As long as C budget ≤ P M i =1 d i , the constrain t in F 1 is not v acuous. The choice of the v ector a budget ensures that F 1 is a subset of F 2 as w e will prov e b elow. Lemma 1 ( F 0 is c ontaine d in F 2 ) N ( , F 0 , k · k L 2 ( µ m X ) ) ≤ N ( , F 1 , k · k L 2 ( µ m X ) ) ≤ N ( , F 2 , k · k L 2 ( µ m X ) ) . Pr o of It is suﬃcient to sho w F 0 ⊆ F 1 ⊆ F 2 . The ﬁrst inequality was discussed earlier; since d i = inf π ∈ Π L π ( i ), this implies: M X i =1 d i p ( ˜ x i ) ≤ M X i =1 L π ( i ) p ( ˜ x i ) ≤ C budget ⇒ F 0 ⊆ F 1 . W e now show F 1 ⊆ F 2 . W e ﬁrst low er b ound p ( ˜ x i ) by a line with slop e m 1 := e B b X b (1+ e B b X b ) 2 and intercept m 0 := B b X b e B b X b (1+ e B b X b ) 2 + 1 1+ e B b X b suc h that m 1 f λ ( ˜ x i ) + m 0 ≤ p ( ˜ x i ) within the function range [ − B b X b , B b X b ]. This leads to the deﬁnition of a budget as w e show no w: P i d i p ( ˜ x i ) ≥ P i d i ( m 1 ( λ · ˜ x i ) + m 0 ) = ˜ a · λ + a 0 , (26) where ˜ a j := m 1  P i d i ˜ x j i  = e B b X b (1+ e B b X b ) 2  P i d i ˜ x j i  for j = 1 , ..., d (27) and a 0 = m 0 P i d i =  B b X b e B b X b (1+ e B b X b ) 2 + 1 1+ e B b X b  P i d i . Th us ∀ λ ∈ F 1 , ˜ a · λ + a 0 ≤ P M i =1 d i p ( ˜ x i ) ≤ C budget , (28) whic h implies ˜ a · λ ≤ C budget − a 0 or equiv alently , 1 C budget − a 0 ˜ a · λ ≤ 1. This allo ws us to deﬁne a budget using (27) as a j budget = 1 C budget − a 0  e B b X b (1 + e B b X b ) 2  X i d i ˜ x j i ! for j = 1 , .., d, whic h is the same as (22). This v ector is such that the set F 2 is larger than F 1 . u t R emark 5 (Deriving a budget for Cost 2) : The ab ov e lemma can b e adapted to Cost 2 to give the corresp onding a budget that we had deﬁned earlier. In particular, for an y collection of v alues log(1 + e λ · ˜ x i ) ≥ 0 for all i , X i d i log(1 + e λ · ˜ x i ) ≤ X i L π ( i ) log(1 + e λ · ˜ x i ) . 28 Theja T ulabandhula, Cynthia Rudin Th us the class of functions that ob ey the constraint P i d i log(1 + e λ · ˜ x i ) ≤ C budget is larger than the class ob eying P i L π ( i ) log(1 + e λ · ˜ x i ) ≤ C budget , whic h is F 0 . F 1 will b e the set corresp onding to the former constraint: F 1 := ( f λ ∈ F : M X i =1 d i log(1 + e λ · ˜ x i ) ≤ C budget ) . W e no w deﬁne F 2 and a budget as follows. W e can also see that log (1 + e λ · ˜ x i ) can be lo wer bounded b y a line with slope m 1 := e − B b X b 1+ e − B b X b and in tercept m 0 := B b X b e − B b X b 1+ e − B b X b + log(1 + e − B b X b ) in the function range [ − B b X b , B b X b ] giving us the deﬁnition of a budget for Cost 2 as follows: C budget ≥ P i d i log(1 + e λ · ˜ x i ) ≥ P i d i ( m 1 ( λ · ˜ x i ) + m 0 ) = ˜ a · λ + a 0 , where ˜ a j := m 1  P i d i ˜ x j i  = e − B b X b 1+ e − B b X b  P i d i ˜ x j i  for j = 1 , ..., d and a 0 = m 0 P i d i =  B b X b e − B b X b 1+ e − B b X b + log(1 + e − B b X b )  P i d i . Th us, 1 C budget − a 0 ˜ a · λ ≤ 1, and since we wan ted to ha ve a budget · λ ≤ 1 w e deﬁne a budget elemen t-wise as: a j budget = 1 C budget − a 0  e − B b X b 1 + e − B b X b  X i d i ˜ x j i ! for j = 1 , .., d. Note that w e hav e pro duced tw o a budget v ectors for each of the tw o costs: Cost 1 and Cost 2 ab ov e. Let B (0 , B b ) := { λ : λ ∈ R d , k λ k 2 ≤ B b } . Let the half space corresp onding to F 2 b e H k a budget k − 1 2 := { λ : a budget · λ ≤ 1 } . The lemma below relates co vering n umbers of F and F 2 in function space to cov ering n um b ers of B (0 , B b ) and B (0 , B b ) ∩ H k a budget k − 1 2 in R d . Lemma 2 ( R elating c overing numb ers in k · k L 2 ( µ m X ) to k · k 2 ) a. sup µ m X N ( , F , k · k L 2 ( µ m X ) ) ≤ N ( /X b , B (0 , B b ) , k · k 2 ) , and b. sup µ m X N ( , F 2 , k · k L 2 ( µ m X ) ) ≤ N ( /X b , B (0 , B b ) ∩ H k a budget k − 1 2 , k · k 2 ) . Pr o of Each elemen t f ∈ F corresp onds to at least one elemen t of B (0 , B b ) b y deﬁ- nition of F . Cho ose an y distribution µ m X . Consider tw o elemen ts λ f , λ g ∈ B (0 , B b ) On Combining Machine Learning with Decision Making 29 corresp onding to functions f , g ∈ F ⊂ L 2 ( µ m X ). Then, k f − g k 2 L 2 ( µ m X ) = 1 m m X i =1 ( f ( x i ) − g ( x i )) 2 = 1 m m X i =1 (( λ f − λ g ) · x i ) 2 ≤ 1 m m X i =1 k λ f − λ g k 2 2 k x i k 2 2 (Cauc hy-Sc hw arz to each term) ≤ k λ f − λ g k 2 2 1 m m X i =1 X 2 b ! (since sup x ∈X k x k 2 ≤ X b ) = k λ f − λ g k 2 2 X 2 b . Consider a minimal /X b -co ver { λ r } r for B (0 , B b ) where λ r corresp onds to a function r ∈ F . Then b y deﬁnition, ∀ λ ∈ B (0 , B b ) , ∃ λ r : k λ − λ r k 2 ≤ /X b . Thus, pic king an y tw o such elements λ f , λ g in a ball of radius /X b around λ r , w e see that, the corresp onding functions f , g belong to a ball of radius  measured using distance in L 2 ( µ m X ) b y the inequalit y ab ov e. The centers of these  -balls in L 2 ( µ m X ) form an  -cov er for F . The size of this set is equal to N ( /X b , B (0 , B b ) , k· k 2 ) (which is the size of /X b -co ver for B (0 , B b )). The size of the minimal  -cov er of F will b e less than or equal to this size. Hence, N ( , F , k· k L 2 ( µ m X ) ) ≤ N ( /X b , B (0 , B b ) , k· k 2 ). T aking a supremum ov er all µ m X , we obtain the ﬁrst inequalit y of the lemma. The same argumen t also works for the second inequalit y . u t Because of rotational symmetry of B (0 , B b ), the v olume cut oﬀ b y a h yp erplane a budget · λ = 1 from B (0 , B b ) is determined only by its distance from the origin, whic h is 1 / k a budget k 2 . Suc h a p ortion (or its complement, if smaller) of a ball obtained from slicing the ball with a hyperplane is called a spherical cap. It can b e parameterized by the distance of its (h yp er)plane base from the cen ter of the ball as shown b elow. F or notation, let the volume of a set A ⊂ R d b e represented as V ol ( A ). F or example, V ol ( B 1 ) = π d/ 2 Γ [ d/ 2+1] . Lemma 3 ( V olume of spheric al c aps ) L et the volume of b al l B (0 , B b ) in R d b e denote d as V ol ( B (0 , B b )) . Given a d -dimensional ve ctor a , let z = k a k − 1 2 b e a numb er and H z = { λ : a · λ ≤ 1 } b e a half sp ac e p ar ameterize d by z . L et the spheric al c ap b e denote d by B (0 , B b ) ∩ H 0 z wher e the c ap is at a distanc e z (me asur e d fr om the b ase of the c ap to the c enter of the b al l), and H 0 z r epr esents the c omplement half sp ac e ( H z ∪ H 0 z = R d ). Then, V ol ( B (0 , B b ) ∩ H 0 z ) /V ol ( B (0 , B b )) is e qual to two expr essions:  1 2 − z B b Γ [ 1+ d 2 ] √ πΓ [ d +1 2 ] 2 F 1  1 2 , 1 − d 2 ; 3 2 ;  z B b  2  = 1 2 I 1 − z 2 /B 2 b  d + 1 2 , 1 2  , wher e 2 F 1 ( a, b ; c ; d ) and I x ( e, f ) ar e the hyp er ge ometric and r e gularize d inc omplete b eta functions r esp e ctively. Pr o of See Li (2011) and references therein. Next, we need the relationship betw een pac king num b ers and cov ering n umbers to pro ve Theorem 2: 30 Theja T ulabandhula, Cynthia Rudin Lemma 4 ( Packing and c overing numb ers ) F or every (pseudo) metric sp ac e ( X, dist ) , A ⊆ X , and  > 0 , N ( , A, dist ) ≤ M ( , A, dist ) . Pr o of See Theorem 4 in Kolmogoro v and Tikhomirov (1959) or Theorem 12.1 in An thony and Bartlett (1999) for a pro of of this classical result. W e use the ab o ve lemma to obtain b ounds for the co vering n umbers of subsets of R d whic h app eared in Lemma 2. Theorem 2 ( Bound on Covering Numb ers ) N ( /X b , B (0 , B b ) , k · k 2 ) ≤  2 B b X b  + 1  d , and N  /X b , B (0 , B b ) ∩ H k a k − 1 2 , k · k 2  ≤     V ol  B B b +  2 X b ∩ H k a k − 1 2 +  2 X b  V ol  B B b +  2 X b       2 B b X b  + 1  d . Pr o of Both statemen ts inv olv e a v olumetric argumen t. F or a pro of of the ﬁrst inequalit y , see Section 3 of Kolmogorov and Tikhomirov (1959) or Lemma 4.10 in Pisier (1989) or Lorentz (1966) or Lemma 3 in Cuck er and Smale (2002). T o sho w the second part, let the volume of the complemen t of the spherical cap b e V ol ( B (0 , B b ) ∩ H k a k − 1 2 ); we need to ﬁnd an upper b ound for the minimal /X b -co ver of this set. W e can do that b y scaling a minimal  -cov er, which w e ﬁnd no w. By extending the boundary of B (0 , B b ) ∩ H k a k − 1 2 b y / 2 we can b ound the maximal pac king num b er M ( , B (0 , B b ) ∩ H k a k − 1 2 , k · k 2 ) as follows: M ( , B (0 , B b ) ∩ H k a k − 1 2 , k · k 2 ) × V ol ( B 1 )( / 2) d ≤ V ol ( B B b + / 2 ∩ H k a k − 1 2 + / 2 ) . Or, M ( , B (0 , B b ) ∩ H k a k − 1 2 , k · k 2 ) ≤ V ol  B B b + / 2 ∩ H k a k − 1 2 + / 2  V ol ( B 1 ) 1 ( / 2) d = V ol  B B b + / 2 ∩ H k a k − 1 2 + / 2  V ol ( B 1 ) 1 ( / 2) d ( B b + / 2) d ( B b + / 2) d = V ol  B B b + / 2 ∩ H k a k − 1 2 + / 2  V ol ( B B b + / 2 ) ( B b + / 2) d ( / 2) d . Again, scaling  to /X b and using the relationship betw een N ( , A, dist ) and M ( , A, dist ) in Lemma 4 yields the second result. u t Th us w e ha ve so far shown the relationship b etw een co vering num b ers of F 0 , F 1 , and F 2 in terms of a certain metric in Lemma 1, w e hav e shown ho w those co vering num b ers are related to cov ering num b ers in ` 2 ( R d ) in Lemma 2, w e hav e sho wn how the latter cov ering num b ers relate to volumes in ` 2 ( R d ) in Theorem 2, and w e hav e shown how to compute one of these volumes in Lemma 3. T o complete the pro of of Theorem 1, w e will use a relation betw een the co vering n umber of a class of loss functions of some set G and the co vering num b er of the set G itself. W e will also use a uniform conv ergence b ound of Pollard (1984). On Combining Machine Learning with Decision Making 31 Theorem 3 ( Pol lar d 1984 ) L et l G b e a set of functions on X × Y with 0 ≤ l ( f λ ( x ) , y ) ≤ M b ound , ∀ l ∈ l G and ∀ ( x, y ) ∈ X × Y . L et { x i , y i } m 1 b e a se quenc e of m instanc es dr awn indep endently ac c or ding to µ X ×Y . Then for any  > 0 , P ( ∃ l ∈ l G : | R emp ( f λ , { x i , y i } m 1 ) − R true ( f λ ) | >  ) ≤ 4 E h N  / 16 , l G , k · k L 1 ( µ m X ×Y ) i exp  − m 2 128 M 2 b ound  . Pr o of See Theorem 24 in Pollard (1984) (also in Zhang, 2002, Theorem 1). W e can relate the cov ering n umber for Pollard’s loss functions set l G to the co vering num b er for set G as follows. Lemma 5 ( R elating l G to G ) If every function fr om function class l G r epr esente d as l : f ( X ) ×Y 7→ R , f ∈ G , is Lipschitz in its ﬁrst ar gument with Lipschitz c onstant L , then the c overing numb er of l G is r elate d to the c overing numb er of G by sup µ m X ×Y N  , l G , k · k L 1 ( µ m X ×Y )  ≤ N  / L , G , k · k L 1 ( µ m X )  . Pr o of Consider tw o functions f , g ∈ G . Let the corresp onding functions in class l G b e l f = l ( f ( x ) , y ) and l g = l ( g ( x ) , y ). k l f − l g k L 1 ( µ m X ×Y ) = 1 m m X i =1 | l ( f ( x i ) , y i ) − l ( g ( x i ) , y i ) | ≤ 1 m m X i =1 L| f ( x i ) − g ( x i ) | = Lk f − g k L 1 ( µ m X ) . This implies, giv en { x i , y i } m i =1 , if ˆ G is a minimal / L -cov er of G in L 1 ( µ m X ), w e can construct an  -cov er of l G in L 1 ( µ m X ×Y ) as ˆ l G = { l f i : f i ∈ ˆ G } . The size of the minimal  -cov er will b e smaller than the size of such an  -cov er. T aking the suprem um ov er all empirical distributions, we get the desired result. u t Theorem 3 and Lemma 5 inv olv e L 1 co vering num b ers, but our cov ering num b er b ounds start with an L 2 metric in Lemma 2. So w e need to switch from L 1 to L 2 metric. The following lemma uses the identit y k f − g k L 1 ( µ m X ) ≤ k f − g k L 2 ( µ m X ) (true b ecause of Jensen’s inequality applied to norms) to relate the tw o. Lemma 6 N ( , A, k · k L 1 ( µ m X ) ) ≤ N ( , A, k · k L 2 ( µ m X ) ) . Pr o of See for a version, Lemma 10.5 in Anthon y and Bartlett (1999). Finally , we can prov e the main result. Pr o of (Of The or em 1) In our setting, the loss function is logistic with Lipsc hitz constant L = 1 (when viewed as a function of f ( x )). The class of loss functions is thus deﬁned by l F 0 := { l : f λ ∈ F 0 } . Each l ∈ l F 0 is also non-negativ e and b ounded as needed in the statemen t of Theorem 3. 32 Theja T ulabandhula, Cynthia Rudin Starting from the exp ectation term on the right hand side of Theorem 3 using F 0 as G w e get, E [ N ( / 16 , l F 0 , k · k L 1 ( µ m X ×Y ) )] ≤ sup µ m X ×Y N ( / 16 , l F 0 , k · k L 1 ( µ m X ×Y ) ) b ounding exp ectation b y supremum ≤ sup µ m X N   16 L , F 2 , k · k L 2 ( µ m X )  from Lemma 5, 6 and 1 resp ectively ≤ N   16 · 1 · X b , B (0 , B b ) ∩ H k a budget k − 1 2 , k · k 2  from Lemma 2, substituting L = 1 ≤     V ol  B B b +  32 X b ∩ H k a budget k − 1 2 +  32 X b  V ol ( B B b +  32 X b )      32 B b X b  + 1  d from Theorem 2 = α ( d, a budget ( C budget ))  32 B b X b  + 1  d from Lemma 3 . The ab ov e step uses the relation b etw een spherical cap and its complement along with Lemma 3, V ol  B (0 , B b ) ∩ H 0 k a budget k − 1 2  = V ol ( B (0 , B b )) − V ol  B (0 , B b ) ∩ H k a budget k − 1 2  . Using the derived inequality within Theorem 3 completes the pro of. u t 7 F uture work W e provide several av enues for future work. – Other gr aph applic ations: The MLOC framework is a general to ol that can help decision mak ers translate uncertaint y in prediction to uncertaint y in op- erational costs. The ML&TRP itself is a speciﬁc application of the MLOC framew ork that can be applied to the p ow er grid (as w e did), but also to de- liv ery truck routing and other physical routing problems, and can b e used for more abstract routing problems suc h as netw ork routing problems, where dis- tances on the graph do not necessarily corresp ond to a physical distance. In the future it would b e interesting to explore some of these applications. – R elaxing the c ost c onstraints in the MLOC: Our generalization b ound for the ML&TRP applied to a hypothesis space that was an intersection of an l 2 ball with a halfspace. It would b e interesting to consider more general op erational cost constraints, suc h as quadratic constraints and other con vex functions. As it turns out, there are many applications where such constraints naturally arise. In curren t w ork, w e are constructing b ounds for these types of constraints, whic h lead to exotic h yp othesis spaces, such as an in tersection of an l 2 ball with an ellipsoid (for quadratic constraints) or a general conv ex b o dy (for con vex constraints). – Se quential MLOC: Curren tly the MLOC framework applies to one-shot de- cision problems. It would b e in teresting to extend it to sequential decision problems, p erhaps where multiple decisions are made in a sequence of deci- sion epo chs, and training data arrive incrementally . In this case, the baseline On Combining Machine Learning with Decision Making 33 tec hnique analogous to the “sequen tial process” w ould b e a Mark ov decision pro cess (MDP). The MLOC framew ork would then assist in understanding the reasonable range of costs for v arious sequen tial decision p olicies. 8 Conclusion In this w ork, w e ev aluated the MLOC framework in the con text of a real ap- plication and demonstrated improv ements o ver current standards. In particular, w e presented an application in the domain of transp ortation routing called the ML&TRP . Our framework takes adv antage of uncertain ty in statistical mo del- ing to explore the decision space and ﬁnd p oten tially more practical solutions. W e pro vide experiments quantifying the improv ements and the scalabilit y of the framew ork with respect to routing problem size. W e provided a generalization b ound for the ML&TRP (and for the general MLOC framework) indicating that a prior b elief in the op erational cost can p otentially b e beneﬁcial to prediction abilit y in general. References Shiv ani Agarwal. Ranking on graph data. In Pr o c e e dings of the 23r d International Confer enc e on Machine L e arning , 2006. Martin Anthon y and Peter L. Bartlett. Neur al network le arning: The or etic al foun- dations . Cambridge Universit y Press, 1999. Aaron Arc her and Anna Blasiak. Improv ed approximation algorithms for the minim um latency problem via prize-collecting strolls. In Pr o c e e dings of the Twenty-First Annual ACM-SIAM Symp osium on Discr ete Algorithms , pages 429–447, 2010. Aaron Archer, Asaf Levin, and David P . Williamson. A faster, b etter approxima- tion algorithm for the minimum latency problem. SIAM Journal of Computing , 37(5):1472–1498, 2008. Sanjeev Arora and George Karak ostas. A 2 +  approximation algorithm for the k -MST problem. Mathematic al Pr o gr amming , 107(3):491–504, 2006. F ran Barb era, Helmut Sc hneider, and Peter Kelle. A condition based maintenance mo del with exp onential failures and ﬁxed insp ection interv als. The Journal of the Op er ational R ese ar ch So ciety , 47(8):pp. 1037–1045, 1996. Mikhail Belkin, P artha Niy ogi, and Vik as Sindhw ani. Manifold regularization: A geometric framew ork for learning from labeled and unlab eled examples. Journal of Machine L e arning R ese ar ch , 7:2399–2434, 2006. Avrim Blum, Prasad Chalasani, Don Copp ersmith, Bill Pulleyblank, Prabhak ar Ragha v an, and Madhu Sudan. On the minimum latency problem. ArXiv Math- ematics e-prints , September 1994. Pierre Bonami, Lorenz T. Biegler, Andrew R. Conn, G´ erard Corn u´ ejols, Ignacio E. Grossmann, Carl D. Laird, Jon Lee, Andrea Lo di, F ran¸ cois Margot, Nicolas W. Sa wa ya, and Andreas W¨ ach ter. An algorithmic framew ork for conv ex mixed in teger nonlinear programs. Discr ete Optimization , 5(2):186–204, 2008. Olivier Chapelle, Bernhard Sc h¨ olkopf, and Alexander Zien, editors. Semi- Sup ervise d L e arning . MIT Press, Cambridge, MA, 2006. 34 Theja T ulabandhula, Cynthia Rudin Imre Csisz´ ar and G. T usn´ ady . Information geometry and alternating minimization pro cedures. Statistics and De cisions , 1(Suppl.):205–237, 1984. F elip e Cuc ker and Steve Smale. On the mathematical foundations of learning. Bul letin-Americ an Mathematic al So ciety , 39(1):1–50, 2002. S ¸ eyda Ertekin, Cynthia Rudin, and T yler McCormic k. Predicting p o wer failures with reactiv e p oint pro cesses. In Pr o c e e dings of AAAI L ate Br e aking T r ack , 2013. Matteo Fischetti, Gilb ert Lap orte, and Silv ano Martello. The delivery man prob- lem and cumulativ e matroids. Op er ations R ese ar ch , 41:1055–1064, No vem b er 1993. Mic hel Go emans and Jon Kleinberg. An improv ed appro ximation ratio for the minim um latency problem. Mathematic al Pr o gr amming , 82:111–124, 1998. Aiwina Heng, Andy C.C. T an, Joseph Mathew, Neil Montgomery , Dragan Banje- vic, and Andrew K.S. Jardine. In telligent condition-based prediction of machin- ery reliabilit y . Me chanic al Systems and Signal Pr o c essing , 23(5):1600 – 1614, 2009. W altraud Huyer and Arnold Neumaier. Global optimization by m ultilevel co ordi- nate searc h. Journal of Glob al Optimization , 14:331–355, June 1999. Andrey Kolmogoro v and Vladimir Tikhomiro v. ε -en tropy and ε -capacit y of sets in function spaces. Usp ekhi Matematicheskikh Nauk , 14(2):3–86, 1959. Miriam Lechmann. The trav eling repairman problem - an o verview. Diplomarb eit, Universitat Wein , pages 1–79, 2009. Shengqiao Li. Concise F ormulas for the Area and V olume of a Hyp erspherical Cap. Asian Journal of Mathematics & Statistics , 4(1):66–70, 2011. George G. Lorentz. Metric entrop y and approximation. Bul letin-Americ an Math- ematic al So cie ty , 72:903–937, 1966. Marzio Marseguerra, Enrico Zio, and Luca Podoﬁllini. Condition-based main te- nance optimization by means of genetic algorithms and monte carlo sim ulation. R eliability Engine ering & System Safety , 77(2):151 – 165, 2002. Isab el M´ endez-D ´ ıaz, Paula Zabala, and Abilio Lucena. A new formulation for the tra veling deliveryman problem. Discr ete Applie d Mathematics , 156(17):3223– 3237, 2008. John Ashw orth Nelder and Roger Mead. A simplex metho d for function mini- mization. Computer Journal , 7(4):308–313, 1965. Gilles Pisier. The volume of c onvex bo dies and Banach sp ac e ge ometry , volume 94. Cam bridge Universit y Press, Cambridge, 1989. Da vid Pollard. Conver genc e of sto chastic pr o c esses . Springer, 1984. Luis Miguel Rios. Algorithms for deriv ativ e-free optimization. PhD thesis, Uni- versity of Il linois at Urb ana-Champ aign , pages 1–133, 2009. Cyn thia Rudin, Reb ecca P assonneau, Axinia Radev a, Haimonti Dutta, Stev e Ierome, and Delﬁna Isaac. A pro cess for predicting manhole even ts in Man- hattan. Machine L e arning , 80:1–31, 2010. Cyn thia Rudin, Reb ecca Passonneau, Axinia Radev a, Steve Lerome, and Delﬁna Isaac. 21st-century data miners meet 19th-cen tury electrical cables. IEEE Com- puter , 44(6):103–105, June 2011. Cyn thia Rudin, Da vid W altz, Roger N. Anderson, Alb ert Boulanger, Ansaf Salleb- Aouissi, Maggie Chow, Haimonti Dutta, Philip Gross, Bert Huang, Steve Ierome, Delﬁna Isaac, Arthur Kressner, Reb ecca J. Passonneau, Axinia Radev a, and Leon W u. Mac hine learning for the New York City p ow er grid. IEEE T r ansac- On Combining Machine Learning with Decision Making 35 tions on Pattern Analysis and Machine Intel ligenc e , 34(2):328–345, F eb 2012. Cyn thia Rudin, S ¸ eyda Ertekin, Reb ecca Passonneau, Axinia Radev a, Ashish T omar, Bo yi Xie, Stanley Lewis, Mark Riddle, Debbie Pangsrivinij, and Tyler McCormic k. Analytics for p ow er grid distribution reliability in new y ork city . accepted, 2014. Theja T ulabandhula and Cynthia Rudin. Machine learning w ith op erational costs. Journal of Machine L e arning R ese ar ch , 14:1989–2028, 2013. Theja T ulabandhula, Cyn thia Rudin, and Patric k Jaillet. The mac hine learning and tra veling repairman problem. In Pr o c e e dings of the Se c ond International Confer enc e on Algorithmic De cision The ory , 2011. Oﬃce of Electric T ransmission United States Department of Energy and Distribu- tion. Grid 2030: A national vision for electricity’s second 100 y ears. T ec hnical rep ort, United States, July 2003. Ian Urbina. Mandatory safet y rules are proposed for electric utilities. New Y ork Times , 2004. Aug 21, Late Edition, Sec B, Col 3, Metrop olitan Desk, Page 2. C. A. v an Eijl. A p olyhedral approach to the deliv ery man problem. T echnical re- p ort, Memorandum COSOR 95–19, Department of Mathematics and Computer Science, Eindhov en Univ ersity of T echnology , The Netherlands, 1995. Andr ´ es W eintraub, J. Ab oud, C. F ernandez, G. Lap orte, and E. Ramirez. An emergency v ehicle dispatc hing system for an electric utilit y in Chile. Journal of the Op er ational R ese ar ch So ciety , pages 690–696, 1999. T ong Zhang. Co vering num b er b ounds of certain regularized linear function classes. Journal of Machine L e arning R ese ar ch , 2:527–550, 2002. Dengy ong Zhou, Jason W eston, Arthur Gretton, Olivier Bousquet, and Bernhard Sc h¨ olkopf. Ranking on data manifolds. In A dvanc es in Neur al Information Pr o c essing Systems 16 , pages 169–176. MIT Press, 2004.

On Combining Machine Learning with Decision Making

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment