Lifted Graphical Models: A Survey

Lifted Graphical Models: A Surv e y Lilyana Mihalko v a & Lise Getoor Uni versity of Maryland Colle ge P ark December 25, 2013 1 Motiv ation and Scope Multi-relational data, in which entities of different types engage in a rich set of relations, is ubiquitous in many domains of current interest. For example, in social network analysis the entities are indi viduals who relate to one another via friendships, family ties, or collaborations; in molecular biology , one is frequently interested in modeling ho w a set of chemical substances, the entities, interact with, inhibit, or catalyze one another; in web and social media applications, a set of users interact with each other and with a set of web pages or other online resources, which may themselves be related via hyperlinks; in natural language processing tasks, it is often necessary to reason about the relationships between documents, or words within a sentence or a document. By incorporating such relational information into learning and reasoning, rather than relying solely on entity-speciﬁc attributes, it is usually possible to achiev e higher predictiv e accuracy for an unobserv ed entity attrib ute, e.g., [SNB + 08]. For e xample, by exploiting h yperlinks between web pages, one can improve categorization accuracy [CSN98]. Dev eloping algorithms and representations that can effecti vely deal with relational information is important also because in many cases it is necessary to predict the existence of a relation between the entities. For e xample, in an online social network application, one may be interested in predicting friendship relations between people in order to suggest new friends to the users; in molecular biology domains, researchers may be interested in predicting how ne wly-de veloped substances interact. Gi ven the di versity of applications that in volve learning from and reasoning about multi-relational in- formation, it is not surprising that the ﬁeld of statistical relational learning (SRL)[DGM04, FGM06, DK09, GT07, KRK + 10] has recently experienced signiﬁcant growth. This survey provides a detailed overvie w of de velopments in the ﬁeld. W e limit our discussion to representations that can be seen as deﬁning a graphical model using a relational language, or alternati vely as “lifted” analogs of graphical models. Although in this way we omit discussions of several important representations, such as stochastic logic programs [Mug96] and ProbLog [DKT07], which are based on imposing a probabilistic interpretation on logical reasoning, by limiting the scope of the survey we are able to provide a more focused and uniﬁed discussion of the rep- resentations that we do cov er . F or these and other models, we refer the reader to [DK03]. Because of the great variety of e xisting SRL applications, we cannot possibly do justice to all of them; therefore, the focus is on representations and techniques, and applications are mentioned in passing where they help illustrate our point. The survey is structured as follows. In Section 2, we deﬁne SRL and introduce preliminaries. In Sec- tion 3, we describe sev eral recently introduced SRL representations that are based on lifting a graphical model. Our goal in this section is to establish a uniﬁed view on the av ailable representations by deﬁning a generic, or template, SRL model and discussing ho w particular models implement its various aspects. 1 Concept Representation Example Random v ariable (R V) Upper-case letters X , Y Set of R Vs Bold upper-case letters X , Y V alue assigned to R V Lo wer-case letters x , y Set of v alues assigned to R Vs Bold lo wer-case letters x , y Logical v ariable T ypewriter upper -case letters X , Y Entity/constant T ypewriter lo wer-case letters x , y Set of items other than R Vs Calligraphic upper-case letters X , Y T able 1: Notation used throughout this survey In this way , we establish not just criteria for comparisons of the models, but also a common frame work in which to discuss inference (Section 4), parameter learning (Section 5.1), and structure learning (Section 5.2) algorithms. 2 Pr eliminaries 2.1 What is SRL? Statistical relational learning (SRL) studies kno wledge representations and their accompanying learning and inference techniques that allow for efﬁcient modeling and reasoning in noisy and uncertain multi-relational domains. In classical machine learning settings, the data consists of a single table of feature vectors, one for each entity in the data. A crucial assumption made is that the entities in the data represent independent and identically distributed (IID) samples from the general population. In contrast, multi-relational domains contain entities of potentially different types that engage in a v ariety of relations. Thus, a multi-relational domain can be seen as consisting of sev eral tables: a set of attribute tables, one for each entity type, that con- tain feature-vector descriptions for each corresponding entity , and a set of relationship tables that establish relationships among tw o or more of the entities in the domain. As a consequence of the relationships among the entities, they are no longer independent, and the IID assumption is violated. A further characteristic of multi-relational domains is that they are typically noisy or uncertain. For example, there frequently is uncertainty regarding the presence or absence of a relation between a particular pair of entities. T o summarize, an ef fecti ve SRL representation needs to support the following two essential aspects: a) it needs to provide a language for e xpressing dependencies between dif ferent types of entities and their di verse relations; and b) it needs to allo w for probabilistic reasoning in a potentially noisy en vironment. 2.2 Background and Notation Here we establish the notation and terminology to be used in the rest of this survey . SRL draws on both probability theory and on logic programming, which sometimes use the same term to describe dif ferent concepts. For example, the word “variable” could mean random variable (R V), or a logical variable. T o av oid confusion, we distinguish between dif ferent meanings using different fonts, as summarized in T able 1. 2 2.2.1 T erminology of Relational Languages This section provides an ov ervie w of several commonly-used relational languages with a focus on the aspects that are most important to the rest of our discussion. First-order Logic First-order logic (FOL) provides a ﬂexible and expressiv e language for describing typed objects and relations. FOL distinguishes among four kinds of symbols: constants, variables, pred- icates, and functions [RN03]. Constants describe the objects in the domain, and we will alternati vely call them entities. For e xample, in the notation of T able 1, x and y are entities. Entities are typically typed. Log- ical variables act as placeholders and allo w for quantiﬁcation, e.g., X and Y . Predicates represent attributes or relationships and ev aluate to true or false , e.g., Publication ( paper , person ) , which establishes a relation between a paper and an author , and Category ( paper , category ) , which provides the category of a paper , are predicates, and the strings in the parentheses specify the types of entities on which these predicates operate. Functions ev aluate to an entity in the domain, e.g., MotherOf . W e will adopt the conv ention that the names of predicates and functions will start with a capital letter . The number of arguments of a predicate or a function is called its arity . A term is a constant, a v ariable, or a function on terms. A predicate applied to terms is called an atom, e.g., Publication ( X , Y ) . A positi ve literal is an atom and a negati ve literal is a negated atom. A formula consists of a set of positiv e or negati ve literals connected by conjunction ( ∧ ) or disjunction ( ∨ ) operators, e.g., ¬ Friends ( X , Y ) ∨ Friends ( Y , X ) . The variables in formulas are quantiﬁed, either by an existential quantiﬁer ( ∃ ) or by a univ ersal quantiﬁer ( ∀ ) . Here we follo w the typical assumption that when no quantiﬁer is speciﬁed for a variable, ∀ is understood by default. A formula expressed as a disjunction with at most one positive literal is called a Horn clause ; if a Horn clause contains exactly one positi ve literal, then it is a deﬁnite clause. The positi ve literal in a deﬁnite clause is called the head, whereas the remaining literals constitute the body . Deﬁnite clauses can alternativ ely be re-written as an implication as Body ⇒ Head . T erms, literals, and formulas are called grounded if the y contain no v ariables. Otherwise, they are ungr ounded. Grounding, also called instantiation, is carried out by replacing variables with con- stants in all possible type-consistent ways. The set of all groundings of f will be denoted with G C f , where C is a set of constraints that specify which groundings are allowed. In general, if predicate and function arguments are typed, type constraints are present by def ault. T o connect this FOL terminology to probability theory , we note that when the value of a ground atom is gov erned by a random process, it becomes a random variable with values in the set { true , false } . For example, let x and y represent two entities, i.e., speciﬁc indi viduals; then the grounded atom Friends ( x , y ) represents the assertion that x and y are friends. If we are giv en a probability distribution that gov erns the v alue of Friends ( x , y ) , we can reason about it in the same way in which we reason about ordinary random v ariables. In addition, it is helpful to treat unground atoms as parameterized R Vs [Poo03], in the sense that once their variables, or parameters, are replaced by constants, the y become R Vs. For example, if X and Y are logical variables, Friends ( X , Y ) is a parameterized R V because once we ground it by replacing the parameters X and Y with actual entities, we obtain R Vs. W e will refer to parameterized R Vs as par-R Vs for short, e.g. Friends ( X , Y ) is a par -R V . Object-Oriented Representations As an alternati ve to FOL, the attributes and relations of entities can be described using an object-oriented representation (OOR). Here again, x and y represent speciﬁc entities in the domain, whereas X and Y are v ariables, or entity placeholders. As in FOL, entities are typed. Attributes and relations are e xpressed using a notation analogous to that commonly used in object-oriented languages. For example, x . Category refers to the category of paper x , whereas x . Author refers to its authors. In- verse relations are also allowed, e.g., y . Author − 1 refers to the papers of which y is an author . Using this 3 notation, chains of relations can be conv eniently speciﬁed, e.g. x . Author . Author − 1 . Category giv es the set of categories of all papers written by the authors of x . Note that because the Author relation is typi- cally one-to-many , in general, x . Author refers to a set of entities, i.e., the set of all authors of the paper . Because of this, object-oriented languages allow for aggregation functions, such as mean , mode , max , or sum . For example, we can write mode ( x . Author . Author − 1 . Category ) . As in FOL, OOR statements can be grounded, or instantiated, by replacing variables with entities from the domain. Analogous to FOL, we will vie w ungrounded relation/attribute chains, as well as aggre gations thereof, as par-R Vs. Structured Query Language (SQL) It is natural to manipulate relational data, which is often stored in a relational database, using SQL. Thus, not surprisingly , SQL has been used as a representation in some of the SRL models discussed in the surve y . For self-sufﬁcienc y , we provide a brief overvie w . The attributes and relations of objects can be vie wed as deﬁning a relational schema, in which an attribute table corresponds to each entity type and a relationship table corresponds to each relation type in which entities can engage. It is therefore natural to manipulate such data using SQL. Here we revie w the select statement, which has been used to represent relational dependencies in SRL models. For our purposes, the most useful form of the select statement is expressed as follo ws: SELECT FROM WHERE 2.2.2 T erminology of Probabilistic Graphical Models SRL also draws heavily on graphical models. Therefore, we next introduce basic concepts from that area. For a detailed introduction to graphical models, we refer the reader to [KF09]. In general, to describe a probability distribution on n binary R Vs, one needs to store 2 n parameters, one for each possible conﬁgu- ration of value assignments to the R Vs. Howe ver , frequently sets of R Vs are conditionally independent of one another , and thus, many of the parameters will be repeated. T o a void such redundancy of representation, se veral graphical models have been de veloped that explicitly represent conditional independencies. One of the most general representations is the factor graph [KFL01]. A factor graph consists of the tuple h X , F i , where X is a set of R Vs, and F is a set of arbitrary but strictly positive functions called factors. It is typi- cally drawn as a bipartite graph (Figure 1). The two partitions of vertices in the factor graph consist of the R Vs in X and the factors in F respectiv ely . There is an edge between an R V X and a factor f if and only if X is necessary for the computation of f ; i.e., each factor is connected to its ar guments. As a result, the structure of a factor graph deﬁnes conditional independencies between the variables. In particular , a variable is conditionally independent of all v ariables with which it does not share factors, giv en the v ariables with which it participates in common factors. A factor graph h X , F i deﬁnes a probability distrib ution over X as follows. Let x be a particular assign- ment of v alues to X . Then, P ( X = x ) = 1 Z Y f ∈F f ( x f ) . Abov e, x f represents the v alues of those v ariables in X that are necessary for computing f ’ s value. Z is a 4 Figure 1: Factor graph. Circular nodes correspond to v ariables, whereas square nodes correspond to factors. V ariables are connected to the factors of which the y are arguments. normalizing constant that sums ov er all possible v alue assignments y to X , and is giv en by: Z = X y Y f ∈F f ( y f ) . (1) As before, y f represents the v alues of only those v ariables in Y that are necessary to compute f . Factor graphs generalize two very common graphical models–Bayesian and Marko v networks. A Bayesian network [Pea88] is represented as a directed acyclic graph, whose vertices are the R Vs in X . The probability distribution over X is speciﬁed by pro viding the conditional probability distrib ution for each node gi ven the v alues of its parents. The simplest way of expressing these conditional probabilities is via conditional prob- ability tables (CPTs), which list the probability associated with each conﬁguration of values to the nodes. A Bayesian network can be con verted to a factor graph in a straightforward way as follows. For each node X , we introduce a factor f X to represent the conditional probability distribution of X giv en its parents. Thus, f X is computed as a function of only X and its parents. In this case, the product is automatically normalized, i.e., the normalization constant Z sums to 1. A Markov network [Pea88] is an undirected graphical model whose nodes correspond to the v ariables in X . It computes the probability distribution over X as a product of strictly positiv e potential functions deﬁned ov er cliques in the graph, i.e. for any set of variables that are connected in a maximal clique, there is a potential function that takes them as arguments. A conv enient representation for potential functions is the log-linear model, in which each potential function φ ( X 1 . . . X n ) that is computed as a function of n v ariables X 1 . . . X n is represented as the exponentiated product exp( λ · f ( X 1 . . . X n )) . In this expression, λ ∈ R is a learnable parameter , and f is a feature that captures characteristics of the variables and can e v aluate to an y v alue in R . In general, there may be more than one potential function deﬁned ov er a clique. In this way , a v ariety of feature functions, each with its own learnable parameter λ , can be deﬁned for the same set of variables. Markov networks map directly to factor graphs—to con vert a Markov network to a factor graph, for each maximal clique in the Markov network, we include a factor that ev aluates to the product of potentials deﬁned ov er that clique. The adv antage of discussing factor graphs rather than Bayesian networks, Mark ov networks, and others is that by describing algorithms for factor graphs, we make them immediately av ailable to representations that can be viewed as specializations of factor graphs. This is especially true of inference algorithms. On the other hand, it will be beneﬁcial to discuss at least some aspects of learning techniques separately for directed and undirected models. 5 3 Over view of SRL models Existing SRL representations can be split into two major groups. The ﬁrst group consists of “lifted” graph- ical models – representations that use a structured language to deﬁne a probabilistic graphical model. Rep- resentations in the second group impose a probabilistic interpretation on logical inference. As already discussed, to allow for greater depth, here we limit ourselves to the ﬁrst group of languages. T o provide a con venient representation that describes the common core of “lifted” graphical models, we start with par-f actor graphs, short for parameterized factor graphs, deﬁning them in the terminology of [Poo03]. A par-f actor graph is analogous to a factor graph [KFL01] in that it generalizes a large class of SRL models and allows us, as much as possible, to present a uniﬁed treatment of them regardless of whether they are based on directed or undirected representations. 3.1 Par -Factor Graphs A par-factor graph is a “lifted” factor graph in the sense that, when instantiated, a par-factor graph deﬁnes a factor graph. It consists of a set of par-factors. Each par-factor is represented as a triple ( A , φ, C ) , where A is a set of parameterized random variables, φ is a function that operates on these v ariables and ev aluates to a strictly positiv e value, and C is a set of constraints on ho w the v ariables may be instantiated. The par-f actor graph is then just a set of par-f actors F = { ( A i , φ i , C i ) } . Each of the “lifted” graphical models representations can be viewed as deri ved from this generic par-f actor graph by specifying the language used to e xpress the φ i -s, the A i -s, and the C i -s. The probability distribution deﬁned by a par-factor graph is gi ven by the follo wing expression: P ( X = x ) = F ( x ) = 1 Z Y Φ i ∈F Φ i ( x ) = 1 Z Y Φ i ∈F Y g ∈G C i Φ i ( A i ) g ( x g ) (2) The last line considers each possible par-factor in the graph and multiplies together the values of all of its instantiations g ∈ G C i Φ i ( A i ) on x g , the subset of the giv en assignment x needed to ev aluate g . In this expres- sion, g ( x ) uses just those from the set x that are relev ant to its computation, i.e. that are the corresponding instantiations to the parameterized R Vs in A i . If we compare equations 1 and 2, we see that, indeed, by instantiating a par-factor graph, we obtain a factor graph. Howe ver , here, all the factors that are instantia- tions of the same par -factor share common structure and parameters; thus, par -factor graphs allo w for better generalization. In the remainder of this section, we ﬂesh out this description by considering se veral popular SRL repre- sentations, discussing how they can be vie wed as speciﬁc instantiations of par-factor graphs, and grouping them according to the type of graphical model they deﬁne, i.e. directed, undirected, or hybrid. This is not meant to be an exhausti ve list; rather , our goal is to highlight some of the dif ferent ﬂavors of representations. 3.2 Undirected SRL Representations This subsection describes SRL representations that can be vie wed as “lifted” undirected graphical models. 6 Relational Markov Networks. As their name suggests, relational Markov networks (RMNs) [T AK02] deﬁne Markov networks through a relational representation. RMNs use an object-oriented language and SQL (see Section 2.2.1) to specify par -factors. In particular , each par-factor Φ = ( A , φ, C ) is deﬁned using an SQL select statement in which the select...from part establishes the par-R Vs in A , and the where part establishes the constraints C ov er instantiations. The R Vs resulting from instantiating the par-R Vs are multinomial, i.e., they can take on one of multiple discrete v alues. Each tuple returned by the select statement constitutes an instantiation of the par -factor; thus, the v ariables that appear in a returned tuple form a clique in the Markov network deﬁned by the RMN. All of these cliques that are instantiations of the same par-factor share a potential function φ , which takes a log-linear form. In particular , let A be a particular instantiation of the par-R Vs in A , i.e., A is one of the returned tuples. Then, for speciﬁc v alues a , assigned to the variables in A , φ ( A = a ) = exp( λ · f ( a )) , where, as in Markov networks, f is an arbitrary feature function over A . Note, howe ver , that unlike in Markov networks, here the potential function φ —its feature function and parameter—is shared across all instantiations of the par-f actor Φ . This property is common to all of the SRL languages we consider here, and is, in fact, one of their main deﬁning characteristics—that through relational languages, they allo w for generalization via parameter tying. T o illustrate, we provide an example from collecti ve classiﬁcation of hyperlinked documents, as pre- sented in [T AK02]. The goal of the following par -factor is to set up a clique between the label assignments of any two hyperlinked documents in order to capture the intuition that documents on the web that link to one another typically hav e correlated labels. select D1.Category, D2.Category where Document D1, Document D2, Link L WHERE L.From = D1.Key and L.To = D2.Key This select statement sets up the cliques but does not specify the potential function φ to be used ov er these cliques. The deﬁnition of φ can be used to incorporate further domain knowledge in the model. For example, if we know that most pages tend to link to pages of the same category , we can deﬁne φ ( D1 . Category , D2 . Category ) = exp( λ · 1 [ D1 . Category = D2 . Category ]) , where the feature func- tion takes the form of the indicator function 1 [ x ] that returns 1 if the proposition x is true and 0 otherwise. A positi ve λ , encourages hyperlinked pages to be assigned the same category , while a negati ve λ discourages this. T askar et al. [TCK04] hav e shown that for associative RMNs, in which the factors fav or the same la- bels for R Vs in the same clique, inference and learning is tractable (for binary random v ariables) and closely approximated for the non-binary case. Marko v Logic Networks. Markov logic networks (MLNs) [RD06, DL09] also deﬁne a Markov network when instantiated. Par-f actors in MLNs are speciﬁed using ﬁrst-order logic. Each par-f actor Φ = ( A , φ, C ) is represented by a ﬁrst-order logic rule R Φ with an attached weight λ Φ . A consists of all par-R Vs that appear in the rule; therefore, in the instantiated Markov network, each instantiation, or grounding, of R Φ , establishes a clique among the R Vs that appear in that instantiation. The instantiated R Vs are Boolean-valued. The potential function φ is implicit in the rule, as we describe next. Let A be the set of R Vs in a particular instantiation, and a be a particular assignment of truth v alues to A ; then, φ ( A = a ) = exp( λ Φ · R Φ ( a )) , where R Φ ( a ) = 1 if it is true on the giv en truth assignment a and R Φ ( a ) = 0 otherwise. In other words, clique potentials in MLNs are represented using log-linear functions in which the ﬁrst-order logic rule itself acts as a feature function, whereas the weight associated with the rule is the parameter . 7 As an illustration, we present an e xample from [RD06], in which the patterns of human interactions and smoking habits are considered. One regularity in this domain is that friends have similar smoking habits, i.e., that if two people are friends, then they will either both be smokers or both be non-smokers. This can be captured with the follo wing ﬁrst-order logic rule (where λ is the weight associated with it): λ : Friends ( X , Y ) ⇒ ( Smokes ( X ) ⇔ Smokes ( Y )) The par-R Vs in the par -factor deﬁned by this rule are A = { Friends ( X , Y ) , Smokes ( X ) , Smokes ( Y ) } , and e very possible instantiation of these par -R Vs establishes a clique in the instantiated Markov netw ork, e.g., if there are only two entities, a and b , then the instantiated Markov network will contain the cliques { Friends ( a , a ) , Smokes ( a ) } , { Friends ( a , b ) , Smokes ( a ) , Smokes ( b ) } , { Friends ( b , a ) , Smokes ( a ) , Smokes ( b ) } , and { Friends ( b , b ) , Smokes ( b ) } among the R Vs Smokes ( a ) , Smokes ( b ) , Smokes ( a , b ) , Smokes ( b , a ) , and Smokes ( b , b ) . So far , we ha ve not discussed ho w MLNs specify the constraints C of a par-f actor . MLNs do not ha ve a special mechanism for describing constraints, but constraints can be implicit in the rule structure. T wo ways of doing this are as follo ws. First, we can allo w only instantiations that ground a gi ven v ariable with a speciﬁc entity by replacing this variable with the entity name. For example, if we want to constrain X = a in the rule above, we simply write it as Friends ( a , Y ) ⇒ ( Smokes ( a ) ⇔ Smokes ( Y )) . A second and more general way of introducing constraints is via predicates whose v alues will be kno wn at inference time, as in discriminati vely trained models. For example, suppose we kno w that at inference time we will observe as e vidence the truth values of all groundings of the Friends predicate, and the goal will be to infer people’ s smoking habits. Then, the rule Friends ( X , Y ) ⇒ ( Smokes ( X ) ⇔ Smokes ( Y )) can be seen as setting up a clique between the Smokes v alues only of entities that are friends. This is because, if for a particular pair of entities x and y , Friends ( x , y ) is f alse, then the corresponding instantiation of the rule is trivially satisﬁed, regardless of assignments to groundings of Smokes . Therefore, such instantiations can be ignored when instantiating the MLN. A variant of MLNs is the Hybird MLN model [WD08], which extends MLNs to allow for real-v alued predicates. In Hybrid MLNs, the same formula can contain both binary-v alued and real-valued terms. Such formulas are e v aluated by interpreting conjunction as a multiplication of v alues. Probabilistic Similarity Logic. Another “lifted” Markov network model is probabilistic similarity logic (PSL) [BMG09], which allo ws for reasoning about similarities. In PSL, par -factors are deﬁned as weighted rules expressed as a mix of ﬁrst-order logic and object-oriented languages. Thus, similar to MLNs, each par-f actor Φ = ( A , φ, C ) consists of a rule R Φ with attached weight λ Φ , whose par-R Vs determine the set A and whose structure determines the potential function φ . Howe ver , unlike in RMNs and MLNs, R Vs resulting from instantiating a par-R V are continuous-valued in the interval [0 , 1] . Thus, instantiating a set of PSL rules results in a continuous-valued Markov network. Constraints in C can be speciﬁed in a similar manner as in MLNs. Unlike previous models, PSL additionally supports reasoning about similarities, both between entity attributes and between sets of entities. Similarity functions can be any real-valued function whose range is the interv al [0 , 1] , and in formulas the y can be mixed with regular relational terms. T o illustrate, consider an example from [BMG09] in which the task is to infer document similarities in W ikipedia based on document 8 attributes and user interactions with the document. One potentially useful rule states that tw o documents are similar if the sets of their editors are similar and their text is similar: { A . editor } s 1 ≈ { B . editor } ∧ A . text s 2 ≈ B . text ⇒ A s 3 ≈ B Abov e, s i ≈ represent similarity functions, and a term enclosed in curly braces, as in { A . editor } , refers to the set of all entities related to the variable through the relation. The par-R Vs in the above rule are A = { { A . editor } s 1 ≈ { B . editor } , A . text s 2 ≈ B . text , A s 3 ≈ B } , and, as before, the rule deﬁnes a clique among each possible instantiation of these par-R Vs. The e v aluation R Φ ( a ) of a rule R Φ on an assignment a to an instantiation A of the par-R Vs in volves combining Boolean values and similarity values. In PSL, this is done using t-norms/t-conorms, which generalize the ﬁrst-order logic operations of conjunction/disjunction. While any t-norm/t-conorm pair can be used, in [BMG09] the Lukasiewicz t-(co)norm was preferred due to its property of being linear in the v alues combined. Letting x, y ∈ [0 , 1] be two Boolean or similarity v alues, the Lukasiewicz t-(co)norm is deﬁned as follo ws: x ∧ y = max { 0 , x + y − 1 } x ∨ y = min { x + y , 1 } ¬ x = 1 − x Thus, each instantiated rule has a v alue in the interval [0 , 1] , and one can talk about the distance to satisfaction of a rule instantiation and deﬁne it as d ( R Φ ( a )) = 1 − R Φ ( a ) . PSL generalizes the standard formulation of the joint distribution (e.g., Equation 2) by interpreting it as a penalty on the distance from satisfaction of a set of PSL rules on a giv en assignment of v alues to the instantiated R Vs. The probability of observing a giv en joint assignment of values is gi ven by: P ( X = x ) = 1 Z exp( − d δ ( x )) Abov e, δ is an arbitrary distance function, and d δ ( a ) is the distance from satisfaction function, computed as δ ( V ( a ) , 0 ) , where V ( a ) is the vector of weighted distances to satisfaction of all rule instantiations. If we pick δ to be the L 1 -norm distance, δ ( x, y ) = k x − y k 1 , we get back the standard formulation, in which the potential function φ associated with a clique is gi ven by φ ( A = a ) = exp( − λ Φ · d ( R Φ ( a ))) . Imperatively Deﬁned Factor Graphs. Par-f actor graphs can also be deﬁned using an imperative pro- gramming language, as is done in F A CTORIE, an implementation of imperatively deﬁned factor graphs (IDFs) [MSS09]. F A CTORIE uses Scala , a strongly-typed functional programming language, and in this way provides the model designer with enormous ﬂexibility . In F A CTORIE, variable types are represented as typed objects that can be sub-classed, and potential relations between variables of particular types are represented as instance variables in the corresponding classes. In this way , novel types of v ariables, such as set variables, can be deﬁned, and standard data structures, such as linked lists, can be used in variable type implementations. Each par-f actor Φ = ( A , φ, C ) is deﬁned as a factor template class, whose arguments determine the set of par-R Vs A . The factor template class contains a set of unroll procedures, one for each par-R V , that return the set of instantiations of this par-f actor, corresponding to a giv en instantiation of one of the arguments. The instantiation constraints C are therefore encoded in the unroll methods. The potential function φ is implemented via a statistics method in the factor template class and thus can hav e arbitrary form (as 9 long as it returns a strictly positiv e value). Consider a simpliﬁed version of an example from [MSS09] that deﬁnes a factor template for the task of coreference resolution. The goal of the f actor template is to e valuate the compatibility of an entity Mention and the canonical representation of the particular underlying Entity to which it is assigned. The unroll1 method produces a factor between a given mention and the entity to which it is assigned. Given an entity , the unroll2 method produces a set of factors, one for each mention associated with this entity . val corefTemplate = new Template[Mention, Entity]{ def unroll1(m:Mention) = Factor(m, m.entity) def unroll2(e:Entity) = for (mention <- e.mentions) yield Factor(mention, e) def statistics(m:Mention,e:Entity) = Bool(distance(m.string,e.canonical)<0.5) } 3.3 Directed SRL Representations This subsection describes SRL representations that deﬁne Bayesian networks when instantiated. Par -factors in these representations ha ve special form. In particular , for a par-f actor Φ = ( A , φ, C ) , A consists of a child par-R V C ∈ A and a set of parent par -R Vs Pa = A \ { C } . Because the function φ represents a conditional probability distribution for C giv en P a , the expression in Equation 2 is automatically normalized, i.e., Z = 1 . When specifying directed SRL models, care must be taken to ensure that their instantiations in dif ferent worlds would result in cycle-free directed graphs. Ho we ver , as discussed in [Jae02], Section 3.2.1, this problem is undecidable in general, and guarantees exist only for restricted cases. Bayesian Logic Programs. In Bayesian logic programs (BLPs) [KD01], par-R Vs are expressed as logical atoms, and the dependency structure of C on its parents P a is represented as a deﬁnite clause, called a Bayesian clause, in which the head consists of C , the body consists of P a , and the implication is replaced by a | to indicate probabilistic dependency . A further distinction between ordinary logical clauses and Bayesian clauses is that in the latter , logical atoms are not restricted to ev aluating to just true or false , as illustrated in the example below . Par -factors are formed by coupling a Bayesian clause with a conditional probability distribution (CPD) o ver values for C gi ven values for P a . K ersting and De Raedt [KD01] gi ve an e xample from genetics (originally from [FGKP99]), in which the blood type bt ( X ) of person X depends on inheritance of a single gene, one copy of which, mc ( Y ) is inherited from X ’ s mother , mother ( Y , X ) , while the other copy pc ( Z ) is inherited from her father , father ( Z , X ) . In BLPs this dependency is e xpressed as bt ( X ) | mc ( X ) , pc ( X ) In this example, mc ( X ) and pc ( X ) can take on values from { a, b, 0 } , whereas bt ( x ) can tak e on values from { a, b, ab, 0 } . The speciﬁcation of this par-f actor is completed by providing for each possible combination of v alues assigned to mc ( X ) and pc ( X ) , the probability distribution ov er values of bt ( x ) , e.g., as a conditional probability table. Using BLPs, we next illustrate another aspect common to most directed SRL representations: the use of combining rules. Following the example from [KD01], suppose that in the genetics domain we hav e the follo wing two rules: bt ( X ) | mc ( X ) bt ( X ) | pc ( X ) Each of these rules, comes with a CPD. The ﬁrst one for bt ( X ) gi ven mc ( X ) , and the second one for bt ( X ) gi ven pc ( X ) . Howe ver , what we need is a single CPD for predicting bt ( X ) giv en both of these quantities. 10 Such a CPD is obtained by using combining rules, which are functions of one or more CPDs to a single CPD. For e xample, one commonly used combining function is the noisy-or . Relational Bayesian Netw orks. Relational Bayesian netw orks (RBNs) [Jae02] also represent par-R Vs as logical atoms. A separate par-factor Φ R is deﬁned for each unknown relation R in the domain, such that the child par -R V C is an atom of R . Par-f actors are represented as probabilistic formulas in a syntax that bears a close correspondence to ﬁrst-order logic. The conditional probability distribution of C given P a is implicit in this probabilistic formula, which has range [0 , 1] and is e v aluated as a function of the values of the variables in Pa . In particular , probabilistic formulas in RBNs are recursiv ely deﬁned to consist of (i) constants in [0 , 1] , which in the extreme cases of 1 and 0 correspond to true and false respectiv ely; (ii) indicator functions, which take tuples of logical variables as arguments and correspond to relational atoms; (iii) con vex combinations of formulas, which correspond to Boolean operations on formulas; and, ﬁnally , (i v) combination functions, such as mean , that combine the v alues of se veral formulas. T o illustrate, consider a slight adaptation of an example from [Jae02], where the task is, giv en the pedigree of an individual x , to reason about the v alues of two relations, FA ( x ) and MA ( x ) , which indicate whether x has inherited a dominant allele A from its father and mother respectiv ely . The probabilistic formula for FA ( X ) , may be: Φ FA ( X ) = Φ knownFather ( X ) · Φ A − from − father ( X ) + (1 − Φ knownFather ( X )) · θ Abov e, Φ knownFather ( X ) is an auxiliary sub-formula that ev aluates to true if the father of X is included in the pedigree and to false otherwise; Φ A − from − father ( X ) is an auxiliary sub-formula deﬁned as the mean ov er the FA and MA values of X ’ s father: Φ A − from − father ( X ) = mean { FA ( Y ) , MA ( Y ) | father ( Y , X ) } ; and θ is a learnable parameter that can take v alues in the range [0 , 1] . As in some of the undirected models discussed earlier, RBNs do not provide a dedicated mechanism for specifying the constraints C . Howe ver , constraints may be speciﬁed either by replacing logical v ariables with actual entity names in formulas, or by including tests on the background information, as is the case with the Φ knownFather sub-formula abov e. Probabilistic Relational Models. Probabilistic relational models (PRMs) [KP98, GFK + 07] take a rela- tional database perspecti ve and use an object-oriented language, akin to that described in Section 2.2.1, to specify the schema of a relational domain. Both entities and relations are represented as classes, each of which comes with a set of descriptiv e attributes and a set of reference slots through which classes refer to one another . Using an example from [GFK + 07], consider a document citation domain that consists of two classes, the Paper class with attributes Paper . Topic and Paper . Words , and the Cites class, which establishes a citation relation between two papers via reference slots Cites . Cited and Cites . Citing . The par-R Vs in PRMs correspond to attributes on objects, possibly going through chains of reference slots. Each par-f actor is deﬁned by specifying the par-R Vs corresponding to the child node C and the parent nodes P a respecti vely , and providing a conditional probability distribution for C giv en Pa . For example, one possible par-f actor in the document citation domain may set C = P . Topic and Pa = { P . Cited . Topic , P . Citing . Topic } . Thus, this par-factor establishes a dependency between the topic of a paper and the topics of papers that it cites or that cite it. As with other directed SRL models, PRMs use aggregation functions in cases when the number of parents obtained from instantiating a par-f actor varies. For instance, since papers cite varying number of papers, one would need to aggregate ov er the topics of papers corresponding to the instantiation of P . Cited . Topic . 11 In the abo ve example, reasoning is performed only about the attrib utes of objects, whereas their relations are assumed giv en. Howe ver , in some applications it may be necessary to reason about the presence of a relation between two objects. For example, there may be uncertainty regarding Cites . Citing . PRMs provide extensions for dealing with these cases, for two situations: when the number of links is known, b ut not the speciﬁc objects that are linked (reference uncertainty), as well as when neither the number of links nor the linked objects are kno wn (existence uncertainty). BLOG. BLOG, short for Bayesian LOGic, is a relational language for specifying generativ e models [MMR + 05]. Par -R Vs in BLOG are represented as ﬁrst-order logic atoms. The dependence of a par-R V on its parents is e xpressed by listing the parent par-R Vs and specifying the distribution from which the child is drawn, giv en the parents. For example, Milch et al. [MMR + 05] present a BLOG model for entity reso- lution. This model vie ws the set of citations of a given paper as being drawn uniformly at random from the set of kno wn publications. This is captured by the follo wing BLOG statement: PubCited ( C ) ∼ Uniform ( { Publication P } ) . Similarly , the citation string is viewed as being generated by a string corruption model CitDistrib as a function of the authors and title of the paper being cited: CitString ( C ) ∼ CitDistrib ( TitleString ( C ) , AuthorString ( C )) A unique characteristic of BLOG is that it does not assume that the set of entities in a domain is known in advance and instead allows reasoning ov er variable numbers of entities. This functionality is supported by allowing number statements, in which the number of entities of a giv en type is drawn from a given distribution. For example, in the entity resolution task, the number of researchers is not kno wn in adv ance and is instead dra wn from a user-deﬁned distribution. Standard distrib utions such as the Poisson can also be used. 3.4 Directed versus Undir ected Models Most SRL representations discussed in this surv ey deﬁne either a directed or an undirected graphical model when instantiated. These representations have relati ve advantages and disadv antages, analogous to those of directed and undirected graphical models [KF09]. In terms of representation, directed models are appro- priate when one needs to express a causal dependence. On the other hand, undirected models are better suited to domains containing c yclic dependencies. While causal structure might be easier to elicit from experts, when model structure is learned from data, special care needs to be taken with directed models to ensure acyclicity . Structure revisions in directed models can be ev aluated much faster for typical scor- ing functions, which are decomposable, because parameters change locally , only in the places where the dependency structure has changed. In other words, when the set of parents of a node C change, the only pa- rameters that typically need to be adjusted are those of the conditional distribution of C giv en its ne w set of parents. This contrasts with undirected models in which scoring the re vision of a single par-f actor requires adjusting the parameters of the entire model. Issues pertaining to structure learning are further discussed in Section 5.2. Undirected models hav e a straightforward mechanism for combining the par-factors shared by a single par-R V , simply by multiplying them, whereas directed models require the use of separately deﬁned combining functions, such as noisy-or , or aggre gation functions, such as count, mode, max, and average. On the other hand, the use of combining functions in directed models allo ws for multiple independent causes of a giv en par-R V to be learned separately and then combined at prediction time [HB94]. On the other 12 hand, this kind of causal independence cannot be exploited in undirected models. Finally , because factors in directed models represent conditional probability distributions, they are automatically normalized. In con- trast, in undirected models one needs to ﬁnd efﬁcient ways of computing, or estimating, the normalization constant Z (in Equation 2). 3.4.1 Hybrid SRL Representations Hybrid SRL representations combine the positiv e aspects of directed and undirected models. One such model is relational dependency networks (RDNs) [NJ07], which can be viewed as a lifted dependency network model. Dependency networks [HCM + 00] are similar to Bayesian networks in that, for each variable X , they contain a factor f X that represents the conditional probability distribution of X gi ven its parents, or immediate neighbors, P a X . Unlike Bayesian networks, howe ver , dependency networks can contain cycles and do not necessarily represent a coherent joint probability distrib ution. Marginals are recovered via sampling, e.g., Gibbs sampling (see Section 4). In this respect, dependency networks are similar to Mark ov networks, i.e., the set of parents Pa X of a variable X render it independent of all other variables in the network. RDNs [NJ07] lift dependency networks to relational domains. P ar-f actors in RDNs are similar to those in PRMs and are represented as conditional probability distrib utions over values for a child par -R V C and the set of its parents P a . Analogous to dependency networks, howe ver , cycles are allowed and thus, as in dependency netw orks, RDNs do not always represent a consistent joint probability distrib ution. There has also been an effort to unify directed and undirected models by providing an algorithm that con verts a given directed model to an equiv alent MLN [NKL + 10]. In this way , one can model multiple causes of the same variable independently while taking advantage of the variety of inference algorithms that hav e been implemented for MLNs. Bridging directed and undirected models is important also as a step to wards representations that combine both directed and undirected sub-components. 4 Infer ence An SRL model can be used to draw two types of inferences, which are analogous to those supported by graphical models. In the ﬁrst type, the goal is to compute the marginal probabilities of variables; we will refer to this as computing mar ginals. In the second type, MAP (maximum a posteriori) inference, the task is to ﬁnd the most likely joint assignment of v alues to the unkno wn variables, also known as the MAP state. This section is structured as follo ws. W e start with an ov ervie w of algorithms that directly port inference techniques de veloped in the graphical models literature by instantiating the gi ven SRL model and operating ov er the resulting factor graph. Then, we surve y “lifted” inference approaches that exploit the symmetries present in an SRL model. Stronger emphasis will be placed on approaches developed speciﬁcally for SRL representations. 4.1 Inference on the instantiated factor graph One advantage of the SRL representations discussed in this survey is that, since they are instantiated to factor graphs, inference algorithms dev eloped in the graphical models literature can be directly ported here. One of the earliest techniques used to ef ﬁciently instantiate a gi ven SRL model in a gi ven domain is knowledge- based model construction (KBMC) [WBG92], which dynamically instantiates a knowledge base only to the extent necessary to answer a particular query of interest. KBMC has been adapted to instantiate both directed, e.g., [KP97, PKMT99, GFKT02], and undirected models, e.g., [RD06]. Application of KBMC in these and other frameworks exploit the conditional independency properties implied by the graph structure 13 of the instantiated model; in particular , the fact that in answering a query about a set of random variables X , one only needs to reason about variables that are not rendered conditionally independent of X giv en the v alues of observed v ariables. Next, we brieﬂy revie w some of the most commonly used inference techniques. For more detail, we refer the reader to [KF09]. V ariable Elimination One of the earliest and simplest algorithms that can be used to perform e xact infer- ence ov er the grounded factor graph is variable elimination VE [PZ03]. Suppose we would like to compute the marginal probability of a particular instantiation X of a particular par-R V X (the instantiation of a par- R V is an ordinary R V). T o do that, we need to sum out all the other v ariables (i.e. all other instantiations of par-R Vs), which we will call Y . VE proceeds in iterations summing out variables from the set Y one by one. An ordering over the variables in Y is established, and in each iteration the next Y ∈ Y is selected, and the set of factors is split into two groups—the ones that contain Y and the ones that do not. All fac- tors containing Y are multiplied together and Y is summed out, thus effecti vely removing Y from Y . The ef ﬁciency of the algorithm is af fected by the ordering over Y that was used; heuristics for selecting better orderings are av ailable. In the end the result is normalized. This algorithm can be adapted to ﬁnd the MAP state by maximizing o ver variables rather than summing ov er them. In particular , suppose we would like to ﬁnd the most lik ely joint assignment to a set of v ariables X . As before, we impose an ordering o ver X and proceed in iterations, this time, ho wever , eliminating each v ariable X ∈ X by maximizing the product of all factors that contain X and remembering which v alue of X gave the maximum v alue. Belief Pr opagation Another algorithm that can be used to compute mar ginals ov er a f actor graph is belief propagation (BP) [Pea88], also kno wn as the sum-product algorithm [KFL01] because it consists of a series of summations and products. BP computes the mar ginals e xactly on graphs that contain no cycles. A special case that frequently arises in practice is that of chain graphs, in which BP is kno wn as the forward-backward algorithm [Rab89]. BP’ s operation is based on a series of “messages” that are sent from v ariable nodes to factor nodes and vise versa. For a complete deri v ation of these messages, we refer the reader to [KFL01]; here we provide only the result, using the notation in that paper . Messages sent from a variable node X to a factor node f are denoted µ X → f ( X ) , and messages sent from a factor node f to a variable node X are denoted µ f → X ( X ) (note that both are functions of X ). The messages sent are deﬁned as shown below . In the following expressions, n ( X ) denotes the set of neighbors of node X , i.e. these are the factors in which it participates; analogously , n ( f ) is the set of variables used to calculate a factor; and ∼ { X } denotes the set of all v ariables except X : µ X → f ( X ) = Y h ∈ n ( X ) \{ f } µ h → X ( X ) (3) µ f → X ( X ) = X ∼{ X }   f ( n ( f )) Y Y ∈ n ( f ) \{ X } µ Y → f ( Y )   (4) In other words, before sending a message to a neighbor , a variable waits to receiv e messages from all of its other neighbors, and then simply sends the product of these messages. If the variable is a leaf node, thus having a single neighbor, it sends it a trivial message of 1. Similarly , before a factor sends a message to a v ariable, it waits to receiv e messages from all other variables and then multiplies these messages together with itself and sums out all other variables e xcept for the one to which the message is sent. If the factor is a 14 leaf node, it simply sends itself. In the end, the marginal at a variable X is calculated as the product of all incoming messages from the neighboring factors. BP can also be used on loopy graphs where it is run for a sequence of iterations. Although in such cases it is not guaranteed to output correct results, in practice it frequently con verges and, when this happens, the v alues obtained are typically correct [MWJ99, YFW01]. As in variable elimination, BP can be easily adapted to compute the MAP state by replacing the sum- mation operator in Equation 4 with a maximization operator . This is called the max-product algorithm, or , if the underlying graph is a chain, the V iterbi algorithm. Sampling Exact inference is intractable in general. An alternativ e approach is to perform approximate inference, based on sampling. One of the most popular techniques is Gibbs sampling, a Markov chain Monte Carlo (MCMC) sampling algorithm. At the onset, Gibbs sampling initializes the unkno wn variables. This can be done randomly , but faster con vergence may be obtained with more carefully picked values, i.e. ones that result in a MAP state. Sampling then proceeds in iterations, where in each iteration a ne w value for one of the unknown v ariables is sampled, gi ven the current assignments to the remaining variables. Under general conditions, Gibbs sampling con verges to the tar get distribution [T ie94]. One of these conditions that may often be broken in practice is ergodicity , or the requirement that each state (i.e. particular conﬁguration of assignments to the variables) can be aperiodically visited from each other state. Ergodicity is violated when the domain contains deterministic or near-deterministic dependen- cies, in which case sampling becomes stuck in one region and con ver ges to an incorrect result. One way of av oiding this problem is to jointly sample new values for blocks, or groups, of variables with closely coordinated assignments. Another way of ov ercoming the challenge of deterministic or near-deterministic dependencies is to perform slice sampling [D WW99]. Intuitively , in slice sampling, auxiliary variables are used to identify “slices” that “cut” across the modes of the distrib ution. By sampling uniformly from a slice, this technique allo ws sampling to jump across modes, thus pre venting it from getting stuck in a single region. Slice sampling was used to deriv e the MC-SA T algorithm [PD06], which performs slice sampling in MLNs. MC-SA T identiﬁes the slice from which to sample as the set of all possible variable assignments that satisfy an appropriately selected set of grounded clauses (which in MLNs deﬁne the factors), where the probability of selecting a grounded clause is larger for clauses with larger weights. MC-SA T samples (near-) uniformly from this slice using the SampleSA T algorithm [WES04]. An orthogonal concern is the ef ﬁciency of sampling. One approach to speeding up sampling is to use memoization [Pfe07], in which values of past samples are stored and reused, instead of generating a new sample. If care is taken to keep reuses independent of one another , the accuracy of sampling can be improv ed by allo wing the sampler to draw a lar ger number of samples in the allotted time. W eighted Satisﬁability MAP inference with MLNs is equiv alent to ﬁnding a joint assignment to the par- R V instantiations such that the weight of satisﬁed formula instantiations is maximized. In other words, performing MAP inference in MLNs is equi v alent to solving a weighted satisﬁability problem, as discussed by Richardson and Domingos [RD06] who used the MaxW alkSat algorithm [KSJ97]. (Integer) Linear Programming MAP inference can be performed by also solving an integer linear pro- gram that is constructed from the giv en factor graph. In the most general construction, e.g., [KF09, T as04], a Boolean-v alued variable v j f is introduced into the program for each f actor f and each possible assignment of v alues j to the R Vs X f in volved in the computation of f . Constraints are added to enforce the conditions that (1) for each factor f only one of the v j f is set to 1 at any giv en time and (2) the values of v j f 1 and 15 v j 0 f 2 where f 1 and f 2 share variables are consistent. The MAP state is found by maximizing the objective log Q f f ( x f ) = P f , j v j f f ( x f = j ) under these constraints. Specializations of this general procedure of generating constraints for a factor graph exist for the cases of factor graphs obtained by instantiating MLNs and PSL. A specialized procedure for casting MAP infer- ence in MLNs as an integer linear program is provided in [Rie08]. The resulting linear program contains a v ariable y X for each par-R V instantiation X whose value is unknown, as well as a v ariable λ f for each instantiation of each formula (corresponding to each factor in the instantiated factor graph). The formula instantiations are simpliﬁed by replacing par-R Vs whose values are known with their observed v alues (e.g., true or false ) and replacing all other par -R V instantiations with their corresponding v ariables y . The cor- respondence between a formula instantiation f and its corresponding v ariable λ f is established by requiring logical equi v alence between f and λ f , effecti vely re writing each f as λ f ⇔ f . These rewritten formulas are then con verted to conjuncti ve normal form, and each disjunction transformed into a linear constraint. As an example of how this latter step is carried out, consider the disjunction ¬ X ∨ Y . Its corresponding linear constraint is − 1 . 0 X + 1 . 0 Y ≥ 0 . In PSL, MAP inference is cast as a second-order cone program. This is done as follows[BMG09]. As with MLNs, the program contains a variable v X for each unkno wn par-R V instantiation X and a rule vari- able v R for each PSL rule instantiation R . All par-R V instantiations are replaced with their corresponding v ariables, and the correspondence between rule variables and their corresponding instantiated rules is es- tablished by including the constraint v R ≥ d ( R ( a )) , where d ( R ( a )) represents the distance to satisfaction of rule instantiation R , as deﬁned in the description of PSL on page 9. Additional (hard) constraints can also be included. Because par-R V instantiations in PSL have continuous, rather than Boolean, values, the v ariables in the optimization are not constrained to be integers, as is the case in MLNs. As a result, under appropriate choices for t-(co)norms and similarity functions, the second-order cone program can be solved ef ﬁciently in polynomial time, as opposed to the integer linear programming case, which is known to be NP-hard [Sch98]. 4.2 Improving Infer ence Efﬁciency Cutting Plane Inference As we described them above, the procedures for casting MAP inference in MLNs and PSL as an optimization problem are naiv e in the sense that they ﬁrst fully instantiate the gi ven formulas or rules and then con vert them to constraints. In reality , the efﬁciency of both procedures is signiﬁcantly improv ed by making use of a default value — false in the case of MLNs, and 0.0 in the case of PSL — and realizing that most rule instantiations, and thus their corresponding constraints, will be satisﬁed by setting par-R V instantiations to their default value. In other words, the only constraints that need to be included in the optimization problem are those corresponding to rule instantiations that are not yet fully satisﬁed by the current assignment of values to the unkno wn par-R V instantiations. This realization leads to an iterative procedure [Rie08, BMG09], whereby a series of optimization problems are solved, each subsequent problem including only those additional constraints that are not yet fully satisﬁed by the assignment of values to the par-R V instantiations. Riedel [Rie08] relates this procedure to cutting plane algorithms de veloped in the operations research community . In cutting plane optimization, rather than solving the original problem, one ﬁrst optimizes a smaller problem that contains only a subset of the original constraints. The algorithm then proceeds in iterations, each time adding to the acti ve constraints those from the original problem that are not satisﬁed by the current solution. This process continues until a solution that satisﬁes all constraints is found. In the worst case, this may require considering all constraints. Howe ver , in many cases, it will be possible to ﬁnd a solution by only considering a small subset of the constraints. 16 Lazy Inference Lazy inference is a related meta-inference technique that is based on the fact that relational domains are typically sparse, i.e., that very few of all possible relations are actually true. Lazy inference was originally de veloped to improv e the memory ef ﬁciency of MAP inference with MaxW alkSat for MLNs, resulting in the LazySA T algorithm [SD06]. LazySat maintains sets of active R Vs and active formula instantiations. Only the activ e R Vs and formulas are explicitly maintained in memory , thus dramatically decreasing the memory requirements of inference. Initially , all R Vs are set to false , and the set of acti ve R Vs consists of all R Vs participating in formula instantiations that are not satisﬁed by the initial assignment of false v alues. A formula instantiation is acti v ated if it can be made unsatisﬁed by ﬂipping the value of zero or more acti ve R Vs. Thus the initial set of activ e formula instantiations consists of those activ ated by the initially activ e R Vs. The algorithm then carries on with the iterations of MaxW alkSat, acti v ating R Vs when their value gets ﬂipped and then activ ating the rele v ant rule instantiations. LazySat was later generalized to other inference techniques, such as sampling [PDS08]. F aster Instantiation The efﬁcienc y of inference with the techniques described in Section 4.1 is af fected by how quickly the par-factor graph is grounded to a factor graph. For the case of MLNs, Shavlik and Natarajan [SN09] introduced FR OG, an algorithm that preprocesses a gi ven MLN to impro ve the ef ﬁciency of instantiation. The basic idea is that if an evidence literal in a grounded clause is satisﬁed, then the clause is already true regardless of the values of the remaining literals and can be excluded from consideration; thus, when grounding a clause, one only needs to consider v ariable substitutions that lead to e vidence literals not being satisﬁed, which, in many cases results in a signiﬁcantly lower number of instantiations. FR OG employs a set of heuristics to identify groups of v ariable substitutions that can be safely ignored. 4.3 Lifted Inference By performing inference on the instantiated factor graph, one can take advantage of the many av ailable inference techniques that have been e xtensi vely studied in the graphical models literature. Howe ver , partic- ularly on larger problems, such an approach can be prohibitiv e in terms of both memory requirements and running time. T o address this issue, a variety of “lifted” inference techniques hav e been dev eloped. Lifted inference exploits the observation that factor graphs obtained by grounding a set of par-f actors exhibit a large degree of symmetry , which would lead to repeating the same set of computations multiple times. By org anizing these computations in a way that exploits the symmetries and avoids repetition, lifted inference techniques can lead to large efﬁcienc y gains. The earliest techniques are based on recognizing identical structure that would result in repeated identical computations, and performing the computation only the ﬁrst time, caching the results and subsequently reusing them [KP97, PKMT99]. Next we describe sev eral lifted inference approaches, org anizing them according to the underlying inference algorithm they use. Lifted V ariable Elimination First-order variable elimination (FO VE) was introduced by [Poo03] and later signiﬁcantly extended in a series of works [dAR05, dAR06, MZK + 08]. As in ordinary VE, the goal of FO VE is to obtain the mar ginal distribution over a set of v ariables Q by summing out the values of the remaining ones. Howe ver , unlike VE, FO VE sums over entire sets of v ariables, such as the possibly constrained set of all groundings of a par -R V . Thus, the crux of FO VE is to deﬁne operations that eliminate, or sum out, entire sets of groundings of par-R Vs in such a way that the result is the same as the one that would be obtained by summing out each R V individually . Next we provide a brief discussion of the sev eral elimination operations that hav e been deﬁned and summarize the conditions under which they apply . For a more detailed treatment, we refer the reader to the abov e papers; a more uniﬁed treatment is presented in 17 [dAR07] and an excellent basic introduction with examples is presented in [KP09a]. All of the elimination operations assume the following tw o conditions on the model and use auxiliary operations to achieve them. First, the par-factors in the model need to be shatter ed [dAR05], which means that for any two par-f actors in the model, the sets of groundings of their par-R Vs under the constraints are either identical or completely disjoint. Intuitiv ely , this is necessary in order to ensure that the same reasoning steps will apply to all of the factors resulting from grounding a gi ven par-factor . A model is shattered using the splitting operation [Poo03]. Second, there should be just one par-f actor in the model containing the par-R V that is being eliminated. This is achiev ed using the fusion operation [dAR05] which essentially multiplies together all par-f actors that depend on the par-R V being eliminated. T o facilitate the remainder of this discussion, let q be the par-R V we are trying to eliminate, and φ be the par-factor that depends on q . Elimination Operators The ﬁrst elimination operation is in version elimination , introduced by [Poo03] (the conditions for its correctness were completely speciﬁed in [dAR05]). In version elimination applies only when a one-to-one correspondence can be established between the groundings of q and those of φ . This con- dition is violated when the logical variables that appear in q are dif ferent from the logical variables in φ . For example, suppose that q depends on the logical v ariable X and φ depends on the par-R Vs { p ( X , Y ) , q ( X ) } . In version elimination would not work in this case because q does not depend on the logical variable Y . In a nutshell, in version elimination takes a sum of products, where the sum is ov er all groundings of q and the products are over all possible substitutions to the variables in φ and simpliﬁes it to a product of sums, where each sum now depends on the number of possible truth assignments to q (e.g., true or false ), which is in general much smaller than the total number of its groundings. Another elimination operation is counting elimination [dAR05], which is based on the insight that fre- quently the factors resulting from grounding φ form a fe w lar ge groups with identical members. These groups can be easily identiﬁed by considering the possible truth assignments to the groundings of φ ’ s ar- guments. For each possible truth assignment to the grounded arguments, counting elimination counts the number of groundings that would hav e that truth assignment. Then only one grounding from each group needs to be ev aluated and the result exponentiated to the total number of groundings in that group. T o be able to count the number of groundings resulting in a particular truth assignment efﬁciently , counting elim- ination requires that the choice of substitutions when grounding one of the par-R Vs of φ does not constrain the choice of substitutions for any of the other ones. Finally , although we hav e described counting elimina- tion in the context of eliminating the groundings of just one par-R V ( q ), in fact it can be used to eliminate a set of par-R Vs. Finally , [MZK + 08] introduced elimination with counting formulas which exploits exchangeability be- tween the parameterized random v ariables on which a gi ven par-f actor depends. In particular , this property is observed when the par -factor is a function of the number of arguments with a particular value rather than the precise identity of these arguments. The extended algorithm is called C-FO VE. As we discussed in Section 3.4, directed models may require aggregation over a set of values. This may happen, for example, when there is a par-factor in which the parent par-R Vs contain variables that do not appear in the child par-R V . In order to aggregate over such variables in a lifted fashion, [KP09b] introduced aggr e gation par-factors and deﬁned a procedure via which an aggregation par-factor is con verted to a product of two par-f actors, one of which in volv es a counting formula. In this way , they are able to handle aggregation using C-FO VE. Starting from an Instantiated Graph FO VE and its extensions apply when an ungr ounded par - factor graph is pro vided, and their goal during inference is to consider as few grounded cases as possible. An 18 alternati ve setting is when an instantiated factor graph is gi ven, and the goal is to recognize the symmetries that are present in it in order to av oid repeated computation. Thus, in the ﬁrst setting the potential symmetries are implicit in the speciﬁcation of the ungrounded model and the task is to ﬁnd out how the presence of e vidence breaks these symmetries, whereas in the second scenario only the grounded model is provided, and the task is to recover the symmetries. This latter case naturally occurs when querying a probabilistic database, e.g., [SDG09b], and was studied by [SDG08]. Their method identiﬁes shar ed factors, which compute the same function and is based on the observation that if the inputs of two shared factors hav e the same values, their outputs will also be the same. Shared factors are discovered by constructing an rv-elim graph, which simulates the operation of VE without actually computing factor values. The rv-elim graph provides a con venient way of identifying shared factors whose computations can be carried out once and cached as needed for later . All lifted VE algorithms described so far carry out exact computations. Speed of inference can be further improved by performing approximate inference. [SDG09a] extended their algorithm to this setting by relaxing the conditions for considering two factors to be the shared. One way of doing this is by declaring two factors to be shared if the last k computations as simulated in the rv-elim graph are the same. Thus, this approach is based on the intuition that the effect of more distant inﬂuences is relativ ely small. Another approximation scheme used by [SDG09a] is to place factors together in bins if the values they compute are closer than some threshold. Lifted Belief Propagation Lifted BP algorithms [JMF07, SD08, KAN09, dNB + 09] proceed in two stages. In the ﬁrst stage, the grounded factor graph F is compressed into a “template” graph T , in which super-nodes represent groups of v ariable or factor nodes that send and recei ve the same messages during BP . T wo super - nodes are connected by a super -edge if any of their respecti ve members in F are connected by an edge, and the weight of the super-edge equals the number of ordinary edges it represents. Once the template graph T is constructed, BP is run over it with trivial modiﬁcations. The message sent from a variable super -node X to a factor super -node f is given by µ X → f ( X ) = µ f → X ( X ) w ( X ,f ) − 1 Y g ∈{ n ( X ) \ f } µ g → X ( X ) w ( X ,g ) (5) In the above expression, w ( X , f ) is the weight of the super-edge between X and f , and n ( X ) is the set of neighbors of X . The message sent from a factor super -node f to a variable super -node X is given by µ f → X ( X ) = X ∼{ X }   f ( n ( f )) µ X → f ( X ) w ( f ,X ) − 1 Y Y ∈ n ( f ) \{ X } µ Y → f ( Y ) w ( Y ,f )   (6) At this point, we in vite the reader to compare equations 5 and 6 to their counterparts in the standard case, equations 3 and 4. W e note that the lifted case is almost identical to the standard one, except for the super- edge weight exponents. Next, we describe how the template factor graph is constructed. The ﬁrst algorithm was giv en by Jaimovich et al. [JMF07]. This algorithm targets the scenario when no e vidence is provided and is based on the insight that in this case, factor nodes and v ariable nodes can be grouped into types such that two fac- tor/v ariable nodes are of the same type if they are groundings of the same par-f actor/parameterized v ariable. The lack of evidence ensures that the grounded factor graph is completely symmetrical and any two nodes of the same type have identical local neighborhoods, i.e. they hav e the same numbers of neighbors of each type. As a result, using induction on the iterations of loopy BP , it can be seen that all nodes of the same 19 type send and recei ve identical messages. As pointed out by the authors [JMF07], the main limitation of this algorithm is that it requires that no evidence be pro vided, and so it is mostly useful during learning when the data likelihood in the absence of e vidence is computed. Singla and Domingos [SD08] built upon the algorithm of Jaimovich et al. [JMF07] and introduced lifted BP for the general case when evidence is provided. In the absence of evidence, their algorithm reduces to that of Jaimovich et al. In this case, the construction of the template graph is a bit more complex and proceeds in stages that simulate BP to determine ho w the propagation of the e vidence affects the types of messages that get sent. Initially , there are three v ariable super-nodes containing the true, f alse, and unknown v ariables respectiv ely . In subsequent iterations, super-nodes are continually reﬁned as follows. First, factor super-nodes are further separated into types such that the factor nodes of each type are functions of the same set of v ariable super-nodes. Then the variable super-nodes are reﬁned such that variable nodes hav e the same types if they participate in the same numbers of factor super -nodes of each type. This process is guaranteed to con verge at which point the minimal (i.e. least granular) template factor graph is obtained. K ersting et al. [KAN09] pro vide a generalized and simpliﬁed description of [SD08]’ s algorithm, casting it in terms of general factor graphs, rather than factor graphs deﬁned by probabilistic logical languages, as was done by Singla and Domingos. Finally , lifted BP has been extended for the any-time case [dNB + 09], which combines the approach of [SD08] with that of [MK09]. Clustering An alternative approach is to cluster the R Vs in the instantiated factor graph by the similarity of their neighborhoods, and then compute the mar ginal probability of only one representativ e of each cluster , assigning the result to all members of a cluster . This approach was taken with the B AM algorithm [MR09], where the neighborhood of each R V X was restricted by setting the v alues of R Vs at a giv en distance from X to their MAP values, thus cutting of f the inﬂuence of neighbors that are further away . R Vs whose neighborhoods to the gi ven depth are identical are then clustered. 5 Learning Analogous to learning of graphical models, learning of par-factor graphs can be decomposed into struc- ture learning and parameter learning. Structure learning entails discovering the dependency structure of the model, i.e., what par-R Vs should participate together in par-f actors. On the other hand, parameter learning in volves ﬁnding an appropriate parameterization of these par-f actors. P arameter learning typically has to be performed multiple times during structure learning in order to score candidate structures. In some ap- plications, the designer hard-codes the structure of the par-f actors as background knowledge, and training consists only of parameter learning. 5.1 Parameter Learning Algorithms for parameter learning of graphical models can be extended in a straightforward way for pa- rameter learning of “lifted” graphical models. This extension is based on the realization that an instantiated par-f actor graph is simply a factor graph in which subsets of the factors, namely the ones that are instan- tiations of the same par-f actor , have identical parameters. The standard terminology for such par-factors that share their parameters is to say that their parameters are tied . Thus, in its most basic form, parameter learning in par -factor graphs can be reduced to parameter learning in factor graphs by forcing factors that are instantiations of the same par-f actors to ha ve their parameters tied. 20 While a complete treatment of parameter learning in graphical models is beyond the scope of this survey , we next provide a brief ov ervie w of basic approaches and discuss how they can be easily extended to allow for learning with tied parameters. For a detailed discussion, we refer the reader to [KF09]. W e consider the case of fully observ ed training data. Maximum likelihood parameter estimation (MLE) aims at ﬁnding values for the parameters such that the probability of observing the training data D is max- imized, i.e., we are interested in ﬁnding values θ ∗ , such that θ ∗ = arg max θ P ( D|F θ ) . For now , let F θ be a factor graph parameterized by θ . W e now ﬁnd it helpful to consider directed and undirected models sep- arately . In the directed case, e.g., Bayesian networks, parameter learning consists of learning a conditional probability distribution (CPD) function for each node given its parents. Thus, in the simplest scenario, θ consists of a set of conditional probability tables , one for each node. The maximum likelihood estimate for the entry of a node A taking on a v alue a , gi ven that it’ s parents B ha ve values b , is found simply by calculating the proportion of time that conﬁguration of v alues is observed in D : P M LE D ( A = a | B = b ) = count D ( A = a, B = b ) P a 0 count D ( A = a 0 , B = b ) (7) In undirected models, the situation is slightly more complicated because the MLE parameters cannot be calculated in closed form, and one needs to use gradient descent or some other optimization procedure. Supposing that, as introduced in Section 2.2.2, our representation is a log-linear model with one parameter per factor , then the gradient with respect to a parameter θ i of a potential function φ i is gi ven by: ∂ ∂ θ i = φ i ( D ) − E θ i ( φ i ) (8) Abov e, E θ i ( φ i ) is the expected v alue of φ i according to the current estimate for θ i . W e ne xt describe how Equations 7 and 8 are extended to work with tied parameters. In a nutshell, this is done by computing statistics ov er all factors that share a set of parameters. In directed models, factors with tied parameters share their CPDs. Thus, in this case, in Equation 7 counts are computed not just for a single node b ut for all nodes that share the CPD. Let A be that set of nodes, and let B A be the set of parents of node A . Then for all A ∈ A , Equation 7 becomes: P M LE D ( A = a | B = b ) = P A ∈ A count D ( A = a, B A = b ) P A ∈ A P a 0 count D ( A = a 0 , B A = b ) (9) Analogously , in the undirected case, Equation 8 is modiﬁed to sum over all potentials in the set Φ θ i that share a parameter θ i : ∂ ∂ θ i = X φ ∈ Φ θ i φ ( D ) − E θ i   X φ ∈ Φ θ i φ   (10) One issue that arises when learning the parameters of an SRL model as described abov e is computing the suf ﬁcient statistics, e.g., the counts in Equation 9 and the sums in Equation 10. Models that are based on a database representation can take adv antage of database operations to compute sufﬁcient statistics ef ﬁciently . For example, in PRMs, the computation of sufﬁcient statistics is cast as the construction of an appropriate vie w of the data and then running simple database queries on it [Get02]. Caching is used to achiev e further speed-ups. While the above discussion focused on one particular learning criterion, that of maximum likelihood estimation, in practice other criteria exist. F or example, rather than optimizing the data likelihood, one 21 Algorithm 1 Generic Structure Learning Algorithm Input: H : Hypothesis space, possibly encoding language bias A : Search algorithm ρ : Reﬁnement operator S : Scoring function Output: Set of par-factors Procedur e: 1: S 0 ← generateInitialCandidates () 2: while moreTime && observeImprovements do 3: S i ← generateCandidateRefinements ( S i − 1 , ρ, A , H ) 4: f or each s ∈ S i do 5: Add S ( s ) to Scores ( S i ) 6: end f or 7: S i ← Prune ( S i , Scores ( S i ) , A ) 8: end while 9: Return bestModel ( S last , A , S ) can dramatically improv e efﬁcienc y by instead optimizing the pseudo-likelihood [Bes75]. The pseudo- likelihood is computed by multiplying the probability of each v ariable, conditioned on the v alues of its Marko v blanket observed in the data. As another example, to reduce overﬁtting, one may use Bayesian learning and impose a prior ov er the learned parameters [Hec99, KF09]. Parameter learning in PRMs, both with respect to a maximum likelihood criterion and a Bayesian cri- terion, is discussed in [Get02]. Lo wd and Domingos present a comparison of several parameter learning methods for MLNs [LD07]. A max-margin parameter learning approach is presented in [HM09] and later extended to train parameters in an online f ashion [HM11]. 5.2 Structure Learning The goal of structure learning is to ﬁnd the skeleton of dependencies and regularities that make up the set of par-factors. Structure learning in SRL builds heavily on corresponding work in graphical models and inductiv e logic programming (ILP). The basic structure learning procedure can be viewed as a greedy heuristic search through the space of possible structures. A generic structure learning procedure is shown in Algorithm 1. This procedure is parameterized by a hypothesis space H , which could potentially encode a language bias; a search algorithm A , e.g., beam search; a reﬁnement operator ρ , which speciﬁes ho w ne w structures are deriv ed from a given one; and a scoring function S , which assigns a score to a giv en structure. The algorithm starts by generating an initial set of candidates. This typically consists of trivial par-f actors, e.g. ones consisting of single par -R Vs. It then proceeds in iterations, in each iteration reﬁning the existing candidates, scoring them, and pruning from the current set of candidates ones that do not appear promising. The details of ho w these steps are carried out depend on the particular choices for A , ρ , and S . The reﬁnement operator ρ is language-speciﬁc, but it typically allo ws for several kinds of simple incremen- tal changes, such as the addition or remov al of parents of a par-R V . Common choices for S are (pseudo) log-likelihood or related measures that can be directly ported from the graphical models literature. Algo- rithm 1 is directly analogous to approaches for learning in graphical models, such as [DPDPL97, Hec99], as well as to approaches de veloped in ILP , such as the F O I L algorithm [Qui90]. V ariants of this algorithm, adapted to the particular SRL representation, have been used by se veral authors. For example, for the di- 22 rected case, an instantiation of this algorithm to learning PRMs is described in [Get02]. In this case, the generateCandidateRefinements method checks for acyclicity in the resulting structure and employs classic revision operators, such as adding, deleting, or reversing and edge. In addition to the greedy hill- climbing algorithm that always prefers high-scoring structures over lower -scoring ones, Getoor [Get02] presents a randomized technique with a simulated annealing ﬂav or where at the beginning of learning the structure search procedure takes random steps with some probability p and greedy steps with probability 1 − p . As learning progresses, p is decreased, gradually steering learning aw ay from random choices. For the case of undirected models, Kok and Domingos [KD05] introduced a version of this algorithm for learn- ing of MLN structure. Their algorithm proceeds in iterations, each time searching for the best clause to add to the model. Searching can be performed using one of two possible strate gies–beam search or shortest-ﬁrst search. If beam search is used, then the best k clause candidates are kept at each step of the search. On the other hand, with shortest-ﬁrst search, the algorithm tries to ﬁnd the best clauses of length i before it mov es on to length i + 1 . Candidate clauses in this algorithm are scored using the weighted pseudo log- likelihood measure, an adaptation of the pseudo log-likelihood that weighs the pseudo likelihood of each grounded atom by 1 ov er the number of groundings of its predicate to pre vent predicates with larger arity from dominating the expression. Another technique de veloped in the graphical models community that has been e xtended to par -factor graphs is that of structure selection through appropriate regularization. In this approach [LGK07], a lar ge number of factors of a Markov network are e v aluated at once by training parameters over them and using the L 1 norm as a regularizer (as opposed to the typically used L 2 norm). Since the L 1 norm imposes a strong penalty on smaller parameters, its effect is that it forces more parameters to 0, which are then pruned from the model. Huynh and Mooney [HM08] extended this technique for structure learning of MLNs by ﬁrst using Aleph [Sri01], an of f-the-shelf ILP learner , to generate a large set of potential par-factors (in this case, ﬁrst-order clauses), and then performed L 1 -regularized parameter learning o ver this set. One of the difﬁculties of structure learning via greedy search, as performed in Algorithm 1, is that the space over possible structures is very large and contains man y local maxima and plateaus. Thus, much work has focused on dev eloping approaches that address this challenge. Next, we discuss two groups of approaches. The ﬁrst group exploits more sophisticated search techniques, whereas the second one is based on constraining the search space by performing a carefully designed pre-processing step. Using Mor e Sophisticated Sear ch One way of addressing the potential shortcomings of greedy structure selection is by using a more sophisticated search algorithm A . For example, Biba et al. [BFE08] used iterated local search as a way of av oiding local maxima when performing discriminative structure learning for MLNs. Iterati ve local search techniques [LMS03] alternate between two types of search steps, either moving towards a locally optimal solution, or perturbing the current solution in order to escape from local optima. An alternativ e approach is to search for structures of increasing complexity , at each stage using the structures found at the previous stage to constrain the search space. Such a strategy was employed by Khosravi et al. [KSM + 10] for learning MLN structure in domains that contain many descriptive attributes. Their approach, which is similar to the technique employed to constrain the search space in PRMs [FGKP99] described below , distinguishes between two types of tables – attribute tables that describe a single entity type, and relationship tables that describe relationships between entities. The algorithm, called MBN, then proceeds in three stages. In the ﬁrst stage dependencies local to attribute tables are learned. In the second stage, dependencies ov er a join of an attribute table and a relationship table are learned, but the search space is constrained by requiring that all dependencies local to the attribute table found in the ﬁrst stage remain 23 the same. Finally , in the third stage dependencies ov er a join of two relationship tables, joined with rele vant attribute tables, are learned, and the search space is similarly constrained. An orthogonal characteristic of MBN is that, although the goal is to learn an undirected SRL model, dependencies are learned using a Bayesian network learner . The directed structures are then con verted to undirected ones by “moralizing” the graphs. The advantage of this approach is that structure learning in directed models is signiﬁcantly faster that structure learning in undirected models due to the decomposability of the score, which allows it to be updated locally , only in parts of the structure that ha ve been modiﬁed, and thus scoring of candidate structures is more ef ﬁcient. Constraining the Search Space A second group of solutions is based on constraining the search space ov er structures, typically by performing a pre-processing step that, roughly speaking, ﬁnds some more promising regions of the space. One approach, used for PRM learning, is to constrain the set of potential parents of each par-R V X [FGKP99]. This algorithm proceeds in stages, in each stage k , forming the set of potential parents of X as those par-R Vs that can be reached from X through a chain of relations of length at most k . Structure learn- ing at stage k is then constrained to search only ov er those potential parent sets. Thus, so far , this algorithm is similar to techniques such as MBN described abov e. Howe ver , the algorithm further constrains potential parent candidates by requiring that they “add value” beyond what is already captured in the currently learned set of parents. More speciﬁcally , the set of potential parents of par-R V X at stage k consists of the parents in the learned structure from stage k − 1 , and any par -R Vs reachable through relation chains of length at most k that lead to a higher value in a specially designed score measure. This algorithm directly ports scoring functions that were de veloped for an analogous learning technique for Bayesian netw orks [FNP99]. A series of algorithms in this group hav e been de veloped for learning MLN structure. The ﬁrst in the series was BUSL [MM07]. BUSL is based on the observation that, once an MLN is instantiated into a Marko v network, the instantiations of each clause of the MLN deﬁne a set of identically structured cliques in the Marko v network. BUSL in verts this process of instantiation and constrains the search space by ﬁrst inducing lifted templates for such cliques by learning a “Markov network template, ” an undirected graph of dependencies whose nodes are not ordinary v ariables but par-R Vs. Then clause search is constrained to the cliques of this Markov network template. Marko v network templates are learned by constructing, from the perspectiv e of each predicate, a table in which there is a row for each possible instantiation of the predicate and a column for possible par-R Vs, with the value of a cell i, j being set to 1 if the data contains a true instantiation of the j ’th par-R V such that variable substitutions are consistent with the i ’th predicate instantiation. The Marko v network template is learned from this table by an y Marko v network learner . A further MLN learner that is based on constraining the search space is the LHL algorithm [KD09]. LHL limits the set of clause candidates that are considered by using relational pathﬁnding [RM92] to focus on more promising ones. Dev eloped in the ILP community , relational pathﬁnding [RM92] searches for clauses by tracing paths across the true instantiations of relations in the data. Figure 2 gives an e xample in which the clause Credits ( A , B ) ∧ Credits ( A , C ) ⇒ WorkedFor ( A , C ) is learned by tracing the thick-lined path between brando and coppola and v ariablizing appropriately . Howe ver , because in real-world relational domains the search space o ver relational paths may be very lar ge, a crucial aspect of LHL is that it does not perform relational pathﬁnding over the original relational graph of the data but over a “lifted hypergraph, ” which is formed by clustering the entities in the domain via an agglomerativ e clustering procedure, itself implemented as an MLN. Intuitively , entities are clustered together if they tend to participate in the same kinds of relations with entities from other clusters. Structure search is then limited only to clauses that can be deri ved as relational paths in the lifted hyper graph. 24 brando godFather coppola rainMak er W orkedFor Credits Credits Credits Figure 2: Example of relational pathﬁnding. K ok and Domingos [KD10] have proposed constraining the search space by identifying “structural mo- tifs, ” which capture commonly occurring patterns among densely connected entities in the domain. The resulting algorithm, called LSM, proceeds by ﬁrst identifying motifs and then searching for clauses by per- forming relational pathﬁnding within them. T o discover motifs, LSM starts from an entity i in the relational graph and performs a series of random walks. Entities that are reachable within a thresholded hitting time and the hyperedges among them are included in the motif and the paths via which the y are reachable from i are recorded. Next, the entities included in the motif are clustered by their hitting times into groups of potential symmetrical nodes. The nodes within each group are then further clustered in an agglomerative manner by the similarity of distributions over paths via which they are reachable from i . This process re- sults in a lifted hypergraph, analogous to the one produced by LHL; howe ver , whereas in LHL nodes were clustered based on their close neighborhood in the relational graph, here they are clustered based on their longer-range connections to other nodes. Motifs are extracted from the lifted hypergraphs via depth-ﬁrst search. 5.2.1 Structure Re vision and T ransfer Learning Our discussion so far has focused on learning structure from scratch. While approaches based on greedy search, such as Algorithm 1, can be easily adapted to perform re vision by starting learning from a giv en structure, some work in the area has also focused on approaches, speciﬁcally designed for structure re vision and transfer learning. F or example, P aes et al. [PRZC05] introduced an approach for revision of BLPs based on work in theory revision in the ILP community , where the goal is, giv en an initial theory , to minimally modify it such that it becomes consistent with a set of examples. The BLP re vision algorithm follows the methodology of the FOR TE theory re vision system [RM95], ﬁrst generating revision points in places where the giv en set of rules fails and next focusing the search for revisions to ones that could address the discov- ered revision points. The FOR TE methodology was also followed in T AMAR, an MLN transfer learning system [MHM07], which generates re vision points on MLN clauses by performing inference and observing the ways in which the giv en clauses fail. T AMAR was designed for transfer learning, e.g.,[BL Y06], where the goal is to ﬁrst map, or translate, the gi ven structure from the representation of a source domain to that of a target and then to revise it. Thus, in addition to the re vision module, it also contains a mapping module, which discovers the best mapping of the source predicates to the target ones. The problem of mapping a source structure to a target domain was also considered in the constrained setting where data in the target domain is extremely scarce [MM09]. Rather than taking a structure learned speciﬁcally for a source domain and trying to adapt it to a target domain of interest, an alternative approach to transfer learning is to extract general knowledge in the source domain that can then be applied to a variety of target domains. This is the approach taken in DTM [DD09], which uses the source data to learn general clique templates e xpressed 25 as second-order Markov logic clauses, i.e. with quantiﬁcation both over the predicates and the v ariables. During this step, care is tak en to ensure that the learned clique templates capture general regularities and are not likely to be speciﬁc to the source domain. Then, in the target domain DTM allows for se veral possible mechanisms for using the clique templates to provide declarati ve bias. 5.2.2 Learning Causal Models An important structure learning problem is inducing causal models from relational data. This problem has been recently addressed by Maier et al. [MTOJ10] whose RPC algorithm works with directed SRL models, which allow causal effects to be encoded in the directionality of the links. RPC extends its propositional analog, the PC algorithm [SGS01], which has been developed for learning causality in graphical models. It proceeds in two stages. In the ﬁrst stage, skeleton identiﬁcation is performed to uncov er the structure of conditional independencies in the data. In the second stage, edge orientation takes place, and orientations consistent with the skeleton are considered. RPC ports the edge orientation rules of PC to the relational setting and also dev elops new edge orientation rules that are speciﬁc to relational domains, in particular to model the uncertainty ov er whether a link exists between entities. 6 Conclusion This article has presented a surve y of work on lifted graphical models. W e ha ve revie wed a general form for a lifted graphical model, a par-f actor graph, and shown ho w a number of existing statistical relational representations map to this formalism. W e hav e discussed inference algorithms, including lifted inference algorithms, that efﬁciently compute the answers to probabilistic queries. W e hav e also revie wed work in learning lifted graphical models from data. It is our belief that the need for statistical relational models (whether it goes by that name or another) will gro w in the coming decades, as we are inundated with data which is a mix of structured and unstructured, with entities and relations extracted in a noisy manner from text, and with the need to reason effecti vely with this data. W e hope that this synthesis of ideas from many dif ferent research groups will provide an accessible starting point for new researchers in this expanding ﬁeld. Acknowledgement W e would like to thank Galileo Namata and Theodoros Rekatsinas for their comments on earlier versions of this paper . L. Mihalkov a is supported by a CI fello wship under NSF Grant # 0937060 to the Computing Research Association. L. Getoor is supported by NSF Grants # IIS0746930 and # CCF0937094. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the vie ws of the NSF or the CRA. Refer ences [Bes75] Julian Besag. Statistical analysis of non-lattice data. The Statistician , 24(3):179–195, 1975. [BFE08] Marenglen Biba, Stefano Ferilli, and Floriana Esposito. Discriminativ e structure learning of Marko v logic networks. In Pr oceedings of the 18th International Confer ence on Inductive Logic Pr ogr amming (ILP-08) , 2008. 26 [BL Y06] Bikramjit Banerjee, Y axin Liu, and G. Michael Y oungblood, editors. ICML workshop on “Structural Knowledg e Transfer for Mac hine Learning” , Pittsburgh, P A, 2006. [BMG09] Matthias Broecheler , Lilyana Mihalkov a, and Lise Getoor . Probabilistic similarity logic. In Pr oceedings of 26th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-10) , 2009. [CSN98] Mark Craven, Sean Slattery , and Kamal Nigam. First-order learning for web mining. In Pr oceedings of the 10th Eur opean Conference on Mac hine Learning (ECML-98) , 1998. [dAR05] Rodrigo de Salvo Braz, Eyal Amir , and Dan Roth. Lifted ﬁrst-order probabilistic inference. In Pr oceedings of the 19th International Joint Confer ence on Artiﬁcial Intelligence (IJCAI-05) , 2005. [dAR06] Rodrigo de Salvo Braz, Eyal Amir , and Dan Roth. MPE and partial in version in lifted prob- abilistic variable elimination. In Pr oceedings of the 21st Conference on Artiﬁcial Intelligence (AAAI-06) , 2006. [dAR07] Rodrigo de Salvo Braz, Eyal Amir , and Dan Roth. Lifted ﬁrst-order probabilistic inference. In Lise Getoor and Ben T askar , editors, Introduction to Statistical Relational Learning , chap- ter 15, pages 433–451. MIT Press, 2007. [DD09] Jesse Davis and Pedro Domingos. Deep transfer via second-order Markov logic. In Pr oceed- ings of the 26th International Confer ence on Machine Learning (ICML-09) , 2009. [DGM04] T om Dietterich, Lise Getoor , and Ke vin Murphy , editors. SRL2004: Statistical Relational Learning and its Connections to Other F ields , Banff, Alberta, Canada, 2004. [DK03] Luc De Raedt and Kristian Kersting. Probabilistic logic learning. ACM-SIGKDD Explorations: Special Issue on Multi-r elational Data Mining , 5(5):31–48, 2003. [DK09] Pedro Domingos and Kristian Kersting, editors. International W orkshop on Statistical Rela- tional Learning (SRL-2009) , Leuven, Belgium, 2009. [DKT07] Luc De Raedt, Angelika Kimmig, and Hannu T oiv onen. ProbLog: A probabilistic prolog and its application in link discovery . In Proceedings of the 20th International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-07) , 2007. [DL09] Pedro Domingos and Daniel Lowd. Markov Logic: An Interface Layer for Artiﬁcial Intelli- gence . Morgan&Claypool, 2009. [dNB + 09] Rodrigo de Salvo Braz, Sriraam Natarajan, Hung Bui, Jude Shavlik, and Stuart Russell. Any- time lifted belief propagation. In Pr oceedings of the International W orkshop on Statistical Relational Learning (SRL-09) , 2009. [DPDPL97] Stephen Della Pietra, V incent J. Della Pietra, and John D. Lafferty . Inducing features of random ﬁelds. IEEE T ransactions on P attern Analysis and Mac hine Intelligence , 19(4):380–393, 1997. [D WW99] Paul Damien, Jon W akeﬁeld, and Stephen W alker . Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society , 61(2):331–344, 1999. 27 [FGKP99] Nir Friedman, Lise Getoor , Daphne Koller , and A vi Pfeffer . Learning probabilistic relational models. In Pr oceedings of the 16th International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-99) , 1999. [FGM06] Alan Fern, Lise Getoor , and Brian Milch, editors. SRL2006: Open Pr oblems in Statistical Relational Learning at ICML 2006 , Pittsbur gh, P A, 2006. [FNP99] Nir Friedman, Iftach Nachman, and Dana Pe ´ er . Learning Bayesian network structure from massi ve datasets: The ”sparse candidate” algorithm. In Pr oceedings of the 15th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-99) , 1999. [Get02] Lise Getoor . Learning Statistical Models from Relational Data . PhD thesis, Stanford Univer - sity , 2002. [GFK + 07] Lise Getoor, Nir Friedman, Daphne Koller , A vi Pfeffer , and Ben T askar . Probabilistic relational models. In Lise Getoor and Ben T askar , editors, Statistical Relational Learning , pages 129– 174. MIT Press, 2007. [GFKT02] Lise Getoor, Nir Friedman, Daphne K oller , and Benjamin T askar . Learning probabilistic mod- els of link structure. Journal of Mac hine Learning Researc h , 3:679–707, 2002. [GT07] Lise Getoor and Ben T askar , editors. Intr oduction to Statistical Relational Learning . MIT Press, Cambridge, MA, 2007. [HB94] David Heck erman and John S. Breese. A new look at causal independence. In Pr oceedings of the 10th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-94) , 1994. [HCM + 00] David Heckerman, David Maxwell Chickering, Christopher Meek, Robert Rounthwaite, and Carl Kadie. Dependency networks for inference, collaborati ve ﬁltering and data visualization. J ournal of Machine Learning Resear ch , 1:49–75, 2000. [Hec99] David Heckerman. A tutorial on learning with Bayesian networks. In Michael Jordan, editor, Learning in Graphical Models . MIT Press, 1999. [HM08] T uyen N. Huynh and Raymond J. Mooney . Discriminati ve structure and parameter learning for markov logic networks. In Pr oceedings of the 25th International Confer ence on Machine Learning (ICML-08) , 2008. [HM09] T uyen N. Huynh and Raymond J. Mooney . Max-margin weight learning for Markov logic networks. In Pr oceedings of the Eur opean Confer ence on Machine Learning and Principles and Practice of Knowledg e Discovery in Databases (ECML/PKDD-09) , 2009. [HM11] T uyen N. Huynh and Raymond J. Mooney . Online max-margin weight learning for Markov logic networks. In Proceedings of the Eleventh SIAM International Conference on Data Mining (SDM-11) , 2011. [Jae02] Manfred Jaeger . Relational bayesian networks: A survey . Link ¨ oping Electr onic Articles in Computer and Information Science , 7(015), 2002. 28 [JMF07] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. T emplate based inference in symmetric re- lational Markov random ﬁelds. In Pr oceedings of 23d Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-07) , 2007. [KAN09] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief propagation. In Pr oceedings of 25th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-09) , 2009. [KD01] Kristian K ersting and Luc De Raedt. T o wards combining inductiv e logic programming with Bayesian networks. In Pr oceedings of the 11th International Confer ence on Inductive Logic Pr ogramming (ILP-01) , 2001. [KD05] Stanley K ok and Pedro Domingos. Learning the structure of Markov logic networks. In Pr oceedings of the 22nd International Confer ence on Machine Learning (ICML-05) , 2005. [KD09] Stanley K ok and Pedro Domingos. Learning Markov logic network structure via hypergraph lifting. In Pr oceedings of the 26th International Confer ence on Machine Learning (ICML-09) , 2009. [KD10] Stanley K ok and Pedro Domingos. Learning Markov logic networks using structural motifs. In Pr oceedings of the 27th International Confer ence on Machine Learning (ICML-10) , 2010. [KF09] Daphne K oller and Nir Friedman. Pr obabilistic Graphical Models: Principles and T echniques . MIT Press, 2009. [KFL01] Frank R. Kschischang, Brendan J. Frey , and Hans-Andrea Loeliger . Factor graphs and the sum-product algorithm. IEEE T ransactions on Information Theory , 47(2):498–519, 2001. [KP97] Daphne Koller and A vi Pfeffer . Object-oriented Bayesian netw orks. In Pr oceedings of the 13th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-97) , 1997. [KP98] Daphne Koller and A vi Pfeffer . Probabilistic fame-based systems. In Pr oceedings of the 15th National Confer ence on Artiﬁcial Intelligence (AAAI-98) , 1998. [KP09a] Jacek Kisy ´ nski and David Poole. Constraint processing in lifted probabilistic inference. In Pr oceedings of 25th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-09) , 2009. [KP09b] Jacek Kisy ´ nski and David Poole. Lifted aggregation in directed ﬁrst-order probabilistic mod- els. In Pr oceedings of the 21st International Joint Confer ence on Artiﬁcial Intelligence (IJCAI- 09) , 2009. [KRK + 10] Kristian K ersting, Stuart Russell, Leslie P ack Kaelbling, Alon Hale vy , Sriraam Natarajan, and Lilyana Mihalkov a, editors. Statistical Relational AI W orkshop at AAAI-10 , Atlanta,GA, 2010. [KSJ97] Henry Kautz, Bart Selman, and Y ueyen Jiang. A general stochastic approach to solving prob- lems with hard and soft constraints. In Dingzhu Gu, Jun Du, and Panos Pardalos, editors, DI- MA CS Series in Discr ete Mathematics and Theoretical Computer Science , volume 35, pages 573–586. American Mathematical Society , 1997. [KSM + 10] Hassan Khosravi, Oliver Schulte, T ong Man, Xiaoyuan Xu, and Bahareh Bina. Structure learning for markov logic networks with many descriptive attributes. In Pr oceedings of the 24th Confer ence on Artiﬁcial Intelligence (AAAI-10) , 2010. 29 [LD07] Daniel Lowd and Pedro Domingos. Efﬁcient weight learning for Markov logic networks. In Pr oceedings of the 11th Eur opean Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-07) , 2007. [LGK07] Su-In Lee, V arun Ganapathi, and Daphne Koller . Efﬁcient structure learning of Marko v net- works using L 1 -regularization. In Pr oceedings of the 20th Annual Confer ence on Neural In- formation Pr ocessing Systems (NIPS-06) , 2007. [LMS03] Helena R. Lourenc ¸ o, Olivier C. Martin, and Thomas St ¨ utzle. Iterated local search. In Fred W . Glov er and Gary A. K ochenberger , editors, Handbook of Metaheuristics . Springer , 2003. [MHM07] Lilyana Mihalko v a, T uyen Huynh, and Raymond J. Mooney . Mapping and revising Markov logic netw orks for transfer learning. In Pr oceedings of the 22nd Confer ence on Artiﬁcial Intel- ligence (AAAI-07) , 2007. [MK09] Joris Mooij and Bert Kappen. Bounds on mar ginal probability distributions. In Pr oceedings of the 22nd Annual Confer ence on Neural Information Pr ocessing Systems (NIPS-08) , 2009. [MM07] Lilyana Mihalko v a and Raymond J. Mooney . Bottom-up learning of Markov logic network structure. In Pr oceedings of the 24th International Confer ence on Machine Learning (ICML- 07) , 2007. [MM09] Lilyana Mihalkov a and Raymond J. Mooney . T ransfer learning from minimal target data by mapping across relational domains. In Pr oceedings of the 21st International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-09) , 2009. [MMR + 05] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey K olobov . BLOG: probabilistic models with unknown objects. In Pr oceedings of the 19th International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-05) , 2005. [MR09] Lilyana Mihalkov a and Matthew Richardson. Speeding up inference in statistical relational learning by clustering similar query literals. In Pr oceedings of the 19th International Confer- ence on Inductive Logic Pr ogr amming (ILP-09) , 2009. [MSS09] Andre w McCallum, Karl Schultz, and Sameer Singh. F A CTORIE: Probabilistic programming via imperativ ely deﬁned factor graphs. In Pr oceedings of the 23d Annual Confer ence on Neural Information Pr ocessing Systems (NIPS-09) , 2009. [MTOJ10] Marc Maier, Brian T aylor , H ¨ useyin Oktay , and David Jensen. Learning causal models of relational domains. In Pr oceedings of the 24th Confer ence on Artiﬁcial Intelligence (AAAI- 10) , 2010. [Mug96] Stephen Muggleton. Stochastic logic programs. In Pr oceedings of the 6th International W ork- shop on Inductive Logic Pr ogr amming (ILP-96) , 1996. [MWJ99] K e vin P . Murphy , Y air W eiss, and Michael I. Jordan. Loopy belief propagation for approxi- mate inference: An empirical study . In Pr oceedings of the 15th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-99) , 1999. 30 [MZK + 08] Brian Milch, Luke S. Zettlemoyer , Kristian K ersting, Michael Haimes, and Leslie Pack Kael- bling. Lifted probabilistic inference with counting formulas. In Pr oceedings of the 23d Con- fer ence on Artiﬁcial Intelligence (AAAI-08) , 2008. [NJ07] Jennifer Neville and David Jensen. Relational dependency networks. Journal of Machine Learning Resear ch , 8:653–692, 2007. [NKL + 10] Sriraam Natarajan, T ushar Khot, Daniel Lo wd, Kristian Kersting, Prasad T adepalli, and Jude Shavlik. Exploiting causal independence in Markov logic networks: combining undirected and directed models. In Pr oceedings of the Eur opean Conference on Machine Learning and Principles and Practice of Knowledg e Discovery in Databases (ECML/PKDD-10) , 2010. [PD06] Hoifung Poon and Pedro Domingos. Sound and ef ﬁcient inference with probabilistic and deterministic dependencies. In Pr oceedings of the 21st Conference on Artiﬁcial Intelligence (AAAI-06) , 2006. [PDS08] Hoifung Poon, Pedro Domingos, and Marc Sumner . A general method for reducing the com- plexity of relational inference and its application to MCMC. In Pr oceedings of the 23d Con- fer ence on Artiﬁcial Intelligence (AAAI-08) , 2008. [Pea88] Judea Pearl. Pr obabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer ence . Morg an Kaufmann, 1988. [Pfe07] A vi Pfeffer . Sampling with memoization. In Pr oceedings of the 22nd Confer ence on Artiﬁcial Intelligence (AAAI-07) , 2007. [PKMT99] A vi Pfef fer , Daphne K oller , Brian Milch, and Ken T . T akusagaw a. SPOOK: A system for probabilistic object-oriented knowledge representation. In Pr oceedings of the 15th Conference on Uncertainty in Artiﬁcial Intelligence (U AI-99) , 1999. [Poo03] David Poole. First-order probabilistic inference. In Pr oceedings of the 18th International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-03) , 2003. [PRZC05] Aline Paes, Kate Rev oredo, Gerson Zaverucha, and V itor Santos Costa. Probabilistic ﬁrst- order theory revision from examples. In Pr oceedings of the 15th International Conference on Inductive Logic Pr ogr amming (ILP-05) , 2005. [PZ03] David Poole and Nevin Lianwen Zhang. Exploiting contextual independence in probabilistic inference. Journal of Artiﬁcial Intellig ence Resear ch , 18:263–313, 2003. [Qui90] J. Ross Quinlan. Learning logical deﬁnitions from relations. Machine Learning , 5(3):239–266, 1990. [Rab89] Lawrence R. Rabiner . A tutorial on hidden Markov models and selected applications in speech recognition. Pr oceedings of the IEEE , 77(2), 1989. [RD06] Matthe w Richardson and Pedro Domingos. Marko v logic networks. Mac hine Learning , 62:107–136, 2006. [Rie08] Sebastian Riedel. Improving the accuracy and efﬁcienc y of MAP inference for Markov logic. In Pr oceedings of 24th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-08) , 2008. 31 [RM92] Bradley L. Richards and Raymond J. Moone y . Learning relations by pathﬁnding. In Pr oceed- ings of the T enth National Conference on Artiﬁcial Intellig ence (AAAI-92) , 1992. [RM95] Bradley L. Richards and Raymond J. Mooney . Automated reﬁnement of ﬁrst-order Horn- clause domain theories. Machine Learning , 19(2):95–131, 1995. [RN03] Stuart Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Appr oach . Prentice Hall, Upper Saddle Ri ver , NJ, 2 edition, 2003. [Sch98] Alexander Schrijv er . Theory of Linear and Integ er Pro gramming . W iley , 1998. [SD06] Parag Singla and Pedro Domingos. Memory-efﬁcient inference in relational domains. In Pr oceedings of the 21st Confer ence on Artiﬁcial Intelligence (AAAI-06) , 2006. [SD08] Parag Singla and Pedro Domingos. Lifted ﬁrst-order belief propagation. In Pr oceedings of the 23d Confer ence on Artiﬁcial Intelligence (AAAI-08) , 2008. [SDG08] Prithviraj Sen, Amol Deshpande, and Lise Getoor . Exploiting shared correlations in proba- bilistic databases. In Pr oceedings of the 34th International Confer ence on V ery Lar ge Data Bases (VLDB-08) , 2008. [SDG09a] Prithviraj Sen, Amol Deshpande, and Lise Getoor . Bisimulation-based approximate lifted inference. In Pr oceedings of 25th Confer ence on Uncertainty in Artiﬁcial Intelligence (UAI- 09) , 2009. [SDG09b] Prithviraj Sen, Amol Deshpande, and Lise Getoor . PrDB: Managing and e xploiting rich corre- lations in probabilistic databases. VLDB Journal, special issue on uncertain and pr obabilistic databases , 2009. [SGS01] Peter Spirtes, Clark Glymour , and Richard Scheines. Causation, Pr ediction, and Sear ch . MIT Press, 2001. [SN09] Jude Shavlik and Sriraam Natarajan. Speeding up inference in Markov logic networks by preprocessing to reduce the size of the resulting grounded netw ork. In Pr oceedings of the 21st International J oint Confer ence on Artiﬁcial Intelligence (IJCAI-09) , 2009. [SNB + 08] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher , and Tina Eliassi-Rad. Collectiv e classiﬁcation in network data. AI Magazine , 29(3):93–106, 2008. [Sri01] A. Sriniv asan. The Aleph manual , 2001. http://web.comlab.ox.ac.uk/oucl/ research/areas/machlearn/Aleph/ . [T AK02] Ben T askar , Pieter Abbeel, and Daphne K oller . Discriminative probabilistic models for rela- tional data. In Pr oceedings of the 18th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI-02) , 2002. [T as04] Ben T askar . Learning Structur ed Prediction Models: A Larg e Margin Appr oach . PhD thesis, Stanford Uni versity , 2004. [TCK04] Ben T askar , V assil Chatalbashev , and Daphne K oller . Learning associativ e Markov networks. In Pr oceedings of the 21st International Confer ence on Machine Learning (ICML-04) , 2004. 32 [T ie94] Luke Tierne y . Markov chains for exploiting posterior distributions. Annals of Statistics , 22(4):1701–1728, 1994. [WBG92] Michael P . W ellman, John S. Breese, and Robert P . Goldman. From knowledge bases to deci- sion models. Knowledge Engineering Revie w , 7(1):35–53, 1992. [WD08] Jue W ang and Pedro Domingos. Hybrid Markov logic networks. In Pr oceedings of the 23d Confer ence on Artiﬁcial Intelligence (AAAI-08) , 2008. [WES04] W ei W ei, Jordan Erenrich, and Bart Selman. T owards ef ﬁcient sampling: Exploiting random walk strategies. In Pr oceedings of the 19th Conference on Artiﬁcial Intelligence (AAAI-04) , 2004. [YFW01] Jonathan S. Y edidia, W illiam T . Freeman, and Y air W eiss. Understanding belief propagation and its generalizations. In Pr oceedings of the 17th International Joint Conference on Artiﬁcial Intelligence (IJCAI-01) , 2001. Distinguished papers track. 33

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment