Association Rules in the Relational Calculus

One of the most utilized data mining tasks is the search for association rules. Association rules represent significant relationships between items in transactions. We extend the concept of association rule to represent a much broader class of associ…

Authors: Oliver Schulte, Flavia Moser, Martin Ester

Asso ciation Rules in the R elational Calculus Oliv er Sc h ulte, F la via Moser, Martin Ester and Zhiy ong Lu Sc ho ol of Computing Science Simon F raser Univ ersit y Burnab y , B.C., Canada { osc h ulte,fmoser,ester,zhiy ongl } @cs.sfu.ca August 20, 2021 Abstract One of the most ut ilized data mining tasks is the search for association rules. Asso ciation rules rep re- sent sig nificant relationships b etw een items in transactions. W e extend the concept of association rule to represent a muc h broader class of associations, which we refer to as entity-r elationship rules. Semanticall y , entit y - relationship rules express asso ciations b etw een prop erties of related ob jects. S yntacticall y , th ese rules are b ased on a broad sub class of safe domain relational calculus q ueries. W e prop ose a new d efi- nition of supp ort and confidence for en tity-relationship rules and for th e frequency of entit y-relationship queries. W e p ro ve that the defin ition of frequency satisfies standard probability axioms and the Apriori prop erty . 1 In tro duction One of the goa ls of data mining is to discov er interesting relationships from data. Asso ciation rules expres s relationships that hold with s ufficien t frequency but not a lwa y s . F or example, it may be the case that not all manag ers ear n over $ 6 0,000 a year, but that 90% o f manager s do. The logical form of an asso ciation rule is that of an implication p → q where p and q ho ld together sufficien tly often (the “supp ort” of the rule) and q holds sufficiently often g iven that p holds (the “confidence” of the rule). The traditio na l concept o f asso ciatio n rules severely limits the complexity of the expr e ssions p and q and ther eby limits the class of relationships a data miner can ca ptur e. Ess ent ially , p and q ma y b e only s imple conjunctions, like a n itemset. Thu s we c a nnot hav e r ules ba s ed on Bo olean combinations, such as negations or nested combinations. An example of a r elationship inv olv ing a nega tion would b e a negative factor , such a s “ student s who hav e not taken an intro ductory data base cour se do p o orly in datamining courses” . An e xample o f a nested Bo olean combination would be “s tuden ts who ar e math ma jors or computer science ma jors, and who hav e done well in a discr ete mathematics cours e or in an algo rithms co urse, do well in complexity theo ry”. A nother cla ss of relatio nships that asso cia tion r ule s c a nnot expres s in volves quantification and r elating ob jects to ea ch other. An example would b e the rule “ residents who hav e a neighbour with high incomes tend to have a high income themselves”. The goa l of this pap er is to extend the concept o f an asso ciatio n r ule to a lar ge class of express ions that we refer to a s entity-re lationship queries (ER queries). Intuitiv ely , entit y-r elationship queries express depe ndencie s among e ntities and their proper ties. Entit y-r elationship queries a re a larg e sub clas s of the sa fe queries. Safe queries corr esp ond to a n expres sive subset of first-o rder logic that allows for nested Bo olea n expressions a nd quantification. W e provide a definition of the frequency of a n entit y-r elationship query . This extends the no tion o f an asso c ia tion rule to implications of the form p → q where p ∧ q is a n ER quer y ; we refer to rules of this form as ent ity-r elationship rules . F r o m our definition o f the frequency of an ER query we immediately o btain a definition of the supp ort of an ER r ule , namely the freq uency f r ( p ∧ q ). 1 TV-Prog ram(Pro g -Name :string ) TV-Station(Station-Name:string, Are a :int eger ) W eekdayTV ( TV-Prog ram:string,TV- Sta tion:string ,Viewers:integer,Sponso r:string) W eekendTV ( TV-Progr am:string,TV-Station:s tr ing ,Viewers:integer,Spons or:string) T able 1: A relational schema fo r a TV sur vey mo del. Key fields are underlined. The schema lists TV progra ms a nd stations, and reco rds for each com bination of weekday pro gram and station, how man y viewers view the pr o gram on that station, and who sp ons ors the pro gram. The s ame information is reco rded for week end pro grams. Our definitio n of frequency for ER queries generalize s previous work on defining as so ciation rules in a m ulti-rela tional setting. [1] discusses extending itemset r ules with negations and motiv a tes the usefulness of this e x tension. The quer y extensio n appro ach of the W armr sys tem [4] presents a sp ecial class of ent ity- relationship rules that allows conjunctions of nonnegated statements and existential qua n tification. Our concept of ER rules features in addition negations, universal qua nt ification, nested quantifiers, and nested Bo olean co mbin ations. Thus one contribution of this pap er is a n extended r ule forma t. A characteristic that distinguishes our appro ach from previous work is that pr evious a pproaches assume a given target table that defines a base set of tuples for ev a luating the supp ort of a query . In cont ras t, we start with a query and define a natural base s e t of tuples for e v aluating the supp or t o f the query . W e can think of this approach as dynamically gener ating ent ity sets fo r a given quer y r a ther than ev alua ting queries with resp ect to a fixed ent ity set. Th us the second main contribution of this pap er is a new definition o f supp ort for rules in our extended format. The pa p er is organized as follows. First we rev iew ba s ic r elational databas e concepts such a s the relational schema and the domain re lational calculus. Then we intro duce the concept of a n ent ity query and define the frequency of a quer y in this class of queries . This definition pr ovides the bas is for the no tion of an entit y- relationship rule and fo r defining the supp ort of an entit y-r elationship r ule. W e co mpa re entit y-relatio nship queries to frequent itemsets and to the r ule langua ge o f the W armr system. The final s e ction establis he s sevveral impo r tant for mal pro per ties of query freq uencies as we de fine them and shows that they satisfy the Apriori prop erty , that is, the frequency of a conjunction is no gre ater than the frequency of its conjuncts. 2 En tities in the Domain Relational Calculus This section pres ent s standa rd background mater ial fr om database theory . The first subsection r eviews relational s ch emas, and in tro duces the new concept o f an entity field . Semantically , e n tity fields are those that store v a lues (co ns tants) that refer to entities. The s econd s ubsection defines the standar d notion o f a safe query in the domain relational calculus, a nd the third intro duces a sub clas s o f safe queries that we ter m entity-r elationship queries . 2.1 En tities in Relational Schema s W e beg in with a standard relational sc hema containing a set of tables, each with key fields, des criptive attributes, a nd p os sibly foreign key p ointers. W e use the notation T to refer to a gener ic table that may represent either an entit y se t o r a r elationship set, a nd for a n index w e use T i . A field named name in table T is deno ted by T .name . T able 1 shows a r elational schema for a TV sur vey database; this example is adapted from [6, Sec.2]. T ables 2 – 4 display rel ation instanc es for the TV survey sch ema. W e assume that the tables in the relational s chema can be divided into entity t ables and r elationship tables. This is the case whenever a relationa l schema is derived from an ent ity-relationship mo del (ER mo del) [8, Ch.2 .2]. In tuitiv ely , a n entit y table c o rresp onds to a type of entit y , and a relationship table represents a relation b etw ee n e ntit y types. In our TV survey exa mple, there are tw o types o f entities: TV progra ms represented in the TV-P rogr a m table, and TV statio ns repr esented in the TV- Sta tion table. W e 2 TV-Prog ram TV-Station Viewers Spo nsor Gilmore Global 10 Avon Gilmore CBS 12 La Senza Ho ck ey Night CBC 20 RBC T able 2: T elevision Survey: W eekday TV. TV-Prog ram TV-Station Viewers Spo nsor Gilmore Global 8 Avon Ho ck ey Night CBC 14 Sch wab Simpsons CBS 10 RBC Daily Show CBC 6 La Senz a T able 3: T elevisio n Survey: W eekend TV. now introduce tw o assumptions co ncerning the re la tional schema that facilitate the definition of entit y - relationship queries and their frequencies . Unary Ke y Assump ti on W e ass ume that ev ery entit y ta ble has a single key field. The adv antage of the unary key a ssumption is that g iven this assumption, a single key field in the relational schema r efers to a single entit y . The assumption holds in our TV survey schema b ecause the tw o ent ity tables ha ve key fields TV-Prog r am.Pro g-Name and T V- Sta tion.Station-Name resp ectively . Although it is not always natural to define entities with a single k ey field, there is no los s of g enerality b eca use we can alwa ys form a sing le comp osite key field from a list of key fields. F o r example, if in a Pr ofessor table there are t wo key fields Firs tName, LastName, w e can fo r m a comp osite key field h FirstName , LastName i . Our second assumption is the following. Global N ame Assum ption W e ass ume that for every en tity e , there is a unique constant c such that in every table, the consta nt c denotes en tity e . The g lobal name as sumption is imp or ta nt beca use it allows us to recog nize when the same entit y o ccurs in different tables. In the AI literature, a similar ass umption is often referre d to as the “unique na me assumption” [7, Ch.14 ]. The ass umption do es not amount to a loss o f g enerality b ecause if the same constant c is used in different tables to refer to differen t entities, we can simply index c to distinguish these o ccurrences. F or exa mple, if we hav e tw o different transa ction tables T r ansaction1 and T ransac tion 2, and there is a transactio n 1 in b oth, w e could change the e n try in the first table to refer to 1-1 and in the seco nd table to refer to 1 - 2. A natur al alternative to indexing constants would b e to adopt a conv ention to the effect that a k ey field T .k ey in table T refers to different entities than k ey field T ′ .k ey in table T ′ if and o nly if the names of the key fields in the tw o tables are different. F or example, if we hav e a table for Employees and another for Mana g ers, lab elling the key field in ea ch table as “ssn” indicates that a given so cia l security nu mber refers to the same per son no matter where it app ea rs. In contrast, lab elling the key field in the T ransa ctions1 table “T1-num b e r ” and the key field in the T ra nsactions2 table as “T2- n umber” indicates that the transa c tio n num b ers in different tables refer to differen t transa ctions. Station-Name Area Global 1 CBS 2 CBC 3 T able 4: T elevisio n Survey: Stations and Areas. 3 Symbol T yp e Notation Comment Constants c 1 , c 2 , ... A t most countably man y constants Predicate Symbols P 1 , P 2 , .., P k Exactly one predicate for each table T i Logical Symbols ∃ , ∀ , ∧ , ∨ , ¬ Compariso n Oper ators = , < , > , ≤ , ≥ , 6 = T able 5: The Basic V o cabular y of our DR C languag e fo r a g iven databa se sc hema D with tables T 1 , ..., T k . In many applications, the g lobal name assumption is enforced through foreign key constra in ts. T o illustrate, in the TV e x ample, we may suppos e that the field W eekdayTV.TV-Station is a foreign key p ointer to the field TV-Station.Station-Name, a nd that the field W eekendTV.TV-Station is a foreign key pointer to the same field. So the string co nstant “CBS” refer s to the CBS net work represe nted in the TV-Station table, whether “CBS” app ear s in a n instance of the W eekdayTV r elation or in an instance of the W eekendTV relation. Given the unary key and globa l name assumptions, the following is a v alid definition of how table s , key fields and constants are a sso ciated with entities. Definition 1 L et D b e a datab ase instanc e. 1. An entity table is a table T with a single key field. 2. An fi eld is an entity field if (1) the field is the key of an entity table, or (2) the field is a for eign p ointer to the key of an entity table. 3. A c onstant c is an entity c onstant if c app e ars in an entity field. Examples . Let D b e the TV s ur vey databas e instance fro m T ables 2 – 4. The entit y keys a re TV- Progr am.Pro g-Name, TV-Sta tio n.Station-Name, W ee kdayTV.TV-Program, W eekdayTV.TV-Station, W eek- endTV.TV-Progr am, W eekedTV.TV-Station. Entit y constants include “CBS” and “Simpsons”. Next we rev ie w the domain re lational calculus, which is a logical query langua ge based on a given relational schema. 2.2 Safe Queries in the Domain R elational Calculus W e first define the formal languag e o f the domain relationa l ca lculus, including the well-formed formulas of the calculus. Then we define an imp ortant subcla ss of formulas known as safe querie s . Our pres ent ation follows the s tandard approach, see for example [8, Ch.3]. 2.2.1 The F ormal Language of the Domain Rel ational Calculus In the domain r elational calculus (DRC), for every table T i in the data base s chema there is exactly one predicate P i in the logical language. The num b er of fields in the table T i is the arity of the predicate P i . If T i is an entit y table, then P i is an e ntit y predicate . By the unary key assumption, an entit y table T i has a single key field; w e adopt the conv ention that the key field is the first ar gument in the en tity predicate P i . The complete logical vocabular y of the DR C is listed in T a ble 5. Example . In the TV sur vey mo del, we have the pr edicates shown in T able 6. Thu s we may write W ee k day T V (“GilmoreGirls ” ,“CBS” , 12 , “La Senza”) to assert that “Gilmo re Girls” is shown on “CBS” on weekdays, with 1 2,000 viewers, and sp onsor e d by La Senza. The notion of a well-formed formula is the usua l one for this vocabula ry . Definition 2 Wel l-F orme d F ormulas of the Domain Rel ational Calculus 1. A c onstant c or variable X is a term. 4 T able 6: Predicates of our Logica l Query Lang uage for the TV survey mo del. Pr e dic ates Arity TV-Prog ram(PN) 1 TV-Station(SN,A) 2 W eekdayTV(PN,SN,V,S) 4 W eekendTV(PN,SN,V,S) 4 T able 7: Examples of V alid E xpressions for the database schema for the TV sur vey . Expressio n Type V ≥ 10 atomic formula with V free ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) quantified form ula with P free ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 ∧ co njunction of ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 quantified form ulas 2. I f P is a pr e dic ate symb ol of arity k and t 1 , .., t k ar e k terms, then P ( t 1 , ..., t k ) is an atomic formula. 3. I f t, t ′ ar e two terms, then a c omp arison t 1 θt 2 is an atomic formula. 4. I f F is a formula and X is a variable, t hen ¬ F , ∃ X .F , ∀ X .F ar e formulas. 5. I f F 1 and F 2 ar e formulas, then so ar e F 1 ∧ F 2 and F 1 ∨ F 2 . 6. Al l formulas ar e forme d by the r ep e ate d applic ation of the pr evious ru les. Examples. T able 7 gives examples of v a lid expressio ns and their types p ertaining to the TV survey . W e next define the result o r output of a DRC query . The first step is to define what gro und formulas are s atisfied in a databas e instance D ; a for m ula is ground if it contains no v ariables . The second s tep is to define whic h closed q uer ies F with no free v aria bles a re sa tisfied in a databa se instance D ; as usual in log ic , we write D | = F . Let F [ X 1 /t 1 , .., X k /t k ] b e the formula that results from replacing all free o ccurre nces of each X i in F with the term t i . 1. If t, t ′ are tw o co nstants, then D | = tθ t ′ iff tθt ′ holds. 2. D | = P i ( c 1 , .., c k ) iff h c 1 , .., c k i is a tuple in table T i . 3. D | = F 1 ∨ F 2 iff D | = F 1 or D | = F 2 ; similarly D | = F 1 ∧ F 2 iff D | = F 1 and D | = F 2 ; and D | = ¬ F 1 iff D 2 F 1 . 4. D | = ∃ X.F iff there is a cons ta nt c in the DR C la nguage such that D | = F [ X/c ]; similarly D | = ∀ X .F iff for all constants c we hav e D | = F [ X/c ]. Let F ( X 1 , .., X m ) b e a query with free v ariables X 1 , .., X m . Then on da tabase instance D the q uery F returns the set of all tuples that make F true when substituted in F . F or mally , we write tupl es D ( F ) ≡ {h c 1 , ..., c m i : D | = F [ X 1 /c 1 , ..X m /c m ] } . This definition assumes tha t the co nstants in the language include all constants that app ear in the database tables, which inv olves no lo ss of genera lit y . Examples. Let D b e the database instanc e from T ables 2 – 4. T able 8 shows the results of our example queries for this databa se instance. 5 Query F or m ula F Result tupl es D ( F ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 { “Gilmore”,“ Ho ck ey Night” } F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 { “Ho ckey Night”,“Simpsons” } F 1 ∧ F 2 { “Ho ck ey Night” } T able 8: Results of Query F ormulas on the database instance D from T ables 2 – 4. Query F or m ula F Safe? F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 yes F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 yes F 1 ∧ F 2 yes F 1 ∨ F 2 yes ¬ F 1 no ¬ F 1 ∨ F 2 no F 1 ∧ ¬ F 2 yes T able 9: Examples of safe and unsafe queries for the T V survey databas e sch ema of T able 1. 2.2.2 Safe Queries It is customary to r estrict the set of fo r mulas that may serve as free v ariable s in a query (“query v ariables” for short) to ensure that the result s e t of tuples satisfying query formulas a re b ounded a nd “domain-indep endent” [8, Ch.3.8]. T o this end we adopt the notion of a safe q ue r y . The intu ition behind this conc e pt is that the results of safe queries should b e re stricted to selection conditions applied to (combinations of ) tables in the database. F or example, the query ¬ T V P rog ram ( X ) with free v ariable X is not s a fe b ecause the range of constants s atisfying this quer y is not b ound by an y table in the database. The key idea in the definition of safe query is to conjoin a query formula F to a r estriction of the for m P ∧ F wher e P is a basic pr edicate in the langua ge and hence refers to a table in the database. As is well-kno wn, the expres s ive p ow er o f safe queries is exactly equiv a le nt to that of relatio na l a lgebra [2]. Safe queries are formally defined as follows [8, Ch.3.8]. 1. Replace the ∀ X quantifier b y ¬∃ X ¬ . 2. Whenev er ∨ is used to connect F 1 ∨ F 2 , the tw o for m ulas hav e the same set of free v ar iables. 3. Consider a n y ma ximal subform ula consisting o f the conjunction of one or more for m ulas F 1 ∧ ... ∧ F m . Then all v ariables X app earing free in any o f the F i m ust b e limited as follows. The v ar iable X must be free in some non-ne gate d F i satisfying one of the following co nditions. (a) F i is not a compariso n. (b) F i is X = c where c is a co nstant. (c) F i is X = Y , a nd Y is limited. 4. A ¬ op era tor may apply o nly to a formula in a conjunction of the t yp e discussed in the pr evious r ule. Examples . T a ble 9 gives e x amples of safe and uns afe q ueries for the TV data base sc hema from T a ble 1. This completes our r eview of basic concepts from relatio na l databa se theor y . W e now come to the restriction of safe queries to entit y-rela tionship queries. 2.3 Definition of En tity -Relationship Queries The basic idea b ehind our definition o f a n ER query is that fr ee v ariables should b e limited in such a wa y as to guara ntee that they mu st refer to ent ities. Intuitiv ely , an ER quer y is one whose free v aria bles refer to ent ities. The precise definition is a s follows. 6 Definition 3 L et D b e a datab ase instanc e. 1. A variable X is an enti ty variable c andidate for a DRC formula F if (a) X is not quantifie d over in any p art of F (b) if an expr ession X θ t app e ars in F , then θ is = or 6 = , and if t is a c onstant c , then c is an entity c onstant in D . (c) if an expr ession P ( , X , ) app e ars in F , then the ar gument p osition of X in P ( , X , ) is an entity field. 2. A variable X is an enti ty variable for F if (a) X is an entity variable c andidate for F , and (b) if an expr ession X = Y or X 6 = Y app e ars in F , then Y is an entity variable c andidate. 3. An entity-r elationship (ER) query F for datab ase inst anc e D is a safe DRC query such that al l the fr e e variables in F ar e entity variables for F given D . Examples. Let D b e the TV survey database instance from T ables 2 – 4. In the formula ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 the v aria ble P is an entit y v a riable, and so the formula is a n ER query . The formula ∃ S. ∃ S N .W eek day T V (“Gilmore Girls” , S N , V , S ) is safe but not an ER query b ecause the free v a riable V is not a n en tity v a riable. 3 The F requency of En tit y-Relationship Queries Our ba sic ide a is tha t the limiting conditions in safe queries sp ecify the domain fro m which v alues for a free v ariable X are to b e drawn. O nce the domain for the free v ariables is defined for a giv en for mula F , we ca n take the frequency o f the fo r mula F to be the nu mber of a ssignments to the free v ariables that sa tis fy the formula divided by the size of the domain fo r the for m ula. Safe queries are a natural class of queries for this approach b ecause these queries s pec ify the range from which re s ult tuples may b e drawn by r estricting these results to subsets of tables in the da tabase (cf. Section 2 .2.2). The main issue in our definition c oncerns the cor rect domain for conjunctions or intersections. F o r a simple example, co nsider a databas e s chema with tw o e n tity ta bles Pr o fessor a nd Customer. The query P rof essor ( X ) ∧ C ustome r ( X ) returns entities that are b oth pr ofessors and customer s. What should be the base domain for this quer y ? If ther e are many mor e customer s than pr ofessors , we may get quite different frequency counts if we take the base do main to b e P rofessor than if we take it to b e Customer. So neither o f these seems the right choice. Intuitiv e ly the ba se domain should be a symmetric function of the tw o classes men tioned in the query . The tw o na tur al s ymmetric set-theoretic op erations are in tersectio n and union. If w e take the intersection a s the base domain, the frequency of conjunctions without further s election conditions is alwa ys 10 0%, which do es not se e m rig ht. In particular for our ultimate g oal o f defining the supp ort of asso ciatio n rules, this is unsatisfacto r y . Our prop osal is therefore to us e the u nion of the tw o entit y sets inv olved in the co njunction. Another way to loo k at the union is that it repr esents a kind of closed world assumption: If Professo rs and Customers a r e the only e ntit y types mentioned in the s election conditions of the query , then the members o f these en tity t yp es are exactly the p otential answers to the query . The clo sed world assumption is also the ba sis for o ur frequency definition for queries with negation. F o r example, consider a safe query such as P rof essor ( X ) ∧ ¬ C ustomer ( X ). Since P rofessor s a nd Customer s are the only entit y t yp es mentioned in this query , we take the base domain a gain to b e the union of these tw o 7 Query F ormula F , Reference Domain dom D ( F, X ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P, S N , V , S ) ∧ V ≥ 1 0 dom D ( F 1 , X ) = { “ Gilmore” , “Ho ckey Night” } F 2 = ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 dom D ( F 2 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } F 3 = F 1 ∧ F 2 dom D ( F 3 , X ) = { “ Gilmore” , “Ho ckey Night”,“Simpsons”,“Daily Sho w” } F 4 = F 1 ∨ F 2 dom D ( F 4 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } F 5 = F 1 ∧ ¬ F 2 dom D ( F 5 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } T able 10: Reference Domains for v ar ious form ulas in the TV survey database instance D fr o m T ables 2– 4. sets. The fact that Pro fessors are men tioned p ositively and Customer s nega tively do e s not make a difference to the base domain, but it do es make a difference to the result of the query and hence to its freq uency . On the basis of this pro po sal, we can now recurs ively ass ign a domain to an entit y v ariable in a fo r mula F given a database instance D . W e beg in with just o ne fre e quer y v ar iable and then tackle the more co mplicated case of queries with more than o ne free v a riable. 3.1 Definition of F requency for Queries With One F r ee Query V ariable W e deno te the base domain of an entit y v ar iable X in a query F rela tive to a database instance D as dom D ( F, X ). As we think of v a r iable X as r eferring to the do ma in dom D ( F, X ), w e term dom D ( F, X ) the reference do main of X in the context of query F . Definition 4 L et D b e a datab ase instanc e with ER formula F . 1. I f F is P i ( t 1 , .., t k ) , and X o c curs in F , t hen dom D ( F, X ) = π X [ tupl es D ( F )] . If X is not a fr e e variable in F , then dom D ( F, X ) = ∅ . Her e we think of tup l es D ( F ) as a r elation whose c olumns c orr esp ond to the fr e e variables of F . F or ex ample, the query p ( X , Y , Z ) re turn s a r elation with triples, and we c an think of the fi rst c olumn as name d X and the se c ond as name d Y . The expr ession π r efers to the pr oje ction op er ator of r elational algebr a (with elimination of duplic ates). 2. Le t F b e a single atomic c omp arison of the form Y θ t wher e t is either a variable or a c onstant. If F is X = c , then dom D ( F, X ) = { c } . Otherwise dom D ( F, X ) = ∅ . 3. I f F is ¬ G for some formula G , then dom D ( F, X ) = dom D ( G, X ) . 4. I f F is F 1 ∨ F 2 or F 1 ∧ F 2 , t hen dom D ( F, X ) = dom D ( F 1 , X ) ∪ dom D ( F 2 , X ) . 5. I f F is ∃ Y .G , wher e Y 6 = X , then dom D ( F, X ) = dom D ( G, X ) . If F is ∃ X .G , t hen dom D ( F, X ) = ∅ . Examples. Let D b e the TV survey databa se instance from T a bles 2 – 4. T a ble 10 gives examples of reference domains for v a rious ER queries. As this definition shows, we think of basic predicates as sp ecifying the rang e from which entities are drawn. Co nditions of the form p ( t 1 , ..., X , ...t k ) or X = c we view as “ dir ect bounds” that determine the reference domain of X . V a riable equations o f the form X = Y we view as “sele ction conditions ” that are applied after an entit y has b een specified. Thes e do no t affect the reference domain of X but only the result of the query . Ano ther t yp e of selectio n a r e r e strictions on descriptive attributes, such as V ≥ 10 in the queries in T able 10. Now the frequency o f an E R query is defined as follows. 8 Query F ormula F F requency f r D ( F ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 1 F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 1 / 2 F 1 ∧ F 2 1 / 4 F 1 ∨ F 2 3 / 4 F 1 ∧ ¬ F 2 1 / 4 T able 11: F requencies for v a r ious form ulas in the TV survey database instance D fr om T ables 2 – 4. Definition 5 L et F b e an ER query with fr e e variable X such that dom D ( F, X ) 6 = ∅ . Then f r D ( F ) ≡ | tupl es D ( F ) | | dom D ( F, X ) | . In Section 5 we establish several formal prop erties o f the freq uency of a query according to this definition, for example that the frequency is a num b er b et ween 0 and 1. Examples. Let D b e the TV surv ey databa se instance fro m T ables 2 – 4. T able 11 illlustrates the frequen- cies of v ario us queries. 3.2 Definition of F requency of Queries W ith More Than One F ree V ariable W e assig n a domain to every tuple of en tity v ariables in a for mula F given a data base ins ta nce D , which we denote as dom D ( F, { X 1 , ..., X m } ). Our basic idea is to consider a re sult tuple h c 1 , ..., c m i as denoting a c omp osite entity formed by combining m single ent ities. F or exa mple, consider the rule N ei g hbour ( X , Y ) ∧ ( ∃ I .I ncome ( X , I ) ∧ I > $100 , 000) → ( ∃ I .I nc ome ( Y , I ) ∧ I > $1 00 , 00). (The symbo l → doe s not denote logical implicatio n but defines an asso c ia tion rule; see Section 4.) This says that if X has a n income ov er $100,0 00, then it is likely that a neighbour Y o f X als o has an income of $ 100,0 0 0. The supp ort of this rule is the frequency of the quer y N eig hbour ( X , Y ) ∧ ( ∃ I .I ncome ( X, I ) ∧ I > $1 00 , 000 ) ∧ ( ∃ I .I ncome ( Y , I ) ∧ I > $100 , 0 0). This query has tw o free v a r iables X and Y . The refere nce domain c omprises the entries in the N eig hbour table, that is, the pa irs h X, Y i in the table. Other ex amples of na tural comp osite ent ities include relatio ns like reserv a tions or pur chases. The idea o f tr eating tuples in a r e la tional table as comp osite “individuals” is familiar in the prop ositiona lization literature [5, 3] (for exa mple, c hemical molecules may be trea ted as single en tities although molecules are co mpo sed of different elements that are a lso r epresented in the relationa l s chema). Applying this idea requir es a further constraint on ER queries: the free v ariables { X 1 , ..., X m } must b e “b ound tog ether” in a limiting condition rather than separ ately . F or example, the query P ( X ) ∧ X = Y is a safe ER quer y but the a nswer pair s h x, y i are not b ound to the key fields o f any tuple; an exa mple of the same character is the query P ( X ) ∧ Q ( Y ). T o rule out s uch cases, we imp ose the following co ndition. Definition 6 A liter al is an atomic formula or its n e gation. A n ER query F is valid for variables X 1 , ..., X m if for every maximal c onjunction L = L 1 ∧ ... ∧ L k c onsisting only of liter als, L c ontains a c onjunction of the form X 1 = c 1 ∧ · · · ∧ X m = c m , or L c ontains a c onjunct P ( t 1 , ..., t k ) wher e al l variables { X 1 , ..., X m } o c cur in P ( t 1 , ..., t k ) . An ER query F is valid if F is valid for the set of its fr e e variables. Examples follow b elow in this section. In the case with only one free quer y v ariable X , the definition o f safe query implies tha t every en tit y quer y is v alid. Now let us consider the definitio n of a re fer ence domain for v alid E R queries with one or mor e free v ariables. As in the case with just one query v aria ble , we term dom D ( F, { X 1 , ..., X m } ) the reference domai n o f { X 1 , ..., X m } in the context o f q uery F . Consider the basic c ase of an ato mic formula F = P ( t 1 , ..., t m ) fir st. In keeping with the ide a b e hind safe queries, we can think of such form ulas as sp ecifying a basic r ange for the re sult tuples in a query . So supp ose that the free v ariables in the atomic formula are Y 1 , Y 2 , .., Y k . If o ur query v a riables X 1 , ..., X m are not al l contained in the set { Y 1 , Y 2 , .., Y k } , we consider that the “co mpo site key” X 1 , ..., X m do es no t app ea r in the q ue r y , 9 Query F or m ula F , Reference Domain dom D ( F, X ) F 1 = ∃ S. ∃ V .W eek day T V ( P , S N , V , S ) ∧ V > 10 dom D ( F 1 , X ) = {h “Gilm.”, “Glo.” i , h “Gilm.”,“C BS” i , h “Ho ck. N.”,“CBC” i} F 2 = ∃ S. ∃ V .W eek en dT V ( P , S N , V , S ) ∧ V > 1 0 dom D ( F 2 , X ) = {h “Gilm.”, “Glo.” i , h “Ho c k. N.”, “CB C” i , h “Simps.”,“CBS” i , h “Daily Sh.”,“CBC” i} F 3 = F 1 ∧ F 2 dom D ( F 3 , X ) = {h “Gilm.”, “Glo.” i , h “Gilm.”,“C BS” i , h “Ho ck. N.”,“CBC” i , h “Simps.”,“CBS” i , h “Daily Sh.”,“CBC” i} T able 12 : Refer ence Doma ins for v arious formulas in the TV sur vey database insta nce D fro m T a bles 2 – 4. The free v aria bles query v ariables are P and S N , corresp onding to pairs of pro grams- s tations. and dom D ( F, { X 1 , ..., X m } ) = ∅ . Otherwis e we consider the query res ult tupl es D ( F ) a s a r elation with k columns, of which m ar e named X 1 , .., X m . F or example, the query p ( X , Y , Z ) returns a relation with triples, and w e ca n think of the first c o lumn a s named X and the second as named Y . Thus we can take π h X 1 ,...,X m i tupl es D ( F ) to be the reference do main of the entit y v a r iables X 1 , X 2 , ..., X m in the query F . This leads to the fo llowing inductive definition. The main difference with the de finition for a single query v ar iable is that we ne e d to tr eat conjunctions like X 1 = c 1 ∧ · · · X m = c m as a single comp ound statement. Definition 7 L et D b e a datab ase instanc e with ER formula F and let X 1 , .., X m b e a list of variable s. 1. I f F is P ( t 1 , .., t k ) , and al l variables X 1 , .., X m o c cu r in P ( t 1 , .., t k ) , then dom D ( F, { X 1 , ..., X m } ) = π h X 1 ,...,X m i tupl es D ( F ) , wher e π is the pr oje ction op er ation of r elational algebr a. Otherwise dom D ( F, { X 1 , ..., X m } ) = ∅ . 2. Le t F b e a single atomic c omp arison of the form Y θ t wher e t is either a variable or a c onstant. (a) Su pp ose that m = 1 , Y = X 1 and the c omp arison is X 1 = c (i.e., we just ha ve a single fr e e variable X 1 and t he atomic formula r e quir es X 1 to b e e qual to a c onstant c .) In that c ase dom D ( F, { X 1 } ) = { c } . (b) Otherwise dom D ( F, { X 1 , ..., X m } ) = ∅ . 3. Le t F b e a max imal c onjunction of k > 1 formulas, such that F = C 1 ∧ · · · ∧ C k . (a) If F is a c onjunction of the form C ∧ X 1 = c 1 . . . ∧ X m = c m , then dom D ( F, { X 1 , ..., X m } ) = dom D ( C, { X 1 , ..., X m } ) ∪ {h c 1 , ..., c m i} . (b) Otherwise dom D ( F, { X 1 , ..., X m } ) = S k i =1 dom D ( C i , { X 1 , ..., X m } ) . 4. I f F is F 1 ∨ F 2 , t hen dom D ( F, { X 1 , ..., X m } ) = dom D ( F 1 , { X 1 , ..., X m } ) ∪ dom D ( F 2 , { X 1 , ..., X m } ) . 5. I f F is ¬ G for some formula G , then dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } . 6. I f F is ∃ Y .G , wher e Y 6∈ { X 1 , ..., X m } , then dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } ) . If F is ∃ X i .G for some X i ∈ { X 1 , ..., X m } , then dom D ( F, { X 1 , ..., X m } ) = ∅ . It is easy to chec k that this de finitio n agrees with Definition 4 for queries with just o ne free v ariable . Examples. Co nsider the query “find all pro g ram-statio n pair s that achieve a viewership of over 1 0,000 on b oth weekda ys and weekends”. In the domain relational calculus, this q uery may b e for m ulated as [ ∃ S. ∃ V .W eek day T V ( P, S N , V , S ) ∧ V > 10 ] ∧ [ ∃ S. ∃ V .W eek emdT V ( P, S N , V , S ) ∧ V > 10]. T able 12 shows the calculation of the reference doma in for this formula on the data ba se instance of T ables 2 – 4. Now the frequency o f an E R query is defined as follows. 10 Query F ormula F F requency f r D ( F ) F 1 = ∃ S. ∃ V .W eek day T V ( P , S N , V , S ) ∧ V > 10 2 / 3 F 2 = ∃ S. ∃ V .W eek endT V ( P , S N , V , S ) ∧ V > 10 1 / 4 F 1 ∧ F 2 1 / 5 F 1 ∨ F 2 2 / 5 F 1 ∧ ¬ F 2 1 / 5 T able 13: F requencies for v arious fo rmulas in the TV surv ey database instance D from T ables 2 – 4. The free v ariable s query v a riables are P and S N , corresp onding to pairs of prog rams-sta tions. Definition 8 L et F b e an ER query whose fr e e variables ar e X 1 , .., X m wher e dom D ( { X 1 , ..., X m } , F ) 6 = ∅ . Then f r D ( F ) ≡ | tupl es D ( F ) | | dom D ( F, { X 1 , ..., X m } ) | . T able 13 illustrates the fre q uencies of v ario us queries. 4 En tit y- Relationship Rules W e finally obtain the no tion of an ER asso ciation rule, or ER rule for sho rt. 4.1 Definition of Confidence and Supp ort for E R rules Given the concepts we hav e develop ed so far, the definition o f confidence a nd supp ort for a n entit y - relationship rule are stra ightforw ard. Definition 9 L et D b e a datab ase instanc e. 1. An ER asso ciati on rule is an implic ation of the form F → G , wher e the fr e e variables of G ar e the same as or c ontaine d in the fr e e variables of F , and F ∧ G is a valid ER query. 2. The c onfidenc e of an ER asso ciation ru le F → G is given by con D ( F → G ) ≡ | tupl es D ( F ∧ G ) | | tupl es D ( F ) | . 3. The supp ort of an ER asso ciation ru le F → G is given by suppor t D ( F → G ) ≡ f r D ( F ∧ G ) . As usual with asso c iation rules, the implication F → G do es no t indicate log ical implication (whenever F is true, so is G ) but instead denotes a probabilistic rela tionship. Example. Let D b e the TV survey databa se instance from T ables 2 – 4. Let F 1 be the formula ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 and let F 2 be the formula ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 . Consider the r ule F 1 → F 2 . The supp ort o f this rule is f r D ( F 1 ∧ F 2 ) = 1 / 4 (see T able 1 1). The confidence is |{ “Gilmore” ,“Ho ck ey Night” } ∩ { “ Ho ckey Nigh t”,“Simpso ns” }| |{ “Gilmore” ,“Ho ck ey Night” }| = 1 / 2 . Definition 9 completes our goa l o f providing a definition of confidence a nd supp ort for gener al ent ity- relationship queries . 11 4.2 Comparison With Other Rule Languages This section g ives a brief compariso n of o ur r ule langua g e and fr equency definition to re la ted rule language s. It is ea s y to see that the cla ssic ass o ciation rule appro ach base d on frequent itemsets is a sp ecial case. F or example, suppo s e w e have tw o entit y tables: T ra nsactions(num be r ) that stores tra nsactions, a nd Item(name) for items, and a re lational ta ble T ransItems(T r a nsNumber ,ItemName) that indica tes whic h items app e a r in which transactions. Then for a given item, say “cola ”, the query T r ansactions ( X ) ∧ T r ansI tems ( X , “cola”) returns the set o f transa ctions inv olving “co la”, and the fr e quency of this query is the freq uency of these transactions among all transa ctions. An tonie a nd Za ¨ ıane [1] extend itemset rules with negations, and sur vey a num b er of sea rch algo rithms for finding frequent itemsets with negative conditions. Their search pr o cedure is based on corr elation a nalysis. The W armr system [4] considers queries that ar e conjunctions of literals (e.g., P ( X , Y )). The us er sp ecifies a targ et ta ble T ; the free query v ariables in a W armr query ar e then b ound to the key fie lds of T . If W e ekdayTV is our target table, w e would have tw o fre e quer y v ariables P for progr am and S N for station. All other v ar iables ar e implicitly existen tially quantified. F or example, if Custo mer is the tar get ta ble, the W armr formula C us tome r ( A ) ∧ C hi l d ( A, C ) ∧ B uy s ( C, “co la”) translates into the doma in relationa l ca lc ulus as ∃ C.C u s tomer ( A ) ∧ C hi l d ( A, C ) ∧ B uy s ( C, “cola ”). If we assume that one of the co njuncts in a W armr clause corr esp onds to the target table (e.g., C ustomer ( A )), and all o ther app ear ances of the query v ariable s are related to the tar get table by foreign key constra in ts (e.g., the firs t fie ld in the Child table is a foreig n key to the Custo mer table), then the refer ence domain as w e hav e defined it is exactly the target table, and the frequency that W armr a ssigns to a conjunction agre e s with our definitio n. In this sense our definition of supp ort for ER rules genera lizes that for W arm r rules. 5 The Probabilit y Ax ioms and A Priori Prop ert y In o rder to ensure that Definition 7 yields well-defined pro babilities, w e verify thr e e fac ts : (1) the frequency as defined never inv olves division by 0, so the frequency is well-defined. (2) The definition e ntails that frequencies ar e betw een 0 and 1 (inclusively). (3 ) T he frequency o f t wo m utually exclus ive queries is the sum of their r esp ective frequenc ie s . This third prop er t y holds only with certain qualifica tions due to the restrictions on sa fe q ue r ies. The us ua l probability axioms include the r equirement that (4) the pr obability of the whole space, o r the “cer tain even t” is 1. W e discuss the extent to which this prop erty holds for our definition of frequency . Finally , w e show the Apriori prop er t y: frequencies of co njunctions decre ase monotonically , which is imp ortant for lattice search metho ds. F or the first fact, we have the following result. The notion of a v a lid ER query w as sp ecified in Definition 6. Prop ositio n 10 L et F b e a valid ER query whose fr e e variables ar e X 1 , ..., X m . L et D b e any datab ase instanc e (without empty tables). Then dom D ( F, { X 1 , ..., X m } ) 6 = ∅ . Pro of. If F is v alid, then for ev ery maximal conjunction L of literals that o cc ur s in F , w e hav e dom D ( L, { X 1 , ..., X m } ) 6 = ∅ . Since the refer ence domains of more complex formulas are the union o f the domains of their subformulas, it follows tha t dom D ( F, { X 1 , ..., X m } ) 6 = ∅ . The next pr op osition guara nt ees that the ratios ass igned b y Definition 7 ar e prop erly b ounded b et ween 0 and 1. Prop ositio n 11 L et F b e an ER query in which the variable s X 1 , . . . , X m ar e fr e e such that F is valid for these variables. Le t D b e a datab ase instanc e. Then π h X 1 ,...,X m i tupl es D ( F ) ⊆ dom D ( F, { X 1 , . . . , X m } ) , wher e π is the pr oje ction op er ation of r elational algebr a. In the case in which X 1 , . . . , X m are exactly the free v a riables of F , we hav e π h X 1 ,...,X m i tupl es D ( F ) = tupl es D ( F ), so the prop osition implies that the ratio | tuples D ( F ) | | dom D ( F, { X 1 ,...,X m } ) | is b etw een 0 and 1 . Pro of. The pro of is by induction on the structure of ER formula F . W e beg in by noting t wo basic facts ab out v alid formulas, whic h follow ea sily from Definitions 2, 6, and 7. 12 1. If C = C 1 ∧ ... ∧ C k is a maximal conjunction in F , then C contains a conjunction X 1 = c 1 . . . ∧ X m = c m or a conjunct C i that is a v alid ER q uery . 2. If F 1 ∨ F 2 is a disjunction in F , then b oth of the disjuncts a r e v a lid ER queries. • If F is an atomic formula of the form P ( t 1 , .., t k ), then since F is v alid for X 1 , ..., X m , we hav e dom D ( F, { X 1 , ..., X m } ) = π h X 1 ,...,X m i tupl es D ( F ). • Let F b e a single ato mic comparison of the form Y θ t where t is either a v ar iable or a constant. Since F is v alid, it must b e of the form X 1 = c where m = 1 (i.e., we just hav e a sing le free v ar iable X 1 and the atomic formula requir es X 1 to be eq ua l to a constant c ). So dom D ( F, { X 1 } ) = { c } , and c learly π X 1 tupl es D ( F ) ⊆ { c } . • Let F b e a maximal conjunction of k > 1 formulas, suc h that F = C 1 ∧ · · · ∧ C k . 1. If F is a conjunction of the form C ∧ X 1 = c 1 . . . ∧ X m = c m , then dom D ( F, { X 1 , ..., X m } ) = dom D ( C, { X 1 , ..., X m } ) ∪ { h c 1 , ..., c m i} . Clea rly π h X 1 ,...,X m i tupl es D ( F ) ⊆ {h c 1 , ..., c m i} , whic h is a subset of dom D ( F, { X 1 , ..., X m } ). 2. Other wise dom D ( F, { X 1 , ..., X m } ) = S k i =1 dom D ( C i , { X 1 , ..., X m } ). Since F is v alid, by O bserv a- tion 1 at least one of the conjuncts C i is v alid. So by inductive hypothesis, π h X 1 ,...,X m i tupl es D ( C i ) ⊆ dom D ( { C i , { X 1 , . . . , X m } ) . Now since F is a conjunction involving C i , it follows that π h X 1 ,...,X m i tupl es D ( F ) ⊆ π h X 1 ,...,X m i tupl es D ( C i ) and that dom D ( C i , { X 1 , . . . , X m } ) ⊆ dom D ( F, { X 1 , . . . , X m } ) , which es tablishes the inductive hypothesis for this case. • If F is F 1 ∨ F 2 , then b y Clause 2 of the definitio n of a safe q uery , b oth F 1 and F 2 are v alid a nd contain all the v aria ble s { X 1 , ..., X m } ) as free v ariables . So π h X 1 ,...,X m i tupl es D ( F ) = π h X 1 ,...,X m i tupl es D ( F 1 ) ∪ π h X 1 ,...,X m i tupl es D ( F 2 ) . Also, b y inductive h yp othesis, π h X 1 ,...,X m i tupl es D ( F 1 ) ⊆ dom D ( F 1 , { X 1 , . . . , X m } ) and π h X 1 ,...,X m i tupl es D ( F 2 ) ⊆ dom D ( F 2 , { X 1 , . . . , X m } ) , and b y definition dom D ( F, { X 1 , . . . , X m } ) = dom D ( F 1 , { X 1 , . . . , X m } ) ∪ dom D ( F 2 , { X 1 , . . . , X m } ) . So π h X 1 ,...,X m i tupl es D ( F ) ⊆ dom D ( F, { X 1 , . . . , X m } ) as required. • If F is ¬ G for some formula G , then F is not a safe q uery , hence not an ER quer y , and the claim ho lds v acuously . 13 • If F is ∃ Y .G , then Y 6∈ { X 1 , ..., X m } , since the v ar iables X 1 , ..., X m are free in F . So dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } ) , and π h X 1 ,...,X m i tupl es D ( F ) = π h X 1 ,...,X m i tupl es D ( G ) by the semantics of the ex istent ial quantifier. Clearly if F is v alid, then so is G , so by inductive hypothesis π h X 1 ,...,X m i tupl es D ( G ) ⊆ dom D ( G, { X 1 , . . . , X m } ) which completes the inductive pro of. The third fundamen tal pr op erty of probabilities is finite addi tivity , that the fr equency o f tw o m utually exclusive even ts is the sum of the individua l fr equencies. The difficult y with this pro pe r ty is not that it fails for our frequency definition, but that it is not s traightforw ardly expressed in our langua g e of safe queries. F or example, a natura l formulation of finite additivity would b e to requir e that f r D ( F ) + f r D ( ¬ F ) = f r D ( F ∨ ¬ F ). But if F is a safe query , then ¬ F is not safe, so the frequency f r D ( ¬ F ) is not defined. Another wa y to see the difficulty is to note tha t in standar d probability theory (with a Bo ole an a lg ebra of even ts), finite additivity is equiv alent to the require ment that P r ( A ) = 1 − P r ( ¯ A ), where ¯ A is the complemen t of even t A . But this cannot be expr essed as a require men t on safe quer ies since the nega tion o f a sa fe quer y is not itself safe. How ever, we can show a qua lified version of finite additivity . If S a nd F a re v a lid safe queries with the same free v ariables , then the formulas S ∧ F and S ∧ ¬ F ar e also v a lid safe queries. F or these formulas w e can show the following result. Prop ositio n 12 L et S and F b e valid safe queries with the same fr e e variables { X 1 , ..., X m } . Then for any datab ase instanc e D we have f r D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = f r D ( S ∧ F ) + f r D ( S ∧ ¬ F ) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Pro of. This follows fro m the definitio ns : W e hav e dom D ([ S ∧ F ] ∨ [ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ([ S ∧ F ] , { X 1 , ..., X m } ) ∪ dom D ([ S ∧ ¬ F ] , { X 1 , ..., X m } ), and since dom D ([ S ∧ F ] , { X 1 , ..., X m } ) = dom D ([ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ), it follows that dom D ([ S ∧ F ] ∨ [ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Clearly tupl es D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = tupl e s D ( S ), so f r D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Also, f r D ( S ∧ F ) = tuples D ( S ∧ F ) dom D ( S, { X 1 ,...,X m } ) ∪ dom D ( F, { X 1 ,...,X m } ) and f r D ( S ∧¬ F ) = tuples D ( S ∧¬ F ) dom D ( S, { X 1 ,...,X m } ) ∪ dom D ( F, { X 1 ,...,X m } ) , so f r D ( S ∧ F ) + f r D ( S ∧ ¬ F ) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) , which was to be shown. This result illustr ates that tw o logically equiv alent queries can have different frequencies in a given database instance, although their result tuples a r e always the same. In particular, althoug h the q ueries [ S ∧ F ] ∨ [ S ∧ ¬ F ] and S a re logica lly equiv alent, they hav e different reference doma ins : the domain of [ S ∧ F ] ∨ [ S ∧ ¬ F ] includes also the do main o f the query F . This is due to our closed-world assumption: s ince 14 the entities in the que r y F a re among those men tioned in the query [ S ∧ F ] ∨ [ S ∧ ¬ F ], they are included among the p oten tial answers to the quer y , a lthough in fact no entit y satisfying F will be an actual a ns wer to the query unless it is also a n en tity satisfying S . The fina l standard pro per ty of pr obability measures on a B o olean algebra is that P ( X ) = 1, where X is the “certain e ven t” that contains all pos sible outcomes. One difficulty with this prop erty from the po int o f view of our frequenc y definition is aga in not so much that the pro pe rty fa ils to hold but that it is not straightforw ard to expr ess. A natural wa y to translate the axiom into a logical framework is to req uire that all tauto logies or logically necessary queries receive pro ba bilit y 1 . F or example the query S tudent ( A ) ∨ ¬ S tudent ( A ) is a tautology when v iewed as a lo g ical formula, but it is not a s afe query . Another conceptually illuminating difficult y is that in our freq uency definition, there is no single fixed space of pos sible outco mes or even ts that is indep endent of the query b eing asked. Rather, we define a space of p os s ible outcomes dynamica lly for every quer y (i.e., dom D ( F, { X 1 , ..., X m } ) for query F ). F or a given reference domain, the pro bability 1 prop erty ho lds to the extent that w e can express it. F or ex ample, if the only tw o po ssible gender s are male and female , then the query [ S tu dent ( A ) ∧ Gender ( A, male )] ∨ [ S tu dent ( A ) ∧ Gender ( A, female ) receives frequency 1 in every data base instance. Finally w e show that frequency as defined decreases monotonically with resp ect to c o njunctions. This is imp ortant be cause many algorithms that s earch for frequent query formulas use this pro per ty to avoid exhaustive sear ch. The following result gua rantees that the frequency of a conjunction is less than the frequency of its conjuncts, which w e refer to as the Apriori prop e rty . Prop ositio n 13 (The Apriori Prop ert y) L et D b e a datab ase instanc e with valid ER query F 1 whose fr e e variables ar e X 1 , .., X m and supp ose that F 1 ∧ F 2 is also a valid ER query whose fr e e variables ar e X 1 , .., X m . Then f r D ( F 1 ∧ F 2 ) ≤ f r D ( F 1 ) . Pro of. Clea rly tupl es D ( F 1 ∧ F 2 ) ⊆ tupl es D ( F 1 ) , and dom D ( F 1 , { X 1 , .., X m } ) ⊆ dom D ( F 1 ∧ F 2 , { X 1 , .., X m } ) . So | tupl es D ( F 1 ∧ F 2 ) | | dom D ( F 1 ∧ F 2 , { X 1 , .., X m } ) | ≤ | tupl es D ( F 1 ) | | dom D ( F 1 , { X 1 , .., X m } ) | . Discussion. Previo us a ppr oaches to mining multi-relational rules such a s W armr mine rules for just o ne target table. Our appro a ch in cont ras t can p otentially sea rch the entire space of quer ies for a given language bias, s ince by the prop osition just esta blished, the a priori pro per ty holds for the entire q uery space, not just for a fix e d target table or key a tom, given our definition of frequency and supp ort. So compared to a n iterative approach wher e we r ep eatedly a pply a s ing le-table rule miner to different tables in the database, our appro a ch offers co mputational adv antages. In tuitively , our appro ach combines the results of rule mining for s eparate ta bles when it co ns iders rules that inv olve the separ ate tables at the same time. F or example, suppo se that for the S tudent table, we find that the query S tu de nt ( X ) ∧ Ag e ( X , 30) is infrequent . Then from Prop ositio n 13 w e ca n conclude that the query S tudent ( X ) ∧ Ag e ( X , 30 ) ∧ P rof e ssor ( X ) is infrequent as well. A traditional sing le-table r ule mining system applied to both targe t tables would hav e to ev aluate this conjunction twice, once with the ta rget table S tudent and the second time with the target table P r of essor . The pric e fo r the co mputatio nal a dv antage of the a pr iori pro per ty holding thr oughout the q uery space is that our appro ach restricts the set of in teresting queries co mpared to an iterative application of sing le-table rule mining. F or e xample, it may b e the cas e that the r ule P rof essor ( X ) ∧ S tude nt ( X ) → Ag e ( X, 3 0) receives enough supp ort if ev alua ted with r e sp e c t to Pr o fessors (b ecause it may b e the cas e that most professor s who ar e also taking courses a s students ar e younger), but do es no t receive enough supp o rt if ev aluated with resp ect to Students (p erhaps b ecause very few students are also professors to b egin with). Our definition of supp or t based on taking the union o f the database tables can b e seen a s a c autious a ppr oach bec ause if a query is frequent with resp ect to the union of t wo tables , it is frequent with re spe ct to either 15 table. So a query that is fr e quent with resp ect to the union o f the Pr o fessor and Student tables is frequent with resp ect to b oth. 6 Conclusion The g o al of this repo rt was to extend the concept of co nfidence a nd supp ort for a new class o f ass o ciation rule s which we call entit y- r elationship rules. Entit y - relationship r ules are based on the domain relationa l calculus; they a re muc h more flex ible and expressive than standard itemset rules. ER rules allow for negation, nested Bo olean combinations, and quantification.The main conceptual contribution of this rep or t is a definition of frequency for entit y- relationship queries. Instea d of b eginning with a sp ecified tar g et table o r “ key atom”, we dynamically define a refer ence or ba s e doma in of individuals for each ER query . The key idea of our definition is to take the bas e set of entities o f a conjunctive q ue r y to b e the union of the co njuncts’ base sets. F o r example, the fr e quency of the quer y P r of essor ( X ) ∧ C ustomer ( X ) is computed with r esp ect to the union of Pr ofessors and Customers. W e pr ov ed that our frequency definition satisfies s tandard axioms for probabilities a nd v alidates the A priori pr op erty: the frequency of a conjunction is no gr eater than the frequency of any conjunct. As usual in data mining, there is a tradeoff b et ween the express iveness of the rule or patter n language, and the difficult y of searching for s ignificant patterns. Our rule language is very general a nd in practice a computational search for interesting entit y -relationship r ules will require a lang uage res triction (bias). A central topic for future re search is to explo re lang ua ge res trictions that make feasible a computationa l sea rch for int eres ting entit y-rela tionship r ules. Ac k n o wledgemen ts This research was supp o rted b y Discov ery Gra n ts to the first and thir d author from the Na tural Sciences and Engineering Council of Canada. References [1] “Mining P ositive and Negative Asso ciation Rules: An Approach for Confined Rules”, Maria-Luiza An- tonie and O smar R. Za ¨ ıa ne (2004). 8th Eur op e an Confer enc e on Principles and Pr actic e of Know le dge Disc overy in Datab ases (PKDD 04) , Spring er V erla g LNCS 320 2, pp 27-38, P isa, Italy , Septem b er 20-2 4. [2] “Rela tional Completeness of data ba se sub-language s ”, E. Co dd (197 2). In R.Rustin, editor, Data Base Systems, Prentice Hall. [3] “Attribute-v alue learning versus inductive logic programming: The missing links (extended abstr act)”. In Pr o c e e dings of the Eighth Int ernational Confer enc e on Inductive L o gic Pr o gr amming , pa ges 1–8. Springer, Berlin 1998 . [4] “Discovery of Relational Asso c iation Rules” , L uc Deshap e and Hannu T o ivonen (200 1), Ch.8, in R ela- tional Data Mining , eds. Saso Dzeroski and Nada Lavrac, Springer Berlin. [5] “P rop ositiona lization Approa ches to Rela tional Data Mining”, Stefan Kramer , Nada Lavraˇ c and Peter Flach (200 1), Ch.8, in R elational Data Mining , eds. Sas o Dzeroski and Nada Lavrac, Spring e r Berlin. [6] “E xtending Relationa l Algebr a and Relational Calculus with Set-V alued Attributes and Aggreg ate F unc- tions”, G. ¨ Ozsoy o˘ glu, Z.M. ¨ Ozsoy o˘ glu, and V.Matos (19 87), ACM T r ansactions on Datab ase Systems , V ol.12:4 , pp.56 6–59 2. [7] Artificial Int el ligenc e: A Mo dern Appr o ach , S. Russell and P . Norvig,(19 88). P rentice Hall. 16 [8] Principles of Datab ase and Know le dge-Base Systems , Jeffrey D. Ullma n (19 88), Computer Scie nc e P ress, Ro ckville, Mar yland. 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment