Association Rules in the Relational Calculus

Asso ciation Rules in the R elational Calculus Oliv er Sc h ulte, F la via Moser, Martin Ester and Zhiy ong Lu Sc ho ol of Computing Science Simon F raser Univ ersit y Burnab y , B.C., Canada { osc h ulte,fmoser,ester,zhiy ongl } @cs.sfu.ca August 20, 2021 Abstract One of the most ut ilized data mining tasks is the search for association rules. Asso ciation rules rep re- sent sig niﬁcant relationships b etw een items in transactions. W e extend the concept of association rule to represent a muc h broader class of associations, which we refer to as entity-r elationship rules. Semanticall y , entit y - relationship rules express asso ciations b etw een prop erties of related ob jects. S yntacticall y , th ese rules are b ased on a broad sub class of safe domain relational calculus q ueries. W e prop ose a new d eﬁ- nition of supp ort and conﬁdence for en tity-relationship rules and for th e frequency of entit y-relationship queries. W e p ro ve that the deﬁn ition of frequency satisﬁes standard probability axioms and the Apriori prop erty . 1 In tro duction One of the goa ls of data mining is to discov er interesting relationships from data. Asso ciation rules expres s relationships that hold with s uﬃcien t frequency but not a lwa y s . F or example, it may be the case that not all manag ers ear n over $ 6 0,000 a year, but that 90% o f manager s do. The logical form of an asso ciation rule is that of an implication p → q where p and q ho ld together suﬃcien tly often (the “supp ort” of the rule) and q holds suﬃciently often g iven that p holds (the “conﬁdence” of the rule). The traditio na l concept o f asso ciatio n rules severely limits the complexity of the expr e ssions p and q and ther eby limits the class of relationships a data miner can ca ptur e. Ess ent ially , p and q ma y b e only s imple conjunctions, like a n itemset. Thu s we c a nnot hav e r ules ba s ed on Bo olean combinations, such as negations or nested combinations. An example of a r elationship inv olv ing a nega tion would b e a negative factor , such a s “ student s who hav e not taken an intro ductory data base cour se do p o orly in datamining courses” . An e xample o f a nested Bo olean combination would be “s tuden ts who ar e math ma jors or computer science ma jors, and who hav e done well in a discr ete mathematics cours e or in an algo rithms co urse, do well in complexity theo ry”. A nother cla ss of relatio nships that asso cia tion r ule s c a nnot expres s in volves quantiﬁcation and r elating ob jects to ea ch other. An example would b e the rule “ residents who hav e a neighbour with high incomes tend to have a high income themselves”. The goa l of this pap er is to extend the concept o f an asso ciatio n r ule to a lar ge class of express ions that we refer to a s entity-re lationship queries (ER queries). Intuitiv ely , entit y-r elationship queries express depe ndencie s among e ntities and their proper ties. Entit y-r elationship queries a re a larg e sub clas s of the sa fe queries. Safe queries corr esp ond to a n expres sive subset of ﬁrst-o rder logic that allows for nested Bo olea n expressions a nd quantiﬁcation. W e provide a deﬁnition of the frequency of a n entit y-r elationship query . This extends the no tion o f an asso c ia tion rule to implications of the form p → q where p ∧ q is a n ER quer y ; we refer to rules of this form as ent ity-r elationship rules . F r o m our deﬁnition o f the frequency of an ER query we immediately o btain a deﬁnition of the supp ort of an ER r ule , namely the freq uency f r ( p ∧ q ). 1 TV-Prog ram(Pro g -Name :string ) TV-Station(Station-Name:string, Are a :int eger ) W eekdayTV ( TV-Prog ram:string,TV- Sta tion:string ,Viewers:integer,Sponso r:string) W eekendTV ( TV-Progr am:string,TV-Station:s tr ing ,Viewers:integer,Spons or:string) T able 1: A relational schema fo r a TV sur vey mo del. Key ﬁelds are underlined. The schema lists TV progra ms a nd stations, and reco rds for each com bination of weekday pro gram and station, how man y viewers view the pr o gram on that station, and who sp ons ors the pro gram. The s ame information is reco rded for week end pro grams. Our deﬁnitio n of frequency for ER queries generalize s previous work on deﬁning as so ciation rules in a m ulti-rela tional setting. [1] discusses extending itemset r ules with negations and motiv a tes the usefulness of this e x tension. The quer y extensio n appro ach of the W armr sys tem [4] presents a sp ecial class of ent ity- relationship rules that allows conjunctions of nonnegated statements and existential qua n tiﬁcation. Our concept of ER rules features in addition negations, universal qua nt iﬁcation, nested quantiﬁers, and nested Bo olean co mbin ations. Thus one contribution of this pap er is a n extended r ule forma t. A characteristic that distinguishes our appro ach from previous work is that pr evious a pproaches assume a given target table that deﬁnes a base set of tuples for ev a luating the supp ort of a query . In cont ras t, we start with a query and deﬁne a natural base s e t of tuples for e v aluating the supp or t o f the query . W e can think of this approach as dynamically gener ating ent ity sets fo r a given quer y r a ther than ev alua ting queries with resp ect to a ﬁxed ent ity set. Th us the second main contribution of this pap er is a new deﬁnition o f supp ort for rules in our extended format. The pa p er is organized as follows. First we rev iew ba s ic r elational databas e concepts such a s the relational schema and the domain re lational calculus. Then we intro duce the concept of a n ent ity query and deﬁne the frequency of a quer y in this class of queries . This deﬁnition pr ovides the bas is for the no tion of an entit y- relationship rule and fo r deﬁning the supp ort of an entit y-r elationship r ule. W e co mpa re entit y-relatio nship queries to frequent itemsets and to the r ule langua ge o f the W armr system. The ﬁnal s e ction establis he s sevveral impo r tant for mal pro per ties of query freq uencies as we de ﬁne them and shows that they satisfy the Apriori prop erty , that is, the frequency of a conjunction is no gre ater than the frequency of its conjuncts. 2 En tities in the Domain Relational Calculus This section pres ent s standa rd background mater ial fr om database theory . The ﬁrst subsection r eviews relational s ch emas, and in tro duces the new concept o f an entity ﬁeld . Semantically , e n tity ﬁelds are those that store v a lues (co ns tants) that refer to entities. The s econd s ubsection deﬁnes the standar d notion o f a safe query in the domain relational calculus, a nd the third intro duces a sub clas s o f safe queries that we ter m entity-r elationship queries . 2.1 En tities in Relational Schema s W e beg in with a standard relational sc hema containing a set of tables, each with key ﬁelds, des criptive attributes, a nd p os sibly foreign key p ointers. W e use the notation T to refer to a gener ic table that may represent either an entit y se t o r a r elationship set, a nd for a n index w e use T i . A ﬁeld named name in table T is deno ted by T .name . T able 1 shows a r elational schema for a TV sur vey database; this example is adapted from [6, Sec.2]. T ables 2 – 4 display rel ation instanc es for the TV survey sch ema. W e assume that the tables in the relational s chema can be divided into entity t ables and r elationship tables. This is the case whenever a relationa l schema is derived from an ent ity-relationship mo del (ER mo del) [8, Ch.2 .2]. In tuitiv ely , a n entit y table c o rresp onds to a type of entit y , and a relationship table represents a relation b etw ee n e ntit y types. In our TV survey exa mple, there are tw o types o f entities: TV progra ms represented in the TV-P rogr a m table, and TV statio ns repr esented in the TV- Sta tion table. W e 2 TV-Prog ram TV-Station Viewers Spo nsor Gilmore Global 10 Avon Gilmore CBS 12 La Senza Ho ck ey Night CBC 20 RBC T able 2: T elevision Survey: W eekday TV. TV-Prog ram TV-Station Viewers Spo nsor Gilmore Global 8 Avon Ho ck ey Night CBC 14 Sch wab Simpsons CBS 10 RBC Daily Show CBC 6 La Senz a T able 3: T elevisio n Survey: W eekend TV. now introduce tw o assumptions co ncerning the re la tional schema that facilitate the deﬁnition of entit y - relationship queries and their frequencies . Unary Ke y Assump ti on W e ass ume that ev ery entit y ta ble has a single key ﬁeld. The adv antage of the unary key a ssumption is that g iven this assumption, a single key ﬁeld in the relational schema r efers to a single entit y . The assumption holds in our TV survey schema b ecause the tw o ent ity tables ha ve key ﬁelds TV-Prog r am.Pro g-Name and T V- Sta tion.Station-Name resp ectively . Although it is not always natural to deﬁne entities with a single k ey ﬁeld, there is no los s of g enerality b eca use we can alwa ys form a sing le comp osite key ﬁeld from a list of key ﬁelds. F o r example, if in a Pr ofessor table there are t wo key ﬁelds Firs tName, LastName, w e can fo r m a comp osite key ﬁeld h FirstName , LastName i . Our second assumption is the following. Global N ame Assum ption W e ass ume that for every en tity e , there is a unique constant c such that in every table, the consta nt c denotes en tity e . The g lobal name as sumption is imp or ta nt beca use it allows us to recog nize when the same entit y o ccurs in diﬀerent tables. In the AI literature, a similar ass umption is often referre d to as the “unique na me assumption” [7, Ch.14 ]. The ass umption do es not amount to a loss o f g enerality b ecause if the same constant c is used in diﬀerent tables to refer to diﬀeren t entities, we can simply index c to distinguish these o ccurrences. F or exa mple, if we hav e tw o diﬀerent transa ction tables T r ansaction1 and T ransac tion 2, and there is a transactio n 1 in b oth, w e could change the e n try in the ﬁrst table to refer to 1-1 and in the seco nd table to refer to 1 - 2. A natur al alternative to indexing constants would b e to adopt a conv ention to the eﬀect that a k ey ﬁeld T .k ey in table T refers to diﬀerent entities than k ey ﬁeld T ′ .k ey in table T ′ if and o nly if the names of the key ﬁelds in the tw o tables are diﬀerent. F or example, if we hav e a table for Employees and another for Mana g ers, lab elling the key ﬁeld in ea ch table as “ssn” indicates that a given so cia l security nu mber refers to the same per son no matter where it app ea rs. In contrast, lab elling the key ﬁeld in the T ransa ctions1 table “T1-num b e r ” and the key ﬁeld in the T ra nsactions2 table as “T2- n umber” indicates that the transa c tio n num b ers in diﬀerent tables refer to diﬀeren t transa ctions. Station-Name Area Global 1 CBS 2 CBC 3 T able 4: T elevisio n Survey: Stations and Areas. 3 Symbol T yp e Notation Comment Constants c 1 , c 2 , ... A t most countably man y constants Predicate Symbols P 1 , P 2 , .., P k Exactly one predicate for each table T i Logical Symbols ∃ , ∀ , ∧ , ∨ , ¬ Compariso n Oper ators = , < , > , ≤ , ≥ , 6 = T able 5: The Basic V o cabular y of our DR C languag e fo r a g iven databa se sc hema D with tables T 1 , ..., T k . In many applications, the g lobal name assumption is enforced through foreign key constra in ts. T o illustrate, in the TV e x ample, we may suppos e that the ﬁeld W eekdayTV.TV-Station is a foreign key p ointer to the ﬁeld TV-Station.Station-Name, a nd that the ﬁeld W eekendTV.TV-Station is a foreign key pointer to the same ﬁeld. So the string co nstant “CBS” refer s to the CBS net work represe nted in the TV-Station table, whether “CBS” app ear s in a n instance of the W eekdayTV r elation or in an instance of the W eekendTV relation. Given the unary key and globa l name assumptions, the following is a v alid deﬁnition of how table s , key ﬁelds and constants are a sso ciated with entities. Deﬁnition 1 L et D b e a datab ase instanc e. 1. An entity table is a table T with a single key ﬁeld. 2. An ﬁ eld is an entity ﬁeld if (1) the ﬁeld is the key of an entity table, or (2) the ﬁeld is a for eign p ointer to the key of an entity table. 3. A c onstant c is an entity c onstant if c app e ars in an entity ﬁeld. Examples . Let D b e the TV s ur vey databas e instance fro m T ables 2 – 4. The entit y keys a re TV- Progr am.Pro g-Name, TV-Sta tio n.Station-Name, W ee kdayTV.TV-Program, W eekdayTV.TV-Station, W eek- endTV.TV-Progr am, W eekedTV.TV-Station. Entit y constants include “CBS” and “Simpsons”. Next we rev ie w the domain re lational calculus, which is a logical query langua ge based on a given relational schema. 2.2 Safe Queries in the Domain R elational Calculus W e ﬁrst deﬁne the formal languag e o f the domain relationa l ca lculus, including the well-formed formulas of the calculus. Then we deﬁne an imp ortant subcla ss of formulas known as safe querie s . Our pres ent ation follows the s tandard approach, see for example [8, Ch.3]. 2.2.1 The F ormal Language of the Domain Rel ational Calculus In the domain r elational calculus (DRC), for every table T i in the data base s chema there is exactly one predicate P i in the logical language. The num b er of ﬁelds in the table T i is the arity of the predicate P i . If T i is an entit y table, then P i is an e ntit y predicate . By the unary key assumption, an entit y table T i has a single key ﬁeld; w e adopt the conv ention that the key ﬁeld is the ﬁrst ar gument in the en tity predicate P i . The complete logical vocabular y of the DR C is listed in T a ble 5. Example . In the TV sur vey mo del, we have the pr edicates shown in T able 6. Thu s we may write W ee k day T V (“GilmoreGirls ” ,“CBS” , 12 , “La Senza”) to assert that “Gilmo re Girls” is shown on “CBS” on weekdays, with 1 2,000 viewers, and sp onsor e d by La Senza. The notion of a well-formed formula is the usua l one for this vocabula ry . Deﬁnition 2 Wel l-F orme d F ormulas of the Domain Rel ational Calculus 1. A c onstant c or variable X is a term. 4 T able 6: Predicates of our Logica l Query Lang uage for the TV survey mo del. Pr e dic ates Arity TV-Prog ram(PN) 1 TV-Station(SN,A) 2 W eekdayTV(PN,SN,V,S) 4 W eekendTV(PN,SN,V,S) 4 T able 7: Examples of V alid E xpressions for the database schema for the TV sur vey . Expressio n Type V ≥ 10 atomic formula with V free ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) quantiﬁed form ula with P free ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 ∧ co njunction of ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 quantiﬁed form ulas 2. I f P is a pr e dic ate symb ol of arity k and t 1 , .., t k ar e k terms, then P ( t 1 , ..., t k ) is an atomic formula. 3. I f t, t ′ ar e two terms, then a c omp arison t 1 θt 2 is an atomic formula. 4. I f F is a formula and X is a variable, t hen ¬ F , ∃ X .F , ∀ X .F ar e formulas. 5. I f F 1 and F 2 ar e formulas, then so ar e F 1 ∧ F 2 and F 1 ∨ F 2 . 6. Al l formulas ar e forme d by the r ep e ate d applic ation of the pr evious ru les. Examples. T able 7 gives examples of v a lid expressio ns and their types p ertaining to the TV survey . W e next deﬁne the result o r output of a DRC query . The ﬁrst step is to deﬁne what gro und formulas are s atisﬁed in a databas e instance D ; a for m ula is ground if it contains no v ariables . The second s tep is to deﬁne whic h closed q uer ies F with no free v aria bles a re sa tisﬁed in a databa se instance D ; as usual in log ic , we write D | = F . Let F [ X 1 /t 1 , .., X k /t k ] b e the formula that results from replacing all free o ccurre nces of each X i in F with the term t i . 1. If t, t ′ are tw o co nstants, then D | = tθ t ′ iﬀ tθt ′ holds. 2. D | = P i ( c 1 , .., c k ) iﬀ h c 1 , .., c k i is a tuple in table T i . 3. D | = F 1 ∨ F 2 iﬀ D | = F 1 or D | = F 2 ; similarly D | = F 1 ∧ F 2 iﬀ D | = F 1 and D | = F 2 ; and D | = ¬ F 1 iﬀ D 2 F 1 . 4. D | = ∃ X.F iﬀ there is a cons ta nt c in the DR C la nguage such that D | = F [ X/c ]; similarly D | = ∀ X .F iﬀ for all constants c we hav e D | = F [ X/c ]. Let F ( X 1 , .., X m ) b e a query with free v ariables X 1 , .., X m . Then on da tabase instance D the q uery F returns the set of all tuples that make F true when substituted in F . F or mally , we write tupl es D ( F ) ≡ {h c 1 , ..., c m i : D | = F [ X 1 /c 1 , ..X m /c m ] } . This deﬁnition assumes tha t the co nstants in the language include all constants that app ear in the database tables, which inv olves no lo ss of genera lit y . Examples. Let D b e the database instanc e from T ables 2 – 4. T able 8 shows the results of our example queries for this databa se instance. 5 Query F or m ula F Result tupl es D ( F ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 { “Gilmore”,“ Ho ck ey Night” } F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 { “Ho ckey Night”,“Simpsons” } F 1 ∧ F 2 { “Ho ck ey Night” } T able 8: Results of Query F ormulas on the database instance D from T ables 2 – 4. Query F or m ula F Safe? F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 yes F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 yes F 1 ∧ F 2 yes F 1 ∨ F 2 yes ¬ F 1 no ¬ F 1 ∨ F 2 no F 1 ∧ ¬ F 2 yes T able 9: Examples of safe and unsafe queries for the T V survey databas e sch ema of T able 1. 2.2.2 Safe Queries It is customary to r estrict the set of fo r mulas that may serve as free v ariable s in a query (“query v ariables” for short) to ensure that the result s e t of tuples satisfying query formulas a re b ounded a nd “domain-indep endent” [8, Ch.3.8]. T o this end we adopt the notion of a safe q ue r y . The intu ition behind this conc e pt is that the results of safe queries should b e re stricted to selection conditions applied to (combinations of ) tables in the database. F or example, the query ¬ T V P rog ram ( X ) with free v ariable X is not s a fe b ecause the range of constants s atisfying this quer y is not b ound by an y table in the database. The key idea in the deﬁnition of safe query is to conjoin a query formula F to a r estriction of the for m P ∧ F wher e P is a basic pr edicate in the langua ge and hence refers to a table in the database. As is well-kno wn, the expres s ive p ow er o f safe queries is exactly equiv a le nt to that of relatio na l a lgebra [2]. Safe queries are formally deﬁned as follows [8, Ch.3.8]. 1. Replace the ∀ X quantiﬁer b y ¬∃ X ¬ . 2. Whenev er ∨ is used to connect F 1 ∨ F 2 , the tw o for m ulas hav e the same set of free v ar iables. 3. Consider a n y ma ximal subform ula consisting o f the conjunction of one or more for m ulas F 1 ∧ ... ∧ F m . Then all v ariables X app earing free in any o f the F i m ust b e limited as follows. The v ar iable X must be free in some non-ne gate d F i satisfying one of the following co nditions. (a) F i is not a compariso n. (b) F i is X = c where c is a co nstant. (c) F i is X = Y , a nd Y is limited. 4. A ¬ op era tor may apply o nly to a formula in a conjunction of the t yp e discussed in the pr evious r ule. Examples . T a ble 9 gives e x amples of safe and uns afe q ueries for the TV data base sc hema from T a ble 1. This completes our r eview of basic concepts from relatio na l databa se theor y . W e now come to the restriction of safe queries to entit y-rela tionship queries. 2.3 Deﬁnition of En tity -Relationship Queries The basic idea b ehind our deﬁnition o f a n ER query is that fr ee v ariables should b e limited in such a wa y as to guara ntee that they mu st refer to ent ities. Intuitiv ely , an ER quer y is one whose free v aria bles refer to ent ities. The precise deﬁnition is a s follows. 6 Deﬁnition 3 L et D b e a datab ase instanc e. 1. A variable X is an enti ty variable c andidate for a DRC formula F if (a) X is not quantiﬁe d over in any p art of F (b) if an expr ession X θ t app e ars in F , then θ is = or 6 = , and if t is a c onstant c , then c is an entity c onstant in D . (c) if an expr ession P ( , X , ) app e ars in F , then the ar gument p osition of X in P ( , X , ) is an entity ﬁeld. 2. A variable X is an enti ty variable for F if (a) X is an entity variable c andidate for F , and (b) if an expr ession X = Y or X 6 = Y app e ars in F , then Y is an entity variable c andidate. 3. An entity-r elationship (ER) query F for datab ase inst anc e D is a safe DRC query such that al l the fr e e variables in F ar e entity variables for F given D . Examples. Let D b e the TV survey database instance from T ables 2 – 4. In the formula ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 the v aria ble P is an entit y v a riable, and so the formula is a n ER query . The formula ∃ S. ∃ S N .W eek day T V (“Gilmore Girls” , S N , V , S ) is safe but not an ER query b ecause the free v a riable V is not a n en tity v a riable. 3 The F requency of En tit y-Relationship Queries Our ba sic ide a is tha t the limiting conditions in safe queries sp ecify the domain fro m which v alues for a free v ariable X are to b e drawn. O nce the domain for the free v ariables is deﬁned for a giv en for mula F , we ca n take the frequency o f the fo r mula F to be the nu mber of a ssignments to the free v ariables that sa tis fy the formula divided by the size of the domain fo r the for m ula. Safe queries are a natural class of queries for this approach b ecause these queries s pec ify the range from which re s ult tuples may b e drawn by r estricting these results to subsets of tables in the da tabase (cf. Section 2 .2.2). The main issue in our deﬁnition c oncerns the cor rect domain for conjunctions or intersections. F o r a simple example, co nsider a databas e s chema with tw o e n tity ta bles Pr o fessor a nd Customer. The query P rof essor ( X ) ∧ C ustome r ( X ) returns entities that are b oth pr ofessors and customer s. What should be the base domain for this quer y ? If ther e are many mor e customer s than pr ofessors , we may get quite diﬀerent frequency counts if we take the base do main to b e P rofessor than if we take it to b e Customer. So neither o f these seems the right choice. Intuitiv e ly the ba se domain should be a symmetric function of the tw o classes men tioned in the query . The tw o na tur al s ymmetric set-theoretic op erations are in tersectio n and union. If w e take the intersection a s the base domain, the frequency of conjunctions without further s election conditions is alwa ys 10 0%, which do es not se e m rig ht. In particular for our ultimate g oal o f deﬁning the supp ort of asso ciatio n rules, this is unsatisfacto r y . Our prop osal is therefore to us e the u nion of the tw o entit y sets inv olved in the co njunction. Another way to loo k at the union is that it repr esents a kind of closed world assumption: If Professo rs and Customers a r e the only e ntit y types mentioned in the s election conditions of the query , then the members o f these en tity t yp es are exactly the p otential answers to the query . The clo sed world assumption is also the ba sis for o ur frequency deﬁnition for queries with negation. F o r example, consider a safe query such as P rof essor ( X ) ∧ ¬ C ustomer ( X ). Since P rofessor s a nd Customer s are the only entit y t yp es mentioned in this query , we take the base domain a gain to b e the union of these tw o 7 Query F ormula F , Reference Domain dom D ( F, X ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P, S N , V , S ) ∧ V ≥ 1 0 dom D ( F 1 , X ) = { “ Gilmore” , “Ho ckey Night” } F 2 = ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 dom D ( F 2 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } F 3 = F 1 ∧ F 2 dom D ( F 3 , X ) = { “ Gilmore” , “Ho ckey Night”,“Simpsons”,“Daily Sho w” } F 4 = F 1 ∨ F 2 dom D ( F 4 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } F 5 = F 1 ∧ ¬ F 2 dom D ( F 5 , X ) = { “Gilmore” , “ Ho ck ey Night”,“Simpsons”,“Daily Show” } T able 10: Reference Domains for v ar ious form ulas in the TV survey database instance D fr o m T ables 2– 4. sets. The fact that Pro fessors are men tioned p ositively and Customer s nega tively do e s not make a diﬀerence to the base domain, but it do es make a diﬀerence to the result of the query and hence to its freq uency . On the basis of this pro po sal, we can now recurs ively ass ign a domain to an entit y v ariable in a fo r mula F given a database instance D . W e beg in with just o ne fre e quer y v ar iable and then tackle the more co mplicated case of queries with more than o ne free v a riable. 3.1 Deﬁnition of F requency for Queries With One F r ee Query V ariable W e deno te the base domain of an entit y v ar iable X in a query F rela tive to a database instance D as dom D ( F, X ). As we think of v a r iable X as r eferring to the do ma in dom D ( F, X ), w e term dom D ( F, X ) the reference do main of X in the context of query F . Deﬁnition 4 L et D b e a datab ase instanc e with ER formula F . 1. I f F is P i ( t 1 , .., t k ) , and X o c curs in F , t hen dom D ( F, X ) = π X [ tupl es D ( F )] . If X is not a fr e e variable in F , then dom D ( F, X ) = ∅ . Her e we think of tup l es D ( F ) as a r elation whose c olumns c orr esp ond to the fr e e variables of F . F or ex ample, the query p ( X , Y , Z ) re turn s a r elation with triples, and we c an think of the ﬁ rst c olumn as name d X and the se c ond as name d Y . The expr ession π r efers to the pr oje ction op er ator of r elational algebr a (with elimination of duplic ates). 2. Le t F b e a single atomic c omp arison of the form Y θ t wher e t is either a variable or a c onstant. If F is X = c , then dom D ( F, X ) = { c } . Otherwise dom D ( F, X ) = ∅ . 3. I f F is ¬ G for some formula G , then dom D ( F, X ) = dom D ( G, X ) . 4. I f F is F 1 ∨ F 2 or F 1 ∧ F 2 , t hen dom D ( F, X ) = dom D ( F 1 , X ) ∪ dom D ( F 2 , X ) . 5. I f F is ∃ Y .G , wher e Y 6 = X , then dom D ( F, X ) = dom D ( G, X ) . If F is ∃ X .G , t hen dom D ( F, X ) = ∅ . Examples. Let D b e the TV survey databa se instance from T a bles 2 – 4. T a ble 10 gives examples of reference domains for v a rious ER queries. As this deﬁnition shows, we think of basic predicates as sp ecifying the rang e from which entities are drawn. Co nditions of the form p ( t 1 , ..., X , ...t k ) or X = c we view as “ dir ect bounds” that determine the reference domain of X . V a riable equations o f the form X = Y we view as “sele ction conditions ” that are applied after an entit y has b een speciﬁed. Thes e do no t aﬀect the reference domain of X but only the result of the query . Ano ther t yp e of selectio n a r e r e strictions on descriptive attributes, such as V ≥ 10 in the queries in T able 10. Now the frequency o f an E R query is deﬁned as follows. 8 Query F ormula F F requency f r D ( F ) F 1 = ∃ S. ∃ S N . ∃ V .W eek day T V ( P , S N , V , S ) ∧ V ≥ 10 1 F 2 = ∃ S. ∃ S N . ∃ V .W eek endT V ( P , S N , V , S ) ∧ V ≥ 10 1 / 2 F 1 ∧ F 2 1 / 4 F 1 ∨ F 2 3 / 4 F 1 ∧ ¬ F 2 1 / 4 T able 11: F requencies for v a r ious form ulas in the TV survey database instance D fr om T ables 2 – 4. Deﬁnition 5 L et F b e an ER query with fr e e variable X such that dom D ( F, X ) 6 = ∅ . Then f r D ( F ) ≡ | tupl es D ( F ) | | dom D ( F, X ) | . In Section 5 we establish several formal prop erties o f the freq uency of a query according to this deﬁnition, for example that the frequency is a num b er b et ween 0 and 1. Examples. Let D b e the TV surv ey databa se instance fro m T ables 2 – 4. T able 11 illlustrates the frequen- cies of v ario us queries. 3.2 Deﬁnition of F requency of Queries W ith More Than One F ree V ariable W e assig n a domain to every tuple of en tity v ariables in a for mula F given a data base ins ta nce D , which we denote as dom D ( F, { X 1 , ..., X m } ). Our basic idea is to consider a re sult tuple h c 1 , ..., c m i as denoting a c omp osite entity formed by combining m single ent ities. F or exa mple, consider the rule N ei g hbour ( X , Y ) ∧ ( ∃ I .I ncome ( X , I ) ∧ I > $100 , 000) → ( ∃ I .I nc ome ( Y , I ) ∧ I > $1 00 , 00). (The symbo l → doe s not denote logical implicatio n but deﬁnes an asso c ia tion rule; see Section 4.) This says that if X has a n income ov er $100,0 00, then it is likely that a neighbour Y o f X als o has an income of $ 100,0 0 0. The supp ort of this rule is the frequency of the quer y N eig hbour ( X , Y ) ∧ ( ∃ I .I ncome ( X, I ) ∧ I > $1 00 , 000 ) ∧ ( ∃ I .I ncome ( Y , I ) ∧ I > $100 , 0 0). This query has tw o free v a r iables X and Y . The refere nce domain c omprises the entries in the N eig hbour table, that is, the pa irs h X, Y i in the table. Other ex amples of na tural comp osite ent ities include relatio ns like reserv a tions or pur chases. The idea o f tr eating tuples in a r e la tional table as comp osite “individuals” is familiar in the prop ositiona lization literature [5, 3] (for exa mple, c hemical molecules may be trea ted as single en tities although molecules are co mpo sed of diﬀerent elements that are a lso r epresented in the relationa l s chema). Applying this idea requir es a further constraint on ER queries: the free v ariables { X 1 , ..., X m } must b e “b ound tog ether” in a limiting condition rather than separ ately . F or example, the query P ( X ) ∧ X = Y is a safe ER quer y but the a nswer pair s h x, y i are not b ound to the key ﬁelds o f any tuple; an exa mple of the same character is the query P ( X ) ∧ Q ( Y ). T o rule out s uch cases, we imp ose the following co ndition. Deﬁnition 6 A liter al is an atomic formula or its n e gation. A n ER query F is valid for variables X 1 , ..., X m if for every maximal c onjunction L = L 1 ∧ ... ∧ L k c onsisting only of liter als, L c ontains a c onjunction of the form X 1 = c 1 ∧ · · · ∧ X m = c m , or L c ontains a c onjunct P ( t 1 , ..., t k ) wher e al l variables { X 1 , ..., X m } o c cur in P ( t 1 , ..., t k ) . An ER query F is valid if F is valid for the set of its fr e e variables. Examples follow b elow in this section. In the case with only one free quer y v ariable X , the deﬁnition o f safe query implies tha t every en tit y quer y is v alid. Now let us consider the deﬁnitio n of a re fer ence domain for v alid E R queries with one or mor e free v ariables. As in the case with just one query v aria ble , we term dom D ( F, { X 1 , ..., X m } ) the reference domai n o f { X 1 , ..., X m } in the context o f q uery F . Consider the basic c ase of an ato mic formula F = P ( t 1 , ..., t m ) ﬁr st. In keeping with the ide a b e hind safe queries, we can think of such form ulas as sp ecifying a basic r ange for the re sult tuples in a query . So supp ose that the free v ariables in the atomic formula are Y 1 , Y 2 , .., Y k . If o ur query v a riables X 1 , ..., X m are not al l contained in the set { Y 1 , Y 2 , .., Y k } , we consider that the “co mpo site key” X 1 , ..., X m do es no t app ea r in the q ue r y , 9 Query F or m ula F , Reference Domain dom D ( F, X ) F 1 = ∃ S. ∃ V .W eek day T V ( P , S N , V , S ) ∧ V > 10 dom D ( F 1 , X ) = {h “Gilm.”, “Glo.” i , h “Gilm.”,“C BS” i , h “Ho ck. N.”,“CBC” i} F 2 = ∃ S. ∃ V .W eek en dT V ( P , S N , V , S ) ∧ V > 1 0 dom D ( F 2 , X ) = {h “Gilm.”, “Glo.” i , h “Ho c k. N.”, “CB C” i , h “Simps.”,“CBS” i , h “Daily Sh.”,“CBC” i} F 3 = F 1 ∧ F 2 dom D ( F 3 , X ) = {h “Gilm.”, “Glo.” i , h “Gilm.”,“C BS” i , h “Ho ck. N.”,“CBC” i , h “Simps.”,“CBS” i , h “Daily Sh.”,“CBC” i} T able 12 : Refer ence Doma ins for v arious formulas in the TV sur vey database insta nce D fro m T a bles 2 – 4. The free v aria bles query v ariables are P and S N , corresp onding to pairs of pro grams- s tations. and dom D ( F, { X 1 , ..., X m } ) = ∅ . Otherwis e we consider the query res ult tupl es D ( F ) a s a r elation with k columns, of which m ar e named X 1 , .., X m . F or example, the query p ( X , Y , Z ) returns a relation with triples, and w e ca n think of the ﬁrst c o lumn a s named X and the second as named Y . Thus we can take π h X 1 ,...,X m i tupl es D ( F ) to be the reference do main of the entit y v a r iables X 1 , X 2 , ..., X m in the query F . This leads to the fo llowing inductive deﬁnition. The main diﬀerence with the de ﬁnition for a single query v ar iable is that we ne e d to tr eat conjunctions like X 1 = c 1 ∧ · · · X m = c m as a single comp ound statement. Deﬁnition 7 L et D b e a datab ase instanc e with ER formula F and let X 1 , .., X m b e a list of variable s. 1. I f F is P ( t 1 , .., t k ) , and al l variables X 1 , .., X m o c cu r in P ( t 1 , .., t k ) , then dom D ( F, { X 1 , ..., X m } ) = π h X 1 ,...,X m i tupl es D ( F ) , wher e π is the pr oje ction op er ation of r elational algebr a. Otherwise dom D ( F, { X 1 , ..., X m } ) = ∅ . 2. Le t F b e a single atomic c omp arison of the form Y θ t wher e t is either a variable or a c onstant. (a) Su pp ose that m = 1 , Y = X 1 and the c omp arison is X 1 = c (i.e., we just ha ve a single fr e e variable X 1 and t he atomic formula r e quir es X 1 to b e e qual to a c onstant c .) In that c ase dom D ( F, { X 1 } ) = { c } . (b) Otherwise dom D ( F, { X 1 , ..., X m } ) = ∅ . 3. Le t F b e a max imal c onjunction of k > 1 formulas, such that F = C 1 ∧ · · · ∧ C k . (a) If F is a c onjunction of the form C ∧ X 1 = c 1 . . . ∧ X m = c m , then dom D ( F, { X 1 , ..., X m } ) = dom D ( C, { X 1 , ..., X m } ) ∪ {h c 1 , ..., c m i} . (b) Otherwise dom D ( F, { X 1 , ..., X m } ) = S k i =1 dom D ( C i , { X 1 , ..., X m } ) . 4. I f F is F 1 ∨ F 2 , t hen dom D ( F, { X 1 , ..., X m } ) = dom D ( F 1 , { X 1 , ..., X m } ) ∪ dom D ( F 2 , { X 1 , ..., X m } ) . 5. I f F is ¬ G for some formula G , then dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } . 6. I f F is ∃ Y .G , wher e Y 6∈ { X 1 , ..., X m } , then dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } ) . If F is ∃ X i .G for some X i ∈ { X 1 , ..., X m } , then dom D ( F, { X 1 , ..., X m } ) = ∅ . It is easy to chec k that this de ﬁnitio n agrees with Deﬁnition 4 for queries with just o ne free v ariable . Examples. Co nsider the query “ﬁnd all pro g ram-statio n pair s that achieve a viewership of over 1 0,000 on b oth weekda ys and weekends”. In the domain relational calculus, this q uery may b e for m ulated as [ ∃ S. ∃ V .W eek day T V ( P, S N , V , S ) ∧ V > 10 ] ∧ [ ∃ S. ∃ V .W eek emdT V ( P, S N , V , S ) ∧ V > 10]. T able 12 shows the calculation of the reference doma in for this formula on the data ba se instance of T ables 2 – 4. Now the frequency o f an E R query is deﬁned as follows. 10 Query F ormula F F requency f r D ( F ) F 1 = ∃ S. ∃ V .W eek day T V ( P , S N , V , S ) ∧ V > 10 2 / 3 F 2 = ∃ S. ∃ V .W eek endT V ( P , S N , V , S ) ∧ V > 10 1 / 4 F 1 ∧ F 2 1 / 5 F 1 ∨ F 2 2 / 5 F 1 ∧ ¬ F 2 1 / 5 T able 13: F requencies for v arious fo rmulas in the TV surv ey database instance D from T ables 2 – 4. The free v ariable s query v a riables are P and S N , corresp onding to pairs of prog rams-sta tions. Deﬁnition 8 L et F b e an ER query whose fr e e variables ar e X 1 , .., X m wher e dom D ( { X 1 , ..., X m } , F ) 6 = ∅ . Then f r D ( F ) ≡ | tupl es D ( F ) | | dom D ( F, { X 1 , ..., X m } ) | . T able 13 illustrates the fre q uencies of v ario us queries. 4 En tit y- Relationship Rules W e ﬁnally obtain the no tion of an ER asso ciation rule, or ER rule for sho rt. 4.1 Deﬁnition of Conﬁdence and Supp ort for E R rules Given the concepts we hav e develop ed so far, the deﬁnition o f conﬁdence a nd supp ort for a n entit y - relationship rule are stra ightforw ard. Deﬁnition 9 L et D b e a datab ase instanc e. 1. An ER asso ciati on rule is an implic ation of the form F → G , wher e the fr e e variables of G ar e the same as or c ontaine d in the fr e e variables of F , and F ∧ G is a valid ER query. 2. The c onﬁdenc e of an ER asso ciation ru le F → G is given by con D ( F → G ) ≡ | tupl es D ( F ∧ G ) | | tupl es D ( F ) | . 3. The supp ort of an ER asso ciation ru le F → G is given by suppor t D ( F → G ) ≡ f r D ( F ∧ G ) . As usual with asso c iation rules, the implication F → G do es no t indicate log ical implication (whenever F is true, so is G ) but instead denotes a probabilistic rela tionship. Example. Let D b e the TV survey databa se instance from T ables 2 – 4. Let F 1 be the formula ∃ S. ∃ S N . ∃ V .W ee k day T V ( P, S N , V , S ) ∧ V ≥ 10 and let F 2 be the formula ∃ S. ∃ S N . ∃ V .W ee k endT V ( P , S N , V , S ) ∧ V ≥ 10 . Consider the r ule F 1 → F 2 . The supp ort o f this rule is f r D ( F 1 ∧ F 2 ) = 1 / 4 (see T able 1 1). The conﬁdence is |{ “Gilmore” ,“Ho ck ey Night” } ∩ { “ Ho ckey Nigh t”,“Simpso ns” }| |{ “Gilmore” ,“Ho ck ey Night” }| = 1 / 2 . Deﬁnition 9 completes our goa l o f providing a deﬁnition of conﬁdence a nd supp ort for gener al ent ity- relationship queries . 11 4.2 Comparison With Other Rule Languages This section g ives a brief compariso n of o ur r ule langua g e and fr equency deﬁnition to re la ted rule language s. It is ea s y to see that the cla ssic ass o ciation rule appro ach base d on frequent itemsets is a sp ecial case. F or example, suppo s e w e have tw o entit y tables: T ra nsactions(num be r ) that stores tra nsactions, a nd Item(name) for items, and a re lational ta ble T ransItems(T r a nsNumber ,ItemName) that indica tes whic h items app e a r in which transactions. Then for a given item, say “cola ”, the query T r ansactions ( X ) ∧ T r ansI tems ( X , “cola”) returns the set o f transa ctions inv olving “co la”, and the fr e quency of this query is the freq uency of these transactions among all transa ctions. An tonie a nd Za ¨ ıane [1] extend itemset rules with negations, and sur vey a num b er of sea rch algo rithms for ﬁnding frequent itemsets with negative conditions. Their search pr o cedure is based on corr elation a nalysis. The W armr system [4] considers queries that ar e conjunctions of literals (e.g., P ( X , Y )). The us er sp eciﬁes a targ et ta ble T ; the free query v ariables in a W armr query ar e then b ound to the key ﬁe lds of T . If W e ekdayTV is our target table, w e would have tw o fre e quer y v ariables P for progr am and S N for station. All other v ar iables ar e implicitly existen tially quantiﬁed. F or example, if Custo mer is the tar get ta ble, the W armr formula C us tome r ( A ) ∧ C hi l d ( A, C ) ∧ B uy s ( C, “co la”) translates into the doma in relationa l ca lc ulus as ∃ C.C u s tomer ( A ) ∧ C hi l d ( A, C ) ∧ B uy s ( C, “cola ”). If we assume that one of the co njuncts in a W armr clause corr esp onds to the target table (e.g., C ustomer ( A )), and all o ther app ear ances of the query v ariable s are related to the tar get table by foreign key constra in ts (e.g., the ﬁrs t ﬁe ld in the Child table is a foreig n key to the Custo mer table), then the refer ence domain as w e hav e deﬁned it is exactly the target table, and the frequency that W armr a ssigns to a conjunction agre e s with our deﬁnitio n. In this sense our deﬁnition of supp ort for ER rules genera lizes that for W arm r rules. 5 The Probabilit y Ax ioms and A Priori Prop ert y In o rder to ensure that Deﬁnition 7 yields well-deﬁned pro babilities, w e verify thr e e fac ts : (1) the frequency as deﬁned never inv olves division by 0, so the frequency is well-deﬁned. (2) The deﬁnition e ntails that frequencies ar e betw een 0 and 1 (inclusively). (3 ) T he frequency o f t wo m utually exclus ive queries is the sum of their r esp ective frequenc ie s . This third prop er t y holds only with certain qualiﬁca tions due to the restrictions on sa fe q ue r ies. The us ua l probability axioms include the r equirement that (4) the pr obability of the whole space, o r the “cer tain even t” is 1. W e discuss the extent to which this prop erty holds for our deﬁnition of frequency . Finally , w e show the Apriori prop er t y: frequencies of co njunctions decre ase monotonically , which is imp ortant for lattice search metho ds. F or the ﬁrst fact, we have the following result. The notion of a v a lid ER query w as sp eciﬁed in Deﬁnition 6. Prop ositio n 10 L et F b e a valid ER query whose fr e e variables ar e X 1 , ..., X m . L et D b e any datab ase instanc e (without empty tables). Then dom D ( F, { X 1 , ..., X m } ) 6 = ∅ . Pro of. If F is v alid, then for ev ery maximal conjunction L of literals that o cc ur s in F , w e hav e dom D ( L, { X 1 , ..., X m } ) 6 = ∅ . Since the refer ence domains of more complex formulas are the union o f the domains of their subformulas, it follows tha t dom D ( F, { X 1 , ..., X m } ) 6 = ∅ . The next pr op osition guara nt ees that the ratios ass igned b y Deﬁnition 7 ar e prop erly b ounded b et ween 0 and 1. Prop ositio n 11 L et F b e an ER query in which the variable s X 1 , . . . , X m ar e fr e e such that F is valid for these variables. Le t D b e a datab ase instanc e. Then π h X 1 ,...,X m i tupl es D ( F ) ⊆ dom D ( F, { X 1 , . . . , X m } ) , wher e π is the pr oje ction op er ation of r elational algebr a. In the case in which X 1 , . . . , X m are exactly the free v a riables of F , we hav e π h X 1 ,...,X m i tupl es D ( F ) = tupl es D ( F ), so the prop osition implies that the ratio | tuples D ( F ) | | dom D ( F, { X 1 ,...,X m } ) | is b etw een 0 and 1 . Pro of. The pro of is by induction on the structure of ER formula F . W e beg in by noting t wo basic facts ab out v alid formulas, whic h follow ea sily from Deﬁnitions 2, 6, and 7. 12 1. If C = C 1 ∧ ... ∧ C k is a maximal conjunction in F , then C contains a conjunction X 1 = c 1 . . . ∧ X m = c m or a conjunct C i that is a v alid ER q uery . 2. If F 1 ∨ F 2 is a disjunction in F , then b oth of the disjuncts a r e v a lid ER queries. • If F is an atomic formula of the form P ( t 1 , .., t k ), then since F is v alid for X 1 , ..., X m , we hav e dom D ( F, { X 1 , ..., X m } ) = π h X 1 ,...,X m i tupl es D ( F ). • Let F b e a single ato mic comparison of the form Y θ t where t is either a v ar iable or a constant. Since F is v alid, it must b e of the form X 1 = c where m = 1 (i.e., we just hav e a sing le free v ar iable X 1 and the atomic formula requir es X 1 to be eq ua l to a constant c ). So dom D ( F, { X 1 } ) = { c } , and c learly π X 1 tupl es D ( F ) ⊆ { c } . • Let F b e a maximal conjunction of k > 1 formulas, suc h that F = C 1 ∧ · · · ∧ C k . 1. If F is a conjunction of the form C ∧ X 1 = c 1 . . . ∧ X m = c m , then dom D ( F, { X 1 , ..., X m } ) = dom D ( C, { X 1 , ..., X m } ) ∪ { h c 1 , ..., c m i} . Clea rly π h X 1 ,...,X m i tupl es D ( F ) ⊆ {h c 1 , ..., c m i} , whic h is a subset of dom D ( F, { X 1 , ..., X m } ). 2. Other wise dom D ( F, { X 1 , ..., X m } ) = S k i =1 dom D ( C i , { X 1 , ..., X m } ). Since F is v alid, by O bserv a- tion 1 at least one of the conjuncts C i is v alid. So by inductive hypothesis, π h X 1 ,...,X m i tupl es D ( C i ) ⊆ dom D ( { C i , { X 1 , . . . , X m } ) . Now since F is a conjunction involving C i , it follows that π h X 1 ,...,X m i tupl es D ( F ) ⊆ π h X 1 ,...,X m i tupl es D ( C i ) and that dom D ( C i , { X 1 , . . . , X m } ) ⊆ dom D ( F, { X 1 , . . . , X m } ) , which es tablishes the inductive hypothesis for this case. • If F is F 1 ∨ F 2 , then b y Clause 2 of the deﬁnitio n of a safe q uery , b oth F 1 and F 2 are v alid a nd contain all the v aria ble s { X 1 , ..., X m } ) as free v ariables . So π h X 1 ,...,X m i tupl es D ( F ) = π h X 1 ,...,X m i tupl es D ( F 1 ) ∪ π h X 1 ,...,X m i tupl es D ( F 2 ) . Also, b y inductive h yp othesis, π h X 1 ,...,X m i tupl es D ( F 1 ) ⊆ dom D ( F 1 , { X 1 , . . . , X m } ) and π h X 1 ,...,X m i tupl es D ( F 2 ) ⊆ dom D ( F 2 , { X 1 , . . . , X m } ) , and b y deﬁnition dom D ( F, { X 1 , . . . , X m } ) = dom D ( F 1 , { X 1 , . . . , X m } ) ∪ dom D ( F 2 , { X 1 , . . . , X m } ) . So π h X 1 ,...,X m i tupl es D ( F ) ⊆ dom D ( F, { X 1 , . . . , X m } ) as required. • If F is ¬ G for some formula G , then F is not a safe q uery , hence not an ER quer y , and the claim ho lds v acuously . 13 • If F is ∃ Y .G , then Y 6∈ { X 1 , ..., X m } , since the v ar iables X 1 , ..., X m are free in F . So dom D ( F, { X 1 , ..., X m } ) = dom D ( G, { X 1 , ..., X m } ) , and π h X 1 ,...,X m i tupl es D ( F ) = π h X 1 ,...,X m i tupl es D ( G ) by the semantics of the ex istent ial quantiﬁer. Clearly if F is v alid, then so is G , so by inductive hypothesis π h X 1 ,...,X m i tupl es D ( G ) ⊆ dom D ( G, { X 1 , . . . , X m } ) which completes the inductive pro of. The third fundamen tal pr op erty of probabilities is ﬁnite addi tivity , that the fr equency o f tw o m utually exclusive even ts is the sum of the individua l fr equencies. The diﬃcult y with this pro pe r ty is not that it fails for our frequency deﬁnition, but that it is not s traightforw ardly expressed in our langua g e of safe queries. F or example, a natura l formulation of ﬁnite additivity would b e to requir e that f r D ( F ) + f r D ( ¬ F ) = f r D ( F ∨ ¬ F ). But if F is a safe query , then ¬ F is not safe, so the frequency f r D ( ¬ F ) is not deﬁned. Another wa y to see the diﬃculty is to note tha t in standar d probability theory (with a Bo ole an a lg ebra of even ts), ﬁnite additivity is equiv alent to the require ment that P r ( A ) = 1 − P r ( ¯ A ), where ¯ A is the complemen t of even t A . But this cannot be expr essed as a require men t on safe quer ies since the nega tion o f a sa fe quer y is not itself safe. How ever, we can show a qua liﬁed version of ﬁnite additivity . If S a nd F a re v a lid safe queries with the same free v ariables , then the formulas S ∧ F and S ∧ ¬ F ar e also v a lid safe queries. F or these formulas w e can show the following result. Prop ositio n 12 L et S and F b e valid safe queries with the same fr e e variables { X 1 , ..., X m } . Then for any datab ase instanc e D we have f r D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = f r D ( S ∧ F ) + f r D ( S ∧ ¬ F ) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Pro of. This follows fro m the deﬁnitio ns : W e hav e dom D ([ S ∧ F ] ∨ [ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ([ S ∧ F ] , { X 1 , ..., X m } ) ∪ dom D ([ S ∧ ¬ F ] , { X 1 , ..., X m } ), and since dom D ([ S ∧ F ] , { X 1 , ..., X m } ) = dom D ([ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ), it follows that dom D ([ S ∧ F ] ∨ [ S ∧ ¬ F ] , { X 1 , ..., X m } ) = dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Clearly tupl es D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = tupl e s D ( S ), so f r D ([ S ∧ F ] ∨ [ S ∧ ¬ F ]) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) . Also, f r D ( S ∧ F ) = tuples D ( S ∧ F ) dom D ( S, { X 1 ,...,X m } ) ∪ dom D ( F, { X 1 ,...,X m } ) and f r D ( S ∧¬ F ) = tuples D ( S ∧¬ F ) dom D ( S, { X 1 ,...,X m } ) ∪ dom D ( F, { X 1 ,...,X m } ) , so f r D ( S ∧ F ) + f r D ( S ∧ ¬ F ) = tupl es D ( S ) dom D ( S, { X 1 , ..., X m } ) ∪ dom D ( F, { X 1 , ..., X m } ) , which was to be shown. This result illustr ates that tw o logically equiv alent queries can have diﬀerent frequencies in a given database instance, although their result tuples a r e always the same. In particular, althoug h the q ueries [ S ∧ F ] ∨ [ S ∧ ¬ F ] and S a re logica lly equiv alent, they hav e diﬀerent reference doma ins : the domain of [ S ∧ F ] ∨ [ S ∧ ¬ F ] includes also the do main o f the query F . This is due to our closed-world assumption: s ince 14 the entities in the que r y F a re among those men tioned in the query [ S ∧ F ] ∨ [ S ∧ ¬ F ], they are included among the p oten tial answers to the quer y , a lthough in fact no entit y satisfying F will be an actual a ns wer to the query unless it is also a n en tity satisfying S . The ﬁna l standard pro per ty of pr obability measures on a B o olean algebra is that P ( X ) = 1, where X is the “certain e ven t” that contains all pos sible outcomes. One diﬃculty with this prop erty from the po int o f view of our frequenc y deﬁnition is aga in not so much that the pro pe rty fa ils to hold but that it is not straightforw ard to expr ess. A natural wa y to translate the axiom into a logical framework is to req uire that all tauto logies or logically necessary queries receive pro ba bilit y 1 . F or example the query S tudent ( A ) ∨ ¬ S tudent ( A ) is a tautology when v iewed as a lo g ical formula, but it is not a s afe query . Another conceptually illuminating diﬃcult y is that in our freq uency deﬁnition, there is no single ﬁxed space of pos sible outco mes or even ts that is indep endent of the query b eing asked. Rather, we deﬁne a space of p os s ible outcomes dynamica lly for every quer y (i.e., dom D ( F, { X 1 , ..., X m } ) for query F ). F or a given reference domain, the pro bability 1 prop erty ho lds to the extent that w e can express it. F or ex ample, if the only tw o po ssible gender s are male and female , then the query [ S tu dent ( A ) ∧ Gender ( A, male )] ∨ [ S tu dent ( A ) ∧ Gender ( A, female ) receives frequency 1 in every data base instance. Finally w e show that frequency as deﬁned decreases monotonically with resp ect to c o njunctions. This is imp ortant be cause many algorithms that s earch for frequent query formulas use this pro per ty to avoid exhaustive sear ch. The following result gua rantees that the frequency of a conjunction is less than the frequency of its conjuncts, which w e refer to as the Apriori prop e rty . Prop ositio n 13 (The Apriori Prop ert y) L et D b e a datab ase instanc e with valid ER query F 1 whose fr e e variables ar e X 1 , .., X m and supp ose that F 1 ∧ F 2 is also a valid ER query whose fr e e variables ar e X 1 , .., X m . Then f r D ( F 1 ∧ F 2 ) ≤ f r D ( F 1 ) . Pro of. Clea rly tupl es D ( F 1 ∧ F 2 ) ⊆ tupl es D ( F 1 ) , and dom D ( F 1 , { X 1 , .., X m } ) ⊆ dom D ( F 1 ∧ F 2 , { X 1 , .., X m } ) . So | tupl es D ( F 1 ∧ F 2 ) | | dom D ( F 1 ∧ F 2 , { X 1 , .., X m } ) | ≤ | tupl es D ( F 1 ) | | dom D ( F 1 , { X 1 , .., X m } ) | . Discussion. Previo us a ppr oaches to mining multi-relational rules such a s W armr mine rules for just o ne target table. Our appro a ch in cont ras t can p otentially sea rch the entire space of quer ies for a given language bias, s ince by the prop osition just esta blished, the a priori pro per ty holds for the entire q uery space, not just for a ﬁx e d target table or key a tom, given our deﬁnition of frequency and supp ort. So compared to a n iterative approach wher e we r ep eatedly a pply a s ing le-table rule miner to diﬀerent tables in the database, our appro a ch oﬀers co mputational adv antages. In tuitively , our appro ach combines the results of rule mining for s eparate ta bles when it co ns iders rules that inv olve the separ ate tables at the same time. F or example, suppo se that for the S tudent table, we ﬁnd that the query S tu de nt ( X ) ∧ Ag e ( X , 30) is infrequent . Then from Prop ositio n 13 w e ca n conclude that the query S tudent ( X ) ∧ Ag e ( X , 30 ) ∧ P rof e ssor ( X ) is infrequent as well. A traditional sing le-table r ule mining system applied to both targe t tables would hav e to ev aluate this conjunction twice, once with the ta rget table S tudent and the second time with the target table P r of essor . The pric e fo r the co mputatio nal a dv antage of the a pr iori pro per ty holding thr oughout the q uery space is that our appro ach restricts the set of in teresting queries co mpared to an iterative application of sing le-table rule mining. F or e xample, it may b e the cas e that the r ule P rof essor ( X ) ∧ S tude nt ( X ) → Ag e ( X, 3 0) receives enough supp ort if ev alua ted with r e sp e c t to Pr o fessors (b ecause it may b e the cas e that most professor s who ar e also taking courses a s students ar e younger), but do es no t receive enough supp o rt if ev aluated with resp ect to Students (p erhaps b ecause very few students are also professors to b egin with). Our deﬁnition of supp or t based on taking the union o f the database tables can b e seen a s a c autious a ppr oach bec ause if a query is frequent with resp ect to the union of t wo tables , it is frequent with re spe ct to either 15 table. So a query that is fr e quent with resp ect to the union o f the Pr o fessor and Student tables is frequent with resp ect to b oth. 6 Conclusion The g o al of this repo rt was to extend the concept of co nﬁdence a nd supp ort for a new class o f ass o ciation rule s which we call entit y- r elationship rules. Entit y - relationship r ules are based on the domain relationa l calculus; they a re muc h more ﬂex ible and expressive than standard itemset rules. ER rules allow for negation, nested Bo olean combinations, and quantiﬁcation.The main conceptual contribution of this rep or t is a deﬁnition of frequency for entit y- relationship queries. Instea d of b eginning with a sp eciﬁed tar g et table o r “ key atom”, we dynamically deﬁne a refer ence or ba s e doma in of individuals for each ER query . The key idea of our deﬁnition is to take the bas e set of entities o f a conjunctive q ue r y to b e the union of the co njuncts’ base sets. F o r example, the fr e quency of the quer y P r of essor ( X ) ∧ C ustomer ( X ) is computed with r esp ect to the union of Pr ofessors and Customers. W e pr ov ed that our frequency deﬁnition satisﬁes s tandard axioms for probabilities a nd v alidates the A priori pr op erty: the frequency of a conjunction is no gr eater than the frequency of any conjunct. As usual in data mining, there is a tradeoﬀ b et ween the express iveness of the rule or patter n language, and the diﬃcult y of searching for s igniﬁcant patterns. Our rule language is very general a nd in practice a computational search for interesting entit y -relationship r ules will require a lang uage res triction (bias). A central topic for future re search is to explo re lang ua ge res trictions that make feasible a computationa l sea rch for int eres ting entit y-rela tionship r ules. Ac k n o wledgemen ts This research was supp o rted b y Discov ery Gra n ts to the ﬁrst and thir d author from the Na tural Sciences and Engineering Council of Canada. References [1] “Mining P ositive and Negative Asso ciation Rules: An Approach for Conﬁned Rules”, Maria-Luiza An- tonie and O smar R. Za ¨ ıa ne (2004). 8th Eur op e an Confer enc e on Principles and Pr actic e of Know le dge Disc overy in Datab ases (PKDD 04) , Spring er V erla g LNCS 320 2, pp 27-38, P isa, Italy , Septem b er 20-2 4. [2] “Rela tional Completeness of data ba se sub-language s ”, E. Co dd (197 2). In R.Rustin, editor, Data Base Systems, Prentice Hall. [3] “Attribute-v alue learning versus inductive logic programming: The missing links (extended abstr act)”. In Pr o c e e dings of the Eighth Int ernational Confer enc e on Inductive L o gic Pr o gr amming , pa ges 1–8. Springer, Berlin 1998 . [4] “Discovery of Relational Asso c iation Rules” , L uc Deshap e and Hannu T o ivonen (200 1), Ch.8, in R ela- tional Data Mining , eds. Saso Dzeroski and Nada Lavrac, Springer Berlin. [5] “P rop ositiona lization Approa ches to Rela tional Data Mining”, Stefan Kramer , Nada Lavraˇ c and Peter Flach (200 1), Ch.8, in R elational Data Mining , eds. Sas o Dzeroski and Nada Lavrac, Spring e r Berlin. [6] “E xtending Relationa l Algebr a and Relational Calculus with Set-V alued Attributes and Aggreg ate F unc- tions”, G. ¨ Ozsoy o˘ glu, Z.M. ¨ Ozsoy o˘ glu, and V.Matos (19 87), ACM T r ansactions on Datab ase Systems , V ol.12:4 , pp.56 6–59 2. [7] Artiﬁcial Int el ligenc e: A Mo dern Appr o ach , S. Russell and P . Norvig,(19 88). P rentice Hall. 16 [8] Principles of Datab ase and Know le dge-Base Systems , Jeﬀrey D. Ullma n (19 88), Computer Scie nc e P ress, Ro ckville, Mar yland. 17

Association Rules in the Relational Calculus

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment