Concept Stability for Constructing Taxonomies of Web-site Users

Owners of a web-site are often interested in analysis of groups of users of their site. Information on these groups can help optimizing the structure and contents of the site. In this paper we use an approach based on formal concepts for constructing…

Authors: Sergei O. Kuznetsov, Dmitry I. Ignatov

Concept Stability for Constructing Taxonomies of Web-site Users
Concept Stabilit y for Constructing T axonomies of W eb-site Users Sergei O. Kuznetsov and Dmitrii I. Ignatov State Universit y Higher Sc ho ol of Economics (HSE) and All-Russia Institute for Scientific and T echnical In formation (VI NITI), Mosco w, Russia Abstract. Owners of a web-site are often interested in analysis of groups of users of their site. Information on these groups can help optimizing the stru cture and conten ts of th e site. In th is pap er w e use an approach based on formal concepts for constructing taxonomies of user group s. F or decreasing the huge amount of concepts that arise in app lications, w e employ stabilit y index of a concept, which describes how a group giv en by a con cep t ext ent differs from other such groups. W e analyze resulting taxonomies of user groups for three target websites. Problem Statement and Domain Mo dels Owners of a web-site a r e often interested in analyzing gro ups of users o f their site. Information on these groups can help to optimize the s tructure a nd cont ents of the site. F or exa mple, in teraction with members of e a ch gro up ma y be orga nized in a sp ecial ma nner. In this pap er we use an approa c h based on fo rmal concepts [1] for cons tr ucting taxo nomies o f user gro ups. F or our exp eriments w e hav e c hos en four target w ebsites: the site of the St ate Univ ersity Higher Schoo l of E conomics (www.hse.r u), an e- s hop of household equipment, the site of a lar ge bank , a nd the site of a car e-shop (the names of the last three s ites cannot b e disclosed due to lega l ag r eements). Users of these sites are desc r ibe d by a ttributes that cor resp ond t o other sites, either external (from three groups of sites: finance, media, education) or in ter na l (w eb-page s of the s ite). More precisely , initial “exter nal” data co nsists of user records eac h con taining t he user id, the time when t he user first entered this site, the time of his/her last visit, and the total num b er of ses sions during the p erio d under considera tio n. An “internal” user rec o rd, on the other ha nd, is simply a list of pag es within the target website visited by a par ticular us er. By “external” and “internal” taxonomies w e mean (par ts of ) concept lattices for co n texts with either “external” or “internal” attributes. F or ex ample, the external context ha s the form K e = ( U, S e , I e ) , where U is the set of all users of the target site, S e is the se t of all sites from a sample (no t including the target one), the incidence rela tion I e is given b y all pairs ( u, s ) : u ∈ U, s ∈ S e , such that user u vis ited site s . Analogous ly , the int ernal cont ext is o f the form K i = ( U, S i , I i ), where S i is the s e t of all own pages of the targ et site. A concept of this context is a pa ir ( A, B ) such that A is a group of users that visited together a ll other sites from B . F ormal framew ork Before pr o ceeding, we brie fly reca ll the F CA terminology [1]. Given a (formal) c ontext K = ( G , M , I ), wher e G is ca lle d a set of obje cts , M is called a set of attributes , and the binary rela tion I ⊆ G × M specifies whic h ob jects ha ve whic h attribute, the der iv ation oper a tors ( · ) I are defined for A ⊆ G a nd B ⊆ M as follows: A I = { m ∈ M | ∀ g ∈ A : g I m } ; B I = { g ∈ G | ∀ m ∈ B : g I m } . Put differently , A I is the se t o f attributes c o mmon to all o b jects of A and B I is the set of ob jects sharing all attributes of B . If this does not result in ambiguit y , ( · ) ′ is used instead of ( · ) I . The do uble application o f ( · ) ′ is a closure op erator, i.e., ( · ) ′′ is extensive, idemp otent, and monotonous. Ther efore, sets A ′′ and B ′′ are said to b e close d . A (formal) c onc ept of the context ( G, M , I ) is a pair ( A, B ), where A ⊆ G , B ⊆ M , A = B ′ , and B = A ′ . In this cas e, w e als o hav e A = A ′′ and B = B ′′ . The set A is called the extent and B is called the intent of the concept ( A, B ). In categoric al ter ms, ( A, B ) is equiv alently defined by its ob jects A or its attributes B . A concept ( A, B ) is a su b c onc ept o f ( C, D ) if A ⊆ C (equiv alently , D ⊆ B ). In this case, ( C, D ) is ca lled a sup er c onc ept of ( A, B ). W e wr ite ( A, B ) ≤ ( C, D ) and define the relations ≥ , < , and > as usual. If ( A, B ) < ( C , D ) and ther e is no ( E , F ) such that ( A, B ) < ( E , F ) < ( C , D ), then ( A, B ) is a lower neighb or of ( C , D ) and ( C, D ) is an upp er neig hb or of ( A, B ); notation: ( A, B ) ≺ ( C, D ) and ( C , D ) ≻ ( A, B ). The set of all concepts ordere d b y ≤ forms a lattice, which is denoted by B ( K ) and called the c onc ept lattic e o f the context K . The relation ≺ defines edges in the c overing gr aph of B ( K ). Data and Their Prepro cessing W e received “external” data with the following information for each user-site pair: ( user id, time of the first visit, time of the last vis it, total num b er of sessio ns durin g the p erio d ). “Internal” data ha ve almost the same for ma t with an additional field url page , which correspo nds to a particular v is ited page of the target site. Information was gathered from about 10 000 sites o f Russian internet (domain .ru). In descr ibing users in terms of sites they visites we ha d to tackle the problem of dimensionality , since concept lattices can b e very large (exponential in the worst case) in terms o f a ttributes. T o r educe the s ize of input data we used the following techniques. F or ea ch user we selected only tho s e sites that were visited by more than a certain num b er of times during the observ ation perio d. This gave us information ab out p erma men t in terests o f par ticular users. Each tar get site was considered in terms of sites of three gro ups: newspa per sites, financial sites, and educational sites. Some pages can b e merg ed (as attributes) according to (implicit) domain ontology . F or example, if us e rs of a bank site hav e p ersona l pages, it is reasona ble to fuse all these pa g es by calling the r esulting a ttr ibute “ a p erso nal web-page”. A certain obser v ation perio d can b e c hosen; usually we to ok a o ne-month per io d. How ever, even for large reduction of input size, concept lattices can b e very large. F or example, a context of siz e 41 25 × 22 5 ga ve rise to a lattice with 57 32 9 concepts. Using Stability fo r Selecting In teres ting Subsets of Concepts T o choose in teresting groups of users we employ ed stabilit y index of a concept defined in [2,3] and considered in [4] (in slightly different form) as a to o l for constructing taxo nomies. On one hand, sta bilit y index shows the independence of an in tent on par ticular ob jects of extent (which may app ear or not app ear in the context depending on rando m factors). On the other hand, s tabilit y index o f a concept shows how muc h ex ten t of a concept is different fr om similar smaller extents (if this difference is very sma ll, then its do ubtful tha t ex ten t refers to a “stable category ”). F o r detailed motiv a tion of staibility indices see [2,3,4]. Definition. Let K = ( G, M , I ) b e a formal co ntext and ( A, B ) be a formal concept o f K . The stability index σ of ( A, B ) is defined as follows: σ ( A, B ) = |{ C ⊆ A | C ′ = A }| 2 | A | . Obviously , 0 ≤ σ ( A, B ) ≤ 1. The stability index of a concept indicates ho w m uch the concept inten t dep ends on par ticular ob jects of the exten t. A stable int ent (with stability index clo se to 1 ) is pr obably “r eal” even if the desc ription of so me ob jects is “noisy”. In application to our data, the stability index shows how likely we are to still obser ve a common group of int erests if w e ignore sev eral users. Apart from b eing noise - resistance, a stable group do es not collaps e (e.g., merge with a different group, split in to several indep endent subgro ups) when a few mem b ers o f the g roup stop attending the target sites. In our exp er imen ts we used ConceptE xplorer [5] for computing and visualiz- ing la ttices and their parts. W e co mpared res ults of taking most s table co ncepts (with stability index exc e eding a threshold) with taking an “ iceb erg” of a concept lattice (or der filter o f a lattice containing all conc e pts with extents larger than a threshold). The results lo o k co rrelated, but nevertheless, substantially different. The se t of s table ex tents co ntained very imp ortant, but not large groups of users. In Figs. 1, 2 we present parts of a co ncept lattice for the site www.hse.ru de- scrib ed by “ex ternal” attributes whic h were taken to b e Russian internet newspa- per s visited b y users of www.hse.ru during one mon th more than 2 0 t imes. Fig. 1 presents an iceber g with 25 concepts having largest extent. Many o f the concepts corres p ond to news pap ers that are in the middle of p olitical sp ectrum, read “by everybo dy” and thus, not very interesting in characteizing so cia l groups. Fig. 1. Ic e b er g with 25 concepts Fig. 2 pr esents a n ordered set of 25 conc e pts having largest stability index. A s compared to the iceb erg, this part of the concept latice con ta ins several so cio lo g- ically impo rtant groups such as reader s of AIF (“yello w press” ), Cosmo po litain, Exp ert (hig h pr ofessional analytical sur veys) etc. Fig. 2. O rdered set of 25 concepts with la rgest stability References 1. B. Gan ter, R. Wille, F ormal Conc ept Analysis: Mathematic al F oundations , Springer, Berlin (1999). 2. S.O. Kuznetsov, Stability as an estimate of th e degree of substantiation of hy- p otheses derived on th e basis of op erational similarit y . Nauchn. T ekh. Inf., Ser.2 (A utomat. Do cument. Math. Linguist.) No. 12 (1990) pp . 2129. 3. S.O. K uznetso v, On stability of a formal concept. In SanJuan, E., ed., Pr o c. JIM’03 , Metz, F rance (2003). 4. C. R oth, S. Obiedko v, D. G. Kourie, T ow ards Concise Representation for T ax- onomies of Epistemic Communities, Pr o c. CLA 4th International Confer enc e on Conc ept L attic es and their Applic ations (2006). 5. S .A . Y evtushenko, System of data analysis “Concept Exp lorer,” in Pr o c. 7th R us- sian Confer enc e on Ar tificial I ntel ligenc e (KII-2000) , Mosco w (2000), 127 -134 (in Russian).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment