A Hybrid Trajectory Clustering for Predicting User Navigation

International J ournal of Recen t Tre nds in Engineering A Hybrid Traject ory Clustering for Predicting User Navigation Hazarath Munaga *1 , J. V. R. Murthy 1 , and N. B. Venkateswar lu 2 1 Dept. of CSE, J NTU Kakinada, India Email: {hazarath. munaga, mjonnalagedda}@ gmail.com 2 Dept. of CSE, AIT AM, Tekka li, India Email: venkat_ritc h@yahoo .com * Hazarath Munaga alias MHM Krishn a Prasad Abstract — In th is paper, we pre sent a novel technique f or predicting and visualizing users' future navigations. Here, user navigation is considered as the sequence of URL's visited by the user. We have used distance-based s pecific trajectory clustering to partition users and integrated with Markov model for predicting users' future navigation. For testing the proposed techni que, we developed a tool called P- NAS (Predicting and visualizing user NAvigationS), for predicting and visualizing user future navigations. We have demonstrated the effecti veness of our solution by testing the tool on user navigatio n data o btained fr om msnbc.com, and validated the result with Cross validation and B ootstrapping techniques. Index Terms — Mark ov model, Spe cific Trajectory clustering, Trajectory visualization, w ebsite navigation I. I NTRODUCTI ON One o f t he greate st challenge s for computer sc ience analysts is the understa nding of huma n behavior in the context of “digital enviro nments” (e.g. web site). I n the recent years, this knowledge i s used in various means in web site develo pment to i ncrease ROI (ret urn o f interest ) of a business organization. P redicting u ser navigatio n can be useful in man y applicatio ns. For e.g. i n case of web page access: • predictio n can make a chan ge in the web advertisement (hereina fter, advt) area where a considerable a mount of mo ney is paid for p lacing advts' on the web sites, • predictio n h elps the site designers to reorganize the web sites for enhancing the site topolo gy and user p ersonalizat ion as well as semantic mod eling, and • also helpful for caching the predicte d p age for faster access, subseque ntly for improving browsing perfor mance. Various techniques ha ve been discussed in the literature to clu ster user sessio ns [1][2] [3][4][5] [6]. Reference [3] prop osed a graph partition algorith m (Metis) that c ombines b oth the time sp ent on a page and Longest Common S ubsequences (LCS) to cluster user sessions. T he LCS al gorithm has first ap plied on all pair s of user se ssions, later each L CS p ath i s compacted usin g a concept-categor y of page hierarchy, similaritie s between LCS pa ths were c omputed as a function of the time spent on the cor responding pa ges in the paths weighted b y a cer tain factor. T hen, it built an abstract similarity graph for the set of sessions to be clustered. Finally, Metis is used to segm ent the graph i nto clusters. Reference [1], demonstrated the usage o f hierarchical clustering algorit hm (BIRCH ) for clusteri ng generalized sessions. Even though the traditional cluster ing techniques li ke k-means have b een used for pr edicting user navigatio n patter ns, ty pically, the y are not ver y successful in attai ning good results. More over, the k- means (i) is not suitable for generati ng non -globular clusters and, ( ii) it is difficult to co mpare the quality o f the computed clusters (e.g. the different initial partition s and the k value af fect t he outcome). F inally, the k-mea ns appro ach re quires lot of computational ti me for convergence. Hence, so me r esearchers at tempted to i mprove the user navigation pr ediction accura cy b y combining dif ferent techniques, for e.g., [2], [4], combined clustering with association rules, and [5 ], [6] combined clustering with Markov model. Reference [5 ] partitio ned site users using a model- based clusteri ng ap proac h where they i mplemented fir st order Markov model using the Expectation-Maxi mization (EM) algorithm. As per [ 7], EM algor ithm requires a large number of iterations to take final r esult; hence it is very slow in con vergence. Reference [6] able to generate Significant Usage Patterns ( SUP) from clu sters of abstracte d W eb sess ions; SUPs are derived from first order Markov model w ith each group of u ser sessions. In many a pplicatio ns, intuitively, first-or der or second-order Markov models are not very accurate in predicting the user ’s bro wsing behavior, since these models do not look far into the past to correc tly clas sify the differ ent observed p atterns. A s a result, higher-order m odels are often u sed; a s a result, good prediction req uires higher-order models (e. g., third, fourth-order). Unfortunately, these hig her-order model s have a number of limitations [8] associate d with high state-space co mplexity, reduce d coverage, and sometimes even worse pred iction accurac y. Reference [9], [10 ] demonstrated the usage of trajecto ry clu stering for visualizing, a nalyzing and obtaining hidden pattern s fro m user navigatio ns ob tained International J ournal of Recen t Tre nds in Engineering from virtual en vironments and selectin g cluster head s which implicitly used to extenuate the life time of wireless sensor ne tworks resp ectively. To overco me the ab ove limita tions, or at least a way to minimize t he usa ge of Markov model, in t his p aper, we propose a specific traj ectory clu stering al gorithm to cluster user navi gation patt erns of a web s ite and integrated with k th order Markov model ( hereinafter , KMM) to pred ict the user n avigations. More over, we have develope d java b ased tool for P redicting and visualizing u ser N AvigationS (hereinafter, P -NAS) to predict the behav ior of individual navigating i n a particular web site. II. S PECIFI C T RAJECTORY C LUST ERING The p roposed algorithm can b e used for predicting t he user future na vigations req uired for the a nalyst to an swer the queries like “ what are th e future visits of the user?” . We consider the pa ge seq uences of the site visitor as a trajecto ry. Any clusteri ng algorithm requ ires a dissimilarity method for calculatin g d issimilarit y bet ween entities suc h as traj ectories. The follo wing section explains about the traj ectory si milarity emplo yed in our algorithm. A. Trajectory Diss imilarity In the literature, we have ti me warping distance which is used in matchin g speec h signals i n speec h reco gnition [11]. Also, si milar methods are used in DN A matching i n bio-infor matics. A similar technique is used to find longest co mmon s ubsequence of t wo sequences using fast probabilistic algorit hms, and then the distance i s calculated [12 ], [13]. Here, we used Levenshtei n distance [14] to calculate dissimilarity between trajec tories. For creating trajecto ries we used si mple mapping of pa ge ca tegories like 1 for “frontpage”, 2 for “news”, 3 for “tech” etc. We used Levenshtei n distance to o vercome the influence of large number ed pages o ver lower numbered ones. Table I (A) shows the sample traj ectories and Tab le I(B) shows t he o btained dissimilarity m atrix f or the sample trajecto ries with respec t to testTraj . T he dissimilarity measure wh ich we have used in our study is given i n Fig. 1 which holds metric spac e require ments. B. Proba bilistic Models The prob abilistic models we have used are Mar kov models. These are very com monly used in the p rediction of user na vigations based o n the pre vious navigations, in particular identi fication o f the ne xt page to be acce ssed by the W eb site user based on the seq uence o f previo usly accessed pages [8 ]. For e.g., let P = {p 1 , p 2 ,…,p m } be a set of pages i n a website, and W b e a user session with a sequence of pa ges visited by the user in a visit. Assuming that the user has visited l pa ges, t hen prob(p i |W) i s t he probabilit y that the user visits page p i ne xt. T he users next visit ( p l +1 ) is esti mated by: p l+1 = max p ∈ P {P(P l+1 = p| W)} = max p ∈ P { P(P l+1 = p|p l , p l-1 , p l-2 ,…,p 1 )} This prob ability, prob ( p i |W ), is estimated b y using all sequences of all u sers in training data, say D . Naturally, the lo nger l and the larger D , th e more acc urate prob ( p i|W ). Ho wever, it is infeasible to have ver y long l and large D a nd it leads t o un necessary co mplexity. Therefore, to overcome t his prob lem, a more feasibl e probabilit y is estimated b y as suming that the seque nce of the web pages visited by users follows a Markov p rocess. The funda mental as sumption of pred ictions based on Markov models is that the nex t state is de pendent only on the pr evious k states. Then the ab ove equation becomes, P l+1 = m ax p ∈ P {P(P l+1 = p |p l , p l-1 , p l-2 ,…,p l-(k-1) )} , where k denotes t he number o f t he pr eced ing pages. T he r esulting model of this equation is called the k th order Markov model or simply KMM. For a give n sample, the associa ted prob ability (hereinafter, AP) of the sy mbol is estimated b y th e number of times th at symbo l (say, p i ) is a ssociated b y k letters to all o ther symb ols ( say, p 1 to n) follo wed by k letters i.e. AP(p i ) = ∑ n i i p Frequency p Frequency 1 ) ( ) ( , where n is no. of discrete sym bols. For e.g., let us consider the seq uence o f k letter s are follo wed b y 2 1 4 3 5 6 1 2 2 7 (10) s ymbols. The symbol and its a ssociated probabilities are shown in the following T able II . TABLE II S AMPLE SYMBO LS AND T HEIR A SSOCI ATED P ROBABILITY Symbol 1 2 3 4 5 6 7 AP 0.2 0.3 0.1 0.1 0.1 0 .1 0.1 C. Clustering Routine The traj ectories which are having at least the le ngth of the te stTraj (i.e. testTraj contains current user navigations) will b e consider ed for perfor ming clusteri ng TABLE I ( A ) S AMPLE T RAJECTOR IES Label Trajectory Consider p ortion for cal. dissimilarity testTraj 1 2 3 1 2 3 A 1 2 3 4 5 1 2 3 B 1 2 3 5 6 1 2 3 C 2 3 4 2 3 4 D 6 7 8 9 10 6 7 8 E 12 13 14 15 16 12 13 14 TABLE I (B) D ISSIMILARITY MATRIX FOR SAMPLE TR AJ . W . R . T TO TEST T RAJ Trajectory A B C D E testTraj 0 0 3 3 3 Algorithm (Compu te Dissimilarity) testTraj is a trajector y having m sy mbols A is a trajectory having n symbols; n>= m Compare first m symbols of test Traj with that of A and f ind number of mismatches (N M) . Dissimilarity of testTraj and A is give n as: dis(testTraj, A)=max( NM, (m-n)) Fig. 1 Algor ithm for computing dissi milarity International J ournal of Recen t Tre nds in Engineering task. The cluster ro utine provid es the follo wing data for the analyst to p redict the navi gational be havior of users: - • A group of trajectories containing similar navigation beha vior, and • Associated pr obability matri x for future visits. The clustering ro utine contain s the follo wing stages: 1. Dissimilari ty matrix for traj ectories w.r.t testTraj will be computed using algor ithm shown in Fi g. 1, 2. Using the follo wing speci fic trajec tory clustering algorithm, pro bable cluster will be computed 3. T ake a traj ectory (seq uentially) , if the d issimilarit y with the testTraj is zer o then a dd to the cluster C. 4. Using the followin g Algorithm ( Compute APMatrix ), compute associ ated prob ability of pages for future visits. 5. If there is no future probab ility or the ob tained probabilit y is less than the co nsiderable value, then only build the KM M for getting future vis its. Algorithm(Compu te APMatrix ) 1. In put: C is a cluster ha ving n traj ectories; testTraj is a trajectory c ontaining current user visits; 2. Outp ut: AP is an associated pr obability, a o ne-dimensional matrix of size ( m ) for m sy mbols; next is a o ne-dimensional matr ix o f size m used for counting next/immediate pro bable page weights; temp = 0 , totalCoun t = 0 are temporary variab les; //for getting the i mmediate pa ge id 3. for i = 1 to n of C , temp = C[i] [testTraj.leng th+1]; next[temp] ++; totalCoun t++; end for // for co mputing the AP of pag es 4. for i = 1 to m symbols AP[i] = next[i]/totalCou nt ; end for 5. return AP ; III. E XPERIMENTAL W ORK We have used the Web navigation data obtained from “msnbc.co m” 2 . T he d ata co mes fro m Inter net I nformation Server (IIS) logs a nd news r elated po rtions of msnbc.co m for the e ntire da y of Sep te mber, 28, 1 999 (Pacifi c Standard T ime). Eac h sequence i n the d ataset correspo nds to p age views of a user during t hat twent y- four hour period . Each instant in the sequence correspo nds to a users reque st for a p age. Requests ar e record ed at the level o f pa ge c ategory.. T he cate gories i n “frontpage”, “ne ws”, “tech”, “local”, “opinion”, “on-air”, “misc”, “weather ”, “health”, “ living”, “ business”, “sports”, “summary”, “bbs” (bulletin board service), 2 http://kdd.ics.uci.e du “travel”, “msn-ne ws”, and “m sn-sports”. T he f ull data set consists of 9898 18 ( users), with a n avera ge of 5.7 events per sequence. T he co mputed detailed view of the data set (no. of pages visited b y user’) is shown in follo wing Table Fig.2. I n our stud y, w e have taken the potential users who visited at least 3 pages and up to 13 pages; it got up to 17 3879 users. For testing our prop osal, as a test case, w e consid er already visited a nd currentl y visiting pa ges ar e 1, 3 and 4; the following is ob tained outp ut from the tool: • Fig. 3a., sho ws the extracted clusters fro m the dataset, • As shown in Fi g. 3b, c, after visiting 1 3 4 pages, the user is expected to visit 2 nd page with 57%, 7 th page with 26% and 12 th page with 17% probabilit y, • If the user visits the 2 nd page; as sho wn in Fi g. 3d, he can further expec ted to visi t 12 th page with 50%, 6 th page with 29% and 1 4 th page with 21% probabilit y. Even though o ur to ol is designed to p redict fu ture user naviga tions, it supports t he ad ministrator/ website designer in the followin g way: 1. Ba sed on the status of the user, in advance next probable pages can b e brought to t he cache, subsequentl y it reduces the ca che latency and access time. Mo reover, if the r equired p age is available in cache /main memory latency i s ver y small, o n the ot her ha nd it sho uld be bro ught fro m secondar y me mory leading t o page faults. I f we predict properly these t ypes o f prob lems can b e avoided o r at least red uced. 2. T he advts’ or the t hings which require maximu m user atte ntion ca n be dynamica lly positioned i n the next probable pages. 3. If a po tential co mpany wants to give advt i n more than one pa ge, the to ol suggest s most pr obable pages: • for e.g., as shown in Fi g. 3b , after visitin g 1 3 4 pages, user s are goin g to navigate t hrough 2nd, 7 th and 1 2th p ages; hence, the c ompan y can present it s advt not only i n 4th pa ge, and it can be repea ted or pr esented in more d etail way in 2nd, 7 th and 12 th pages to get more users attention, No. of Visits Users % with total No. of Visits Users % with total 1 601384 60.76 10 297 0.03 2 214392 21.66 11 142 0.01 3 94711 9.57 12 238 0.02 4 43321 4.38 13 143 0.01 5 19692 1.99 14 74 0.01 6 890 2 0.90 1 5 67 0.01 7 400 8 0.40 1 6 9 0.00 8 168 8 0.17 1 7 13 0.00 9 737 0.07 Fig. 2 Detailed view of the user visits International J ournal of Recen t Tre nds in Engineering • on the o ther ha nd, if t he co mpany ca n’t in vest the huge amount (i n general, cost for ad vt in “main” or “business” page is expe nsive when compare with other page s) sti ll it can present its adv t. Fo r e .g., we ob serve t hat, u sers visited “ne ws” and “business” pages, w ith 40% prob ability they are going to visit the “tech” page. T he co mpany can p resent it s advt as a small snapshot in “busine ss” page later he can present the d etail ad vt in t he next probable p age i.e. “tech” page . A. Valid ation Since t he 1980s, cr oss-validation (CV) [1 6] and bootstrap ping ( BTS) [17] have been the popular techniques for esti mation of t he u nknown perfor mance o f a classifier designed for discrimi nation. Here we used the 5-fold CV a nd BTS to validate our model. Obtai ned results fro m t he 5-fold valid ation are sh own in Fig. 4, where it shows the achieved percentage o f success and average c lusters for med for c orresp onding user ses sions. The propo sed model is teste d for a variety o f co ntexts, and observes tha t: 1. As s hown in Fig. 4 a, for a mixed dataset of user s who navigated b etween 3 to 8 pages for pred icting future visit with KMM su ppor t w e achieve a maximum of 95% success. In p articular, our goal is to reduce the usage of KMM model, due to its huge requ irement of memory and proc essing ti me. For predicting the users 4th visits almost we achieve same perc entage of succes s without usin g KMM. 2. As shown in Fi g. 4b ., for pred icting users exact visit b etween 4th to 1 0th, for predicting 4th visit without KMM suppo rt we got 94% s uccess; a nd as usual, a s the le ngth of visit i ncreases success rate is d rop down, and in this context, it beco mes mandatory to use KMM model for ac hieving good success rate. 3. T o verify, the ob tained va lidation res ult using 5- fold CV technique, for the same da taset wi thout using KMM supp ort, we compare with BT S technique; fro m Fig. 4c , we can ea sily observe that always B TS is givin g higher succes s rate and almost similar clu sters for the dataset. Fig. 3 Obtained vis ualization from the P-NA S (a) (b) (c) Fig. 4. (a) Validation Results for mixed dataset (b) Validation Results for particular visit (c) Comparison betwee n BTS a nd 5-fold CV techniques International J ournal of Recen t Tre nds in Engineering IV. C ONCLUSI ON In this p aper, we ha ve presented a no vel speci fic trajecto ry clustering algorith m for clu stering, visua lizing and analyzing user future navigation patter ns. Users are grouped into clusters such that the users with si milar navigation patter ns are pla ced in t he sa me cluster. Experiments are carried out using the msnbc.co m users’ data, a nd valid ated with c ross-validation (CV) and bootstrap ping (BT S) technique s. This tool can be used for predicting the user future visits. T his k nowledge can be used in increasin g the busines s. This model ca n be m odified to m ake further improvement in p rediction a ccuracy. An importan t observation tha t I made while going though t he literatur e on various web predictio n models is that, not many researchers have made an attempt to o ptimize the mode l using genetic algo rithms or simulated annealing, hence, it can be considered as a fu ture w ork. T his mod el has a drawback, i.e., it do es not track the p robab ility of the next item that has never be en seen. U sing a varia nt o f predictio n by partial matching will help ta ke car e of this situation and shou ld be co nsidered in the work ahead. R EFERENCES [1] Y. Fu, K. Sandhu, and M .-Y. Shih, “Clustering of web users based on access patterns,” in Pro ceedings of the 1999 KDD Workshop on Web Mining . Sprin ger-Verlag, 1999. [2] H. Lai and T.-C. Yang, “A group-based inference appro ach to custo mized marketing on the web int egrating clusterin g and association ru les techniques,” in HICS S ’00: Proceedings of the 33rd Hawaii In ternational Conference on System Sciences -Volume 6. Washi ngton, DC, USA: IEEE Computer Society, 2000, p. 6054. [3] A. Banerjee and J. Ghosh, “Clickstream clustering using weighted longest common subsequences,” in Proceedings of the Web Mining Workshop a t the 1st S IAM Con ference on Data Mining , 2001, pp. 33–40. [4] S. L. F Liu, Z Lu, “Mining association rules using clustering,” Intelligent Data Analysis , vol. 5 , no. 4, p p. 309–326, 2000. [5] I. Cad ez, D. Heckerman, C. M eek, P. Smyth, and S. White, “Modelbased clustering and visualization of navigation patterns on a web site,” Data Min. Knowl. Discov ., vol. 7, no. 4, pp. 399–424, 2003. [6] L. Lu, M . Dunham, and Y. Meng, “Discovery o f significant usage patterns from clusters of click - strea m data,” in WebKDD2005 , 2005. [7] L. Xu and M. I. Jord an, “On convergence pro perties o f the em algorith m for gaussian mixtures,” Neural Comput ., vol. 8, no. 1, pp. 129–151, 1996. [8] M. Deshpand e and G. Karypis, “S elective markov models for predi cting web page accesses,” A CM Tran s. Interet Technol. , vol. 4, no. 2, pp. 163–184, 2004. [9] H. M unaga, L. Iero nutti, an d L. Chittaro, “CAST - a novel trajectory clustering and visualization tool for spatio temporal data,” in IHCI-2009: Proceedings of the First International conference on Intellig ent Human Computer Interaction . Springer, India, January 2009, pp. 169–175. [10] H. Munaga, J. V . R. M urthy, and N. B. Venkateswarlu, “A novel trajectory clusterin g techniq ue for selecting cluster heads in wireless sensor networks,” In ternationa l Journ al on Recent Trends in Engineering , vol. 1, pp. 357–361, Ma y 2009. [11] H. S akoe and S. Chiba, Dynamic programming algorithm optimization fo r spoken wo rd recognition . San Francisco, CA, USA: Morgan Kaufmann Publishers In c., 1990. [12] G. Das, D. Gunopulos, and H. Man nila, “Finding similar time series,” in PKDD ’97: Proceeding s of the First European Symposium on Principles o f Data Mini ng and Knowledge Discovery . London, UK: Springer-Verlag, 1997, pp. 88–100. [13] B. Bollob´as, G. Das, D. Gunopulos, and H. Mannila, “Time-series si milarity p roblems and well-separated geometric sets,” Nordic J. of Computing , vol . 8, no. 4, pp. 409–423, 2001. [14] L. I. Vladimir, “Binary codes capable of correctin g deletions, insertions and reversals,” Soviet Physics Doklady , vol. 10, pp. 707–710, February 1966 . [15] F. Khalil, J. Li, and H. Wang, “Integrating recommendation models for i mproved web pa ge predictio n accuracy,” in ACS C ’08: Proceedings of the thirty-first Australasian conference on Computer science. Darlinghurst, Australia, Australia: Australian Co mputer Society, Inc., 2008, pp. 91–100. [16] M. Stone, “Cross-validatory choi ce and assessment of statistical predictio ns,” R oy. Statist. Soc. Ser. B , vol. 36, pp. 111–148, 1974. [17] B. Efron a nd R. Tibshirani, Introduction to th e Bootstrap . Chapman and Hall, London, 1993.

A Hybrid Trajectory Clustering for Predicting User Navigation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment