An Efficient Preprocessing Methodology for Discovering Patterns and Clustering of Web Users using a Dynamic ART1 Neural Network

Reading time: 5 minute
...

📝 Original Info

  • Title: An Efficient Preprocessing Methodology for Discovering Patterns and Clustering of Web Users using a Dynamic ART1 Neural Network
  • ArXiv ID: 1109.1211
  • Date: 2011-09-07
  • Authors: C. Ramya, and G. Kavitha

📝 Abstract

In this paper, a complete preprocessing methodology for discovering patterns in web usage mining process to improve the quality of data by reducing the quantity of data has been proposed. A dynamic ART1 neural network clustering algorithm to group users according to their Web access patterns with its neat architecture is also proposed. Several experiments are conducted and the results show the proposed methodology reduces the size of Web log files down to 73-82% of the initial size and the proposed ART1 algorithm is dynamic and learns relatively stable quality clusters.

💡 Deep Analysis

Figure 1

📄 Full Content

Web log data is usually diverse and voluminous. This data must be assembled into a consistent, integrated and comprehensive view, in order to be used for pattern discovery. Without properly cleaning, transforming and structuring the data prior to the analysis one cannot expect to find the meaningful patterns. Rushing to analyze usage data without a proper preprocessing method will lead to poor results or even to failure. So we go for preprocessing methodology. The results show that the proposed methodology reduces the size of Web access log files down to 73-82% of the initial size and offers richer logs that are structured for further stages of Web Usage Mining (WUM).

We also present an ART1 based clustering algorithm to group users according to their Web access patterns. In our ART1 based clustering approach, each cluster of users is represented by a prototype vector that is a generalized representation of URLs frequently accessed by all the members of that cluster. One can control the degree of similarity between the members of each cluster by changing the value of the vigilance parameter. In our work, we analyze the clusters formed by using the ART1 technique by varying the vigilance parameter ρ between the values 0.3 and 0.5.

The main objectives of preprocessing are to reduce the quantity of data being analyzed while, at the same time, to enhance its quality. Preprocessing comprises of the following steps -Merging of Log files from Different Web Servers, Data cleaning, Identification of Users, Sessions, and Visits, Data formatting and Summarization as shown in Fig. 1.

At the beginning of the data preprocessing, the requests from all log files, put together into a joint log file with the Web server name to distinguish between requests made to different Web servers and taking into account the synchronization of Web server clocks, including time zone differences.

The second step of data preprocessing consists of removing useless requests from the log files. Since all the log entries are not valid, we need to eliminate the irrelevant entries. Usually, this process removes requests concerning non-analyzed resources such as images, multimedia files, and page style files.

In most cases, the log file provides only the computer address (name or IP) and the user agent (for the ECLF log files). For Web sites requiring user registration, the log file also contains the user login (as the third record in a log entry) that can be used for the user identification.

A user session is a directed list of page accesses performed by an individual user during a visit in a Web site A user may have a single (or multiple) session(s) during a period of time. The session identification problem is formulated as “Given the Web log file, capture the Web users’ navigation trends, typically expressed in the form of Web users’ sessions”.

This is the last step of data preprocessing. Here, the structured file containing sessions and visits are transformed to a relational database model.

A clustering algorithm takes as input a set of input vectors and gives as output a set of clusters thus mapping of each input vector to a cluster. A novel based approach for dynamically grouping Web users based on their Web access patterns using ART1 NN clustering algorithm is presented in this paper. The proposed ART1 NN clustering methodology with a neat architecture is discussed.

The proposed clustering model involves two stages -Feature Extraction stage and the Clustering Stage. First, the features from the preprocessed log data are extracted and a binary pattern vector P is generated. Then, ART1 NN clustering algorithm for creating the clusters in the form of prototype vectors is used. The feature extractor forms an input binary pattern vector P that is derived from the base vector D. The procedure is given in Fig. 2. It generates the pattern vector which is the input vector for ART1 NN based clustering algorithm. The architecture of ART1 NN based clustering is given in Fig. 3. Each input vector activates a winner node in the layer F2 that has highest value among the product of input vector and the bottom-up weight vector. The F2 layer then reads out the top-down expectation of the winning node to F1, where the expectation is normalized over the input pattern vector and compared with the vigilance parameter ρ. If the winner and input vector match within the tolerance allowed by the ρ, the ART1 algorithm sets the control gain G2 to 0 and updates the top-down weights corresponding to the winner. If a mismatch occurs, the gain controls G1 & G2 are set to 1 to disable the current node and process the input on another uncommitted node. Once the network is stabilized, the top-down weights corresponding to each node in F2 layer represent the prototype vector for that node. Summary of the steps involved in ART1 clustering algorithm is shown in Table 1.

We have conducted several experiments on log files collected from NASA Web site during July 1995.

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut