Efficient Web Log Mining using Doubly Linked Tree

Efficient Web Log Mining using Doubly Linked Tree
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

World Wide Web is a huge data repository and is growing with the explosive rate of about 1 million pages a day. As the information available on World Wide Web is growing the usage of the web sites is also growing. Web log records each access of the web page and number of entries in the web logs is increasing rapidly. These web logs, when mined properly can provide useful information for decision-making. The designer of the web site, analyst and management executives are interested in extracting this hidden information from web logs for decision making. Web access pattern, which is the frequently used sequence of accesses, is one of the important information that can be mined from the web logs. This information can be used to gather business intelligence to improve sales and advertisement, personalization for a user, to analyze system performance and to improve the web site organization. There exist many techniques to mine access patterns from the web logs. This paper describes the powerful algorithm that mines the web logs efficiently. Proposed algorithm firstly converts the web access data available in a special doubly linked tree. Each access is called an event. This tree keeps the critical mining related information in very compressed form based on the frequent event count. Proposed recursive algorithm uses this tree to efficiently find all access patterns that satisfy user specified criteria. To prove that our algorithm is efficient from the other GSP (Generalized Sequential Pattern) algorithms we have done experimental studies on sample data.


💡 Research Summary

The paper addresses the problem of extracting frequent web access patterns from massive web server logs, which have grown dramatically as the World Wide Web expands. Traditional sequential pattern mining techniques such as the Generalized Sequential Pattern (GSP) algorithm rely on the Apriori principle and require multiple scans of the entire database to generate candidate sequences. While effective for modest datasets, GSP becomes inefficient when dealing with long user sessions and a large number of distinct events typical of web logs.

To overcome these limitations, the authors propose a two‑phase approach that first filters out infrequent events and then compresses the remaining data into a novel data structure called a Doubly Linked Tree (DLT). In the first scan of the Web Access Sequence (WAS) database, each event’s occurrence count is tallied. Events whose support falls below a user‑specified threshold ξ are discarded, thereby reducing the alphabet size and eliminating unnecessary branches before any tree construction takes place.

During the second scan, the algorithm builds the DLT. Each node stores an event label, a count representing how many times the prefix ending with that event appears, a pointer to its parent (for backward traversal), and a link to the next node with the same label (forming an event‑node queue). A header table maintains the front of each queue, enabling rapid access to all nodes sharing a label. By inserting only the filtered frequent subsequence of each session, common prefixes are shared in the tree, dramatically reducing both height (bounded by the longest frequent subsequence plus one) and width (limited by the number of distinct frequent subsequences).

Once the DLT is built, the original log data are no longer needed. Mining proceeds via a conditional search strategy. For each event i present in the tree, the algorithm follows its event‑node queue to collect all prefix paths that end with i, forming a conditional sequence base PS(i). From PS(i) it extracts conditional frequent events, constructs a conditional DLT, and recursively mines this sub‑tree. Each pattern discovered in the conditional tree is concatenated with i as a suffix and added to the final pattern set. If a tree consists of a single branch, the algorithm simply returns all unique combinations of nodes on that branch, avoiding further recursion.

A key technical nuance is the handling of “unsubsumed counts.” Because multiple nodes may share the same label along different paths, higher‑level prefixes can inadvertently inflate the support of lower‑level prefixes. The algorithm corrects this by subtracting the counts of all super‑prefixes from a given prefix’s count, ensuring that each pattern’s support is computed accurately without double‑counting.

The authors implemented both the DLT‑based algorithm and the classic GSP in C++ and evaluated them on a real‑world web log dataset comprising 36,878 entries (≈61 KB) collected over a week in February 2004. Experiments varied the support threshold from 5 % to 30 % and also increased the number of access sequences to test scalability. Results show that at low support thresholds (e.g., 5 %), the DLT method required roughly 100 seconds, whereas GSP needed about 450 seconds—a four‑fold speedup. As the threshold increased, both algorithms ran faster, but the DLT still maintained a consistent advantage. When the dataset size grew, GSP’s runtime escalated sharply, while the DLT’s runtime grew almost linearly, confirming its superior scalability for large‑scale web usage mining.

In conclusion, the paper demonstrates that compressing web access logs into a doubly linked tree and applying recursive conditional mining yields substantial performance gains over traditional Apriori‑based GSP, especially for low‑support, large‑dataset scenarios. The authors suggest future work on automating the preprocessing stage, integrating content‑based analysis, and extending the approach to distributed environments to further enhance practical applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment