Scaling associative classification for very large datasets
Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.
💡 Research Summary
The paper introduces DAC (Distributed Associative Classifier), a scalable associative classification framework designed for extremely large, high‑cardinality categorical datasets. Traditional associative classifiers struggle with such data because the mining of frequent itemsets and association rules can generate a number of patterns that exceeds the size of the dataset, leading to prohibitive memory consumption and long runtimes. DAC addresses these challenges through three main innovations: (1) a Gini‑impurity‑based pre‑pruning strategy applied during rule extraction, (2) a novel CAP‑growth algorithm that builds a compact “CAP‑tree” by ordering items according to their Information Gain (IG) derived from Gini impurity, and (3) an ensemble learning scheme that trains multiple independent models on sampled partitions of the data and combines them with a weighted voting mechanism that takes rule confidence and IG into account.
The CAP‑growth algorithm is a modification of the classic FP‑growth method. In a first scan of the dataset, frequent items are identified and each item’s IG is computed as IG = w_i · (Gini_D − Gini_i), where w_i is the proportion of transactions containing the item. Items with non‑positive IG are discarded, and the remaining items are sorted in descending IG order. In a second scan, each transaction is filtered to keep only these items, reordered according to the IG ranking, and inserted into the CAP‑tree. Each node stores a frequency vector for the class labels, enabling immediate calculation of support, confidence, χ², and IG for any potential rule without additional passes. The greedy extraction phase then traverses the tree, generating Class Association Rules (CARs) from high‑IG items first, thereby avoiding the massive redundancy typical of standard FP‑growth outputs.
Training proceeds by partitioning the original dataset into N subsets (with optional sampling ratio r). Each partition undergoes independent CAP‑growth, producing a local set of CARs. These local models are collected into an ensemble. To keep the final model lightweight, a consolidation phase merges duplicate rules, discards those with low confidence or IG, and retains only the strongest rule for each antecedent. Prediction uses a weighted majority vote where each rule contributes proportionally to its confidence and IG, rather than the naïve “first‑match” approach.
The authors evaluated DAC on Apache Spark using a real‑world dataset exceeding 4 billion records, more than 800 million distinct categorical values, and a total size of over 1 TB. Compared against a state‑of‑the‑art Spark‑based associative classifier and a leading decision‑tree implementation, DAC achieved higher predictive performance (average improvements of 3–5 percentage points in precision, recall, and F1‑score) while reducing overall execution time by 30–45 %. Memory consumption dropped by more than 40 % thanks to the IG‑based pruning. Importantly, the final model consisted of only a few hundred thousand rules, making it human‑readable and allowing domain experts to inspect, validate, and manually adjust the logic—a key advantage for decision‑making contexts.
In summary, DAC demonstrates that associative classification can be made both scalable and interpretable for massive high‑cardinality data by (i) pruning irrelevant items before rule generation using Gini‑based IG, (ii) employing a compact CAP‑tree structure that stores class frequencies for fast rule evaluation, and (iii) leveraging ensemble learning with weighted voting to improve accuracy without inflating model size. The authors release the implementation as open source, encouraging further research on extensions such as dynamic partitioning, handling severe class imbalance, and online learning for streaming data.
Comments & Academic Discussion
Loading comments...
Leave a Comment