Multi-modal Mining and Modeling of Big Mobile Networks Based on Users Behavior and Interest
Usage of mobile wireless Internet has grown very fast in recent years. This radical change in availability of Internet has led to communication of big amount of data over mobile networks and consequently new challenges and opportunities for modeling of mobile Internet characteristics. While the traditional approach toward network modeling suggests finding a generic traffic model for the whole network, in this paper, we show that this approach does not capture all the dynamics of big mobile networks and does not provide enough accuracy. Our case study based on a big dataset including billions of netflow records collected from a campus-wide wireless mobile network shows that user interests acquired based on accessed domains and visited locations as well as user behavioral groups have a significant impact on traffic characteristics of big mobile networks. For this purpose, we utilize a novel graph-based approach based on KS-test as well as a novel co-clustering technique. Our study shows that interest-based modeling of big mobile networks can significantly improve the accuracy and reduce the KS distance by factor of 5 comparing to the generic approach.
💡 Research Summary
The paper addresses the growing need for accurate traffic modeling in large‑scale mobile wireless networks, where traditional approaches that fit a single generic model to the entire network fail to capture the heterogeneous dynamics introduced by user behavior and content interest. Using an unprecedented dataset collected from a university campus, the authors combine NetFlow records, DHCP logs, and wireless AP session logs to obtain a comprehensive view of traffic at the flow level, together with the associated user device (via MAC address) and the physical location (building) of each flow. The dataset comprises roughly 100 million flow records per day and covers more than 3,2000 users, making it one of the largest publicly described mobile network traces.
Data preprocessing is performed with the DataPath big‑data engine. After filtering IP prefixes (24‑bit) to focus on popular web services, the top 100 active domains (e.g., google, facebook, netflix) and 68 campus buildings are identified. Each flow is then annotated with its domain and location, enabling a three‑dimensional matrix of domain‑location‑user interactions.
The authors first conduct a distribution‑fitting analysis. For each domain and each building, they extract per‑second traffic volume series and test nine candidate continuous distributions (Weibull, Lognormal, Generalized Extreme Value, etc.) using the Kolmogorov‑Smirnov (KS) test at a 5 % significance level. Results show that domains fall into four distribution families: about 25 % best fit Weibull, 23 % Lognormal, 21 % Generalized Extreme Value, and the remainder other types. Buildings exhibit a similar pattern, with Weibull (35 %), Lognormal (25 %) and Generalized Extreme Value (18 %) dominating. Hence, the globally best‑fit model (Generalized Extreme Value) is not universally optimal for individual domains or locations.
To quantify similarity between traffic patterns, the authors apply the two‑sample KS test to every pair of domains and every pair of buildings, building two similarity matrices (100 × 100 for domains, 68 × 68 for buildings). These matrices are interpreted as graphs where nodes represent domains or buildings and edges indicate statistically indistinguishable traffic distributions. Graph clustering (modularity detection followed by the Fruchterman‑Reingold layout) reveals distinct modules: popular domains such as google, facebook, and apple form isolated high‑degree nodes with unique traffic signatures, while the remaining domains group into 12 clusters of varying sizes. Buildings also separate into four modules, reflecting functional categories (lecture halls, labs, dormitories, etc.) that influence traffic characteristics.
The most innovative contribution is the joint analysis of domain and location using an information‑theoretic co‑clustering algorithm (Bregman‑divergence minimization). This technique simultaneously clusters users based on the domains they access and the buildings they visit, yielding five behavioral groups. Each group exhibits a characteristic mix of content types (e.g., video streaming, social media) and spatial patterns (e.g., frequenting lecture halls versus cafeterias). The co‑clusters demonstrate that multi‑modal user profiles capture nuances that single‑dimension clustering cannot.
Finally, the paper constructs “interest‑based” traffic models by assigning the appropriate fitted distribution to each domain, building, and behavioral group, and compares these models against a baseline generic model that uses a single distribution for the whole network. Using the KS distance as the error metric, the interest‑based approach reduces the average distance from 0.12 (generic) to 0.024, a five‑fold improvement. Weighted traffic intensity further confirms the robustness of the method. The authors argue that such refined models can improve network resource allocation, content caching strategies, and the design of profile‑aware services such as profile‑cast and iCast, which currently rely on coarse mobility or location information alone.
In summary, the study demonstrates that incorporating user interests (accessed domains) and spatial context (visited locations) into traffic modeling dramatically enhances accuracy for big mobile networks. The methodology—large‑scale data collection, rigorous statistical fitting, graph‑based similarity analysis, and information‑theoretic co‑clustering—provides a reproducible framework for future research and practical network engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment