Experiments on Data Preprocessing of Persian Blog Networks
Social networks analysis and exploring is important for researchers, sociologists, academics, and various businesses due to their information potential. Because of the large volume, diversity, and the data growth rate in web 2.0, some challenges have been made in these data analysis. Based on definitions, weblogs are a form of social networking. So far, the majority of studies and researches in the field of weblog networks analysis and exploring their stored data have been based on international data sets. In this paper, a framework for preprocessing and data analysis in weblog networks is presented and the results of applying it on a Persian weblog network, as a case study, are expressed.
💡 Research Summary
The paper presents a comprehensive preprocessing framework tailored for Persian‑language blog networks and demonstrates its application on a real‑world dataset from the Persian blog host “ParsiBlog”. Recognizing that social network analysis heavily depends on the quality of input data, the authors argue that existing studies largely rely on international datasets and often overlook the linguistic and technical challenges inherent to Persian blogs, such as mixed encoding, the use of “Finglish” (Persian words written in Latin script), and the scarcity of large, well‑structured corpora.
The framework is divided into three major modules: content data preprocessing, structure‑based preprocessing, and profile data preprocessing.
-
Content Data Preprocessing – The pipeline first strips HTML tags, then normalizes textual content across Persian, English, and Finglish variants, removes stop‑words, extracts keywords, and creates word vectors. A TF‑IDF weighted document‑by‑blog matrix is built, and cosine similarity is computed to quantify thematic similarity between blogs. Applying this to 133,472 posts (April–September 2010) and selecting active bloggers (≥6 posts, at least one post per month) reduces the set to 1,727 blogs and 12,300 posts. After processing, roughly 15,000 distinct keywords remain, providing a compact yet expressive representation of blog content.
-
Structure‑Based Preprocessing – Four interaction types are identified in the blogosphere: blog‑to‑blog (blog roll), post‑to‑post (citations), comment links, and trackbacks (the latter absent in ParsiBlog). All links are extracted, external output links are discarded, self‑loops are removed, and the three graphs (blog roll, post, comment) are merged. To mitigate sparsity, isolated nodes are eliminated and only strongly connected components (SCCs) with at least ten nodes are retained. The original network of 213,05 nodes and 2,573,16 edges collapses to a dense subgraph of 9,065 nodes and 22,216 edges, with the largest SCC containing 8,933 nodes and 220,706 edges. Standard network centrality measures—PageRank (damping factor 0.85) and HITS (hub and authority scores)—are then computed. The authors also illustrate a simple popularity metric based solely on the number of incoming links, showing a strong correlation with the more sophisticated rankings.
-
Profile Data Preprocessing – User profile information is categorized into explicit and implicit attributes: demographic data (age, gender, education), product/brand/person/place mentions, psychological traits (values, attitudes, lifestyle), behavioral history (linking and commenting patterns), non‑verbal cues (ratings, interests), positional data (geolocation), and future tendency indicators (desired products, planned activities). The paper notes that most bloggers are young (average age 21), predominantly male, and fall within the 15‑30 age bracket. These demographic insights, combined with textual and multimedia cues, lay the groundwork for downstream tasks such as user behavior prediction and targeted recommendation.
The experimental results validate the framework’s ability to transform raw, heterogeneous blog data into a structured, analyzable form. Content preprocessing yields a manageable keyword space and enables similarity‑based clustering. Structure preprocessing produces a compact, well‑connected network suitable for graph‑theoretic analyses. Profile preprocessing uncovers valuable sociological statistics about the Persian blogging community.
In the discussion, the authors acknowledge several limitations: the absence of trackback data restricts the full spectrum of interaction types; the language normalization step, especially handling Finglish, lacks a rigorous evaluation of accuracy; and the study treats the network as static, ignoring temporal dynamics that could affect influence measures. They propose future work to incorporate real‑time data streams, improve multilingual normalization techniques, and explore machine‑learning models for automated profile extraction and dynamic influence tracking.
Overall, the paper contributes a domain‑specific preprocessing pipeline that bridges the gap between raw Persian blog data and advanced social network analysis, offering a reproducible methodology for researchers interested in non‑English online communities.
Comments & Academic Discussion
Loading comments...
Leave a Comment