Circadian patterns of Wikipedia editorial activity: A demographic analysis
Wikipedia (WP) as a collaborative, dynamical system of humans is an appropriate subject of social studies. Each single action of the members of this society, i.e. editors, is well recorded and accessible. Using the cumulative data of 34 Wikipedias in…
Authors: Taha Yasseri, Robert Sumi, Janos Kertesz
1 Circadian patterns of Wi kip edia editorial activit y: A demographic analysis T aha Y asseri 1 ∗ , Rob ert Sumi 1 , J´ anos Kert´ esz 1 , 2 1 Departmen t of Theoretical Ph ysics, Budap est Univ ersit y of T ec hnology and Economics, Budafoki ´ ut 8., H-1111 Buda p est. 2 BECS, Sc ho ol of Science, Aalto Univ ersit y , P .O . Bo x 122 0 0, FI-00076 Es po o. ∗ E- m ail: y asseri@ph y .bme.hu Abstract Wikipedia (WP) as a co llaborative, dynamical system of humans is a n appr opriate s ub ject o f so cial studies. Each single action o f the member s of this so ciet y , i.e. editors, is well rec o rded and a c c essible. Using the cum ulative data of 34 Wikipedia s in different languag es, we try to characterize a nd find the universalities and differences in temp oral activity patterns of e dito rs. Ba sed on this data, we estimate the geogr a phical distribution of editors for each WP in the g lobe. F urthermore we also clarify the differenc e s among different gro ups of WPs, whic h or iginate in the v ariance of cultur a l and so cial features of the communities o f editors. In tro duc ti on Relying on the data g athered b y recently develope d information and c o mm unica t ion technologies (ICT), studies on soc ial s y stems ha s entered into a new era, in whic h one is able to track and analyze the behavior of a large num ber of individuals and the interaction b et ween them in details. Among all examples, recent inv estigations base d on cell phone reco rds (ca lls [1] and text mes s ages [2]) and web- based so cieties and media (w eb-pages [3], movie, news a nd status shar ing s ites e.g ., Y ouT ub e.com [4], digg.com [5 ] and twitter.com [6]) hav e op ened very interesting insig h ts int o features o f co lle ctiv e and co opera tive dy na mics of human sy s tems. Wikipedia (WP) a s a free, web-base ency clopedia, which is ent irely written and edited by volunt aries from all around the w orld, has also attracted a tt ent ion of many rese arc hers recently [7 – 10] (for a recent review, see [11]). T o study WP , understand and mo del its evolution [7], cov erage [12], co nflicts or editor ial wars [13, 14], user r eputation [15] a nd many other iss ue s , 2 we should obtain basic information ab out the co mm unity of its editor s , i.e., their age, education level, nationality , individua l editor ial patter ns , fields o f interest and many o ther asp ects. Y et, there has b een rare sys tema tic and unbiased s tud ies in this directio n. The main bar rier here is the priv ac y iss ue s , which prohibit a n y a t tempt to obtain p ersonal data o f co mm itted editors . There are tw o wa ys o f contributing to Wikip edia. The first way is editing a s an unreg is tered user; in this case all the edits ar e recog nized b y the IP a ddress of the editor, and therefore it b ecomes ea sy to lo cate the editor and collect some g eographical info r mation abo ut him/her . B ut most of the editor s take a second w ay which is editing under a regis ter ed user name, which hides the real world identit y and IP address of the editors and therefore is a muc h more s ecure wa y of contributing. Mo reo v er, contributions of such serious editors are identified and unified under o ne single nickname, irre spective of whic h IP addr ess they use to connect to the net work and can b e counted as a measure of maturity in the promotion pro cesses. Cohen has extra c ted g e ographical data from IP addresses o f unregistered editors of Eng lish WP , integrated them ov er time a nd concluded that a bout 8 0% o f edits on E nglish WP are originated from few Eng lish spea king coun tries with high Int ernet p enetration rate, i.e., 60 % from the USA, 12% from UK, 7% from Cana da and 5% fro m Australia [16]. How ever, contributions fro m unregister ed editors are limited to less than 10 p ercent for ma n y WPs (se e T able 1). Moreov er the ra ther small sample of unregistered user s, is not repr esen ting the fea tu res of av erage users , as will b e discussed later . Therefore, indirect metho ds to lo cate editors or to o bta in an y kind of informa tion a bout the communit y is hig hly desirable. One of our aims is to sho w that using the temp oral patterns of WP users, conclusions ab out the geo graphical distribution of (registere d) editors c an b e drawn. Recently muc h effort has b een devoted to descr ib e and under stand the extreme tempor al inhomoge ne- it y o f h uman activities, represented by the bur s tiness of activities and the fat-tailed distribution of time int erv als b et ween even ts [17]. While the circadio n and other p eriodic characteristics of tempor al pa tterns of human activities cannot acco un t for the w ho le r ic hnes s of bur s t y b eha vior [18], they re main imp ortan t for under standing the en tire dyna mics of the systems. These r egularities are induced by circadia n and seasona l cycles of the natur e [1 9 ] on one hand and by cultural asp ects on the other one. Cons e q uen tly , studies on diurna l patterns of the Internet traffic hav e brought interesting informa tion ab out individual habits o f the Internet usag e in different so cieties [20, 21]. In this pap er our fo cus is on such cyclic b eha vior, while investigations o n other asp ects of temp oral inhomogeneities like sho r t time bursty b eha vior and int er-even t interv al distributions are r eported elsewhere [22]. 3 W est et. al. have tried to make use of diurnal characteristics of edits to detect v andalism and destruc- tive edits [23]. Their study was aga in restricted to tracking pos it ive a nd nega tiv e edits from unreg istered editors, for which they found that most of the ”o ff ending edits” ar e co mmit ted during the working ho urs and working days compared to after-da rks and week ends. In the admin-ship of Wikip edia it is also b e- coming fashiona ble to use the p ersonal temp oral fing erprin t o f editors a s a side-to ol to detect and pre vent so c k-puppetry , 1 although this could only b e done with high resp ect to the priv acy p olicies of Wikimedia F oundation. 2 In this work, we first try to characterize the circadian pa tter n of edits on Wikip edia, by analyzing massive da ta of 3 4 WPs, then we introduce a no v el metho d to loc a te a nd find geog raphical distribution of the editors of large in ternational WPs, e.g., English, simple Eng lish, Spanish, etc. F urthermor e, we ana lyze the tempo r al behavior of editors on longe r time scales, i.e. weekly patter ns and rep ort on significant difference s betw een v a rious so cieties. Metho ds This work is ca rried o ut on 34 WPs s e le cted fro m the large st ones in resp ect to the num b er of articles, i.e., those o nes, which have more tha n 100,00 0 articles. 3 Among the sa mple, n um ber o f tota l edits and editors v ary b et w een 3 M to 455 M a nd 46 k to 1 4 M, resp ectiv e ly . I n T able 1 some statistics ab out the WPs under the inv estigation are rep orted. W e considered every single edit p erformed on each WP and having the timestamps assig ned to edits, calculated the ov erall activity of users fo r the time of day and day o f the week. T o see the universalit y of circadian a ctivit y patterns amo ng editors of a ll different languages, we assumed a lo cal time offset for each language. C le a rly there are some languages which ar e no t sp oken only in one country or one time zone, e.g., Spanish, Arabic, etc, wherea s some others are very lo calized in a spec ific time zo ne, e.g., Italian, Hungarian, etc. F or the firs t sort o f languag es, we initially co nsidered the time offset of the most k nown origin of the la nguage. F or the sp ecial cases of the E nglish and s imple English Wikipedias , initially we considered an offset co rrespo nding to USA Central Time. In the ninth column of T able 1, the assig ne d 1 WP editors are generally exp ect ed to edit using only one accoun t. So c k pupp e try is the use of multiple accoun ts to deceiv e other editors, disr up t discussions, distort consensus, av oid sanctions, etc., which is according to WP rules forbidden. 2 http://w ikimedia foundation.org/wiki/Privacy_policy 3 Two Wikip edias of V olap¨ uk and W aray-W ara y are excluded fr om the li st due to their small number of sp eak ers and Wikipedians and considering that many of articles are rob ot ically generated. Th e simple Engli sh Wikipedia is also included in the list, despite i t con tains only around 70,000 ar t icles. 4 time o ff set to each language is rep orted. Note that, due to lack of information such as IP address es of users, this initial ass umpt ions for the origin of edits and corr esponding time offset can not b e an y improv e d at this step. It is one of our g oals to implement a metho d, bas e d on the av erage behavior of WP editors, which is able to determine the p ercent age of the co n tributio ns c o ming from different geogr aphic units. This metho d will be describ ed in the next section in seq uence with the empirical o bserv ations. Results Circadian patterns W e ca lculated the nor malized num ber of edits for each o f the 3 4 WPs with the consecutive time windows of o ne hour for the 24 ho urs of the days. The r ational ac tiv ity level of ea c h time window is calc ulated by dividing the n um ber of edits within the time window by the total num ber of edits. This wa y the circadia n activity patterns are cre a ted as depicted in Figure 1 (a). Mos t WPs show a universal pa tt ern; A minim um of activity a t ar ound 6 A.M., follow ed by a r apid inc r ease up to no on. The activity shows a slight increas e un til a round 9 P .M., where it start to decrease during night. Qua litativ e ly similar sha pes are obs erv ed for other kind of human activities, e.g . cell phone callings and textings [18], and the In ternet instant messaging [24]. Deviations Among all 34 inv estigated WPs, there ar e four, which significantly devia te from all the others in r espect to the cir cadian patterns. in Fig ure. 1 (c) and (d) diurnal activity for these four outliers, Spanish, Portuguese, English a nd simple Eng lish WPs a re shown. In the cas e of Spanish and Portuguese, the main differenc e to the re s t WPs, is the slight shift to the right (later times). Having in mind that Spain and Portugal b oth use lo cal times which hav e a larg er offset compa red to the countries with the same longitude, this comes as no surprise. Beside that, the r ather large num b er of sp eak ers from Latin America not o nly is in favor of this shift, but also flatten the overall amplitude of the diurnal pattern (this will be discussed later in more details). And fina lly the cultural features of tho s e tw o countries might contribute to this observ ation. In the cas e of the English and simple E nglish WPs, for simplicit y , we a ssumed the reference b eing UTC-6 (which corr esponds to the Central Time Zone of the US). Naturally the deviations from the 5 universal pa ttern are very strong, indicating the complex o rigin of the Eng lis h WP . Later we will co me back to this p oin t. T o better illustra te the devia tion fro m an av erage circadian pattern, w e calculated the weigh ted av erage of curves in Figur e. 1 (a). Each WP s pattern is weighted by its tota l num b er of edits. The av erage cur v e is depicted in Figur e . 1 (b). Now we can calculate the difference from this av erage pattern for each WP , D ( t ) at differ en t times of the day t . According to the shap e of D ( t ) and by maximizing the cross- c o rrelation co efficien t, almost all WPs could b e categor ized in 4 categor ies as in Figure. 2. Two of these ca tegories, Figure. 2 (a) and (c) consist o f WPs which hav e less activity during nights compare d to the average pattern. These WPs ar e all in such Euro pean langua ges, which are sp oken in sing le, lo calized regions and therefore the minimum of activity of their editors is deep er than others. In Figure. 2 (b), a categ ory consisting of Asian languages is shown. These WPs are mo re active dur ing nights and less a ctiv e during working hours compar ed to the average. In the last category , shown in Fig ure. 2 (d), a higher activity during nig h t and a low er a ctivit y during w orking hours is a clear sign of a extended distribution of contributors fr om different time zo nes. Ara bic , Persian, Chines e ar e from this categ ory in addition to Spanish, Portuguese , Eng lish and Simple Englis h (not shown). The o ther wa y to lo ok a t the lo calit y of the la nguages is to quantify the sle ep depth . Slee p depth is defined as the differenc e b et w een the maximum and the minimum of the activity of each la nguage users and might be ass umed as a mea sure o f the lo cality of the global distribution of the editor s of the corres p onding langua ge. In the last column of T able 1, the calculated depth v a lues ar e rep orted. These v alues v ary from 2.3 for simple English to 5.6 for Italian. Among those WPs with small s leep de pth are Arabic, Indonesia n, Persian and E ng lish. The av erage sleep depth for the ca teg ory o f Fig . 2(d) is 2 . 8 with standa r d deviatio n of 0 . 4. Among languag es with the la rge sleep depth ar e Italian, Hungarian, Polish, Catala n, and Dutch. These are all lang uages which a re mostly sp ok en in a narrow ar ea of the world a nd therefore are very lo calized in time zones. The a verage sleep depth for the catego ry of Fig. 2(c) is 4 . 9 with standa rd deviation o f 0 . 4. It is also to mention, that although Spanish and Portuguese ar e b oth widely sp oken in differen t a reas and differen t time zones , but the sleep depth of bo th lay in the middle ra nge (4.4 and 4.2 resp ectiv ely). F or a more precise interpretation we try to estimate the s hare of editors from different areas to eac h WP in the next s ection. 6 Geographical distribution o f editors As mentioned ab ov e, due to priv acy p olicy issues, there is no access to the lo cating information of r egis- tered edito rs, such IP a ddresses. How ev er there are studies only consider ing con tributions by unreg istered users which give a very ro ugh image of the real distribution of editors in the glob e [16]. W e aim at a b e tter metho d by decompo sing the ov e rall a c tiv ity pattern of each WP to ba s ic elemen ts, w hich are ass umed to b e repres e ntative for co n tr ibutio ns purely o riginated from a certain time zo ne. F or this purp ose, we av eraged ov e r activity patterns for the 10 WP with the deep est sleep to obtain a smo oth curve, whic h has the features of collective activity of users in synchrony (hereafter called Standa rd cur v e S ( t )). In the next step, we a ssume that the a ctivit y pattern of a WP , A ( t ) with wide spatial dis tr ibution o f editors can be s imulated by sup erpositio ns o f N standard curves with different time shifts τ i and different weights w i for i = 1 to N , A ( t ) = N X i =1 w i S ( t − ∆ τ i ) (1) where ∆ τ i is the difference b et w een τ i and the assumed time offset o f the language (see T able. 1). In gener al, one co uld minimize the erro r of the simulated activit y patter n for each WP for N = 24 different offsets and find the o ptimal weigh ting. Clearly , weigh ts ar e prop ortional to the volume of con- tributions from each time z o ne. F ollo wing this outline we did the optimization, but in a more sup ervised manner. W e restr ic t ed N to the num b er of differen t time zones, which a re r elev an t candidates for b eing an origin of contribution, e.g., we excluded time zones o f nonliving areas of the ea rth. F urthermore, to reduce the complexity of calculations a nd also a void multiple solutio ns , we re duced N to the num b er of areas, which have c onsiderable num b er of sp eakers of the language. In many cases, b y sup erp o sition of N betw een 3 to 6 standard curves, we could fit the empirical data with a high v alue of correlatio n co efficien t betw een the simulated and imp erial da ta sets (see Figure 3) , wherea s tak ing larg er N s do es not dec r ease the err or and it only leads to mo re zer o w i ’s. Finally , by a prop er combination o f demo g raphic informa- tion and optimization techniques, w e estima t ed the sha re of different regions to 9 different WPs. These estimations ar e summarized in Figure 4. Thoug h in s o me ca ses the error function is rather flat around its minimu m, leading to r elativ e larg e tolerance in calcula ted weigh ts, existence of sepa rated multiple minim ums is prohibited by a pply ing the demogr aphic res trictions. 7 W eekly patterns W e also considered the a ctivit y of edito rs during w eeks a nd its dep endence o n the day o f the week. These results ar e shown in Figure 5. According to the weekly pattern o f activity , w e could ca tegorize 28 out of 34 WPs into 4 different categ ories which b elong to tw o main catego ries of ”working days” and ”week end” activity . In the uppe r -left pa ne l of Figure 5, those WPs are shown, which hav e highe s t activity of editors during the working days of the week. Among them, are English, simple English, German, Spanis h, Portuguese and Italian. In the rest of WPs, a big part of edits ar e do ne during weekends. In the cla ss of Polish, Dutch, Korean and Japanese WPs (upp er-right pa ne l of Figure 5) equal activities are shown on Saturdays and Sundays, wherea s in the cla ss of Danish, Swedish, Norwegian and Finnish WPs, editor s hav e very low activity o n Saturdays. The last class of the ”week end” WPs, consists of Arabic and Persian WPs , in whic h F ridays ar e also active days in addition to Saturdays and Sunda ys. The latter is no sur prise, considering that F riday is a public holiday in all o f the original c o un tr ie s along with Saturday in most of them. Discussion The nov el a pproac h to the collective characteristics o f communit y of editors o f WPs, describ ed ab o v e, enables us, for the first time, to shed light on less studied asp ects of Wikipedia. Ba sed on the rep orted results, many basic q uestions and c oncerns ab out the whole pro jects of Wikimedia can b e in v estigated. Knowing the spa tial distr ibut ion of the editors of a certain WP would b e reliable basis fo r ex pla ining sp ecific bia ses in WP ar ticles, heter ogeneous topica l cov erage a nd origins of conflicts a nd editor ial wars to some go o d extent. In addition to that, these results ar ise new que s tions and puzzle s a s well. Cons ide r ing the la rge p opulation of English sp eak ers in North America compared to E urope, and the fact that the Int ernet is most developed in North America, the es tim ation of aro und only half sha r e fo r north America to Englis h WP is a puzzle, which definitely needs further multidisciplinary studies. In the cas e of Simple English WP , the E uropean share is even large r, which is not sur prising, together with the fact that the share of F ar East incr e ased, since this WP is mea n t to b e of use by non-native spe a k ers (thoug h, not necessarily written by them). Note that pr e vious r e s ults of [16] and [23] are partially supp orted by the results rep orted here. F or instance, a share of less than 10 % for Australia n editors in English WP is in b oth articles r eported. Unfortunately , there is no explic it fo cus on the contributions from Europ ean 8 countries in the mentioned works, and it s e ems the la rge amount of efforts by Euro pean e dito rs was ov erloo k ed. How ev er we have rep eated the measure ments on IP a ddresses of unr egistered user s more generally for different WPs by following every sing le edit from this type to lo cate the edito r. Firstly , we co nstructed the “pr e c ise“ activity pa tter n o f unr egistered users , a s shown in Fig. 1(b). The activity pattern of unr e gistered users has cle a rly deep er minim um at night and higher max im um dur ing working hours, co mp are to mo st of the other curves. Unreg is tered us ers contribute to WP occ a sionally and mostly only with few e dits from the sa me IP a ddress. T o b e actively editing even at nights, one must b e extremely committed to WP , therefore the deep sle e p of the ac tivity curve o f unreg istered users comes as no s urprise. W e b eliev e the sa mp le o f unregistered us e r s is more repres en ting the a ctivit y of WP rea ders who edit rarely as they notice nee ds to tin y mo difications her e and ther e while reading the ar ticles than committed users who basica lly write the main b ody o f the ar ticles. The p ercent age of cont ributions b y unregistered us ers is mea sured and r eported in T able 1 for all 3 4 WPs . This v alue v a r ies be t ween 4 for Slo v enian a nd 3 7 for Japanese WP . W e compa red the results for geog r aphical distribution of editors obtained by lo cating them with the IP addres ses to the previo us results desc ribed ab o v e and observed that, both metho ds mainly give similar re s ults for the WP s with rather la rger share o f unreg istered user s, whereas they deviate for WPs with small share of unregistere d users. Finally , one should consider the fact that the committed users, so metimes edit without using their registered user name to v andalize or edit sp ecific controv ersial a r ticles without leaving any trac e which may cause tro ubles for the origina l user na me. In such cases , most of the time an ”o pen proxy“ (with an arbitrar y IP addre ss) is used to hide the real IP address o f the editor. This makes the analysis based on IP addres ses even lo oser. Another interesting part o f the results is on Persian WP . Although more than 70% of native Persian sp eak ers live in Ir an and the res t in clo sely ne ig h b oring countries, but the cor responding WP a ppears in the top list o f WPs with small sleep depth. In addition, the estimated s ha re for edits from Ir an is only a bout 45%. This could be due to the following facts. 1 ) Strong restrictions on the Internet usag e hav e b een a pplied by Iranian gov er nmen t during years as a conseq uence of so cio-political issues 4 , whic h makes it difficult to contribute to WP using Ira nia n based ISPs. 2) Ir an has a high rate o f immigration of studen ts and scholars. That has led to formation o f larg e intellectual communities o ut of Iran, which might b e resp onsible for considerable amount of edits in Persian WP . Low level of con tribution to the F renc h WP by North American editors and to the Arabic WP 4 Wikipedia is one of the few remaining unbanned W eb.2.0 web sites curren tly i n Iran. 9 by Eg yptian editors, could have ro ots in the differences b et w een the sp ok en dialects and the standard languages . Thoug h both languag es (F rench and Arabic) are among the official language s in the mentioned regions, it se ems that the divergence b e tw een dia lects play a n imp ortant role to suppre ss co n tr ibut ing to WP . It should b e mentioned, that there is a sepa rate WP in the lo cal dialect of Egypt, ( Egyptian Ar abic Wikip e dia ) and there ha s b een an unsucces sful effort to launc h the Canadian F r ench Wikip e dia r ecen tly . Therefore we think that the estimations for c on tr ibutions c o uld b e of in terest for the WP co mm unity to o and ela borate the pro c ess of decision making for a new WP in a lo cal dialec t. Clearly , the presented method also has its limitations. F or ins ta nce, accessing to informatio n ab out the distribution o f editors in differen t longitudes is imp o ssible by only considering the time s t amps. Moreov er the r esolution of the regional estimations are not very high. Because of many factors, e.g. applying summer time in many countries the metho d can no t claim at a reso lutio n higher tha n a one hour strip e. F or exa mple, in the case of E nglish WP , the sup ervised optimization results in a ra tio o f 3 for the weigh t of GMT+1 over GMT+0, corr e sponding to Cent ral E urope and W estern Europ e times. But be c ause of the men tioned reasons, distinction betw een the sha re of the very close time zo nes is not justifiable. Mo reo v er, in so me ca ses the erro r of the s im ulated ac tiv ity pattern is not very sensitive to changes in weigh ts of spatially closed offsets. How ev er, all the results presented ab ov e are precise up to the las t significant digit. Putting be s ide the dev iations fro m the av erage of daily a ctivities a nd the weekly a ctivit y for all WPs, one is able to ma k e very clea r conclusio ns. F or example, the daily pattern o f Asian languag es (e.g., Japanese, Chinese and K orean) show higher activity during evenings and nights along with high level of activity a t week ends. This ca n b e r elated par tly to the lengths of working hours in corres ponding countries. This genera l imag e, which holds partially for T urkey and Russia and Isr ael too, could b e in close relation with the high av er age working hours p er day (more than 40 ho urs in all the mentioned cases 5 ) in those countries. F urthermore, among Eur opean countries, we also see the same tendency; in the countries with rather la rger working times, edits are mostly done in later times in ev enings. It is to mention that sa me analy sis have b een done for the se asonal patterns to extr act effects of changes in dayligh t timing, but the large fluctuatio ns in av erage b eha vior, makes it v ery difficult to conclude re le v an t results. The only significant la rge sc ale seaso nal pattern is the reduction of activity with appr o ac hing to the new year holidays for man y WPs. 5 According to the dataset of The Or ganization for Ec onomic Co-op er ation and Development : http:// stats.oe cd.org 10 In c onclusion, based on a da taset of time stamp ed edits on differen t Wikip edias, we studied the diurnal a nd weekly patterns of ac t ivity o f editors. W e could see a universal circa dian pattern fo r all WPs, which has its minimum at dawn and maximum at late afterno on and early e v ening . According to this inv es tig ation, w e also a rgued that us ing a weighted mixture o f contributions from different time zones and an o ptimiza t ion pro cedure, we can estima t e the differ e nt co n tr ibut ions to a WP . In particular , we observe that a consider a bly lar g e pa rt of edits on E nglish and simple Englis h WPs are or iginated fro m Europ e and the shar e of North America w as below expe ctations. The sa me t ype of analysis w as also per formed for other WPs in different languages. In contrast to diurnal pattern, which is universal to a great exten t, w eekly activity patterns of WPs show remar k able differences. W e could, howev er, ident ify t wo main ca tegories, namely ”weekends” and ”working days” active WPs. F urther s tudies are needed to explain these observ a tions in detail and rela t e them to cultural and socia l differences. Ac kno wledgmen ts The pro ject ICT eCollectiv e ackno wledges the fina nc ia l supp ort of the F uture and Emerging T echnologies (FET) pro gram wit hin the Seven th F ra mew o rk Program for Research of the Eur opean Commission, under FET-Op en grant nu m ber : 238 5 97. JK and TY thanks the FiDiPr o program of TEKES for par tial suppor t. W e also thank Wikimedia Deutsc hland e.V. for providing us with the da ta through the Wikimedia to olserv er platform. References 1. Kar sai M, Kivel¨ a M, Pan RK, Kaski K, Ker t ´ esz J, et a l. (20 11) Small but s lo w world: How netw ork top ology and burstiness slow down sprea ding. Phys Rev E 83: 0 25102. 2. W u Y, Zho u C, Xiao J, Kurths J, Sc hellnh uber HJ (201 0) Evidence for a bimoda l dis tr ibut ion in hu man communication. Pro ceedings of the National Aca dem y of Scie nc e s . 3. Hub erman BA, Adamic LA (1999) Internet: Growth dynamics of the world-wide web. Nature 4 01: 131. 4. Szab o G, Hub e rman BA (2010) Pr edicting the p opularity of o nline co n tent. Co mm unicatio ns of the ACM 5 3. 11 5. W u F, Hub erman BA (20 07) Nov elty and collective attention. Pro ceedings of the Na t ional Acade my of Sciences 104: 17 599-17601 . 6. Hub erman BA, Romer o DM, W u F (200 9) So cial netw o r ks that matter: Twitter under the micr o- scop e. First Monda y 14 . 7. V oss J (20 05) Measur ing wikipedia . Pro c 1 0th Intl Conf Intl So c for Scientometrics and Informetrics . 8. Wilkinson DM, Huberma n BA (2007 ) A ssessing the v alue of co operatio n in wikip edia. First Monday 12. 9. Orteg a F, Gonzalez- Barahona JM, Robles G (20 08) On the inequality of contributions to wikip edia. Haw a ii In ternational Co nference on System Sc ie nces 0: 3 04. 10. Ratkiewic z J, F ortunato S, Flammini A, Mencz er F, V es pignani A (2010) Character izing and mo deling the dynamics of online p opularity . Phys Rev Lett 105: 1 58701. 11. Park TK (2011 ) The visibility of wik ip edia in sc holarly publications. First Monday 16. 12. Hollowa y T, Bozicevic M, Br ner K (2007) Analyz ing and visualizing the seman tic cov erage of wikip edia and its author s. Complex ity 12: 30– 4 0. 13. Sumi R, Y asseri T, Rung A, Ko rnai A, Ker t ´ esz J (201 1) Chara cterization and prediction o f wikip edia edit w ars. In: Proc e edings of the A CM W ebSci’11 : 1–3. 14. Sumi R, Y asseri T, Rung A, Kornai A, Ke r t ´ esz J (2011 ) E dit wars in wikipedia . In: Pro ceedings of the Thir d IE EE International Conference o n So cial Computing (So cialCom) . 15. Jav anmar di S, Lop es C, Baldi P (2 0 10) Mo deling user reputatio n in wikis. Statis tica l Analysis and Data Mining 3: 126 –139. 16. Jo nathan C (2010) Computational metho ds for historical research on wikip e dia ’s a rc hiv es. e- Research 1: 67-7 2. 17. Ba r ab´ asi AL (2005 ) The origin of bursts a nd heavy tails in human dynamics. Nature 435: 2 07-211. 18. Jo HH, Ka rsai M, Kert´ esz J, Kaski K (2011) Cir cadian pa tter n a nd burs tin ess in human c o mm u- nication activity . e-print arXiv:1 1010377 . 12 19. Panda S, Hogenesch JB, Kay SA (2002) Circadia n rhythms from flies to human. Nature 417 : 329-3 35. 20. Sp ennemann DHR (2006) The internet and daily life in a ustralia: An explora tion. The Infor mation So ciet y 22: 10 1-110. 21. Sp ennemann DH, Atkinson J, Cornforth D (2007) Sess ional, weekly a nd diur nal patterns of com- puter lab usa ge by students a tt ending a regiona l univ ersity in a ustralia. Computers and Education 49: 726 - 7 39. 22. Y asseri T, Sumi R, Rung A, K ornai A, K ert ´ esz J (in prepar ation) Dyna mics of conflicts in wikip edia . 23. W est A G, Kannan S, Lee I (201 0) Detecting wikipe dia v andalism via spatio-temp oral analysis of revision metadata. In: Pro ceedings of the Third Europ ean W orkshop on Sys tem Security . New Y ork, NY, USA: ACM, EUROSEC ’1 0, pp. 22–28. 24. Pozdnoukhov A, W alsh F (201 0) Explora tory nov elt y iden tification in human a ctivit y data streams. In: P roceedings of the ACM SIGSP A TIAL International W orkshop o n Geo Streaming at 18 th ACM SIGSP A TIAL GIS . Figure L e gends T ables 13 Figure 1 . Normalized activit y of edi tors for (a) all WPs listed in T able 1 exc luding English, simple E nglish, Spanish and Portuguese, (b) the average curve extr a cted from c urv es in (a) and standard cur v e extracted from 10 most lo calized WPs along with the activity curve of unr egistered users, who s e IP-a ddresses are kno wn and therefore one is a ble to lo cate them and obtain the loca l time zone precis ely (c) activit y pattern of Spa nis h (re d) and Portuguese (gr een) WP s, and (d) activity pattern o f English (red) and simple English (gr een) WPs. 14 Figure 2 . Deviation of activity patterns from the a v erage curv e, leading to 4 differen t categorie s of WPs. The g ra y dotted line is the av erage deviatio n of eac h categ ory . The sleep depth (for the definition, see the tex t) of for categories a- d are 4 .5 ± 0.2, 4 .3 ± 0.2, 4.9 ± 0.1 and 2 .8 ± 0.2 re spectively . 15 Figure 3 . Decomp osition of activit y pattern o f Englis h WP into 5 shifted standard curv es with di fferen t weigh ts . the blue line is the empirica l data, the y ellow curve is the st a ndar d curve (see the text for the definitio n), the thin dotted lines are shifted a nd weigh ted s tandard cur v es, and the red dotted line is the linea r sup erp o sition of them which mo dels the empirical data prope r ly . 16 Figure 4 . Estimation o f users con tribution from di fferen t regions. By precisely co m bining the outputs o f the optimizatio n proc ess, describ ed in the text, a nd demogra phic data of each languag e, the share o f each r egion to each WP is estimated. F or the sake of a ccuracy in rep o rting the results, in some cases the contributions of regions closely located, are unified. 17 Figure 5 . Activit y o f editors on diff eren t da ys of the week, categorized in 4 sub categories. There ar e tw o ma jor catego r ies of week end, (b)-(d) and weekdays (a) activit y . Please note that WPs in (d) are languages sp ok en mostly in Muslim countries, which have either T hursdayF r ida y w eekend (Saudi Ara bia,Oman and Y emen), or F ridaySaturda y week end (Algeria, Ba hrain, Egy pt , Iraq, Jordan, Kuw a it, Liby a, Qa tar, Sudan,Syria, United Arab Emirates), or , SaturdaySunda y week end (Moro cco, T unisia). In Iran and Afghanis t an, Persian sp ok en c o un tr ies, only F riday is co nsidered a s week end. 18 T able 1. Wikip edias Statisti cs WP Language M. Country Spea k ers Articles Edits Use rs Activ e IP% Offset S. D. ar Arabic ∗ 255 146 8 368 2433 12 +2 ‡ 2.5 bg Bulgaria n Bulgaria 9 115 4 89 848 14 2 4 .4 ca Catalan Andorra 7 337 7 85 1749 6 0 5.2 cs Czech CzechRep. 11 192 6 146 2329 9 1 4.3 da Danish Denmark 5 147 5 128 1364 10 1 4.3 de German Germany 128 1214 91 1205 245 19 16 1 5 en English - ∗ 600 3609 455 14340 151 549 10 -6 + 2.7 eo Espe r an to - † 2 143 3 49 466 6 +1 N 3.5 es Spanish - ∗ 460 748 48 1789 1564 7 26 +1 △ 4.4 fa Persian Iran 110 124 6 217 1890 5 3.5 2.7 fi Finnish Finland 5 266 10 175 20 60 16 2 4.8 fr F renc h F rance 172 1088 68 1037 165 46 14 1 4.5 he Hebrew Israel 5 116 11 140 2020 3 1 2 4.5 hu Hungarian Hungary 13 187 1 0 167 2055 6 1 5.3 id Indonesia n Indonesia 160 159 5 247 1881 9 8 2 .6 it Italian Italy 62 790 4 3 620 8279 18 1 5.6 ja J apanese Japa n 126 743 3 7 510 10571 37 9 4.8 ko Kor ean Korea 67 159 7 147 1916 14 9 3.4 lt Lithuanian Lith uania 3 13 1 3 4 6 497 7 2 4.6 nl Dutc h Netherla nds 20 6 8 1 25 38 1 5125 10 1 5.1 no No r w egian No rw a y 4 297 9 194 2413 8 1 4 .7 pl Polish Poland 44 793 2 7 425 5403 14 1 5.2 pt P ortuguese - ∗ 230 680 2 5 852 5770 20 0 § 4.2 ro Romanian Romania 27 158 5 181 1255 7 2 3 .6 ru Russian Russia 277 699 3 5 652 12841 16 3 4.3 simple sim.E ng lish - ∗ - 69 2 176 746 1 3 -6 + 2.3 sk Slov ak Slov akia 5 122 3 58 612 7 1 3.7 sl Slov enian Slo v enia 1 109 2 78 579 4 1 3.8 sr Serbian Serbia 11 141 4 81 672 5 1 3.5 sv Sw edish Sweden 9 392 14 221 3467 1 4 1 4.5 tr T urkish T ur k ey 75 1 58 9 337 2499 24 2 4.5 uk Ukr ainian Ukraine 37 274 6 9 9 1929 5 2 4 vi Vietnamese Vietnam 86 20 1 4 223 115 6 8 7 2.7 zh Chinese China 1300 351 1 6 984 5696 13 8 3.7 ∗ F or th e languages which are widely sp oke n in the w orld, the origin country is n ot wel l-defined. † Esperanto has never b een an official language of any country . ‡ Egypt (th e most p opulated A rab coun try) t i me zone. + USA Cen tral standard time zone. N Cen tral Europ ea n time zone. △ Spain time zone. § P ortugal time zone. Statistics a bout WPs under inv estigation. Name of the WP , language, the most p opulated country , in which the language is s p oken, and total num b er of spea k er s in the w orld (millions) a re r eported in columns 1 to 4, followed by num ber of articles (thousands) in the WP , num b er of edits (millions), nu m ber of user s (thousands), n um ber of active user s (users whic h hav e edited in the last month) , and the p ercent age of edits by unreg istered users (kno wn by their IP-addresses ) to the a ll edits. Two last columns consis t of the assigned UTC offset to ea c h WP and the Sleep Depth respectively . The demographic data is taken from Wikip edia and supp osed to give an impr ession to the reader . In the pap er, there is not an y analysis ba sed on this data.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment