Do Cascades Recur?

Do Cascades Recur? Justin Cheng 1 , Lada A Adamic 2 , Jon Kleinberg 3 , Jure Leskov ec 4 1 , 4 Stanf ord Univ ersity , 2 F acebook, 3 Cornell Univ ersity 1 , 4 {jcccf , jure}@cs .stanford.edu, 2 ladamic@fb .com, 3 kleinber@cs.cornell.edu ABSTRA CT Cascades of information-sharing are a primary mechanism by which content reaches its audience on social media, and an activ e line of research has studied how such cascades, which form as content is reshared from person to person, dev elop and subside. In this paper , we perform a lar ge-scale analysis of cascades on Facebook over signiﬁcantly longer time scales, and ﬁnd that a more complex pic- ture emerges, in which many large cascades recur , e xhibiting mul- tiple bursts of popularity with periods of quiescence in between. W e characterize recurrence by measuring the time elapsed between bursts, their ov erlap and proximity in the social network, and the div ersity in the demographics of individuals participating in each peak. W e disco ver that content virality , as revealed by its initial popularity , is a main driv er of recurrence, with the availability of multiple copies of that content helping to spark new bursts. Still, beyond a certain popularity of content, the rate of recurrence drops as cascades start exhausting the population of interested individu- als. W e reproduce these observed patterns in a simple model of content recurrence simulated on a real social network. Using only characteristics of a cascade’ s initial burst, we demonstrate strong performance in predicting whether it will recur in the future. Keyw ords: Cascade prediction; content recurrence; information diffusion; memes; virality . 1. INTR ODUCTION In many online social networks, people share content in the form of photos, videos, and links with one another . As others reshare this content with their friends or followers in turn, cascades of re- sharing can de velop [14]. Substantial pre vious work has studied the formation of such information cascades with the aim of charac- terizing and predicting their growth [7, 23, 47]. Cascades tend to be bursty , with a spike of activity occurring within a few days of the content’ s introduction into the network [34, 37]. This property forms the backdrop to a line of temporal analyses that focus on the basic rising-and-falling pattern that characterizes the initial onset of a cascade [2, 10, 36, 48]. Howe ver , the temporal patterns exhibited by cascades over sig- niﬁcantly longer time scales is largely unexplored. Do successful cascades display a long monotonic decline after their initial peak, . 2.5K 5K 7.5K 10K Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec # Reshares How to be skinny 1. Notice that your body is covered in skin 2. Say “Wow I’m skinny” Congratulations you are now skinny Image Meme Figure 1: An example of a image meme that has recurred, or resur- faced in popularity multiple times, sometimes as a continuation of the same copy , and sometimes as a new copy of the same meme (ex- ample copies are sho wn as thumbnails). This recurrence appears as multiple peaks in the plot of reshares as a function of time. or do the y e xhibit more complex behavior in which they can re- cur , experiencing rene wed bursts of popularity long after their ini- tial introduction? Anecdotally , many of us have experienced déjà vu when a friend shared content we had seen weeks or months ago, but it is not clear whether these are isolated occurrences or glimpses into a rob ust phenomenon. Resolving these basic distinctions in the long-time-scale behavior of cascades is crucial to understanding the longevity of content beyond its initial popularity , and points to ward a more holistic view of ho w content spreads in a network. The present work: Cascade recurr ence. W e perform a year - long large-scale analysis of cascades of public content on Face- book, measuring them ov er signiﬁcantly longer time scales than previously inv estigated. Our ﬁrst main ﬁnding is that recurrence is widespread in the temporal dynamics of large cascades. Among large cascades appearing in 2014, o ver half come back in one or more subsequent bursts. While reshare activity does peak and then drop to very low or ev en zero le vels relativ ely soon after introduc- tion, the same content can recur after a short or extended lull. The pre valence of recurrence prompts sev eral questions about how and why content recurs. Is more broadly or narrowly appealing content more likely to recur? Does a larger initial burst indicate a greater likelihood of recurrence, or does it inhibit subsequent bursts by e xposing and thus satiating man y people in the initial wa ve? Do different bursts of the same content spread in different parts of the network? Is the second burst a continuation of the initial cascade, or a fresh re-introduction of the content into the network? Does the media type of the reshared content matter — for example, whether (a) (b) (c) (d) Figure 2: (a) The diffusion cascade of the e xample meme from Figure 1 as it spreads over time, colored from red (early) to blue (late). Only reshares that prompted subsequent reshares are shown. (b) The cascade is made up of separately introduced copies of the same content; in this drawing of the cascade from (a), each copy is represented in a different color . (c) Sometimes, individual copies experience a resur gence in popularity; again we dra w the cascade from (a), b ut no w highlight a single resur gent copy in red with the spread of all other copies depicted in black. (d) A different network on the same set of users who took part in the cascade, showing friendship edges rather than reshare edges. These edges span reshares across copies and time, sho wing that multiple copies of the meme are not well-separated in the friendship network. it is a photo or a video? Finally , how well can one combine such features to predict whether a piece of widely reshared content is likely to experience additional b ursts in popularity later on? W e moti vate our discussion with an example of content recur - rence. Figure 1 shows an image meme that ﬁrst became popular on Facebook at the end of February 2014, and it depicts how the number of reshares of that meme changed over time. Here, while an initial b urst in resharing acti vity is followed by a gradual de- crease, this meme recurred, experiencing multiple resurgences in popularity — ﬁrst in mid-March, then several times over the next few months. Perhaps surprisingly , there is little to no resharing between consecutiv e bursts. Additionally , multiple near-identical copies of this image meme, represented in different colors, are shared in the network. This distinction between dif ferent copies of the same content will prove important in our later analyses: when a user reshares content through the reshare mechanism pro vided by the site, the content continues onw ard as the same copy; in contrast, when a user reposts or re-uploads the same content and thus shares it afresh, this is a new cop y . Figure 2 sketches the diffusion cascade of this meme, or its prop- agation over edges in the social network. As shown in (a), bursts in acti vity are connected through the same large long-li ved cascade and can be traced through the network, from the initial bursts in March (shown in red), to the smaller bursts nearer the end of 2014 (shown in blue). In (b), where the same network is no w colored ac- cording to the copy of the image being reshared, dif ferent copies of the same content appear at different times, sometimes correspond- ing to when bursts occur , suggesting that recurrence sometimes oc- curs from the introduction of new copies. Ho wever , recurrence may also occur as a continuation of a pre vious copy: the copy high- lighted in red in (c) experiences an initial burst in March, but then resurfaces in popularity later in the year . Further , we see in (d) that friendship ties exist between e ven the earliest and latest reshares — the meme appears to be diffusing rapidly , but also revisits parts of the network through which it had earlier dif fused. While the meme in our example recurred sev eral times, are such memes the exception or the norm? And if such memes are in fact typical, what are the bases for such robust patterns of recurrence? T o answer these questions, we use a dataset of reshare activity of publicly view able photos and videos on Facebook in 2014. Characterizing recurrence. First, we dev elop a simple deﬁnition of a burst , corresponding informally to a spike in the number of reshares over time, that we can use to quantify when recurrence occurs (via multiple observed bursts), and when it does not (a single burst). W e sho w that a signiﬁcant v olume of popularly reshared content recurs (59% of image memes and 33% of videos), and that recurring bursts tend to tak e place over a month apart from each other . Recurrence is itself relatively bursty — rarely do we observe long sustained periods of resharing. Studying the temporal patterns of recurrence, user characteristics of the resharing population, and the network structure of cascades, we ﬁnd that the recurrence of a piece of content is moderated to a large extent by its virality , or broadness of appeal: cascades with initial bursts of activity that are larger , last longer , and ha ve a more div erse population of resharers are more likely to recur . Nonethe- less, it is not the cascades that start out the lar gest or most viral that recur , but those that are moderately appealing. Speciﬁcally , a moderate number of initial reshares, as well as a moderate amount of homophily (or di versity) in the initial resharing population is correlated with higher rates of recurrence. This lies in contrast to more appealing (or popular) content, where one is likely to see a single large outbreak which results in a large single burst, as well as less appealing content, where one is likely to only see a sin- gle small outbreak and thus a smaller single burst. In the former case, we show evidence that a large initial b urst inhibits subsequent recurrence by effecti vely “immunizing” a large proportion of the susceptible population. While individual copies of content already recur in the network (18% for image memes and 30% for videos), the presence of mul- tiple copies catalyzes recurrence, allowing that content to spread rapidly to different parts of the network, signiﬁcantly boosting the rate of recurrence. T o a smaller extent, the principle of homophily , suggesting that people are more likely to share content received from users similar to themselves, also plays a role in recurrence, with user similarity positi vely correlated with the rate of spreading. Modeling recurrence. Motiv ated by the abo ve picture of recur - rence, and inspired by classic epidemiological models of diffusion [39] and disease recurrence [3, 25, 40], we present a simple model of cascading behavior that is primarily dri ven by content virality and the av ailability of multiple copies, and is able to reproduce the observed recurrence features. A simulation of this model, which introduces multiple copies of the same content into the network, can cause independent cascades that peak at different times and in aggregate are observed as recurring. As the virality of the content increases, the shape of a plot of overall reshares in the network over time transforms from a shorter independent single burst, to multiple bursts of differing sizes, to a single large b urst of a longer duration. Replicating our previous ﬁndings, increasing virality increases re- currence, up to a point: once a meme has exposed a large part of the network, further recurrence is inhibited. Predicting recurrence. Finally , we show how temporal, network, demographic, and multiple-copy features may be used to predict whether a cascade will recur, if the recurrence will be smaller or lar ger than the original burst, and when the recurrence occurs. W e demonstrate strong performance in predicting whether the same content will recur after observing its initial burst of popularity (R OC A UC=0.89 for image memes), as well as in predicting the relative size of the resulting burst (0.78). The time of recurrence, on the other hand, appears to be more unpredictable (0.58). Features re- lating to content virality and multiple copies perform best. Though multiple-copy features account for signiﬁcant performance in pre- dicting the recurrence of content, we obtain similarly strong perfor- mance (0.88) when predicting the recurrence of an indi vidual copy of a piece of content. T ogether, these results not only provide the ﬁrst large-scale study of content recurrence in social media, b ut also begin to suggest some of the factors that underpin the process of recurrence. 2. TECHNICAL PRELIMINARIES Studying cascade recurrence requires both sufﬁciently rich data that accurately measures activity throughout a network over long periods of time, as well as a robust deﬁnition of what recurrence is. 2.1 Dataset Description In this paper , we use ov er a year of sharing data from Facebook. All data was de-identiﬁed and analyzed in aggregate. Facebook presents a particularly rich ecosystem of users and pages (entities that can represent organizations or brands) sharing a large amount of content ov er long periods of time. Reliably measuring the spread of content in a network o ver time is challenging because multiple copies of the same content may exist at an y time. As we will later show , the presence of multi- ple copies in a cascade is an important catalyst for recurrence. On Facebook, users and pages may introduce a new copy of the same content by re-posting or re-uploading it; resharing an existing copy instead creates an attribution back to that same copy . Content may be reintroduced, instead of reshared, for various reasons — multi- ple users may have independently discovered the same content, or downloaded and then re-uploaded an image. T o construct a dataset of popularly shared content, we initially selected a seed set of reshared content uploaded to Facebook in March 2014. W e selected the top 200,000 most reshared images, which were publicly viewable, counting only reshares within the 180 days since the image was uploaded, then used a neural net- work classiﬁer [28] to identify images with overlaid text (i.e., im- age memes). One adv antage of studying image memes in par- ticular is that the information that these memes transmit is un- likely to change, as opposed to unembellished images which may be used differently (e.g., if the same photo is used to support sepa- rate causes). Next, we tried to identify other copies of content that exist in this seed set. Beyond exact copies of the same image, many near- identical images, which hav e slightly different dimensions or intro- duce compression artifacts or borders, also exist (as seen in Figure 1). As such, a binary k -means algorithm [21] was used to iden- tify clusters of near-identical images to which each of these can- didates belonged, including images be yond the original set. For each cluster , we then obtained all reshares of images in that clus- ter that were made in 2014. T o verify the quality of the clustering, we manually examined the top 100 most-reshared copies in each of t # Reshares h w r p 0 p 1 b 0 b 1 Figure 3: Recurrence occurs when we observe multiple peaks ( p 0 , p 1 , red crosses) in the number of reshares ov er time. Bursts ( b 0 , b 1 ) capture the activity around each peak. Recurring Cascades Non-Recurring Figure 4: Examples of time series of recurring and non-recurring cascades over a year , colored by copy . Identiﬁed peaks are marked with red crosses; the number of reshares is normalized per cascade. 100 randomly sampled clusters. In 94 clusters, all 100 copies were near-identical. The remaining clusters mainly comprised the same image ov erlaid with different text. This sample of resharing activity in 2014 that we use consists of 395,240,736 users and pages that made 5,167,835,292 reshares of 105,198,380 images. These images were aggre gated into 76,301 clusters. Repeating the process abov e for videos shared on Face- book, we obtain a sample comprising 323,361,625 users and pages that made 2,187,047,135 reshares of 6,748,622 videos, aggregated into 156,145 clusters. Images, videos, users, and pages that were deleted were excluded from analysis. On av erage, each image clus- ter is made up of 1379 copies of the same content. V ideo clusters were smaller , with 43 copies in each cluster on average. As we only measured reshares for a year, we may only be ob- serving part of a cascade’ s spread if it began prior to 2014. Thus, we also considered subsets of each dataset containing only clus- ters that beg an in 2014. W e identiﬁed these subsets by additionally measuring reshares of content in the three months prior to 2014 (October to December 2013) and excluding clusters where activity was observed during this period. Though we mainly analyze recurrence at the cluster le vel, we also in vestigate the recurrence of indi vidual copies by studying the top 100,000 individually most reshared copies in each dataset. 2.2 Deﬁning Recurrence In this work, we deﬁne recurrence relativ e to peaks and bursts in popularity over time. In practice, almost all popular content on # Clusters Copies/Cluster Prop. Recurr ence # Peaks Days Observed Days betw . 1st/2nd Burst Image Memes 51,415 (76,793) 523 (1378) 0.40 (0.59) 2.3 (4.6) 202 (280) 31 (32) V ideos 149,253 (156,145) 13 (43) 0.30 (0.33) 1.6 (2.0) 170 (182) 47 (44) T able 1: Recurrence occurs in a large proportion of popular image memes and videos shared on Facebook. W e note in parentheses statistics computed on all cascades, as opposed to the cascades that began in 2014 whose initial spread we can observ e. Facebook experiences at least one peak in popularity . If content peaks in popularity more than once, we say that it r ecurs . T o identify these peaks, and thus whether a cascade recurs, we measure the number of reshares of content o ver time. Figure 4 shows several examples of recurring memes. Empirically , reshare activity is v aried across different content but is generally b ursty , with long periods of inactivity between peaks. As recurrence oc- curs ov er a long amount of time, we discretize time into days. Intuitiv ely , a recurrence occurs when a peak is observed in the time series. Not only should these peaks be relativ e outliers on a timeline, but they should also last for a signiﬁcant amount of time. Further , we should be able to tell these peaks apart from each other . Motiv ated by this intuition, suppose we observe a meme for t days. Let r i , i ∈ { 1 , 2 , ..., t } be the number of reshares observ ed on day i . W e parameterize recurrence using four variables — h 0 , m , and w place constraints on identiﬁed peaks, and v places a constraint on the “v alley” between peaks. Speciﬁcally , the height h of each peak must be at least h 0 and at least m times the mean reshares per day ¯ r (Figure 3). Additionally , a peak day must be a local maximum within ± w days. Finally , between any two adjacent peaks p i and p i +1 , the number of reshares must drop below v · min { r p i , r p i +1 } . W e call the area around the peak a bur st ( b 0 , b 1 respectiv ely for p 0 , p 1 in Figure 3), whose duration or width w , is deﬁned as the sum of the number of days the number of reshares is increasing before p i and falling after p i , while remaining abo ve ¯ r . There is a one-to-one correspondence between peaks and bursts. In practice, we set h 0 =10, m =2, w =7, and v =0.5 so that each burst is relativ ely well-deﬁned. The red crosses in Figure 4 show the identiﬁed peaks under this regime. While this deﬁnition does not strictly minimize activity between bursts, empirically , activity does drop signiﬁcantly (and in many cases, f alls to zero) in be- tween bursts. Stricter deﬁnitions that reduce the number of iden- tiﬁed peaks (e.g., requiring a well-deﬁned “valley ﬂoor” between two peaks, or increasing h 0 or m ) also resulted in qualitatively similar ﬁndings. The approach we take is fairly rudimentary; fu- ture work may inv olve developing more speciﬁc deﬁnitions of re- currence which take into account the shape of resulting b ursts. 3. CHARA CTERIZING RECURRENCE W e ﬁrst introduce recurrence at a high level, showing that it is both common and bursty , with the same content sometimes resur- facing multiple times. W e then discuss four important classes of observations that we later dra w on to model and predict recurrence: • T emporal patterns: cascades with longer initial bursts, but a moderate number of reshares, are more likely to recur . • Sharer characteristics: recurring and non-recurring cascades differ in demographic makeup, and moderate di versity in the initial sharing population encourages recurrence. Further , changes in homophily in the network affect the speed at which content spreads, and hence burstiness. • Network structure: bursts in a cascade occur in dif ferent, but nonetheless connected parts of the network. Also, large ini- tial bursts tend to exhaust the supply of susceptible users, po- tentially accounting for why moderate, but not high cascade volume or di versity results in greater recurrence. • Catalysts of recurrence: the a vailability of multiple copies in the network may catalyze recurrence. Still, neither does the presence of multiple copies suggest that recurrence is en- tirely an externally-dri ven phenomenon, nor is it a necessary condition for recurrence. In the remainder of this paper , we report results primarily on image memes, and note any salient dif ferences with videos. All differences reported are signiﬁcant at p <10 -10 using a t -test unless otherwise noted. 3.1 Recurrence is common Once introduced on Facebook, popular content continues spread- ing for a long time. On average, the maximum time between re- shares of the same content is 280 days. But rather than being shared at a constant rate (among popularly reshared content, less than 1% of memes have no discernible peak), resharing tends to be bursty , with b ursts typically separated by substantial periods of relati ve in- activity . A mean of 32 days separates the initial and subsequent bursts for image memes (Figure 5b). Previously , we deﬁned recurrence as observing multiple peaks in the number of reshares observed ov er time, and non-recurrence as observing only a single peak. Over these long periods of time, 59% of popular image memes recur . In fact, a signiﬁcant propor- tion of these cascades experience resurgences in popularity (Figure 5a), and may e ven hav e experienced b ursts prior to our observation window . If we limit the sample to the set of image memes which be gan spreading in 2014, 40% of these memes recur (T able 1). 3.2 T emporal patter ns Cascades with larger initial bursts of activity that last longer are more likely to recur, suggesting that more viral, or appealing cas- cades are more likely to recur . Ho wever , it is not the most popular cascades that recur the most, but those that are only moderately popular — while recurrence initially increases with the size of the initial peak, it subsequently decreases. Recurring cascades have larger , longer -lived initial bursts. The initial b urst of a cascade is already indicativ e of recurrence. Recur- ring cascades start out larger (15,547) and initially last longer (9.3 days) than non-recurring cascades (6128 reshares, 6.9 days), spend- ing more time “b uilding up” (Figure 5c) and “winding do wn”. The greater initial popularity of recurring cascades suggests that more viral cascades are more likely to recur , but is this the case? Recurring content is moderately popular . Plotting the total num- ber of reshares in the initial burst against the subsequent number of bursts observed, rather than the number of reshares monotoni- cally increasing or decreasing the rate of recurrence, we observe a striking interior maximum at approximately 10 5 reshares for both image memes and videos (Figure 7a). Neither the initially best- performing (or most viral), nor poorest-performing (or least viral) cascades tend to resurface. In the former case, a single large burst tends to dominate with smaller bursts after; in the latter case, a small number of small bursts is typically observ ed. They keep coming back! While most of our analyses focus on the initial b urst and subsequent recurrence, se veral general trends arise as more recurrence is observed: ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0 5 10 15 # Bursts Empirical CCDF (a) Number of Peaks 0.00 0.25 0.50 0.75 1.00 0 50 100 150 200 Days Between 1st/2nd Burst Empirical CCDF (b) Days Between Bursts ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0 10 20 30 Duration of Initial Burst (Days) Empirical CCDF ● ● Non−Recurring Recurring (c) Cascade Duration Figure 5: (a) 40% of cascades that began in 2014 came back, and (b) ov er 30% of recurring cascades only resurfaced after a month or more. (c) Further , the initial burst of a recurring cascade tends to last longer than that of a non-recurring cascade. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 # Bursts Probability of Subsequent Burst (a) Probability of Subsequent Bursts ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10K 15K 20K 25K 0 2 4 6 8 10 Burst Index # Reshares # Bursts ● ● ● ● ● ● ● ● ● ● 1 2 3 4 5 6 7 8 9 10 (b) # Reshares in Each Burst Figure 6: (a) The probability of subsequent recurrences increases after the initial recurrence. (b) Cascades that recur less tend to hav e bursts that diminish in size over time, while those that recur more tend to hav e a stable burst size. • Once a cascade has recurred, it is more lik ely to resurface again. The probability of recurrence jumps from 0.40 ini- tially , to 0.60 for subsequent recurrences before gradually decreasing (Figure 6a). This observ ation parallels prior work showing that the prior popularity of Y outube videos predicts their future popularity [11]. In fact, for 26% of all image meme cascades, we observ e resharing activity on the ﬁrst and last day of our observation period. These image memes may be “ev ergreen”, tending to continuously recur . • For cascades that recur less, subsequent bursts tend to be smaller; for cascades that recur more, subsequent bursts are more similar in size (Figure 6b), suggesting that they depend less on external factors (e.g., breaking ne ws) to spread. • Subsequent recurrences are briefer than their predecessors. Burst duration monotonically decreases from a mean of 7.6 days for the ﬁrst burst to 6.3 for the tenth. • On av erage, the lull between recurrences is substantial, with bursts happening an 28 to 32 days apart for image memes, and 30 to 44 days apart for videos. Again, these long periods between bursts suggest that recurrence can only be observed ov er substantial periods of time. 3.3 Sharer characteristics People who participate in recurring cascades differ signiﬁcantly from those who participate in non-recurring cascades. While a di- verse user population encourages recurrence, moderately diverse cascades recur the most. Homophily , the concept that similar peo- ple are likely to share the same content, also affects how quickly content spreads, suggesting that it modulates recurrence. Demographics vary with r ecurrence. For recurring cascades, the av erage age of people participating in the initial burst is lo wer (40 vs. 42), b ut the proportion of w omen is higher (65% vs. 58%). The latter observation corroborates pre vious work that sho wed a correlation with ev entual cascade size [14]. Demographics also change across bursts. In the case of image meme cascades, the mean age changes by 2.7 years, and the pro- portion of w omen by 6.1 percentage points (in absolute terms). The same content may become popular in different parts of the world at different times, resulting in recurrence: 13% of the time, the major - ity of people in the initial two bursts come from dif ferent countries. Diversity encourages recurrence. W e now turn our attention to the div ersity (or homophily) of people who take part in a cascade. W e quantify homophily in the network by measuring the entropy of the distribution of demographic characteristics. A low entropy in the distribution of countries users are from (or country-entropy) corresponds to high homophily , suggesting that a majority of shar- ers belong to a small number of countries. On the other hand, a high country-entropy suggests that the countries sharers belong to are more div erse and distributed more e venly . It is not a priori clear whether homophily encourages or inhibits recurrence. Homophily within a community , meaning that con- nected users are receptive to sharing the same content, may help a cascade gain the initial traction it needs to spread, but may also re- sult in the content getting “trapped” in a local part of the network. In contrast, di versity in the users sharing that content suggests it has wider appeal and might come back, but may also result in only a single burst if the initial spread o verwhelms the network. W e ﬁnd that diversity in the country distrib ution is predictiv e of recurrence. Controlling for the duration ( w ), peak height ( h ), and the number of reshares in the initial b ursts of recurring and non-recurring cascades [42], a W ilcoxon Signed-rank test sho ws that a higher country-entropy is indicativ e of recurrence ( W >10 8 , p <10 -10 , effect size r =0.19). Thus, if the initial burst of a cascade occurs in more countries, it is more likely to recur . Higher gender- entropy (i.e., greater gender-balance) also predicts recurrence, but its effect is weaker ( W >10 8 , p <10 -2 , r =0.02). The effect of age is inconsistent across image memes and videos. Recurring content is moderately di verse. Again, it is not the most div erse populations that bring about recurrence: a moderate country-entropy of approximately 3.0 in the initial burst of a cas- cade results in the most recurrence (Figure 7b). An interior maxi- mum can also be observed with respect to the gender-entrop y of the initial burst (Figure 7c). These results, combined with the pre vious observation of a similar interior maximum with respect to the ini- tial number of reshares, suggests that the virality of content plays a signiﬁcant role in recurrence. Cascades spr ead quickly in pockets of homophily . The viral- ity of a cascade and homophily in the network are closely related, ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 3.0 3.5 4.0 10 3 10 4 10 5 10 6 # Reshares in Initial Burst # Bursts (a) # Bursts vs. # Reshares ● ● ● ● ● ● ● ● ● ● ● 2.0 2.5 3.0 3.5 0 1 2 3 4 5 Country Entropy in Initial Burst # Bursts (b) # Bursts vs. Country-Entropy ● ● ● ● ● ● ● ● ● ● 1.5 2.0 2.5 0.2 0.4 0.6 0.8 1.0 Gender Entropy in Initial Bur st # Bursts (c) # Bursts vs. Gender-Entropy Figure 7: (a) A moderate number of reshares results in more recurrence. (b), (c) Similarly , recurrence is more likely when the entropy of the distribution of users across countries, as well as gender , is moderate. and perhaps represent tw o perspecti ves on the spread of content. Greater virality enables content to appeal to a larger population; more homophily suggests that receptive users are closer in the net- work. In fact, homophily in the network modulates the speed of resharing (and thus bursts in a cascade). If we measure the av er- age country-entropy of a sliding window of 100 reshares ordered in time and the time elapsed, and then compute the av erage corre- lation, we ﬁnd a slight positiv e correlation between the two (0.08), suggesting that homophily among those sharing results in faster re- sharing, and hence burstiness. Gender-entrop y and age-entropy are also positively , but more weakly correlated with burstiness (0.06 and 0.04 respectiv ely). One potential cause of this is that pages, with their substantial and homophilous followings, are driving re- sharing. The more pages that share content, the more homophilous the set of potentially reachable people, and thus the quicker content is reshared. Howe ver , as the proportion of reshares attributable to pages increases, the entropy in these demographic characteristics instead increases, an effect opposite to what we observ e. 3.4 Network structur e The initial b ursts of recurring cascades tend to be better con- nected. Further , successful recurrence tends to occur in different, but not disconnected parts of the network. Considering the peo- ple potentially exposed in each burst beyond the resharers, large initial cascades may exhaust the population of susceptible people in the network, a fact that will subsequently become important in explaining the mechanism of recurrence. Bursts of recurring cascades ar e internally more connected. More people and pages share in the initial burst of recurring cas- cades than non-recurring cascades (15,050 users and 59 pages, and 5855 users and 24 pages respectiv ely). T o measure connecti vity within a burst, we used the induced subgraph G 0 of the Facebook network made up of the people and pages resharing in the initial burst. The subgraph includes two kinds of edges: friend edges be- tween people, and follo w edges between people and the pages they like. On this subgraph, people in the initial burst of recurring cas- cades ha ve an average of 3.4 connections to other people and pages in the same burst, relativ e to 3.1 connections in non-recurring cas- cades, suggesting that the initial bursts of recurring cascades are slightly more connected. Subsequent bursts happen in differ ent parts of the network. Bursts of a cascade are separate in time, but may overlap in terms of sharers, or be connected via friend and follow edges. T o start, the sharer overlap between b ursts is small. For recurring cascades, an av erage of 15,050 people and 59 pages make up the initial burst, and 8892 people and 28 pages the subsequent burst. Comparing people and pages across these bursts, Jaccard similar- ities of 0.02 (i.e., 2% of people share the same content in both bursts) and 0.03 mean that b ursts have v ery little direct overlap. W e also ﬁnd evidence of community structure within bursts by considering whether the second burst is proximate in the network to the ﬁrst. Ev en if individuals in the second burst do not repost con- tent a second time, they may not be far remov ed in the social net- work from someone in the ﬁrst burst. Here, we instead consider the induced subgraph G 0+1 of the indi viduals and pages who reshared in the initial two bursts for each cascade. If these bursts correspond to communities within the network, we would expect more edges within bursts than between them. An average of 17,457 friend and 17,242 follo wer edges exist in the ﬁrst b urst; 10,094 friend and 6,310 follower edges exist in the second. Note that these commu- nities are very sparse. A person within a burst has a connection to an average of 3.2 friends or pages within the same burst, indicat- ing that memes tend to diffuse out through the network rather than stay within a narrow community . Still, across bursts, we observe an average of 8,273 friend and 4,755 follower edges, resulting in an average of 1.4 connections to friends and pages in a different burst, indicating that these b ursts are somewhat separated. Large initial bursts exhaust the supply of susceptible people. As noted above, the second burst in a cascade has fewer sharers. Intu- itiv ely , people may tire of content the y have seen before, but is this the case? Studying overlap a third way , we look at the susceptible populations of the initial and subsequent bursts of a cascade. W e approximate these populations by considering people who could hav e been exposed through their connections (friend and follow edges) to those who shared content in these two b ursts. On a verage, 6.4 million unique individuals are potentially ex- posed in the initial burst, with recurring cascades, having a greater number of initial resharers, having greater reach (8.2 million vs. 4.0 million for non-recurring cascades). For recurring cascades, the potential reach of the second burst is smaller , but still sizable at 6.8 million. Comparing the sets of individuals exposed in the ﬁrst two bursts of recurring cascades, we obtain a Jaccard similarity of 0.15, indicating this second burst is mostly reaching a different set of people, but where some people will have seen the same content twice, creating a sense of déjà vu for many . In particular, 28% of people who are potentially exposed in the second peak would have also been exposed in the ﬁrst. This high ov erlap could be a contrib uting factor to the second peak being smaller . Further , the proportion of exposed individuals in the sec- ond peak is positively correlated ( ρ = 0 . 39 ) with the size of the ﬁrst peak, meaning that the larger the initial peak, the more likely that those exposed in the second peak ha ve pre viously seen the content. In other words, in the case of large initial bursts, subsequent bursts are likely reaching a similar part of the network. 3.5 Catalyzing recurr ence As shown in Figures 1 and 2, cascades are made up of reshares of multiple copies of the same content, and the presence of these copies can help catalyze recurrence. Still, neither are copies the only cause of recurrence (recurrence is substantial ev en with a sin- gle copy), nor must the y be independently or externally introduced (many later copies are attrib utable to previously seen copies). Cascades whose reshar es are divided across multiple copies tend to recur . Recurring cascades are made up of more copies than non- recurring cascades (2277 vs. 93). Reshares are also more spread out across multiple copies in the former case (841 and 3445 re- shares per copy for recurring and non-recurring cascades respec- tiv ely), suggesting recurrence may be characterized by multiple smaller outbreaks. The most reshared copy accounts for 72% of reshares in the initial burst for recurring cascades, and 93% for non-recurring cascades. Altogether, the substantial differences here suggest the strong predictiv e power of these characteristics. The appearance of new copies correlates with recurrence. Fur- ther , the introduction of new copies and the number of reshares over time is signiﬁcantly correlated (Pearson’ s r =0.66), suggesting that the appearance of new copies causes bursts, and thus recurrence. On a related note, prior work showed that reposting content helps make it popular [44]. Copies ar e not the only cause of recurrence. Nonetheless, not all copies burst (only 6% are reshared at least 10 times on any single day), and not all bursts are caused by new copies, as we will later show . And while correlations between the number of copies and other characteristics such as duration and country-entropy also ex- ist, when we control for the number of copies in the initial bursts of recurring and non-recurring cascades [42], all previously observed differences in the temporal, sharer, and network characteristics of these cascades still hold ( W >10 8 , p <10 -10 , mean ef fect size r =0.08). Comparing recurring and non-recurring cascades with similar num- bers of copies in their initial bursts, the initial bursts of recurring cascades are still larger , longer-li ved, and more diverse. In all, this suggests that recurrence is not simply caused by distinct copies of the same content spreading through the netw ork, but is a result of a more complex phenomenon which we e xplain in Section 4. A majority of copies ar e internal to the netw ork. Still, where do these copies come from, and are they internal or e xternal to the net- work? By using the network to identify friends and pages who may hav e previously shared a different copy of some content, we can attribute 75% of newly uploaded copies to previously seen copies in the network (this approach roughly estimates content-copying that occurs within Facebook, as users who share a new copy may not have seen a friend’ s shared copy). This suggests a nuanced ap- proach to studying recurrence — external sources may drive some of the introduction of new copies to a social system, b ut a large pro- portion of activity , which we can study , occurs within the network. Pages may also catalyze recurr ence. Pages are responsible for a large proportion of highly-reshared copies (o ver 70% of reshares are attrib utable to page-created copies in the second b urst of re- curring cascades). In recurring cascades, pages tend to re-upload, rather than reshare content, doing so 50% of the time, as opposed to 2% for users. Further, the most popular copy in the second b urst is likely to hav e been created by a page (70%). Giv en the relati vely higher degree of pages, which tend to ha ve tens of thousands of followers, as opposed to users who typically only hav e hundreds of friends, pages may spark recurrence by posting a new copy of the same content, rapidly exposing a number of follo wers to it. Individual copies recur too! Recurrence of the individually most popular copies in our datasets, while lower than when copies are studied in clusters, is still substantial (18%). These individual copies last a signiﬁcant amount of time (261 days), with bursts further apart (41 days). Like cascades of multiple copies, the initial bursts of recurring indi vidual-copy cascades are larger and longer-li ved than those of non-recurring cascades, with later b ursts occurring in different parts of the network. Recurrence of the same copy can also be observed within clusters — 22% of the time, the most re- shared copy in a b urst was also most reshared in a previous b urst. 4. MODELING RECURRENCE T ying our observations together, we present an ov erall picture of the mechanisms of recurrence, then suggest a model of recurrence which we ev aluate through simulations on a real social network. 4.1 Why do cascades r ecur? Our ﬁndings as a whole suggest a model of recurrence where virality is a primary factor , and where the availability of multiple copies can help spark recurrence. V irality plays a primary role in recurrence. V irality , or broad- ness of appeal, affects recurrence: cascades with initial bursts that are larger , last longer , and are more demographically di verse are more likely to recur . Speciﬁcally , moderately popular and diverse cascades are most likely to recur. While recurrence typically oc- curs in different parts of the netw ork, the larger the initial burst of a cascade, the larger the proportion of the potentially exposed population in the subsequent burst that was already previously ex- posed. This observation, coupled with the fact that users tend not to reshare the same content multiple times, suggests that large ini- tial bursts inacti vate a signiﬁcant portion of the network, inhibiting a cascade’ s future spread. Our subsequent simulations show more clearly that this may indeed happen as the initial b urst grows large. Multiple copies in the network help spark recurrence. Bursts in a cascade are separated by relativ ely long periods of inactivity . By studying the av ailability of multiple copies of the same con- tent, we ﬁnd that these copies can act as catalysts for recurrence in different parts of the network. Indeed, multiple introductions of the same content correlate with recurrence. Howe ver , while more copies initially increases the chance of recurrence, they are not the only cause of it; recurring and non-recurring cascades with similar numbers of copies differ signiﬁcantly in virality . Moreover , mul- tiple copies do not explain the substantial recurrence of individual copies. T o a lesser extent, we also discover that homophily in the network af fects the speed of the spread of a cascade in a network. T ogether, moderate content virality and the presence of multiple copies results in recurrence. While the likelihood of recurrence does increase with the number of copies (or potential “sparks”), we can still observe an interior maximum in how recurrence varies with the number of reshares after ﬁxing the number of copies, where a moderate number of reshares results in the most recurrence. 4.2 A simple model of recurr ence Motiv ated by these ﬁndings, we suggest a simple model of cas- cading behavior where recurrence depends on content virality: • If the virality of a cascade is low , it may only appeal to a small group of people, and is thus unable to spread far in the network. Thus, a single, small peak results, with many attempts to propagate in the network failing (Figure 8a). • As virality increases, the cascade is able to spread substan- tially further in the network, and may occasionally even jump to other local communities in the network, spreading faster # Reshares a. Low V irality b. Moderate V irality c. High V irality Overall Individual Bursts t Figure 8: When virality is low , only a small number of attempts at infection succeed. When virality is moderate, more attempts suc- ceed, which aggregate into observable recurrence. When virality is high, rather than a large number of bursts aggregating to form a single lar ge peak, the ﬁrst successful burst infects a lar ge portion of the network, making it dif ﬁcult for other copies to spread. within them. As se veral bursts occur in the network, they may be observed as recurring in aggre gate (Figure 8b). • Howe ver , as virality increases be yond some threshold, any individual burst is likely to spread through a large portion of the susceptible population, inhibiting the transmission of subsequent copies (Figure 8c). This last point lies in contrast to the trivial hypothesis that more independent copies leads to more independent bursts that aggregate to form a single large burst, which does not appear to be the case, as most reshares in initial bursts can be attrib uted to a single copy . 4.3 Simulating recurr ence T o see if such a model of recurrence can reproduce characteris- tics of recurrence observ ed in the data, we now simulate recurrence on a real social network. Our observations and model suggest the use of an SIR model, where nodes in a network are initially sus- ceptible (S) to a contagion, and then may become infected (I) when exposed. Infected nodes subsequently recover (R) and become re- sistant to the contagion. These models have been used to study the spread of disease [6] and information [9, 13, 39] in a network. Setup. Our simulation thus consists of an SIR model with multiple outbreaks introduced at different times, and with resistant nodes re- infectable at a lower rate. W e parameterize our model as follows: For a gi ven contagion c , its virality , or equivalently , the susceptibil- ity of every node in the network, is p c 0 . In other words, if exposed to the contagion, the probability that the node will be infected is p c 0 . Infected nodes attempt to infect all neighbors in the subsequent time step, and then become resistant. As users sometimes share the same content multiple times, resistant nodes have a constant lower probability p c 1 < p c 0 of being re-infected. The introduction of each copy of a contagion is normally distrib uted in time ( N ( µ, σ ) ). Here, we make a simplifying assumption that independent copies of the same content are introduced into the network at different points in time. Follo wing the intuition that more connected entities (e.g., pages) are likely to start outbreaks, the target nodes to infect are sampled, with replacement, proportional to the node’ s degree. m copies are introduced in total. W e simulate this model for 1000 discrete time steps with µ =500, σ =250, and m =50, v arying p c 0 between 5 × 10 -4 and 10 -3 and where p c 1 = 0 . 5 · p c 1 . W e run our simulation on the network of a country with approximately 1.4 million nodes and 160 million friendship edges, repeating the simulation 5000 times. W e measure the total number of infections (or reshares) in each time step, and identify bursts as deﬁned in Section 2.2. Results. Within certain ranges of virality (6 × 10 -4 ≤ p c 0 < 8 × 10 -4 ), we can consistently reproduce recurring cascades. Figure 9a sho ws sev eral examples of the time series of these simulations. In aggre- gate, we can obtain a distribution of number of peaks similar in shape to Figure 5. Plotting the number of peaks against number of reshares in the initial burst (or alternativ ely , p c 0 ), we observe an interior maximum — a moderate amount of virality results in the most recurrence (Figure 9b), replicating our previous ﬁndings. When the virality of the contagion is high ( p c 0 ≥ 8 × 10 -4 ), a large fraction of the highly connected portion of the graph becomes in- fected by a single copy in the initial burst, suppressing subsequent bursts as many nodes are now resistant. T o show this happening, we consider for each simulation, in addition to our original model, an alternate-universe setting where the resistances of nodes are re- set follo wing the initial burst. W e can then measure ho w much the initial burst inhibited the second by observing the lik elihood of a second burst in the alternate case, as well as the overlap of nodes infected in the second burst with nodes in the initial burst. A signiﬁcant difference in the total number of peaks when virality is high (1.0 vs. 2.0, t =92, p <10 -10 ), b ut not when virality is lo w ( p c 0 ≤ 7 × 10 -4 , n.s.) suggests that the supply of susceptible nodes is indeed being used up in the former case, b ut not the latter . A signiﬁcant positive correlation of initial peak size with the size of the o verlap of the second peak in the alternate setting (0.76) further supports this hypothesis and our prior observ ations. Like wise, the connectivity of graph deteriorates signiﬁcantly af- ter a lar ge initial burst ( p c 0 ≥ 8 × 10 -4 ). Here, we measure the al- gebraic connectivity [18] of the graph if all the nodes inv olved in the initial burst are remo ved, and compare this to a baseline that remov es the same number of nodes at random. Connectivity is sig- niﬁcantly lower in the former case (579 vs. 1065, t >17, p <10 -10 ), especially in comparison to the graph’ s initial connectivity (1105). These results together suggest that under such a model of recur- rence, a large initial burst does indeed inhibit subsequent b ursts, as we previously hypothesized (Figure 8c). Also in support of our prior observations, increasing the number of introduced copies m monotonically increases recurrence. Limitations and alternatives. Importantly , our model assumes that recurrence is sparked primarily by independent copies intro- duced to the network. Howe ver , the reality of recurrence is subtler: individual copies recur signiﬁcantly in the network, and homophily may also moderate recurrence. Allowing virality to vary with time [24] or having nodes w ait according to a power -law distribution [17, 35] may also reproduce recurrence with only a single copy . Decision-based queuing processes [8] may also help model the long periods of inactivity between b ursts. 5. PREDICTING RECURRENCE Is it possible to predict if a cascade will resurface in the future? Observing just the initial burst of a cascade, we use features re- lated to the temporality , network structure, user demographics, and presence of multiple copies to determine a ) whether recurrence oc- curs, b ) if the recurrence will be relatively smaller or larg er , and c ) when the recurrence occurs. Ov erall, we ﬁnd that cascades with longer initial b ursts that consist of multiple small outbreaks tend to recur , supporting the hypothesis that content virality and multiple copies play a signiﬁcant role in recurrence. Nonetheless, we obtain similarly strong performance predicting recurrence for individual copies of content. Predicting recurrence may enable us to better forecast content longevity in a netw ork. Recurring Cascades Non-Recurring (a) Simulated T ime Series ● ● ● ● ● ● ● ● 1.0 1.5 2.0 2.5 3.0 10 1 10 2 10 3 10 4 # Reshares in Initial Burst # Bursts (b) # Peaks vs. Reshares, Simulated Figure 9: (a) By varying content virality , a model of recurrence that assumes independent introductions of copies of the same content can simulate recurrence. (b) It also replicates the observ ation that a moderate number of reshares results in more recurrence. A UC on Feature Sets T emporal 0.74 0.76 0.55 + Demographic 0.78 (0.63) 0.76 (0.58) 0.56 (0.52) + Network 0.81 (0.72) 0.77 (0.66) 0.57 (0.53) + Multiple-Copy 0.89 (0.82) 0.78 (0.70) 0.58 (0.54) T able 2: W e obtain strong performance in predicting whether recur- rence occurs and if the subsequent burst will be smaller or larg er , but not in predicting when recurrence occurs. Individual feature set performance is in parentheses. The column headers refer to Sec- tions 5.2, 5.3, and 5.4 respectiv ely . 5.1 F actors driving recurr ence Based on our observations, we de velop se veral features that help predict recurrence, and group them into four categories: T emporal features (7). Initially longer-li ved bursts are suggestive of recurrence, moti vating the importance of the number of days be- for e and after the peak is reached, as well as the number of reshar es befor e and after , and the height of the initial peak . The av erage gradient of the initial burst before and after the peak further char- acterize the shape of the initial burst. Demographic featur es (5). The differences in user characteris- tics and diversity we previously observed suggest the importance of age , gender , as well as the entr opy in the distribution of ag e , gender and country of the initial burst. Network features (6). Recurring cascades appear to be more con- nected in their initial bursts, having more friendship and follower edges , in addition to having a larger potentially e xposed popula- tion . The number of users , pages , and pr oportion of pages in the initial burst also v ary . Multiple-copy features (8). The av ailability of multiple copies plays a signiﬁcant role in recurrence, moti vating the use of the num- ber of copies observed in the initial peak , the entr opy in the distri- bution of reshar es of each copy , the mean reshar es per copy , and the pr oportion of reshar es attributable to the most popular copy . Pages also play a role in recurrence, suggesting that the pr oportion of copies created by pages , the pr oportion of all reshar es made by pages or attributable to page-cr eated copies , and whether the most popular copy was created by a page are useful features. 5.2 Does it recur? Prediction task. W e formulate our prediction task as a binary clas- siﬁcation problem: given only the initial burst of a cascade, we aim to predict if a second b urst will be observed (i.e., if the cascade will recur). W e use a balanced dataset of recurring and non-recurring cascades ( N =40,912 for image memes, 89,368 for videos) so that guessing results in a baseline accuracy of 0.5. Given the non-linear relation of several features to recurrence (e.g., that a moderate num- ber of reshares results in the most recurrence), we use a random forest classiﬁer . In all cases, we perform 10-fold cross-validation and report the classiﬁcation accuracy , F1 score, and area under the R OC curve (A UC). Results. Overall, we ﬁnd strong performance in predicting recur- rence (Accurac y=0.82, F1=0.81, A UC=0.89). A logistic regression classiﬁer results in slightly worse performance (A UC=0.78). T a- ble 2 sho ws ho w performance improves as features are added to the model, as well as indi vidual feature set performance. While multiple-copy features perform best, temporal and network fea- tures, and to a lesser e xtent demographic features, also indi vidually exhibit rob ust performance, suggesting that each signiﬁcantly con- tributes to recurrence. In the absence of strong multiple-copy fea- tures (fewer copies of any one video exist), we obtain worse perfor- mance in predicting the recurrence of videos (Acc=0.69, F1=0.66, A UC=0.76), with temporal features instead performing best. For image meme cascades, the most predictiv e features of re- currence relate to cascades having multiple small outbreaks (fewer reshares per copy (0.78) and a higher entropy in the distribution of reshares across copies (0.72)), and longer initial bursts (more days before (0.63) and after (0.63) the peak). These features re- main important for video cascades. Mirroring the dual importance of multiple-copy and temporal features, just the number of reshares per copy and the average gradient of the initial burst after its peak alone achieve strong performance (0.81). Though the initial burst of a recurring cascade is on average signiﬁcantly larger , size-related features are weaker signals of recurrence ( ≤ 0.59). 5.3 Will the recurr ence be smaller/larger? Prediction task. Assuming that we kno w that a cascade will recur , how much smaller or lar ger will the second burst be? Knowing the relativ e size of the next recurrence can dif ferentiate b ursty cascades that are rising or falling in popularity . Given the initial burst of a cascade, we aim to predict if the relati ve size of the second b urst, or the ratio of the size of the second burst to that of the ﬁrst, is abo ve or below the median (0.28). As the median e venly di vides the dataset, we again have a balanced binary classiﬁcation task with a random guessing baseline accuracy of 0.5. Results. W e also ﬁnd strong performance in predicting the rela- tiv e size of the subsequent burst (Acc=0.72, F1=0.69, A UC=0.78 for image memes, A UC=0.85 for videos). T emporal features here outperform all other feature sets, with the most predictiv e features relating to the cascade having a long initial b urst. 5.4 When does it recur? Prediction task. If a cascade will recur, when will we observe the next burst? With a cascade’ s initial burst, can we predict if the duration between bursts will be greater than the median (14 days)? Results. W e ﬁnd that the timing of recurrence is far less predictable (Acc=0.56, F1=0.51, A UC=0.58 for image memes, A UC=0.60 for videos). Nev ertheless, longer initial bursts are most indicativ e of recurrence happening earlier . 5.5 Predicting recurr ence for individual copies Giv en the correlation of the appearance of multiple copies with bursts, multiple-copy features perform strongest in predicting re- currence. But what if we w ant to predict recurrence of a single instance of some content, where multiple copies do not e xist by deﬁnition? Surprisingly , we obtain similarly strong performance in predicting the recurrence of individual copies ( N =28,454, Acc=0.80, F1=0.79, A UC=0.88 for image memes, A UC=0.82 for videos). Net- work features are strongest (A UC=0.84), with fewer edges between users and pages (0.68) in the initial peak the most predicti ve of recurrence. As indi vidual copies have a single point of origin, fewer edges between pages and users and more edges between users (0.61) suggests that the burst may have resulted more from users sharing content from other users than high-degree pages shar- ing that content with their followers. This observation, together with the fact that longer initial bursts continue to be strongly pre- dictiv e of recurrence (>0.65), suggests the continued signiﬁcance of virality with respect to individual copies. The relative size of the subsequent burst is similarly predictable for individual copies (0.83 for image memes, 0.84 for videos), but interestingly , the time of recurrence is more predictable (0.68 and 0.63 respectively), which may be because any recurrence must be a continuation of the initial copy , as opposed to possibly being sparked by a ne w , less related copy . 6. RELA TED W ORK Signiﬁcant prior work has studied information diffusion in on- line social media [7, 14, 37] — with respect to memes, work has demonstrated the effect of meme similarity [16] and competition for limited attention on subsequent popularity [46]. Most relev ant is previous work that looked at the temporal dynamics of diffusion and dev eloped epidemiological models of recurrence. Among w ork that aims to predict the future popularity of on- line content [12, 32, 44], one relev ant line of research has inv olved modeling the temporal patterns of the diffusion of information in social media [2, 36, 49] or using these patterns to predict future popularity or forecast trends [4, 9, 10, 14, 15, 24]. Perhaps driven by the general shape of the initial b urst of a cascade, man y of these models implicitly assume that the temporal shape of a cascade con- sists primarily of a rising and falling period, and focus on modeling the initial acti vity around a peak [2, 36, 48] or the ov erall popular- ity discounting subsequent spikes [10]. Beyond the initial burst of activity , we studied the long-term temporal dynamics of content on Facebook o ver a year . In prior work, when multiple bursts are observed in a time se- ries, they tend to be of a topic or hashtag rather than an individual piece of content, and are commonly attributed to external stimuli [23, 29, 33, 38] (e.g., news related to that topic). While knowing about external events can help forecast the temporal pattern of the resulting spike [36], there has been little work in predicting if new spikes will appear in the future lacking such knowledge. In partic- ular , rumor recurrence is bursty , with or without external stimuli, and sometimes with embellishments and other mutations [1, 31, 19], but there is little understanding of this phenomenon. Patterns of human acti vity can also explain periodicity in popularity [5, 22, 34], b ut the vast majority of recurrence we observe in this paper is aperiodic. While external stimuli explains some instances of re- currence, we discover other factors that inﬂuence recurrence. In contrast to most work that has observed multiple bursts in topics, we observed recurrence e ven at the lev el of an individual copy . Finally , substantial work has studied ho w b ursts in streams or time series can be detected [26, 27, 41]. In this paper , we adopted a simple deﬁnition of burstiness, parameterizing peaks and bursts relativ e to the mean activity observed. Recurrence has also been studied in the context of epidemiology , though primarily from a modeling perspecti ve. Many base their analysis on SIR models [39], simulating recurrence through intro- ducing dormant periods [25], seasonality effects [3], or changes in contagion ﬁtness [20], which may be periodic [40]. More recently , some work studied content popularity using these models, while ac- counting for user login dynamics and content aging [13]. The struc- ture of the network can also cause periodicity in epidemics [30, 45]. Many focus on modeling speciﬁc types of recurrence (e.g., historical disease epidemics [3]). In contrast, many recurrences we observe are aperiodic, and ﬁndings on synthetic networks may not easily generalize. Inspired by this line of work, we adapted an SIR model assuming multiple points of infection on a real social net- work, and show that key characteristics of recurrence we observed can be reproduced. 7. DISCUSSION AND CONCLUSION Our results start to shed light on the mechanism of content re- currence — studying a large dataset of popularly reshared content, we ﬁnd that recurrence is common, and that content can come back not just once, b ut several times. Strikingly , content may nearly cease to circulate for days, weeks or e ven months, prior to experi- encing another surge in popularity . Such a phenomenon may seem highly unpredictable, but we ﬁnd trends in how recurring cascades behav e, and can predict whether content will come back. The vi- rality , or appeal of a cascade plays a role in recurrence: cascades whose initial bursts are long-lasting, moderately popular , and mod- erately div erse are most likely to recur . The presence of multiple copies of the same content sparks recurrence, though homophily in the network may also inﬂuence recurrence. One limitation of our work is that we only analyze content within a single network. Though most copies of the same content were made within the network, a minority appeared without a prior path. Analyzing the transfer of content between different social networks may reveal different mechanisms of recurrence. Separately , while the appearance of multiple copies correlates with recurrence, this does not hold in the case of indi vidual-copy recurrence. Under- standing recurrence in the absence of mult iple copies (e.g., through studying homophily in more detail) remains future work. Based on our observations, we presented a simple model that exhibits some features of recurrence (e.g., pronounced bursts with little activity in-between, and an internal maximum in the number of b ursts as a function of the number of reshares). Future work could extend such models to account for homophily and commu- nity structure in the network. While the temporal shape, network structure, and user attributes are already highly predictive of resharing behavior , other factors may improve prediction accuracy further: sentimentality or humor may make content ever green, while content tied to current events may have an e xpiration date. Seasonality effects may also cause pe- riodic recurrence: we did observ e an instance of a daylight-savings image meme which appeared, as e xpected, e xactly at the two points during the year when people needed to adjust their clocks. Also, other types of content may exhibit different properties of recur- rence (e.g., link sharing may be more externally driven); the in- teractions of users with shared content (e.g., comments) may also rev eal the reasons why some content came back; the societal con- text of memes, as well as their interactions (or competition) with other content, may also re veal more insight into their popularity [43]. Perhaps most suggestiv e that much remains to be studied is that while we can predict if recurrence will happens, it remains a signiﬁcant challenge to predict when recurrence will happen. Acknowledgments. This work was supported in part by a Mi- crosoft Research PhD Fellowship, a Simons In vestigator A ward, NSF Grants CNS-1010921 and IIS-1149837, and the Stanford Data Science Initiativ e. 8. REFERENCES [1] L. A. Adamic, T . M. Lento, E. Adar , and P . C. Ng. Information ev olution in social networks. WSDM , 2016. [2] M. Ahmed, S. Spagna, F . Huici, and S. Niccolini. A peek into the future: Predicting the evolution of popularity in user generated content. WSDM , 2013. [3] S. Altizer, A. Dobson, P . Hosseini, P . Hudson, M. Pascual, and P . Rohani. Seasonality and the dynamics of infectious diseases. Ecol. Lett. , 2006. [4] S. Asur, B. Huberman, et al. Predicting the future with social media. WI-IA T , 2010. [5] S. Asur, B. Huberman, G. Szabo, and C. W ang. T rends in social media: Persistence and decay . ICWSM , 2011. [6] N. T . Bailey et al. The mathematical theory of infectious diseases and its applications . 1975. [7] E. Bakshy , I. Rosenn, C. Marlo w , and L. A. Adamic. The role of social networks in information dif fusion. WWW , 2012. [8] A.-L. Barabasi. The origin of bursts and hea vy tails in human dynamics. Natur e , 2005. [9] C. Bauckhage, F . Hadiji, and K. K ersting. How viral are viral videos? ICWSM , 2015. [10] C. Bauckhage, K. Kersting, and F . Hadiji. Mathematical models of fads explain the temporal dynamics of internet memes. ICWSM , 2013. [11] Y . Borghol, S. Ardon, N. Carlsson, D. Eager , and A. Mahanti. The untold story of the clones: content-agnostic factors that impact Youtube video popularity . KDD , 2012. [12] Y . Borghol, S. Mitra, S. Ardon, N. Carlsson, D. Eager , and A. Mahanti. Characterizing and modelling popularity of user-generated videos. P erform. Eval. , 2011. [13] M. Cha, F . Bene venuto, Y .-Y . Ahn, and K. P . Gummadi. Delayed information cascades in Flickr: Measurement, analysis, and modeling. Computer Networks , 2012. [14] J. Cheng, L. A. Adamic, P . A. Dow , J. M. Kleinberg, and J. Leskov ec. Can cascades be predicted? WWW , 2014. [15] H. Choi and H. V arian. Predicting the present with Google Trends. Econ. Rec. , 2012. [16] M. Coscia. A verage is boring: How similarity kills a meme’ s success. Sci. Rep. , 2014. [17] R. Crane and D. Sornette. Robust dynamic classes rev ealed by measuring the response function of a social system. PNAS , 2008. [18] M. Fiedler . Algebraic connectivity of graphs. Czech. Math. J. , 1973. [19] A. Friggeri, L. A. Adamic, D. Eckles, and J. Cheng. Rumor cascades. ICWSM , 2014. [20] M. Girvan, D. S. Callaw ay , M. E. Newman, and S. H. Strogatz. Simple model of epidemics with pathogen mutation. Phys. Rev . E , 2002. [21] Y . Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR , 2011. [22] N. Grinberg, M. Naaman, B. Shaw , and G. Lotan. Extracting diurnal patterns of real world acti vity from social media. ICWSM , 2013. [23] D. Gruhl, R. Guha, D. Liben-Nowell, and A. T omkins. Information diffusion through blogspace. WWW , 2004. [24] A. Guille and H. Hacid. A predictive model for the temporal dynamics of information diffusion in online social netw orks. WWW Companion , 2012. [25] A. Johansen. A simple model of recurrent epidemics. J. Theor . Biol. , 1996. [26] D. Kifer, S. Ben-Da vid, and J. Gehrke. Detecting change in data streams. VLDB , 2004. [27] J. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov . , 2003. [28] A. Krizhevsky , I. Sutskev er, and G. E. Hinton. Imagenet classiﬁcation with deep con volutional neural networks. NIPS , 2012. [29] R. Kumar , J. Novak, P . Raghav an, and A. T omkins. On the bursty e volution of blogspace. WWW , 2005. [30] M. Kuperman and G. Abramson. Small world ef fect in an epidemiological model. Phys. Rev . Lett. , 2001. [31] S. Kwon, M. Cha, K. Jung, W . Chen, and Y . W ang. Prominent features of rumor propagation in online social media. ICDM , 2013. [32] H. Lakkaraju, J. J. McAuley , and J. Lesko vec. What’ s in a name? understanding the interplay between titles, content, and communities in social media. ICWSM , 2013. [33] J. Leskovec, L. Backstrom, and J. Kleinber g. Meme-tracking and the dynamics of the news c ycle. KDD , 2009. [34] J. Leskovec, M. McGlohon, C. F aloutsos, N. S. Glance, and M. Hurst. Patterns of cascading beha vior in large blog graphs. SDM , 2007. [35] D. Liben-Nowell and J. Kleinberg. T racing information ﬂow on a global scale using internet chain-letter data. PNAS , 2008. [36] Y . Matsubara, Y . Sakurai, B. A. Prakash, L. Li, and C. Faloutsos. Rise and fall patterns of information dif fusion: model and implications. KDD , 2012. [37] S. A. Myers and J. Leskovec. The b ursty dynamics of the twitter information network. WWW , 2014. [38] S. A. Myers, C. Zhu, and J. Leskovec. Information dif fusion and external inﬂuence in networks. KDD , 2012. [39] M. E. Newman. Spread of epidemic disease on networks. Phys. Rev . E , 2002. [40] L. F . Olsen, G. L. Truty , and W . M. Schaf fer . Oscillations and chaos in epidemics: a nonlinear dynamic study of six childhood diseases in Copenhagen, Denmark. Theor . P opul. Biol. , 1988. [41] G. Palshikar et al. Simple algorithms for peak detection in time-series. ICAD ABAI , 2009. [42] P . R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal ef fects. Biometrika , 1983. [43] B. H. Spitzberg. T oward a model of meme dif fusion (M3D). Communication Theory , 2014. [44] G. Stoddard. Popularity dynamics and intrinsic quality in reddit and hacker ne ws. ICWSM , 2015. [45] J. V erdasca, M. T . Da Gama, A. Nunes, N. Bernardino, J. Pacheco, and M. Gomes. Recurrent epidemics in small world networks. J . Theor . Biol. , 2005. [46] L. W eng, A. Flammini, A. V espignani, and F . Menczer . Competition among memes in a world with limited attention. Sci. Rep. , 2012. [47] J. Y ang and S. Counts. Predicting the speed, scale, and range of information diffusion in twitter . ICWSM , 2010. [48] J. Y ang and J. Leskovec. Modeling information dif fusion in implicit networks. ICDM , 2010. [49] J. Y ang and J. Leskovec. P atterns of temporal variation in online media. WSDM , 2011.

Do Cascades Recur?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment