Modeling the Dynamics of Online Learning Activity

People are increasingly relying on the Web and social media to find solutions to their problems in a wide range of domains. In this online setting, closely related problems often lead to the same characteristic learning pattern, in which people shari…

Authors: Charalampos Mavroforakis, Isabel Valera, Manuel Gomez Rodriguez

Modeling the Dynamics of Online Learning Activity
Mo deling the Dynamics of Online Learning A ctivit y Charalamp os Ma vroforakis ∗ 1 , Isab el V alera 2 , and Man uel Gomez-Ro driguez 2 1 Boston Universit y , cma v@bu.edu 2 Max Planck Institute for Softw are Systems, iv alera@mpi-sws.org, manuelgr@mpi-sws.org Abstract P eople are increasingly relying on the W eb and social media to find solutions to their problems in a wide range of domains. In this online setting, closely related probl ems often lead to the same characteristic le arning p attern , in whic h p eople sharing these problems visit related pieces of information, p erform almost iden tical queries or, more generally , tak e a series of similar actions . In this pap er, we introduce a nov el mo deling framew ork for clustering continuous-time group ed streaming data, the hierarchical Dirichlet Ha wkes pro cess (HDHP), whic h allo ws us to automatically uncov er a wide v ariety of learning patterns from detailed traces of learning activity . Our model allo ws for efficient inference, scaling to millions of actions tak en b y thousands of users. Exp erimen ts on real data gathered from Stack Overflo w reveal that our framework can reco ver meaningful learning patterns in terms of both con tent and temporal dynamics, as well as accurately track users’ interests and goals ov er time. 1 In tro duction Learning has b ecome an online activity – people routinely use a wide v ariety of online le arning platforms , ranging from wikis and question answering (Q&A) sites to online communities and blogs, to learn ab out a large range of topics. In this context, people find solutions to their problems by looking for closely related pieces of information, executing a sequence of queries or, more generally , p erforming a series of online actions . F or example, a high school student ma y study several closely related wiki pages to prepare an essay about a historical even t; a softw are de v elop er may read several answers within a Q&A site to solve a sp ecific programming problem; and, a researcher may c heck a specialized blog written by one of her p eers to learn ab out a new concept or technique. All the ab o ve are examples of le arning p atterns , in whic h p eople perform a series of online actions – reading a wiki page, an answer, or a blog – to achiev e a predefined goal – writing an essay , solving a programming problem, or learning ab out a new concept or technique. In this context, one may expect that p eople with similar goals undertake similar sequences of online actions and thus adopt similar learning patterns. Therefore, one could leverage the v ast av ailability of online traces of users’ learning activit y to disambiguate among interlea v ed learning patterns adopted by individuals ov er time, as well as to automatically identify and trac k those p eople’s in terests and goals ov er time. In this work, w e intr o duce a nov el probabilistic mo del, the Hierarc hical Diric hlet Ha wkes Pro cess (HDHP), for clustering c ontinuous-time grouped streaming data, whic h we use to uncov er the dynamics of learning activit y on the web. The HDHP leverages the prop erties of the Hierarchical Diric hlet Pro cess (HDP) [ 18 ], a p opular Ba yesian nonparametric mo del for clustering problems inv olving multiple groups of data, combined with the Hawk es pro cess [ 13 ], a temp oral p oin t pro cess particularly w ell fitted to mo del so cial activity [ 11 , 19 , 20 ]. In particular, the former is used to account for an infinite num b er of learning patterns, which are shared across users (groups) of an online learning platform. The latter is used to capture ∗ This work was done during the author’s internship at Max Planc k Institute for Softw are Systems. 1 the temp oral dynamics of the learning activity , which alternate b et ween bursts of rapidly o ccurring ev ents and relatively long perio ds of inactivity [4]. In more detail, in the case of the HDHP , the learning pattern distribution that determines the con tent and temp oral parameters of each learning pattern is drawn from a Dirichlet pro cess (DP) [ 12 ]. Each user’s learning activit y is mo deled as a multiv ariate Hawk es pro cess, with as man y dimensions as the num b er of learning patterns ( i.e. , infinite), whose parameters are given by the DP and thus shared across all the users. Ev ery time a user decides to p erform a new action, she ma y opt for starting a new task or follow-up on one of her previous ones. Here, tasks refer to sequences of related actions p erformed closely in time, whic h in turn can b e viewed as realizations of learning patterns. Our mo del allo ws for an efficient inference algorithm, based on sequential Mon te Carlo [ 15 ], which scales to millions of actions and thousands of users. W e exp erimen t on real-w orld data from Stack Ov erflow, using ∼ 1 . 6 million questions p erformed by 16 , 000 users ov er a four year p eriod. Our results sho w that our mo del, taking adv an tage of the temp oral information in the data, do es not only allow us to accurately track users’ in terest ov er time, but also pro vides more meaningful learning patterns in terms of con tent (20% gain in p erplexit y) compared to the HDP . A Python implemen tation of the prop osed HDHP is av ailable in GitHub. 1 Related work. The Hierarchical Dirichlet Hawk es pro cess (HDHP) can be viewed as a mo del for clustering group ed contin uous-time streaming data. In our application domain, each group of data corresp onds to a user’s online actions and the clusters corresp ond to learning patterns, shared across all the users. Therefore, our work relates, on the one hand, to models for clustering groups of data [ 6 , 17 ], and, on the other hand, to mo dels for clustering (single) streaming data [ 3 , 5 , 2 , 10 ]. T o the b est of our kno wledge, mo dels for clustering group ed streaming data are nonexistent to date. The most p opular mo dels for clustering groups of data, the Latent Diric hlet Allo cation (LDA) [ 6 ] and, its nonparametric counterpart, the HDP [ 18 ], originate from the topic mo deling literature. There, eac h do cumen t is transformed in to a b ag-of-wor ds and is mo deled as a mixture of topics that are shared across all do cumen ts. More generally , mo dels for clustering groups of data typically consider that each observ ation (word) within a group (do cumen t) is a sample from a mixture mo del, and the mixture comp onen ts are shared among groups. One could use such mo dels to cluster users’ activity on the W eb, ho wev er observ ations are assumed to b e exc hangeable and thus these mo dels cannot account for the temp oral dynamics of learning activity . As a consequence, they are unable to track users’ in terests and goals ov er time. Mo dels for clustering streaming data can incorp orate temporal dynamics [ 2 , 3 ], ho wev er they can only handle a single stream of data and thus cannot b e used to jointly mo del several users’ learning activity . A dditionally , most of these mo dels discretize the time dimension into bins [ 2 , 3 ], introducing additional tuning parameters, and ignoring the self-excitation across even ts [ 5 ], a phenomenon regularly observed in so cial activit y data [ 11 ]. Perhaps the most closely related work to ours is the recently proposed Diric hlet Hawk es Pro cess (DHP) [ 10 ], a contin uous-time mo del for streaming data that allows for self-excitation. Ho wev er, DHP suffers from a significant limitation: the lack of an underlying DP (or, in fact, any other Bay esian nonparametric) prior on the cluster distribution compromises the iden tifiability and reproducibility of the mo del. A dditionally , from the p ersp ectiv e of our application, DHP only allo ws for single data stream (user) and enforces clusters to b e forgotten after some time. The latter is an o verly restrictiv e assumption, since a user may p erform similar actions, i.e. , b elonging to the same learning pattern, ov er widely spaced interv als of time. 2 Preliminaries In this section, we briefly review the ma jor building blo c ks of the Hierarc hical Dirichlet Hawk es pro cess (HDHP): the Hierachical Diric hlet pro cess (HDP) [18] and the Hawk es pro cess [1]. 1 https://github.com/Networks- Learning/hdhp.py 2 2.1 Hierarc hical Dirichlet Pro cess The HDP is a Bay esian nonparametrics prior, useful for clustering group ed data [ 18 ], which allo ws for an un b ounded n um b er of clusters whose parameters are shared across all the groups. It has b een broadly applied for topic mo deling as the nonparametric coun terpart of the Latent Diric hlet Allo cation (LD A), where the n umber of topics is finite and predefined. More sp ecifically , this pro cess defines a hierarch y of Dirichlet pro cesses (DPs), in which a set of random probability measures G j ∼ DP ( β 1 , G 0 ) (one for eac h group of data) are distributed as DPs with concentration parameter β 1 and base distribution G 0 . The latter is also distributed as a DP , i.e. , G 0 ∼ D P ( β 0 , H ) . In the HDP , the distributions G j share the same supp ort as G 0 , and are conditionally indep endent given G 0 . Chinese Restauran t F ranc hise. An alternative representation of the HDP is the Chinese Restauran t F ranchise Pro cess (CRFP), which allows us not only to obtain samples from the HDP but also to derive efficien t inference algorithms. The CRFP assumes a franchise with as man y restaurants as the groups of data ( e.g. , n umber of do cuments), where all of the restaurants share the same men u with an infinite num b er of dishes (or clusters). In particular, one can obtain samples from the HDP as follows: 1. Initialize the total num b er of dishes L = 0 , and the total num b er of tables in each restauran t K r = 0 , for r = 1 , . . . , R , with R b eing the total num ber of restaurants in the franchise. 2. F or each restauran t r = 1 , . . . , R : F or customer i = 1 , . . . , N r ( N r is the total num b er of customers entering restauran t r ): – Sample the table for the i -th customer in restaurant r from a m ultinomial distribution with probabilities Pr( b ri = k ) = n rk β 1 + i − 1 for k = 1 , . . . , K r Pr( b ri = K r + 1) = β 1 β 1 + i − 1 (1) where n rk = P i − 1 j =1 I ( b rj = k ) is the n umber of customers seated at the k -th table of the r -th restauran t. – If b ri = K r + 1 , i.e. , the i -th customer sits at a new table, sample its dish from a multinomial distribution with probabilities Pr( φ r ( K r +1) = ϕ ` ) = m ` K + β 0 for ` = 1 , . . . , L Pr( φ r ( K r +1) = ϕ L +1 ) = β 0 K + β 0 (2) where K = P r j =1 K j is the total num b er of tables in the franchise, m ` = P r j =1 P K j k =1 I ( φ j k = ϕ ` ) is the total num ber of tables serving dish ϕ ` in the franchise, and ϕ L +1 ∼ H ( ϕ ) is the new dish, i.e. , the parameters of the new cluster. – Increase the n umber of tables in the r -th restaurant K r = K r + 1 and, if φ rK r = ϕ L +1 ( i.e. , the new table is assigned to a new dish/cluster), increase also the total num ber of clusters in the franchise L = L + 1 . Note that, although in the pro cess ab o ve we ha ve generated the data (customers) for each group (restaurant) sequen tially , due to the exchangeabilit y prop erties of the HDP , the resulting distribution of the data is in v ariant to the order at which customers are assumed to en ter an y of the restaurants [18]. 2.2 Ha wk es Pro cess A Hawk es pro cess is a sto c hastic pro cess in the family of temp oral p oin t pro cesses [ 1 ], whose realizations consist of lists of discrete even ts lo calized in time, { t 1 , t 2 , . . . , t n } with t i ∈ R + . A temp oral point process can b e equiv alen tly represented as a counting pro cess, N ( t ) , which records the num b er of even ts b efore time t . The probabilit y of an even t happ ening in a small time windo w [ t, t + dt ) is given b y P ( dN ( t ) = 1 |H ( t )) = λ ∗ ( t ) dt , where dN ( t ) ∈ { 0 , 1 } denotes the increment of the pro cess, H ( t ) denotes the history of even ts up to but not including time t , λ ∗ ( t ) is the conditional in tensity function (intensit y , in short), and the sign ∗ indicates that the intensit y may dep end on the history H ( t ) . In the case of Hawk es pro cesses, the intensit y function adopts 3 the following form: λ ∗ ( t ) = µ + X t i ∈H ( t ) κ α ( t, t i ) , (3) where µ is the base intensit y and κ α ( t, t i ) is the triggering kernel, whic h is parametrized by α . Note that this in tensity captures the self-excitation phenomenon across even ts and thus allows mo deling bursts of activity . As a consequence, the Hawk es pro cess has b een increasingly used to mo del so cial activity [ 11 , 19 , 20 ], which is characterized by bursts of rapidly o ccurring even ts separated by long p erio ds of inactivit y [ 4 ]. Finally , giv en the history of even ts in an observ ation windo w [0 , T ) , denoted by H ( T ) , we can express the log-lik eliho o d of the observed data as L T = X i : t i ∈H ( T ) log λ ∗ ( t i ) − Z T 0 λ ∗ ( τ ) dτ . (4) 3 Learning activit y mo del In an online learning platform, users find solutions to their problems b y sequen tially lo oking for closely related pieces of information within the site, executing a sequence of queries or, more generally , p erforming a series of online actions . In this context, one may exp ect p eople with similar goals to undertake similar sequences of actions, which in turn can b e view ed as realizations of an unbounded num b er of le arning p atterns . Here, w e assume that each action is linked to some particular conten t and we propose a mo deling framework that c haracterizes sequences of actions by means of the timestamps as well as the asso ciated conten t of these actions. Next, we formulate our mo del for online learning activity , starting by describing the data it is designed for. Learning activit y data and generative pro cess. Giv en an online learning platform with a set of users U , w e represent the learning actions of each user as a triplet e := ( time ↓ t , ω ↑ conten t , learning pattern ↓ p ) , (5) whic h means that at time t the user to ok an action linked to con tent ω and this action is asso ciated to the learning pattern p , which is hidden . Then, we denote the history of learning actions taken by eac h user u up to, but not including, time t as H u ( t ) . W e represen t the times of eac h user u ’s learning actions within the platform as a set of counting processes, N u ( t ) , in which the ` -th entry coun ts the num ber of times up to time t that user u to ok an action asso ciated to the learning pattern ` . Then, we characterize these counting pro cesses using their corresp onding intensities as E [ d N u ( t ) |H u ( t )] = λ ∗ u ( t ) dt , where d N u ( t ) = [ dN u,` ( t )] ` ∈ [ L ] denotes the num b er of learning actions in the time window [ t, t + dt ) for each learning pattern, λ ∗ u ( t ) = [ λ ∗ u,` ( t )] ` ∈ [ L ] denotes the corresp onding pattern in tensities, L is the num b er of learning patterns, and the sign ∗ indicates that the intensities ma y dep end on the user’s history , H u ( t ) . Additionally , for eac h learning action e = ( t, ω , p ) , the conten t ω is sampled from a distribution f ( ω | p ) , whic h dep ends on the corresp onding learning pattern p . Here, in order to account for an unbounded num ber of learning patterns, i.e. , L → ∞ , we assume that the learning pattern distribution follo ws a Dirichlet pro cess (DP). Next, we sp ecify the functional form of the user intensit y asso ciated to each learning pattern and the conten t distribution, and w e elab orate further on the learning pattern distribution. In tensity of the user learning activit y . Ev ery time user u p erforms a learning action, she may opt to either start a new task , defined as a sequence of learning actions similar in con tent and p erformed closely in time ( i.e. , a realization of a learning pattern), or to follow-up on an already on-going task. The multiv ariate Ha wkes pro cess [ 1 ], describ ed in Section 2.2, presen ts itself as a natural choice to mo del this b eha vior. This w ay , eac h dimension ` corresp onds to a learning pattern ` and its asso ciated in tensity is giv en by λ ∗ u,` ( t ) = µ u π ` | {z } new task + follow-up z }| { X j : t j ∈H u ( t ) , p j = ` κ ` ( t, t j ) . (6) 4 Here, the parameter µ u ≥ 0 accounts for the rate at whic h user u starts new tasks, π ` ∈ [0 , 1] is the probability that a user adopts learning pattern ` (referred to as le arning p attern p opularity from now on), and κ ` ( t, t 0 ) is a nonnegative k ernel function that mo dels the decaying influence of past even ts in the pattern’s intensit y . Due to conv enience in terms of b oth theoretical prop erties and model inference, we opt for an exp onen tial k ernel function in the form κ ` ( t, t 0 ) = α ` exp ( − ν ( t − t 0 )) , where α ` con trols the self-excitation (or burstiness) of the Ha wkes pro cess and ν con trols the decay . Finally , note that w e can compute user u ’s av erage intensit y at time t analytically as [11]: E H u ( t ) [ λ ∗ u ( t )] = h e ( A − ν I ) t + ν ( A − ν I ) − 1  e ( A − ν I ) t − I  i µ u , (7) where A = diag ([ α 1 , . . . , α L ]) , µ u = [ µ u π 1 , . . . , µ u π L ] > , I is the identit y matrix, and the exp ectation is taken o ver all p ossible histories H u ( t ) . W e can also compute the exp ected n umber of actions p erformed b y user u un til time T as E H u ( T ) [ N u ( T )] = Z T 0 E H u ( τ ) [ λ ∗ u ( τ )] dτ . (8) Con tent distribution. W e gather the conten t asso ciated to each learning action e = ( t, ω , p ) as a vector ω , in which eac h element is a word sampled from a vocabulary W as ω j ∼ M ultinomial ( θ p ) , (9) where θ p is a |W | -length vector indicating the probability of eac h w ord to app ear in conten t from pattern p . Learning pattern parameters. The distribution of the learning patterns is sampled from a DP , G 0 ∼ D P ( β , H ) , which can b e alternativ ely written as G 0 = ∞ X ` =1 π ` δ ϕ ` , (10) where π = ( π ` ) L = ∞ ` =1 ∼ GE M ( β ) is sampled from a stick breaking pro cess [14] and ϕ ` = { α ` , θ ` } ∼ H ( ϕ ) . Remarks. Ov erall, the prop osed learning activity mo del, whic h we refer to as the Hierarc hical Dirichlet Ha wkes pro cess (HDHP), is based on a 2-la yer hierarc hical design. The top lay er is a Dirichlet process that determines the learning pattern distribution, and the b ottom lay er corresp onds to a collection of indep endent m ultiv ariate Hawk es pro cesses, one per user, with as man y dimensions as the n umber of learning patterns, i.e. , infinite. In the HDHP , the p opularit y of eac h learning pattern, or equiv alen tly the probability of assigning a new task to it, is constant o ver time and given by the distribution G 0 . How ever, the probability distribution of the learning patterns for each sp ecific user evolv es contin uously ov er time and directly dep ends on her instan taneous intensit y . Finally , we remark that, due to the infinite dimensionalit y of the Ha wkes process that captures the learning activity of eac h user, sampling or p erforming inference directly on this mo del is in tractable. F ortunately , we can b enefit from the prop erties of b oth the Hawk es and the DP , and prop ose an alternativ e gen erativ e pro cess that we can then utilize to efficien tly obtain samples of the HDHP . 3.1 T ractable mo del represen tation Similarly to the HDP , w e can generate samples from the prop osed HDHP b y following a generative pro cess similar to the CRFP . T o this end, w e lev erage the prop erties of the Hawk es pro cess and represent the learning actions of all the users in the learning platform as a m ultiv ariate Hawk es pro cess, with as many dimensions as the users, from which w e sample the user and the timestamp asso ciated to the next learning action. This action is then assigned to either an existing or a new task with a probability that dep ends on the history of that user up to, but not including, the time of the action. When initiating a new task, the asso ciated learning pattern is sampled from a distribution that accounts for the o verall popularity of all the learning patterns. W e finally sample the action conten t ω as we discussed previously . In the pro cess describ ed ab o v e, eac h user can b e viewed as a restaurant, eac h action as a customer, eac h task as a table, and each pattern as a dish, as in the original CRFP . Hence, if we assume a set of users U and v o cabulary W for the conten t, we can generate N learning actions as follows: 5 1. Initialize the total numb er of tasks K = 0 and the total numb er of le arning p atterns L = 0 . 2. F or n = 1 , . . . , N : (a) Sample the time t n and user u n ∈ U for the new action, suc h that t n > t n − 1 , as in [19] ( t n , u n ) ∼ Hawk es        µ 1 + n − 1 P i =1 κ b i ( t n , t i ) I ( u i = 1) . . . µ U + n − 1 P i =1 κ b i ( t n , t i ) I ( u i = U )        (11) (b) Sample the task b n for the new action from a multinomial distribution with probabilities Pr( b n = k ) = λ ∗ u n ,k ( t n ) λ ∗ u n ( t n ) , for k = 1 , . . . , K Pr( b n = K + 1) = µ u n λ ∗ u n ( t n ) (12) where λ ∗ u n ,k ( t n ) = P n − 1 i =1 κ b i ( t n , t i ) I ( u i = u n , b i = k ) , and λ ∗ u n ( t n ) = µ u n + P n − 1 i =1 κ b i ( t n , t i ) I ( u i = u n ) is the total intensit y of user u n at time t n . (c) If b n = K + 1 , assign the new task to a learning pattern with probability Pr( φ K +1 = ϕ ` ) = m ` K + β , for ` = 1 , . . . , L Pr( φ K +1 = ϕ L +1 ) = β K + β (13) where m ` = P K k =1 I ( φ k = ϕ ` ) is the num ber of tasks assigned to learning pattern ` across all users, and ϕ L +1 = { α L +1 , θ L +1 } is the set of parameters of the new learning pattern L + 1 , which w e sample from α L +1 ∼ Gamma ( τ 1 , τ 2 ) and θ L +1 ∼ D ir ichlet ( η 0 ) . Then, increase the num b er of tasks K = K + 1 and, if φ K +1 = ϕ L +1 , increase also the num ber of clusters L = L + 1 . (d) Sample each w ord in the con tent ω n from ω n,j ∼ M ultinomial ( θ b n ) . Remarks. Note that, in the pro cess ab o ve, b oth users and learning patterns are exc hangeable. Ho wev er, con trary to the CRFP , the generated data consist of a sequence of discrete even ts lo calized in time, which therefore do not satisfy the exc hangeability property . Moreov er, the complexity of this generativ e pro cess differs from the standard CRFP only in tw o steps. First, it needs to sample the ev ent time and user from a Hawk es pro cess as in Eq. 11, which can b e done in linear time with resp ect to the n umber of users [ 11 ]. Second, while the CRFP only accounts for the num ber of customers at each table, the ab o ve process needs to ev aluate the intensit y asso ciated with eac h table (see Eq. 12), which can b e updated in O(1) using the prop erties of the exp onen tial function. W e also wan t to stress that although the ab o ve generative pro cess resembles the one for the Dirichlet Ha wkes pro cess (DHP) [ 10 ], they differ in tw o key factors. First, the DHP can only generate a single sequence of even ts, while the ab o ve process can generate an indep enden t sequence for each user. Second, the DHP do es not instantiate an explicit prior distribution on the clusters, which results in a lack of iden tifiability and repro ducibilit y of the mo del. In other words, new even ts in the DHP are only allow ed to join a new or a curren tly active cluster – once a cluster “dies” ( i.e. , its intensit y b ecomes negligible), no new ev ent can be assigned to it anymore. As a result, tw o bursts of even ts that are similar in conten t and dynamics but widely separated in time will b e assigned to different clusters, leading to multiple copies of the same cluster. In con trast, our generative process ensures the identifiabilit y and repro ducibilit y of the mo del by placing a DP prior on the cluster distribution, and using the CRFP to integrate out the learning pattern p opularit y . 3.2 Inference Giv en a collection of N observ ed learning actions p erformed by a set of users U during a time p eriod [0 , T ) , our goal is to infer the learning patterns that these actions b elong to. T o efficiently sample from the p osterior distribution, similarly to [ 10 ], we leverage the generativ e pro cess describ ed in Section 3.1. W e derive a 6 0 1 2 3 4 True value 0 1 2 3 4 Inferred value (a) Estimation of µ u 0 1 2 3 4 True value 0 1 2 3 4 Inferred value (b) Estimation of α ` 0k 25k 50k 75k 100k 125k 150k #events 0.80 0.89 0.98 NMI (c) Clustering accuracy Figure 1: Ev aluation of the inference algorithm at recov ering the mo del parameters and latent learning pattern asso ciated to each learning even t on 150 k synthetically generated data. sequen tial Monte Carlo (SMC) algorithm that exploits the temp oral dependencies in the observed data to sequen tially sample the latent v ariables asso ciated with each learning action. In particular, the p osterior distribution p ( b 1: N | t 1: N , u 1: N , ω 1: N ) is sequentially approximated with a set of P particles, whic h are sampled from a prop osal distribution q ( b 1: N | t 1: N , u 1: N , ω 1: N ) . T o infer the global parameters, µ u and α ` , w e follow the literature in SMC devoted to the estimation of a s tatic parameter [ 7 , 8 ], and sequen tially up date the former b y maximum lik eliho od estimation and the latter by sampling from its p osterior distribution. The inference algorithm, which is detailed in App endix A, has complexit y O ( P ( U + L + K ) + P ) p er observ ed learning action, where L and K are resp ectiv ely the num ber of learning patterns and the num ber of tasks unco vered up to this action. 4 Exp erimen ts 4.1 Exp erimen ts on syn thetic data In this section, we exp eriment with synthetic data and show that our inference algorithm can accurately reco ver the model parameters as well as assign each generated learning action to the true learning pattern giv en only the times and con tent of the learning even ts. Exp erimen tal setup. W e assume a set of 200 users, L = 50 learning patterns and a v o cabulary of size |W | = 100 . W e then sample the base in tensity of eac h user µ u from Gamma (10 , 0 . 2) , and the learning pattern p opularit y v ector π from a Diric hlet distribution with concentration parameters equal to 1 . F or each learning pattern, we sample the kernel parameter α ` from Gamma (8 , 0 . 25) , we randomly pick 30 words that will b e used by the pattern and sample their distribution from a D irichl et distribution with parameters equal to 3 . W e assume a kernel decay of ν = 5 . Then, for each user w e generate online learning actions from the corresp onding m ultiv ariate Hawk es pro cess. Results. Figure 1 summarizes the results by showing the true and the estimated v alues of the base in tensity of each user µ u and the kernel parameter of each pattern α ` , using a total of 150 k ev ents. Moreov er, it also sho ws the normalized mutual information (NMI) b et ween the true and inferred clusters of actions against the n umber of even ts seen by our inference algorithm. Here, we rep ort the results for the particle which pro vided the maximum lik eliho od, and match the inferred learning patterns to the true ones by maximizing the NMI score. Our inference algorithm accurately reco vers the mo del parameters and, as exp ected, using more even ts when inferring the mo del parameters leads to more accurate assignmen t of even ts to learning patterns. 4.2 Exp erimen ts on real data In this section, we experiment on real data gathered from Stack Ov erflow, a p opular question answ ering (Q&A) site, where users can post questions – with topics ranging from C# programming to Bay esian nonparametrics 7 KS-test AD-test 0 10% 20% 30% 40% 50% % of users rejected HDHP Hawkes (a) Dynamics (b) Conten t Figure 2: Go o dness of fit of the HDHP mo del in terms of (a) dynamics and (b) conten t. – which are, in turn, answered by other users of the site. W e infer our prop osed HDHP on a large set of learning actions, and sho w that the prop osed HDHP recov ers meaningful learning patterns and allows us to accurately track users’ in terests ov er time. Exp erimen tal setup. W e gather the times and conten t of all the questions p osted b y all Stack Ov erflow users during a four year p eriod, from Septem b er 1, 2010 to September 1 2014. Here, we consider eac h user’s question as a learning action. The reason for this c hoice is primarily the av ailability of public datasets, and, secondarily , the fact that a question pro vides clear evidence of the user’s curren t in terest at the time of asking. By lo oking only at the questions, we are underestimating the n umber of actions taken on eac h task, how ever, this bias is shared across all the tasks and, thus, we can still compare the dynamics of different patterns in a sensible wa y . F or each question, we use the set of (up to) five tags (or keyw ords) that the user used to describ e her question as the conten t associated to the learning action. T o ensure that the inferred parameters are reliable and accurate, we only consider users who p osted at least 50 questions and tags that were used in at least 100 questions. After these prepro cessing steps, our dataset consists of ∼ 1 . 6 million questions p erformed b y ∼ 16 , 000 users, and a v o cabulary of ∼ 31 , 400 tags. Finally , w e run our inference algorithm on the first 45 months of data and ev aluate its p erformance on the last three months, used as held-out set. In our exp erimen ts we set the time scale to b e one month, the k ernel decay ν = 5 and the num b er of particles |P | = 200 particles, which work ed well in practice. Our implementation of the SMC algorithms for the prop osed HDHP and the HDP requires, resp ectiv ely , 71ms and 65ms p er question on av erage, which implies that accounting for the temp oral information in the data leads to an increase in runtime of approximately 10%. Go odness of fit. W e ev aluate the go odness of fit of our prop osed mo del on learning activity data, in terms of b oth conten t and temp oral dynamics. T o this end, we first ev aluate the p erformance of the HDHP at capturing the temp oral dynamics of the learning activity and compare it with the standard Hawk es pro cess that only accounts for the temp oral information of the data and, therefore, cannot cluster learning actions in to learning patterns. In the latter, we model the learning activity of eac h user as an indep enden t univ ariate Ha wkes pro cess, disregarding the conten t of each learning action. In other w ords, for each user, we learn b oth a base intensit y µ and a self-excitation parameter α , as defined in Eq. 3, p er user active in the test set U test . In order to compare the p erformance of the mo dels, w e first apply the time c hanging theorem [ 9 ], whic h states that the integral of the in tensity of a p oin t pro cess b et ween t wo consecutiv e even ts R t i +1 t i λ ∗ u ( t ) dt should conform to the unit-rate exp onen tial distribution. Then, we resort to t wo goo dness of fit tests, the K olmogorov-Smirno v and the Anderson-Darling [ 16 ], to measure how well the transformed action times fit the target distribution. Figure 2a summarizes the results by showing the p ercen tage of the users in the held-out set that each test rejects at a significance level of 5% . While the Hawk es pro cess performs sligh tly b etter (5% for the KS-test and 11% for the AD-test) than our mo del, it do es so by using almost 2 × more parameters — 2 |U test | ∼ 5 k for the Ha wkes vs |U test | + L ∗ ∼ 2 . 7 k for the HDHP , where L ∗ = 227 is the n umber of inferred learning patterns. 8 S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR (a) Mac hine Learning S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR (b) Python S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR (c) V ersion Control Figure 3: Three inferred learning patterns in Stack Ov erflow. The top row shows the conten t asso ciated to eac h pattern, in the form of clouds of words, while the b ottom ro w shows t wo samples of its characteristic temp oral dynamics, b y means of the intensities of tw o users using the pattern. Second, w e fo cus on ev aluating the p erformance of the HDHP at clustering learning activit y , and compare it with the HDP [ 18 ], whic h only mak es use of the conten t information in the data. W e resort to the marginal lik eliho o d of the inferred parameters ev aluated on the held-out set of questions ω i ∈ D test , i.e. , p ( ω i |D train , u i , t i ) = L X ` =1 p ( ω i |D train , ` ) p ( ` |D train , u i , t i ) . Ab o v e, the first term is defined in the same wa y for b oth mo dels. How ev er, for the HDP , the second term is simply the topic p opularit y π ` , while for the HDHP it dep ends on the complete user history up to but not including t i , i.e. , p ( ` |D train , u i , t i ) ∝ λ ∗ u i ,` ( t i ) , where λ ∗ u i ,` ( t i ) is given by Eq. 6. Figure 2b shows the log-lik eliho o d v alues obtained under the prop osed HDHP and the HDP on the held-out set. Here, higher log-lik eliho o d v alues mean b etter go o dness of fit and, therefore, all the points ab o ve the x = y line correspond to questions that are b etter captured by the HDHP , whic h are in turn 60% of the held-out questions. A dditionally , w e also compute the p erplexity [6] as per plexity = exp  − P i : e i ∈D test log p ( ω i |D train ) |D test |  . The p erplexit y v alues for the HDHP and the HDP are 204 and 243 , resp ectively , where here low er p erplexit y v alues mean better go odness of fit. These results show that b y mo deling temp oral information, in addition to con tent information, the HDHP fits b etter the conten t in the data than the HDP (20% gain in p erplexit y), and therefore, it provides more meaningful learning patterns in terms of conten t. Learning patterns. In this section, our goal is to understand the c haracteristic prop erties of the learning patterns that Stack Overflo w users follow for problem solving. T o this end, w e first pay atten tion to three particular examples of learning patterns, ‘ Machine L e arning ’, ‘ Python ’ and ‘ V ersion Contr ol ’, among the unco vered L ∗ = 227 learning patterns, and inv estigate their characteristic properties. Figure 3 compares the ab o v e men tioned patterns in terms of conten t, by means of word clouds, and in terms of temp oral dynamics, b y means of the learning pattern intensities asso ciated to t wo differen t users active on eac h of the patterns. Here, we observe that: i) the cloud of words asso ciated to each inferred learning pattern corresp onds to meaningful topics; and ii) despite the sto chastic nature of the temp oral dynamics, the user intensities within the same learning pattern tend to exhibit striking similarities in terms of burstiness and p eriods of inactivity . F or example, we observ e that the Machine L e arning and Python tasks exhibit muc h larger bursts of even ts than V ersion Contr ol . A plausible explanation is that version con trol problems tend to b e more sp ecific and simple, e.g. , resolving a conflict while merging versions, and, thus, can b e quickly solved with one or just 9 0.00 0.06 0.12 (a) P opularity 0.0 0.9 1.8 (b) Burstiness 0 1.5 3.0 Burstiness 0 0.12 Popularity CSS Linux ML (c) P opularity vs. Burstiness Figure 4: Learning patterns. P anels (a) and (b) show the p opularity and burstiness of the top-50 mos t p opular learning patterns, and panel (c) shows the p opularit y and burstiness for all the inferred patterns. W e highligh t the learning pattern examples in Figure 3, as well as some others from T able 1. a few questions. On the contrary , a user interested in mac hine learning or Python ma y face more complex problems whose solution requires asking several questions in a relatively short p eriod of time. Next, we in vestigate whether more p opular learning patterns are also the ones that trigger larger bursts of even ts, i.e. , the learning patterns that engage users to p erform long sequences of closely related learning actions in shorts p eriod of time. Figure 4 shows the p opularit y and burstiness of the 50 most p opular learning patterns sorted in decreasing order of p opularit y , as w ell as a scatter plot which shows the p opularit y against burstiness for all the inferred patterns. Here, w e compute the burstiness as the exp ected n umber of learning even ts triggered by self-excitation during the first month after the adoption of the pattern using Eq. 8. Figure 4 reveals that the burstiness is not correlated with the p opularit y of each pattern. On the con trary , even among the top 20 most p opular patterns, several learning patterns trigger on av erage less than 0.5 follow-up questions. I t is also worth noticing that there is a small set of learning patterns which are muc h more p opular than the rest. In particular, the most p opular learning pattern, which is related to W eb design , captures approximativ ely 12% of the attention of Stac k Overflo w users , and the 20 most p opular learning patterns gather more than 60% of the p opularit y . Moreo ver, Figure 4c highlights examples of learning patterns that are very p opular and burst y – W eb design ; examples of bursty learning patterns that are not very p opular – machine le arning ; and learning patterns that are not p opular nor burst y – UI libs . T able 1 shows the top-20 most probable words in the seven learning patterns highlighted in Figure 4c. User b eha vior. In this section, we use our mo del to identify differen t types of users and derive insigh ts ab out the learning patterns they use ov er time, as well as the evolution of their interests. T wo natural questions emerge in this context: (i) Do users stic k to just a few learning patterns for all their tasks, or p erhaps they explore a different pattern every time they start a task? And, (ii) how long do they commit on their chosen task? First, we visualize the inferred intensities for tw o sp ecific real users, among the several that we found, in Figures 6a - 6b. These are examples of tw o v ery distinctiv e b eha viors: – Explorers : They shift ov er many different learning patterns and rarely adopt the same pattern more than once. F or example, the user in Figure 6a adopts ov er 10 patterns in less than a year, and rarely adopt the same learning pattern more than once. – Lo yals : They remain loy al to a few learning patterns ov er the whole observ ation p erio d. F or example, the user in Figure 6b asks questions asso ciated to tw o learning patterns ov er a p erio d of 4 years perio d and rarely adopts new learning patterns. W e inv estigate to whic h exten t we find explorers and loy als throughout Stack Overflo w at large. T o this end, we compute the user base intensities, µ u , which can be view ed as the num b er of new tasks that a user starts p er mon th, and the distribution of the total num ber of learning patterns adopted by eac h user ov er the observ ation p eriod. Figures 5a and 5b summarize the results, showing several interesting patterns. First, there is a high v ariability across users in terms of new task rate – while most of users start one to tw o new tasks every mon th, there are users who start up to more than 8 tasks mon thly . Second, while approximately 5% of the users remain loy al to at most 5 learning patterns and another 10% of the user explores more than 10 0.3 1 2 3 4 5 6 7 8 8.3+ 0 0.04 0.08 (a) Base in tensity µ u (tasks/month) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 0.00 0.15 0.30 (b) # of adopted patterns p er user < 0.1 0.2 0.5 1 > 1.5 0 0.25 (c) A verage time in months p er task Figure 5: User b eha vior: (a) the inferred user base intensities, (b) the num b er of learning patterns adopted b y the users ov er the 4 y ears, and (c) the av erage time users sp en t for the completion of their tasks. 25 learning patterns ov er the 4 years, the a verage user ( ∼ 87%) adopts b et ween 5 and 25 patterns. Finally , we inv estigate ho w long do users commit on their chosen task. T o answer this question, we compute the av erage time b etw een the initial and the final even t for each task of our users. Figure 5c show the distribution of the av erage time sp ent p er task. Here we observ e that while approximativ ely 10% of the user tasks are concluded in less than a month, most of the users (ov er 75% of the users) sp end one to four mon ths to complete a task. 5 Conclusions In this pap er, w e prop osed a nov el probabilistic mo del, the Hierarchical Diric hlet Hawk es Pro cess (HDHP), for clustering grouped streaming data. In our application, each group corresp onds to a sp ecific user’s learning activit y . The clusters corresp ond to learning patterns, characterized by b oth the conten t and temp oral information and shared across all users. W e then developed an efficient inference algorithm, which scales linearly with the n umber of users and learning actions, and accurately recov ers b oth the pattern asso ciated with each learning user action and the mo del parameters. Our exp erimen ts on large-scale data from Stack Ov erflow show that the HDHP recov ers meaningful learning patterns, b oth in terms of conten t and temp oral dynamics, and offers a characterization of different user b eha viors. W e remark that, the prop osed HDHP could b e run within the learning platform in an online fashion to track users’ interest in real time. With this, one could get b oth a characterization of the different user b eha viors, and recommendations on questions that might b e of interest at an y giv en time. Finally , although here we focused on mo deling online learning activit y data, the prop osed HDHP can b e easily used to cluster a wide v ariet y of streaming data, ranging from news articles, in which there is a single stream (group) of data, to web bro wsing, where one could identify groups of websites that pro vide similar services or conten t. References [1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a pr o c ess p oint of view . Springer Science & Business Media, 2008. [2] A. Ahmed, Q. Ho, C. H. T eo, J. Eisenstein, A. J. Smola, and E. P . Xing. Online Inference for the Infinite T opic-Cluster Model: Storylines from Streaming T ext. In AIST A TS , 2011. [3] A. Ahmed and E. Xing. Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant Pro cess : with Applications to Ev olutionary Clustering. SDM , 2008. [4] A.-L. Barabási. The origin of bursts and heavy tails in human dynamics. Natur e , 435:207, 2005. [5] D. M. Blei and P . I. F razier. Distance dep enden t c hinese restauran t pro cesses. The Journal of Machine L e arning R ese ar ch , 12:2461–2488, No v. 2011. [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Laten t dirichlet allo cation. The Journal of Machine L e arning R ese ar ch , 3:993–1022, 2003. [7] O. Cappé, S. J. Go dsill, and E. Moulines. An ov erview of existing methods and recent adv ances in sequential mon te carlo. Pr o c e e dings of the IEEE , 95(5):899–924, 2007. 11 S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR S EP M AR (a) Explorer user S EP M AR S EP M AR S EP M AR S EP M AR (b) Lo yal user Figure 6: Real-world examples of user b eha vior. An explorer user (panel (a)), shifts ov er many different learning patterns ov er time, while a loy al user (panel (b)) stic ks to a small selection of patterns. [8] C. Carv alho, M. S. Johannes, H. F. Lopes, and N. Polson. Particle learning and smo othing. Statistical Scienc e , 25(1):88–106, 2010. [9] D. J. Daley and D. V ere-Jones. An intr o duction to the the ory of p oint pro c esses. Vol. II . Probabilit y and its Applications (New Y ork). Springer, New Y ork, second edition, 2008. [10] N. Du, M. F ara jtabar, A. Ahmed, A. J. Smola, and L. Song. Diric hlet-ha wkes pro cesses with applications to clustering contin uous-time document streams. In ACM SIGKDD , 2015. [11] M. F ara jtabar, N. Du, M. Gomez-Rodriguez, I. V alera, L. Song, and H. Zha. Shaping social activit y b y incen tivizing users. In NIPS , 2014. [12] T. S. F erguson. A bay esian analysis of some nonparametric problems. The annals of statistics , pages 209–230, 1973. [13] A. G. Hawk es. Sp ectra of some self-exciting and mutually exciting p oin t pro cesses. Biometrika , 58(1):83–90, 1971. [14] J. Sethuraman. A constructiv e definition of Dirichlet priors. Statistic a Sinic a , 4:639–650, 1994. [15] A. Smith, A. Doucet, N. de F reitas, and N. Gordon. Sequential Monte Carlo metho ds in pr actic e . Springer Science & Business Media, 2013. [16] M. A. Stephens. Go o dness of fit with sp e cial r efer enc e to tests for exp onentiality . Stanford Universit y , 1978. [17] Y. W. T eh and M. I. Jordan. Hierarchical Bay esian nonparametric mo dels wi th applications. In N. Hjort, C. Holmes, P . Müller, and S. W alk er, editors, Bayesian Nonp ar ametrics: Principles and Practic e . Cambridge Univ ersity Press, 2010. [18] Y. W. T eh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet pro cesses. Journal of the Americ an Statistic al Asso ciation , 101(476):1566–1581, 2006. [19] I. V alera and M. Gomez-Ro driguez. Mo deling adoption and usage of competing products. In ICDM , 2015. [20] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional hawk es pro cesses. In ICML , 2013. 12 T op- 20 most probable w ords in the learning pattern ‘W eb design ’: jquery jav ascript html php css a jax jquery-ui json forms arrays asp.net h tml5 jquery-mobile mysql dom regex jquery-plugins in ternet-explorer jquery-selectors wordpress ‘sql’: sql mysql sql-server php sql-serv er-2008 database tsql oracle p ostgresql sql-serv er-2005 database-design join stored-pro cedures c# select sqlite sql-serv er-2008-r2 ja v a performance datetime ‘iOS’: ios ob jectiv e-c iphone xco de cocoa-touch ipad uitableview cocoa core-data osx ios4 ios5 uiview uitableviewcell uiview controller ios7 ios6 uinavigationcon troller uiscrollview nsstring ‘Python ’: python nump y p ython-2.7 matplotlib django pandas python-3.x scipy tkinter flask sqlalc hemy list arra ys wxp ython regex dictionary multithreading osx imp ort google-app-engine ‘V ersion c ontr ol’: git svn github version-con trol mercurial eclipse tortoisesvn merge branch rep ository ssh bitbuc ket xco de git-branch commit git-svn osx win do ws jav a gitignore ‘Machine le arning’ (ML): matlab python algorithm r machine-learning jav a matrix plot artificial-intelligence n umpy arrays image-pro cessing nlp statistics op encv math o cta ve data-mining scikit-learn neural-netw ork ‘UI Libr aries’: kno c k out.js ja v ascript kendo-ui jquery asp.net-mv c kno c kout-2.0 kendo-grid durandal asp.net- m vc-4 kno c kout-mapping-plugin kendo-asp.net-m v c breeze single-page-application typescript m vvm asp.net-m vc-3 data-binding signalr json t witter-b ootstrap T able 1: The 20 most probable words for the seven patterns highlighted in Figure 4c. App endix A Details on the Inference Giv en a collection of n observ ed learning actions p erformed b y the users of an online learning site during a time p eriod [0 , T ) , our goal is to infer the learning patterns that these even ts b elong to. T o efficiently sample from the p osterior distribution, we derive a sequen tial Mon te Carlo (SMC) algorithm that exploits the temp oral dependencies in the observed data to sequentially sample the latent v ariables asso ciated with eac h learning action. In particular, the p osterior distribution p ( b 1: n | t 1: n , u 1: n , q 1: n ) is sequentially approximated with a set of |P | particles, which are sampled from a prop osal distribution that factorizes as q ( b 1: n | t 1: n , u 1: n , ω 1: n ) = q ( b n | b 1: n − 1 , t 1: n , u 1: n , ω 1: n ) q ( b 1: n − 1 | t 1: n − 1 , u 1: n − 1 , ω 1: n − 1 ) , where q ( b n | ω 1: n , t 1: n , u 1: n ) = p ( ω n | ω 1: n − 1 , b 1: n ) p ( b n | t 1: n , u 1: n , b 1: n − 1 ) P ( b n ) p ( ω n | ω 1: n − 1 , b 1: n ) p ( b n | t 1: n , u 1: n , b 1: n − 1 ) (14) In the abov e expression, we can exploit the conjugacy b et w een the multinomial and the Diric hlet distributions to integrate out the word distributions θ ` and obtain the marginal likelihoo d p ( ω n | ω 1: n − 1 , b 1: n ) = Γ  C n − 1 ` + η 0 |W |  Q w ∈W Γ  C n w,` + η 0  Γ ( C n ` + η 0 |W | ) Q w ∈W Γ  C n − 1 w,` + η 0  , where C n − 1 ` and C n ` are the num ber of w ords in the pieces of conten t (or queries) ω 1: n − 1 and ω 1: n , resp ectiv ely , and C n − 1 w,` and C n w,` coun t the n umber of times that the w app ears in queries ω 1: n − 1 and ω 1: n , resp ectiv ely . F or each particle p , the imp ortance w eight can b e iterativ ely computed as w ( p ) n = w ( p ) n − 1 p ( t n , u n | t 1: n − 1 , u 1: n − 1 , b ( p ) 1: n − 1 , { α ( p ) ` } L ` =1 ) Q ( p ) n , (15) 13 Algorithm 1 Inference algorithm for the HDHP Initialize w ( p ) 1 = 1 / |P | , K ( p ) = 0 and L ( p ) = 0 for all p ∈ P . for i = 1 , . . . , n do for p ∈ P do Up date the k ernel parameters α ( p ) ` for ` = 1 , . . . , L ( p ) and the user base intensities µ ( p ) u for all u ∈ U . Dra w b ( p ) i from (14). if b ( p ) i = K ( p ) + 1 then Dra w the new task parameters φ ( p ) K ( p ) +1 according to (13). Increase the num ber of tasks K ( p ) = K ( p ) + 1 . if φ ( p ) K ( p ) +1 = ϕ ( p ) L ( p ) +1 then Dra w the triggering kernel for the new topic α ( p ) L ( p ) +1 from the prior. Increase the total n umber of learning patterns L ( p ) = L ( p ) + 1 . Up date the particle weigh t w ( p ) i according to (15). Normalize particle weigh ts. if k w i k 2 2 < thr eshold then Resample particles. where w ( p ) 1 = 1 / |P | and Q ( p ) n = X ( b n ) p ( ω n | ω 1: n − 1 , b 1: n ) p ( b n | t 1: n , u 1: n , b ( p ) 1: n − 1 ) (16) Since the lik eliho od term dep ends on the user base intensities µ u and the k ernel parameters { α ( p ) ` } L ` =1 , following the literature in SMC devoted to the estimation of a static parameter [ 7 , 8 ], we infer these parameters in an online manner. In particular, we sample the k ernel parameters from their p osterior distribution up to, but not including, time t , and we update the user base intensities at time t as µ new u = r µ old u + (1 − r ) ˆ µ u , where ˆ µ u is the maxim um likelihoo d estimation of this parameter given the user history H u ( t ) and r ∈ [0 , 1] is a factor that controls ho w muc h the up dated parameter µ new u differs from its previous v alue µ old u . Algorithm 1 summarizes the ov erall inference procedure, which presents complexity O ( P ( U + L + K ) + P ) p er learning action i fed to the algorithm, where L and K are the total num ber of learning patterns and tasks inferred up to the ( i − 1) -th action. Note also that, the for-lo op across the particles p ∈ P can b e parallelized, reducing the complexity per learning action to O ( U + L + K + P ) . 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment