Cross Device Matching for Online Advertising with Neural Feature Ensembles : First Place Solution at CIKM Cup 2016
We describe the 1st place winning approach for the CIKM Cup 2016 Challenge. In this paper, we provide an approach to reasonably identify same users across multiple devices based on browsing logs. Our approach regards a candidate ranking problem as pa…
Authors: Minh C. Phan, Yi Tay, Tuan-Anh Nguyen Pham
Cr oss De vice Matching f or Online Ad ver tising with Neural Feature Ensembles : Fir st Place Solution at CIKM Cup 2016 Minh C . Phan Nany ang T echnological University phan0050@e.ntu.edu.sg Yi T a y Nany ang T echnological University yta y017@e.ntu.edu.sg T uan-Anh Nguyen Pham Nany ang T echnological University pham0070@e.ntu.edu.sg ABSTRA CT W e describe the 1 st place winning approach for the CIKM Cup 2016 Challenge. In this pap er, we provide an approach to reasonably iden tify same users across m ultiple devices based on browsing logs. Our approach regards a candi- date ranking problem as pairwise classification and utilizes an unsupervised neural feature ensemble approac h to learn laten t features of users. Com bined with traditional hand crafted features, eac h user pair feature is fed into a sup er- vised classifier in order to p erform pairwise classification. Lastly , we prop ose supervised and unsup ervised inference tec hniques. The source co de 1 for our solution can b e found at http://github.c om/vanzytay/cikm cup . K eywords En tity Linking, Cross Device, User Matching, Clickstream Mining, Online Advertising 1. INTR ODUCTION Online Adv ertising is a crucial and essen tial component of business strategy . F or many advertising companies, the abilit y to serve relev an t ads to users is a desirable and at- tractiv e prospect. Ho wev er, it is commonly accepted that users may own or use m ultiple devices. In order to leverage the rich and sophisticated user profiles learned, it would b e ideal to b e able to identify the same person across devices. In this pap er, w e describ e the 1 st place winning approach in the CIKM Cup 2016 Challenge 2 . The problem at hand is in tuitive and simple. Given browsing logs of users, generate a list of candidate user pairs that are predicted to be the same p erson. This can b e seen as a ranking problem which w e are able to conv eniently cast as pairwise classification. In this paper, we describ e our approach and findings. 1 In the case where the rep ository is not accessible, please con tact the authors 2 h ttps://comp etitions.codalab.org/comp etitions/11171 CIKM Cup 2016, Indianapolis, USA A CM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 2. PR OBLEM FORMULA TION In this section, we describ e the dataset and exp erimen tal ev aluation metrics. 2.1 Dataset The dataset used in this comp etition was provided by DCA (Data Centric Alliance). In general, the dataset is generally comprised of user bro wsing logs with timestamps. Eac h user click is defined as an ev ent with a timestamp and url. F or a certain fraction of urls, meta-data suc h as title text is provided. All w ords in the dataset (in urls and title text) are hashed with MD5. This includes url paths (words in url paths hav e the same hash co de if it appears in any meta-data). F or supervised learning, the dataset includes 506 , 136 training pairs. The follo wing table describ es the c haracteristics of the comp etition dataset. # T raining Pairs 506,136 # T esting Pairs 215,307 # Ev ents 66,808,490 # Domain Names 282,613 # T ok ens (Urls) 27,398,114 # T ok ens (Titles) 8,485,859 # Unique Users 339,405 # Unique W ebsites 14,148,535 # Users in T rain 240,732 T able 1: Dataset Characteristics 2.2 Evaluation Metric The goal of the comp etition is to identify the same users across multiple devices. The ev aluation metric used in the leaderboard ranking is the F1 measure. In each stage of the competition, 215,307 ground truths are used in a 50 − 50 split for the v alidation stage and final test phrase. In most cases, the optimal num ber of pairs submitted at each stage is a v alue determined by the contestan ts. 3. OUR METHOD This section describ es the framework used in this comp e- tition. Firstly , we p erform candidate selection (a combina- tion using TF-IDF baselines and neural language mo dels) to select likely candidates. F or example, we take k nearest neigh b ors from vector represen tations of users as selected candidates. Since this often results in a huge excess of candi- date user pairs, we hav e to p erform candidate filtering. W e do that via sup ervised pairwise classification. In general, Figure 1: Overall Flow and Architecture w e regard the problem as a pairwise classification problem. By p erforming pairwise classification, we use the likelihoo d scores f ( u 1 , u 2 ) ∈ [0 , 1] to rank candidate pairs. Finally , we include a supervised and unsup ervised inference tec hnique to refine the final selected candidates. Figure 1 describes the arc hitecture of our approac h. F rom 240,732 users in training set, we sample 98,000 users as tr ain 1 whic h will b e used to train the pairwise classi- fier. An additional 98,000 users, lab eled tr ain 2 are used for training a sup ervised inference classifier. 3.1 Candidate Selection W e p erform candidate selection via a myriad of user rep- resen tation techniques. The first ob vious representation to use is the TF-IDF v ector representation of eac h user. T o construct TF-IDF vecto rs, we construct n tokens for each lev el in the url hierarch y , i.e., a/b/c b ecomes [ a, ab, abc ] F or eac h user, we generate k nearest neigh b ors and add them to the list of prospective candidates. In addition, w e gener- ate prosp ective candidates using mo dels trained by Neural Language Mo dels. T o select an appropriate v alue for k , we studied the recall lev els on the developmen t set. Based on Figure 2 which shows the recall levels with v arying k , we se- lected k = 18 as a trade-off b etw een recall and classification performance. 3.2 Learning User F eatures This section describes the feature engineering process. W e used a comb ination of hand crafted features along with un- supervised feature learning metho ds. Our approach heavily relies on unsup ervised features instead of manual features. It is go od to note that our unsup ervised features were domi- nan t in our fe atur e imp ortanc e analysis whic h can be deriv ed from gradien t b oosting classifiers. 3.2.1 Hierar chically A ware Neur al Ensemble Our approac h heavily exploits Neural Language Mo dels 3 to learn semantic user representations. W e use Do c2V ec, an extension of W ord2V ec [2] to learn semantic represen tations of users. Intuitiv ely , we simply treat the sequential click history of a user as w ords in a senten ce. Therefore, a user and his url clicks are analogous to words in a do cument. 3 W e tried LDA but found it lacking as compared to mo dern language mo deling approaches Figure 2: Recall on KNN results The ke y idea of these mo dels is that they learn based on global concurrence information. Ho wev er, unlike words in a document, there is a rich amount of hierarc hical information presen t in url information that can b e exploited. Therefore, for each url hierarchical level h = { 0 , 1 , 2 , 3 } , we learn a separate Do c2V ec mo del. In this case, h = 0 simply means w e use simply the domain name. It is go od to note that there will be more duplicate tokens in a string of url sequences at lo w er hierarc hical lev els. T o deal with this, w e remo ve all consecutive duplicate items when training our Doc2V ec model. F urthermore, we also train a wor d-level Doc2V ec model by treating only considering w ord tok ens from urls that include titles. Ov erall, the output of each Do c2V ec model is a unique semantic v ector representation of eac h user. F or our implemen tation, we used the Gensim 4 P ack age for training our mo dels. W e use the default setting for all our models but steadily decrease the learning rate and trained additional models with v arying window sizes a mongst ( W = { 5 , 10 } ). W e also exp erimen ted with the c onc at mo del that concatenates the vectors at the hidden lay er instead of av- eraging. The dimensionalit y of each mo del is d = 300. W e also prune infrequen t tok ens/urls that appears less than 5 times in the en tire dataset. 3.2.2 F r om Neural Models to F eatur e Giv en that each of our mo del pro duces user v ectors of length 300 dimensions, it would be impractical to use these v ectors directly as features. Empirically , w e found that usin g these v ectors directly seem to w orsen performance. Th us, for generating features from these neural models, we use distance measures such as the manhattan, euclidean and cosine distance/similarit y betw een eac h user pair as features. In addition, w e add the order information betw een t wo users as well. This is done by taking the k-nearest neighbours of a user A and computing the rank of user B’s app earance in that list. This is done vice v ersa as well. 3.2.3 T ime F eatur es T o supplemen t our latent semantic features of users, we included time features. F or eac h user, w e bu ild the user’s us- age pattern by counting the n umber of urls visited for each hours of the day (hourly pattern), and for each day of a 4 h ttps://radimrehure k.com/gensim/ Figure 3: F eature Imp ortance Analysis from XGB w eek (weekly pattern). W e calculate the absolute difference for the ra w counts and Kul lback Lieb er difference for the nor- malized counts (distributions) as the time features for each user pair. W e also extract the n umber of ov erlaps betw een user’s logs for eac h time interv al of { 5 , 10 , 60 } minutes. 3.2.4 F eatur e Importance T est W e generated a feature important test 5 from XGB. Fig- ure 3 shows the visualisation of feature imp ortance. Not surprisingly , our neural features are the most imp ortan t fea- tures. f 963 , f 964 and f 960 are features from the Do c2V ec model (cosine similarit y) while f 962 and f 961 are TF-IDF features. 3.3 Pairwise Classification Next, w e train a supervised pairwise classifier (XG B1) to predict the likelihoo d of each user pair being the same user. Giv en our sampled train users (98,000) we sample pos- itiv e and negative pairs from knn candidate filtering (with k = 18) to train the model. Note that we delib erately sam- ple negativ e samples from the p ool of near neigh b ors instead of random corruption. W e found this significan tly more ef- fectiv e since it gives our model more discriminative ability since it provides our classifier more challenging negative ex- amples. 3.3.1 Supervised Classifier Our approac h mainly used X GBoost [1], a gradient bo ost- ing technique that is prev alent in the winning solutions of man y mac hine learning competitions. W e tried sev eral other classifiers such as Random F orests and the standard Multi- La yer Perceptron (MLP). In general, w e found that the MLP did not come close to the p erformance of tree based ensem ble classifiers. Our exp erimen t ev aluation section describ es and reports on the results for all the algorithms that we tried. 3.4 Inference T echniques In this section, we introduce the inferencing techniques used to increase the p erformance results. W e view the orig- inal user linking problem as a clustering problem in the wa y that each group of same users is asso ciated with a cluster. If there is a link created for t wo users b etw een cluster A and B, we also make new links for other users across clus- ter A and B. Specifically , we sort the candidate pairs based on the X GB 1’s confidence score and alternately taking the 5 Note that these features were generated mid-wa y for anal- ysis purposes and are not the features from our final mo del Figure 4: Clustering user group Algorithm 1: Inference by cluster merging input : sorted pairs S , inference metho d I output: extended pairs E E = ∅ ; Assign eac h user in S a unique cluster label; for eachpair ( u, v ) ∈ S do l u = cluster lab el( u ); l v = cluster lab el( v ); if l u 6 = l v ∧ I .cond satisf ied ( l u , l v ) then mer ge cl uster s ( l u , l v ); for i ∈ cl uster ( l u ) do for j ∈ cluster ( l v ) do E ← ( i, j ); end end end end return E sorted pairs to p erform the merging. The details are shown in Algorithm 1. In order to control the created clusters’s expansion, we prop ose the following tw o inference metho ds. 3.4.1 Supervised Infer ence F or supervised inference, we only merge t wo cluste rs if the a verage vote of all pairs across 2 clusters is greater than a threshold α = 0 . 5. The votes is the confidence scores of 2 users to b e merged. In order to hav e this confidence score, we train another pairwise classifier and refer it as X GB 2. Note that X GB 2 is differen t with X GB 1. The training data for X GB 2 is the extended pairs from Algorithm 1 with a 0 blind 0 inference that alwa ys allow the merging of 2 clusters. 3.4.2 Unsupervised Infer ence Our empirical analysis of the training set shows that most of the users are owning 3-5 devices on a verage. Based on the simple intuit ion of ( a, b ) ∧ ( b, c ) → ( a, c ), w e generate new candidate pairs. Naturally , we limit the total size of clus- ters to b e merged to a threshold β = 5 since we regard such inference pro cess risky . This is b ecause we are assuming all pairs used in the inference are correct. Our empirical ob- serv ations on lo cal testing shows that such inferencing may run a risk of decreasing the F1-score. How ever, when tuned correctly , i.e., selecting the right threshold of candidate pair scores to be included in the inference pro cess, can lead to ma jor increase in p erformance. Figure 5: Overview of Final Candidate Selection 3.4.3 F inal Candidate Selection W e use the top 45,000 sorted pairs together with the su- pervised inference as the input for Algorithm 1 and obtain about 59,000 extended pairs. Similarly , using top 80,000 pairs with the unsup ervised inference gives us 97,000 ex- tended pairs. W e combine the pairs extended from super- vised inference, unsup ervised inference and sorted pairs from the pairwise classifier (X GB1) by that order. The final pairs for submission will b e the first 120,000 unique pairs in the com bined list. Figure 5 illustrated the selection procedure. 4. EXPERIMENT AL EV ALU A TION This section outlines the exp erimental ev aluation on our local test set as well the comp etition leaderb oard. 4.1 Local Dev elopment T ests F or our mo del dev elopment, we experimented with sev- eral differen t classifiers. T able 2 shows the results on our local developmen t set at F1@120 K which according to our empirical observ ations, is an optimal split for the num ber of pairs to submit. F or the sake of completeness, we included both sup ervised and sup ervised techniques. Metho d F1@120K TF-IDF + KNN 0.065 Doc2V ec + KNN 0.072 LSTM-RNN 18.22 MLP 25.63 Random F orest 38.32 X GB 41.60 T able 2: Ev aluation on Lo cal Developmen t Set Aside from the staple gradien t b oosting metho ds, we ex- perimented with deep learning techniques namely the stan- dard Multi-lay er P erceptron (MLP) and Recurrent Neural Net works. F or MLP , XGB and RF, we trained it when the exact same fe atures descri b ed in ea rlier sections . W e also ex- perimented with using user clicks as sequential inputs to a long short term memory recurren t 6 neural netw ork (LSTM- RNN) as a binary classifier with a final sigmoid lay er. It is clear that unsupervised techniques such as the TF- IDF + KNN baseline do es not p erform we ll. It is go od to note that Doc2V ec p erforms sligh tly b etter than TF-IDF whic h is exp ected. Amongst sup ervised classifiers, our XGB performs the greatest. W e quickly abandoned MLP due 6 This w as generally unstable and tak es a long training time. Without satisfactory results, we quickly abandoned this ap- proac h. How ev er, we note that the final output state might be a useful feature to time and hardware constraints since we could not afford the computational resources to intensiv ely tune the hyper- parameters. Similarly , we did not p erform hyperparameter tuning for RF and XGB except for the num b er of trees. Our final X GB used ab out 3500 trees. 4.1.1 On T raining T ime The training time for XGB to ok ab out ≈ 4 hours on our mac hine. MLP was significantly faster at ≈ 30 minutes using GTX980 GPUs. Ho wev er, LSTMs, due to needing to process long sequences of user streams to ok 2 day s even with GPUs. Generating prediction scores for users also to ok up a non- trivial amoun t of time often amoun ting to ≈ 1 − 2 hours easily . 4.2 Final Competition Evaluation Finally , w e rep ort on the final comp etition results. In total, we submitted 120,000 pairs for ev aluation as a trade- off betw een precision and recall. User T eam F1 Precision Recall yta y017 NTU 0.4204 0.3986 0.4445 ls Lea vingseason 0.4167 0.3944 0.4416 dremo vd MlT rainings 1 0.4120 0.4031 0.4213 bendyna MlT rainings 2 0.4017 0.3659 0.4452 namkhan tran - 0.3611 0.3323 0.3954 T able 3: Final Ev aluation Results Our final submission consist of a single classifier, XGB with 3500 trees using the abov e mention ed features. W e applied inferencing techniques mentio ned ab ov e to generate the final submission file. It is goo d to note that unsup ervised inference increased our final F1 score from 0 . 4155 to 0 . 4204 whic h was critical to win the comp etition. 5. CONCLUSION W e prop osed a framework for cross device user matc hing. Our system in volv es a nov el application of neural language models onto clickstream information. W e increased the F1 of the baseline by 751%. The ma jor contributing features are in fact, obtained in an unsup ervised manner. Outside a competition setting, w e b eliev e there is tremendous potential for deep learning to b e applied in this domain. 6. REFERENCES [1] T. Chen and C. Guestrin. Xgb oost: A scalable tree bo osting system. In Pr oc ee dings of the 22nd AC M SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining, San F r ancisc o, CA, USA, Au gust 13-17, 2016 , pages 785–794, 2016. [2] T. Mikolo v, I. Sutskev er, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their comp ositionality . In A dvanc es in Neur al Information Pr o c essing Systems 26: 27th Annu al Confer ence on Neur al Information Pr o c essing Systems 2013. Pr o c e e dings of a me eting held De c emb er 5-8, 2013, L ake T aho e, Nevada, Unite d States. , pages 3111–3119, 2013.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment