Privacy-preserving classifiers recognize shared mobility behaviours from WiFi network imperfect data

1 Pri v ac y-preserving classiﬁers recognize shared mobility behaviours from W iFi network imperfect data Orestes Manzanilla-Salazar ∗ and Brunilde Sansò † Email: ∗ orestes.manzanilla@polymtl.ca, † brunilde.sanso@polymtl.ca Abstract —This paper proves the concept that it is feasible to accurately recognize speciﬁc human mobility shar ed patterns, based solely on the connection logs between portable devices and WiFi Access Points (APs), while preserving user’s privacy . W e gathered data from the Eduroam W iFi network of Polytechnique Montreal, making omission of device tracking or ph ysical layer data. The behaviors we chose to detect were the movements associated to the end of an academic class, and the patterns related to the small br eak periods between classes. Stringent conditions were self-imposed in our experiments. The data is known to have errors noise, and be susceptible to information loss. No countermeasures were adopted to mitigate any of these issues. Data pr e-processing consists of basic statistics that were used in aggr egating the data in time intervals. W e obtained accuracy values of 93.7 % and 83.3 % (via Bagged T rees) when recognizing behaviour patterns of breaks between classes and end-of-classes, respecti vely . Index T erms —W ireless networks, movement patterns, indoors behaviour , machine learning, supervised learning. I . I N T RO D U C T I O N P UBLIC indoors facilities like hospitals, univ ersities, air- ports and malls often of fer W iFi access services to visitors and employees. One example of such networks is the "Eduroam" network service offered in many academic institutions worldwide providing Internet access to faculty , employees and students [1]. People lea ve footprints in the data generated by network administrativ e systems and in the portable devices being carried with them as they move through the space. W e classify these footprints as: • De vice-fr ee network-gather ed: information about individ- uals is obtained whether they carry a de vice connected to the network or not, by observing the perturbations caused in the wireless signals. People’ s bodies absorve or reﬂect the electromagnetic wav es affecting measured values of the physical layer (e.g.: signal strength, phase, etc.) [2] [3] [4]. • De vice-gather ed: this refers to changes registered in the client devices connected to the network, either by acti ve O. Manzanilla-Salazar is with the Department of Electrical Engineering of Polytechnique Montréal, QC, Canada, (e-mail: orestes.manzanilla@polymtl.ca) B. Sansò is with the Department of Electrical Engineering of Polytechnique Montréal, QC, Canada, (e-mail: brunilde.sanso@polymtl.ca) This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this v ersion may no longer be accessible. Fig. 1: Count of devices connected to each AP in one ﬂoor of the building. sensors (GPS, accelerometers, compass, light sensors) or connection data (e.g.: time of ﬂight, signal strength, etc.) [5] [6] [7]. • De vice-enabled network-gather ed: these are changes re g- istered by ﬁxed APs or Base Station (BS) to which a device (moving node) is connected. This encompasses both connection characteristics (e.g.: time of ﬂight, signal strength, etc.) as well as the status of connection between a device and a particular AP or BS. This paper deals with the third type of footprint. Our objectiv e of this work is to do a pr oof of concept that ML methods can be applied to detect human behaviour under stringent data gathering conditions. In particular , even though we make inference from footprints of the third class, we exclusi vely use the Edur oam logs of the devices connected to each AP , in two ﬂoors of one b uilding of Polytechnique Montreal in Canada (Fig. 1). W e do not use physical layer measurements of any kind, nor do we track the user . W e show 2 Indoors human movement Wireless network activity Machine Learning model leaves footprints in generates data used by  infers Fig. 2: Subsystems. that it is possible to push the limits of both the austerity of the data used, and the simplicity of the inference techniques, while maintaining the ability to infer acti vities related to the mov ement of people. In this paper , the activities chosen are the mov ement patterns related to the moment in which a class has ﬁnished, and those related to periods of time when students hav e a break between classes. W e identify three systems (Fig. 2): • Indoors human movement: this refers to the humans tran- siting the space under study , their activities, in particular those implying movement big enough to trigger a hand- ov er in a de vice connected to the W iFi network, as is done in [8]. As mentioned above, in our case, the behaviours of interest in this system are circumscribed to the way aca- demic acti vities are scheduled in the university . Another element that giv es structure to the trajectories followed by people within the building, is its physical structure and layout. • W ir eless network: this encompasses the set of APs in the ﬂoors considered in the data gathering process, the set of portable or mobile de vices carried by people within the building, connected to the network, and the components of the Cisco Connected Mobile Experiences (CMX) solution, which is used by the wireless network administrator in Polytechnique Montreal. • Machine learning models: this is constituted by the models of supervised learning used to infer acti vities in the human system from data of the wireless ne work system. The aim of learning about human activities has been prov ed to be feasible in small scale activities like rising an arm [2], larger trajectories [9] [10], activities in volving groups of people in large spaces [6] [11] and routines [12]. Why is it important to infer patterns of human behavior from wireless data? Because when people’ s health [13], integrity and security is put at risk [14], ha ving the ability to pr edict , detect and understand human behavior is useful to have timely reactions, and more precise mitigation strategies. There are other uses for which detecting human behavior patterns is useful, like crowdsensing, recommendation systems, social networks [15], as well as classiﬁcation and understanding of the use of physical spaces [8] [16] [17] [18] [19]. W e show in this paper a ne w strategy to study human activity via patterns in WiFi AP logs [5] [20]. Instead of focusing in differenciating spaces by ho w they are used [8], or on individual mobility , our goal is to detect of speciﬁc patterns of simultaneous behaviour in groups of people. The approach takes into consideration the following premises: 1) The data required should be easy to obtain in widespread W iFi networks. 2) Pre-processing and machine learning techniques as well as storage requirements should be as simple and cheap as possible. 3) Both the data required for the techniques proposed and the information that is inferred from it, should respect users priv acy . In order to comply with our premises, we intentionally refrain from: • Analysing indi vidual data. • Applying ﬁlters to eliminate redundant data. • Applying ﬁltering to eliminate data from individuals that do not follow the speciﬁc pattern that is being detected (e.g.: not following the academic hourly schedule). • Implementing methods to identify and correct errors in the data. • Applying techniques to recognize and repair cases of incomplete data. • Using an y conte xt information from the infrastructure or the date and time of a measurement, when using a trained machine learning model to detect a pattern. • Using wireless Channel State Information (CSI). W e show that notwithstanding the stringent conditions self- imposed, it is feasible to detect speciﬁc simultaneous be- haviour patterns, such as the end of a class of an academic course, or that people are in one of the small breaks between classes. T o the best of our knowledge this is the ﬁrst time that mov ement patterns have been successfully recognized based on aggregated de vice counts per AP in a W iFi network, without CSI or physical layer v alues. The rest of the article is comprised by a re view of the related literature, a section on the issues and challenged in volv ed, and a detailed description of the three systems mentioned: the human behaviour , the W iFi network, and the machine learning models used to infer human behaviour (Fig. 2). W e will then present our results and analysis in the following section and will end the article discussing our conclusions and future work. I I . R E L A T E D W O R K In order to put our work into context, we will brieﬂy mention the main approaches that can be giv en to the study of spatio-temporal patterns in human behaviour , as well as the ways in which data has been gathered in other empirical studies. Lastly we will mention some of the ways in which this kind of data has been analysed in pre vious w orks. 3 A. Spatio-temporal patterns in human behavior Spatial information regarding human acti vity can be classi- ﬁed as pose [21] and micro-activities [2] [13] [22], meaning the change of the shape of the space the human body occupies at a particular moment, and trajectories or the history of loca- tion through time. Human trajectories have enough re gularity as to make it feasible to look for patterns in them [9] [10]. Also, human routines have enough structure as to allow the analysis of patterns [12]. The focus of the analysis of human movement can be placed on indi vidual patterns (generally trajectories or tracking data), or in groups of people (crowds or ﬂocks), who exhibit a similar simultaneous behavior . 1) T rajectory analysis: The study of patterns in trajectories [9] [10] requires analyzing a time-serie of location coordinates or indicators of some kind. Recently deep learning techniques hav e been used to represent trajectory patterns [23] [24] [25]. Location represents a great concern in terms of pri vac y , which has motiv ated researchers to create algorithms to anonymize location data [26]. There is howe ver a public concern of the possibility that an individual’ s anonymized data can be used to reconstruct the identity of the person (de-anonymization) [27]. 2) Crowd analysis: Crowd Analysis has gained attention as stampedes in places with high density of people turns them into an additional risk for themselves, besides the threat that can be assumed to o r iginate the stampede. Some work has been done in the prediction of the formation of crowds [28]. A recent and thorough re view on the empirical studies of cro wds can be found in [11]. 3) Data gathering methods: Data from records of people activities, surveys, interviews, video surveillance, as well as from experiments with human volunteers, ev acuation drills, natural disasters, virtual reality , and animals [29] [11] have helped b uild models to understand and predict the behavior of crowds, in addition to networks data. W e group the sources of data in the follo wing categories: • V ideo-based: crowd behavior has been analyzed by ex- tracting patterns from videos [30]. Machine learning techniques allo w the processing of videos from one or multiple cameras to identify both individual and crowd mov ement [31], as well as perform pose estimation [21], and recognition of frames where relev ant events take place [32]. Deep learning techniques can recognize a pos- ture simultaneously captured from dif ferent viewpoints [33], and in people identiﬁcation based on range sensors such as LiD AR and RGBD cameras, which provide data in the form of 3D point clouds [34]. One of the main issues with video is that lighting, obstruction, and perspectiv e can affect the quality of the analysis. 3D point clouds are not affected by light conditions, b ut the positioning and obstacles can still represent a challenge, besides being a more expensi ve technology . • Network-based: GPS traces are one of the most obvious sources of location data. T races from taxi driv ers hav e been used to study patterns in their mobility . This kind of traces, howe ver , are of limited application in indoors and urban environments [35], as obstacles can produce a shadowing effect. Devices GPS traces or data from GTFS networks can be used along with additional infor- mation, like W iFi traces [36], to ov ercome these issues. Applications installed on the client side can also provide information in the form of surveys like DataMobile [37], which collects trav el information. Gathering users data from their devices, howe ver , presents the limitation of requiring users to share information re garding their position and trajectories, which is frequently av oided by users [38]. This is where monitoring centered in the network (ﬁxed nodes) overcomes this problem as the information gathering becomes a passiv e and automatic process regularly carried out by Wi-Fi or GSM networks [39]. This kind of data gathering presents low costs, and the pervasi ve nature of wireless networks opens the possibilities of collecting large quantities of data easily av ailable in the administration software of public places that offer wireless connection [40]. One of the problems to be overcomed with network data, is the lo w precision in location estimations [41] [42], though it can be improv ed by considering it along with other types of information [35]. This motiv ates, in some studies, the use of symbolic spaces instead of geographical coordinates [43]. One example of a symbolic location is the identiﬁcation of the AP to which a devices is connected. Some work is required in order to ﬁnd the relationship between connections to APs and speciﬁc physical spaces [8]. WiFi and Bluetooth connections between user’ s de vices can giv e information about the social network among the people present in a particular area, which can complement information about their behavior . De velopments in multi- hop networks [44] have brought increasing attention in this kind of data. Studying movement within a WiFi net- works via connection logs of connections established with APs, indirectly provides lo w cost large scale information about realistic natural trajectories. In [8] the same raw data we use in our research is used to infer the kind of use that is giv en to the space where each AP is. Nine categories of places are found via clustering. The focus is that they ﬁnd patterns in places, whereas we ﬁnd patterns in time. Because of this fact, the information is aggregated per day , instead of per minute. • sensorless: the effects the presence of human bodies hav e over signal strength make it feasible to estimate the number of people in an area [45], without requiring every person in it to carry a device. Also signal strength com- bined with phase alterations, time-of-ﬂight, among other signal characteristics allow the detection of activities [21] and micro-activities [2] [13] [22]. • sensor-based: some research has been done to identify the human acti vity using sensors data, like signal strength (the use of the wireless signal as a sensor), device accelerom- eter , light sensors, compass, etc [7]. Our research goes in the opposite direction of these methods, for our premise is to take advantage of simple and widespread existing technology , instead of working with additional and more sophisticated data gathering equipments and techniques. 4 B. Network Analysis V arious data analysis and inference techniques ha ve been proposed and used in order to extract spatio-temporal patterns in the behaviour of people, from networks data. In the analysis of W iFi data, looking at changes in number of connections to a network’ s APs from the point of view of the frequency domain allo ws the identiﬁcation of the most important periodicities in data [46]. Besides the techniques that ha ve been mentioned in the modelling and classiﬁcation of traces, there are another tasks of relev ance. There is the ability to predict tasks patterns related speciﬁcally to indi viduals in a particular AP , as is shown in [7] where length of stay of a device in a particular AP is predicted based on W i-ﬁ data along with information from other sensors in the device. W e classify the focus of the analysis of the network traces in three major categories: 1) F ocused on spaces: the analysis of mobility patterns can be aimed at studying how spaces are used [8] [16] [17] [18] [19]. The subject over which information is being inferred are physical spaces. 2) F ocused on individuals: on the other hand, the subjects ov er which inference is applied, can be the indi viduals moving through space. In [5] a technique is proposed to classify individual behavior based on the hand-over patterns of users’ de vices among various antennas in a GSM or W i-Fi network. In [20] wireless traf ﬁc data is used to build models of individual users’ mobility . A combination of focus on both individuals and spaces is observed in tasks like that proposed in [7] (predict length of stay at a particular individual connected to a speciﬁc AP). 3) F ocused on gr oups: here the focus is understanding mov ement related to speciﬁc beha vior shared by groups, ﬂocks or crowds. The analysis of W iFi logs, for example, permits the identiﬁcation of groups of individuals mov- ing “together”, or ﬂocks, within indoor en vironments [6]. W e position our research in this category , as we take advantage on the fact that the aggregation of individual data, which is beneﬁcial from the point of view of priv acy , permits the detection of shared behaviours. I I I . I S S U E S A N D C H A L L E N G E S W e now bring to attention the two main issues inherent to the nature of our research and its premises: A. Privacy issues One of the main challenges in the use of machine learning and statistical inference to analyze people’ s data is the need for a balance between the protection of priv acy and innovation [47]. In the case of AP connection logs, users’ MA C ad- dresses, the devices model identiﬁcation, as well as the user’ s network identiﬁcation are sensitiv e information, which brings the necessity of implementing some kind of anonymization technique. Not only this eliminates the collection of socio- demographic information [8] but also prohibits using knowl- edge about the roles of a person in an area of study . Some researchers reconstruct the de vice information (distinguishing between mobile devices and laptops) [8], users’ roles (in an academic setting the roles could be those of students, faculty , staff, visitors, etc.) [17] or ev en the identity of the users [48]. For some groups, ho wever , the inference of anon ymized data should be avoided and ev en regulated for priv acy reasons [47]. More details on de-anonymization can be found in [49]. Refraining from using sensible data, as well as reconstructing (de-anonymizing it) it using ML, is the most conservati ve approach. B. Quality of information c hallenges The quality of the APs logs data is dependent on the following: • The status of the mobile or portable device WiFi activ a- tion. Users have the choice of turning off the WiFi an- tenna of their devices, or the whole device can be turned off, which makes the device in visible to the network. W e consider this to be a problem of incompleteness of the data . • Device speciﬁc behaviour can vary depending on the battery lev el, and other settings, affecting the time that the W iFi passes without transmitting to the netw ork. During this time, the device is also invisible to the network, and thus another contribution to incompleteness of the data . • Because of the limited range of connection to the APs, devices can get out of range, becoming also in visible to the network. Recognizing a device coming back from a departure from the network as the same device, in the context of hashed MAC addresses, requires maintaining in memory a record of all the hashed generated. If the hashes are randomly generated each time devices connect again to the network, adding these new logs to the previous history of the de vice becomes a challenging task. W e consider giving a dif ferent identiﬁcation code to a same de vice in dif ferent moments, as a sour ce of error s in the data. • The hand-over of the connection of a device to the net- work, from one AP to another , is triggered in general by problems in the quality of the connection, and av ailability of a better AP to connect with. Interpreting the hand- ov er as an ev ent caused by the mo vement of the de vice, will introduce errors in the data, as it can also be caused by obstructions or by the so-called Ping-Pong effect that arises when the connection of device is quickly handed- ov er between two or more APs that are within range, ev en though the device might be geographically static [8]. W e consider this to be a source of noise in the data. • The location of the AP with which a device has estab- lished a connection might not be the one physically clos- est to it. Therefore, symbolically assigning the location of the AP with which a connection is established, as an approximation of the general location of a device might be inaccurate. In fact, a connection might be established with an AP that is in a dif ferent ﬂoor , or a different room in the b uilding. This phenomena can also be considered as a source of noise in the data. 5 W e decided to accept the aforementioned issues and chal- lenges as constraints. In the following section we go into detail about the way we approach the human activities, the network, and the use of pre-processing and ML to take advantage of the data in spite of the limitations and premises. I V . M O D E L S P RO P O S E D W e introduce now models of the three systems we have deﬁned in our research (Fig. 2): A. Modelling of human behavior The main elements of the human system are the people transiting the two ﬂoors considered for the study . These are composed by students, faculty , administrative staff and employees, and visitors. Students, faculty , and occasionally members of administrativ e staff and employees, are considered to have their acti vities affected in various le vels, by the weekly schedule of classes. Students enrol in courses whose classes take place within pre-established academic bloc ks . These blocks start at 8:30 am, 9:30 am, 10:30 am, 11:30 am in the morning of work days, and ha ve a duration of 50 minutes, lea ving an interv al of 10 minutes that can be used either as a break between two hours of a same course, or to go to the next class. Afternoon blocks start at 12:45 pm, and then on each hour until 4:45 pm, with the same duration. W e consider only data from 8:30:00 am to 11:29:59 am, as most classes take place during the morning hours. Also, we exclude weekends, as no regular classes are scheduled on them. In our analysis, we discard data from W ednesdays, as these are days of the week in which a great proportion of the faculty has departmental meetings in Polytechnique Montreal. The human dynamics of interest in this system, are those behaviours that, being shared by ideally large quantities of people, are associated to acti vities implying movements through the indoors spaces, lar ge enough to trigger a hand-o ver of the connection of a mobile or portable device. W e focus in shared behaviours that can be easily labelled, for being ruled by the weekly academic schedule. The term "rule", howe ver , could be too strong, as in reality it is suggested for the faculty and instructors whose classes encompass tw o or more blocks, to leav e the 10 minutes interv al between the end of a block and the start of the following one for students to take a break, but such suggestions can be ignored by some faculty or instructors. In some courses the break can be postponed for the sake of the continuity of the teaching process. When the 10-minutes interval occur between two dif ferent courses, students use this time to attend to personal needs and walk to the next class. There is no guarantee, howe ver , that after the start of a break, or the end of a class, students will actually leave a classroom, nor that students will arrive exactly at the beginning of a block. W e deﬁned tw o behaviours of interest: • The interv als of break between tw o blocks of class, and • The moment in which a class ﬁnishes. The ﬁrst one will allo w us to associate data from a particular moment to a behaviour that is deﬁned as having some duration. T ABLE I: T ypical ra w data sampled from AP logs. Sample ID UserName AP MA C Address Sample time-stamp 1 5bac0b ... e0ed722608... 2017-04-07 15:51:07 1 d704b5... e934a10b73... 2017-04-07 15:51:07 2 f059ea... 5b50722d57... 2017-04-07 15:51:52 The second will allow the detection of a particular ev ent in time (theoretically instantaneous). Alongside the presence of students and faculty following the schedule in a particular moment, we assume there will be presence of human activity not ruled by this academic schedule as administrativ e staff, employees, visitors and students who are not attending to class may wander about in halls, stairs, bathrooms, food courts, as well as unoccupied classrooms, which will be also within the range of the APs. B. Modelling of the network The Eduroam WiFi network in Polytechnique Montreal, is monitored via the Cisco CMX solution, which constitutes our main instrument for data gathering. A script was written, to refresh the web interface that samples information from the devices connected to the APs of the network. One month of data, totalling 189,259 samples was gathered, in sampling intervals averaging 12 seconds. Samplings are not regularly spaced in time because each time the refresh is requested, the system needs to receive data from all the APs, which imposes a v ariable delay . Each sample provided a table with all the connections acti ve between devices and APs in the area under study . A total of 6,830,873 connection logs between devices and APs were retrie ved. A simple random hashing process was used to anonymize the addresses, device models and users identiﬁcations for each device. When a particular device receiv es a hash, it is remembered during the period o ver which the de vice maintains its connection. When it disconnects, the hash is "forgotten". When and if the device connects again to the network, a new random hash will be generated. W e do not treat the data to ﬁx issues like noise, errors and incompleteness. In particular, we refrain from the follo wing: • Implementing any method to dif ferentiate between types of devices (un-anonymization is against our premises). • Implementing any method to recognize when two devices belong to the same user . • Detect Ping-Pong sequences of hand-o vers of a device. • Implementing any method to recognize when a device connected in different times with different random hashes is, in fact, the same de vice. • Using location estimates nor signal strength, that are giv en by the Cisco CMX solution in each sample of a connection between an AP and a de vice. • Inferring when a disconnected device has left the area of study , or not. • Inferring when a hand-over has been triggered by move- ment, or not. • Estimating the geographical location of a de vice. • Including information regarding the status of the APs. 6 Three examples of the raw data gathered are shown in T able I. The ﬁrst two rows correspond to two devices connected to different APs, reported during the sampling number 1, requested to the administrati ve software at 15:51:07. The third one is a dif ferent device from the sampling number 2, that takes place some seconds later . W e emphasize that when a new sampling is requested, both its consecutive ID and time- stamp will be associated to every r ow of data generated in that request. For each acti ve connection at that instant between a device and an AP , one row will be generated, sharing this information, as is show in table I. The time-stamps will be used in order to aggregate the ro ws of data. C. Modelling the learning pr ocess The problem is posed as one of supervised learning problem where each input instance corresponds to a vector of statistics related to APs and the output or label for the interval is manually assigned for each of the e xperiments. The input, or feature vector x i , is formed by pre-processing the data provided by the network administration software to calculate the minimum, maximum, variance and average of the number of de vices connected to each one of the 67 APs distributed in two ﬂoors of a building of Polytechnique Montréal (Fig. 1). The raw data from AP logs is pre-processed as follows: • The raw data is aggregated by each sampling time-stamp, counting the number of devices per each of the APs. • The resulting table is further aggregated into one-minute basic statistics describing the v ariability of the device count within each minute. Maximum, minimum, average, standard de viation and variance are calculated over the count values of the samplings performed within the one- minute interval for each AP . Figure 3 shows one example of the behavior of the maximum of connections across an interval of time. Figure 4 shows the behavior of the av erage of connections in the same interv al. Finally , ﬁgure 5 shows the behavior of the standard de viation of the number of connections. In the APs from which these examples were chosen, there is a different behaviour roughly around 9:30 am, when a block of classes starts. • Statistics for all APs are concatenated as a vector , which is the feature v ector x i that will be used as input for the machine learning models. Each vector x 1 describing the variability of the device counts within the one-minute intervals is assigned a label y i , according to the academic weekly schedule information, depending on the pattern we want to detect in each recognition task. The label has two possible values, P T , which indicates the presence of the pattern, and N T , associated to the absence of the pattern. V arious classical machine learning classiﬁcation techniques are applied, estimating accuracy via 10-fold cross-validation. In a production en vironment, the fully trained model of choice is then used to perform inference on whether the input shown to each classiﬁer exhibits the pattern corresponding to the label. This implementation would allow the detection of the end of a class, or the detection of a break between classes, Fig. 3: Example of maximum of connections. Fig. 4: Example of av erage of connections. Fig. 5: Example of standard deviation of connections. without consulting the academic schedule, solely based on the behaviour of the device counts (Fig. 7). In all our experiments, a majority of the data was labelled as N T , producing a situation of class imbalance. W e sub-sampled the lar ger dataset (randomly discarding from each experiment data from the set N T ) to deal with this problem. The classiﬁers considered in our experiments were the following: Decision trees, Logistic Regression, Linear Support V ector Machine (SVM), Quadratic SVM, Cubic SVM, Gaus- sian SVM, K-Nearest Neighbours (kNN), Cosine kNN, Cubic kNN, W eighted kNN, Boosted T rees, Bagged T rees, Subspace Discriminant, Subspace kNN and R US Boosted T rees The data was pre-processed in Python with MySQL and the implementation used for the ML methods was that of the 7 DA TA STRUCTURE Sampling of connected devices Devices connected to AP1 Devices connected to AP2 1 minute aggregations per AP AP2 1m statistics AP1 1m statistics AP1 Interval Label FEA TURE VECTOR date AP2 WiFi Network T ARGET time Fig. 6: Data pre-processing. Network activity Pre-processing raw data Model selection Feature vectors T raining choice of model T rained classi ﬁ er behavior label target value behavior inference Fig. 7: Machine Learning process. Classiﬁcation Learner App of Matlab R2017b. Labelling pr ocess: One advantage of aiming at recognizing patterns related to the academic schedule, is that it allows the labelling of data without spending time monitoring the behavior of people in situ or via video-surveillance as is done in [8]. Instead we use a script that checks for some conditions to determine if the pattern is present or absent and assign the corresponding label, based on the time-stamp of the interval from which x i was calculated (Fig. 6). Let T be the experiment at hand, the label are assigned as follows: y i = y ( x i ) = ( P T , if the pattern is present in x i N T , otherwise, (1) W e deﬁned the way the presence of the pattern depending on whether the pattern is associated to an event (considered instantaneous), or to a speciﬁc time interval . In the case of the recognition of an event, if time-stamp of a vector is within a range of the event, the pattern is considered to be pr esent . For patterns that takes place during an interval the pattern is considered to be present if the one-minute interval from which Fig. 8: T ask 1: Labeling data from a br eak interval . Fig. 9: T ask 2: Labeling data from the end of a class . x i has been calculated, ov erlaps with the interval where the pattern takes place. T w o binary classiﬁcation tasks posed:: • T ask 1 - Br eak Interval Recognition: this problem in volves the identiﬁcation of the 10-minute breaks between class blocks. During this periods, students exit one class to assist to another, or take a break to continue the same class. Some instructors and teachers freely choose to ignore the scheduled break maintaining continuity in the class, or postponing the break to another moment. The labelling strategy is the one deﬁned for patterns that take place in intervals . The label P B was assigned including a tolerance of one minute before and after the theoretical break interv al (as stated in the academic blocks of the schedule) was conceded as tolerance (Fig. 8). • T ask 2 - End of Class Event Recognition: this problem in- volv es the identiﬁcation of the mov ement patterns related to the end of a class block. The labelling strategy used is the one deﬁned for events . The label P E is assigned within a range of 2.5 minutes from the scheduled end of a class (Fig. 9). as follows: V . R E S U LT S A N D A NA LY S I S The top performing ML method used in our experiments of binary classiﬁcation (Bagged T rees) shows an accuracy (% of validation data that is well classiﬁed) of 93.7 % for the task of recognizing that a statistics vector belongs to an interv al of break between academic blocks, and of 83.3 % for the task of detecting the ev ent of culmination of a block of classes (T able II). The use of Principal Component Analysis was discarded, as its ef fect in accurac y was not systematically beneﬁcial, ranging from decreases of 18.5 % to increments of 25.2 %. The response of the accuracy to increments of the size of the dataset beyond the data of one week was not beneﬁcial for all the ML techniques. In particular, the Bagging with Decisions 8 T ABLE II: Results for 10-fold cross-v alidation one week of data. % Accuracy % Accuracy ML method Break Intervals End of Class Fine Tree 83.2 64.6 Medium Tree 82.8 64.6 Coarse Tree 59.2 56.2 Logistic Reg 70.6 45.8 Linear SVM 72.7 59.7 Quadratic SVM 79.0 65.3 Cubic SVM 82.4 62.5 Fine Gaussian SVM 76.5 56.9 Medium Gaussian SVM 69.3 64.6 Coarse Gaussian SVM 55.9 50.0 Fine kNN 83.6 70.8 Medium kNN 73.5 52.8 Coarse kNN 52.1 51.4 Cosine kNN 73.9 50.7 Cubic kNN 70.6 49.3 W eighted kNN 80.3 64.6 Boosted Trees 91.6 56.9 BaggedT rees 93.7 83.3 Subspace Discriminant 76.5 54.9 Subspace kNN 90.3 81.9 R US Boosted Trees 80.7 59.7 trees, as implemented in the Classiﬁcation Learner App of Matlab R2017b (with 30 learners), saturated on the highest values with the data from one week, which is an indication that further improvement in the classiﬁcation tasks with the dataset at hand might require noise ﬁltering, better hyper-parameter tuning, more elaborated classiﬁcation models or more attention to the feature engineering. These results sho w the applicability of well-kno wn machine learning classiﬁcation techniques to detect two speciﬁc pat- terns that respond to the structural organization of activities within an academic en vironment, and should not be general- ized to other kinds of buildings, or different types of activities. In ﬁgures 3, 4 and 5, there were observable changes in the data around the break interv al between 9:20 and 9:30 am. If the activity of people not engaged in the class acti vities in these intervals of time had been more numerous, or with larger and more activ e movement patterns, the observ able patterns in the data might become in visible within the values generated as a consequence of indi viduals not engaged in classes at those intervals. V I . C O N C L U S I O N A N D A N A L Y S I S W e successfully applied machine learning binary classiﬁ- cation techniques, on data aggregated via basic one-minute statistics from noisy , incomplete, and imperfect raw data. Particularly stringent conditions were accepted to make the method easily applicable to situations with widespread indoors wireless administrativ e software. The classiﬁcation models obtained were able to detect with reasonably high accuracy , common shared mobility patterns in an academic environment. Contrary to the general belief stating that there is a hard trade-off between the limitations imposed on machine learning by regulations, and the opportunities to innov ate [47], the method used open possibilities for further innov ation while preserving priv acy both in the way data is required, and in the nature of the patterns inferred. Moreov er it requires low costs in terms of infrastructure, in comparison to other approaches to gather and analyse data. Further experiences with strategies like the proposed here, can easily be enriched with information from the context of each en vironment. There are clear paths of further improve- ment: 1) The temporal nature of the data indicates that the use of time-series models might be promising in more elab- orated tasks like forecasting of the connection counts on each AP , and possibly anomaly detection in mobility patterns. 2) The use of infrastructure, architectural information, as well as the nature of the planned use of the spaces cov ered by the APs, can also boost our ability to extract kno wledge and detect patterns in the behaviour of people. Obtaining le vels of accurac y of 83.3 % and 93.7 % in the detection of shared patterns in human mobility behaviour , with a methodology that has innov ated in the sense of pushing down, towards simplicity (not mitigating noise, accepting errors and missing data, and limiting to widespread tech- nology), instead of up (more expensi ve, complex and scarce technology), demonstrates that the conﬂict between priv acy regulations, budget constraints and the need to innov ate in ways to solve our problems, is a promising one. This proof of concept is only a ﬁrst step in this research stream. In our future work we will be analysing more complex behaviours, aiming to be able to recognize threats to the security , integrity and health of people, based mainly in W iFi AP connection logs. R E F E R E N C E S [1] S. Grifﬁoen, M. V ermeer, B. Dukai, S. van der Spek, and E. V erbree, “Exploring indoor mov ement patterns through eduroam connected wire- less de vices, ” in Pr oceedings of the 20th A GILE International Confer - ence on Geographic Information Science, . W ageningen University , 2017. [2] H. Li, K. Ota, M. Dong, and M. Guo, “Learning human activities through wi-ﬁ channel state information with multiple access points, ” IEEE Communications Magazine , vol. 56, no. 5, pp. 124–129, 2018. [3] W . W ang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding and modeling of wiﬁ signal based human activity recognition, ” in Pr oceedings of the 21st Annual International Conference on Mobile Computing and Networking . A CM, 2015, pp. 65–76. [4] K. Ali, A. X. Liu, W . W ang, and M. Shahzad, “Keystrok e recognition using wiﬁ signals, ” in Proceedings of the 21st Annual International Confer ence on Mobile Computing and Networking . A CM, 2015, pp. 90–102. [5] M. Mun, D. Estrin, J. Burke, and M. Hansen, “Parsimonious mobility classiﬁcation using gsm and wiﬁ traces, ” in Pr oceedings of the F ifth W orkshop on Embedded Networked Sensors (HotEmNets) , 2008. [6] M. B. Kjærgaard, M. Wirz, D. Roggen, and G. Tröster , “Mobile sensing of pedestrian ﬂocks in indoor en vironments using wiﬁ signals, ” in IEEE International Confer ence on P ervasive Computing and Communications (P erCom), . IEEE, 2012, pp. 95–102. [7] J. Manweiler, N. Santhapuri, R. R. Choudhury , and S. Nelakuditi, “Predicting length of stay at wiﬁ hotspots, ” in Pr oceedings IEEE INFOCOM . IEEE, 2013, pp. 3102–3110. [8] G. Poucin, B. Farooq, and Z. P atterson, “ Acti vity patterns mining in wi-ﬁ access point logs, ” Computers, En vir onment and Urban Systems , vol. 67, pp. 55–67, 2018. [9] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understanding individual human mobility patterns, ” Nature , vol. 453, no. 7196, pp. 779–782, 2008. 9 [10] C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of predictability in human mobility , ” Science , vol. 327, no. 5968, pp. 1018–1021, 2010. [11] M. Haghani and M. Sarvi, “Crowd beha viour and motion: Empirical methods, ” T ransportation r esearch part B: Methodological , 2017. [12] N. Eagle and A. S. Pentland, “Eigenbehaviors: Identifying structure in routine, ” Behavioral Ecology and Sociobiology , vol. 63, no. 7, pp. 1057– 1066, 2009. [13] B. T an, Q. Chen, K. Chetty , K. W oodbridge, W . Li, and R. Piechocki, “Exploiting wiﬁ channel state information for residential healthcare informatics, ” IEEE Communications Magazine , vol. 56, no. 5, pp. 130– 137, 2018. [14] X. Song, R. Shibasaki, N. J. Y uan, X. Xie, T . Li, and R. Adachi, “Deepmob: Learning deep kno wledge of human emergenc y behavior and mobility from big and heterogeneous data, ” ACM T ransactions on Information Systems (TOIS) , vol. 35, no. 4, p. 41, 2017. [15] B. Guo, H. Chen, Q. Han, Z. Y u, D. Zhang, and Y . W ang, “W orker- contributed data utility measurement for visual crowdsensing systems, ” IEEE T ransactions on Mobile Computing , vol. 16, no. 8, pp. 2379–2391, 2017. [16] N. Caceres and F . G. Benitez, “Supervised land use inference from mobility patterns, ” Journal of Advanced T ransportation , v ol. 2018, 2018. [17] A. J. Ruiz-Ruiz, H. Blunck, T . S. Prentow , A. Stisen, and M. B. Kjaergaard, “ Analysis methods for extracting knowledge from large- scale wiﬁ monitoring to inform building facility planning, ” in P ervasive Computing and Communications (P erCom), 2014 IEEE International Confer ence on . IEEE, 2014, pp. 130–138. [18] J. L. T oole, M. Ulm, M. C. González, and D. Bauer, “Inferring land use from mobile phone activity , ” in Proceedings of the ACM SIGKDD international workshop on urban computing . A CM, 2012, pp. 1–8. [19] F . Calabrese, J. Reades, and C. Ratti, “Eigenplaces: segmenting space through digital signatures, ” IEEE P ervasive Computing , vol. 9, no. 1, pp. 78–84, 2010. [20] M. Kim and D. Kotz, “Modeling users’ mobility among wiﬁ access points, ” in W orkshop on W ireless trafﬁc measurements and modeling . USENIX Association, 2005, pp. 19–24. [21] A. T oshev and C. Szegedy , “Deeppose: Human pose estimation via deep neural networks, ” in Pr oceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 1653–1660. [22] Z. W ang, B. Guo, Z. Y u, and X. Zhou, “W i-ﬁ csi-based behavior recog- nition: From signals and actions to activities, ” IEEE Communications Magazine , vol. 56, no. 5, pp. 109–115, 2018. [23] R. Shah and R. Romijnders, “ Applying deep learning to basketball trajectories, ” arXiv preprint , 2016. [24] H. W u, Z. Chen, W . Sun, B. Zheng, and W . W ang, “Modeling trajectories with recurrent neural networks, ” in Pr oceedings of the 26th International Joint Conference on Artiﬁcial Intelligence IJCAI-17 . IJCAI, 2017, pp. 3083–3090. [25] A. Alahi, K. Goel, V . Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces, ” in Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2016, pp. 961–971. [26] S. W ang, Q. Hu, Y . Sun, and J. Huang, “Priv acy preservation in location- based services, ” IEEE Communications Magazine , vol. 56, no. 3, pp. 134–140, 2018. [27] M. Al-Rubaie and J. M. Chang, “Privac y preserving machine learning: Threats and solutions, ” arXiv pr eprint arXiv:1804.11238 , 2018. [28] Z. Huang, P . W ang, F . Zhang, J. Gao, and M. Schich, “ A mobility network approach to identify and anticipate large crowd g atherings, ” T ransportation Researc h P art B: Methodological , vol. 114, pp. 147–170, 2018. [29] S. Motsch, M. Moussaid, E. G. Guillot, M. Moreau, J. Pettre, G. Ther - aulaz, C. Appert-Rolland, and P . Degond, “Forecasting crowd dynamics through coarse-grained data analysis, ” bioRxiv , p. 175760, 2017. [30] J. M. Grant and P . J. Flynn, “Crowd scene understanding from video: a survey , ” A CM T ransactions on Multimedia Computing, Communications, and Applications (TOMM) , vol. 13, no. 2, p. 19, 2017. [31] J. Shao, K. Kang, C. Change Loy , and X. W ang, “Deeply learned attributes for crowded scene understanding, ” in Proceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2015, pp. 4657– 4666. [32] C. Gan, N. W ang, Y . Y ang, D.-Y . Y eung, and A. G. Hauptmann, “Devnet: A deep event network for multimedia e vent detection and evidence recounting, ” in Proceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2015, pp. 2568–2577. [33] H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition, ” in Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 2015, pp. 2458–2466. [34] A. Haque, A. Alahi, and L. Fei-Fei, “Recurrent attention models for depth-based person identiﬁcation, ” in Pr oceedings of the IEEE Confer- ence on Computer V ision and P attern Recognition , 2016, pp. 1229–1238. [35] N. Aschenbruck, A. Munjal, and T . Camp, “Trace-based mobility modeling for multi-hop wireless networks, ” Computer Communications , vol. 34, no. 6, pp. 704–714, 2011. [36] S. A. H. Zahabi, A. Ajzachi, and Z. Patterson, “Transit trip itinerary inference with gtfs and smartphone data, ” T ransportation Researc h Recor d: Journal of the T ransportation Research Board , no. 2652, pp. 59–69, 2017. [37] Z. Patterson and K. Fitzsimmons, “Datamobile: Smartphone travel survey experiment, ” T ransportation Resear ch Record: Journal of the T ransportation Researc h Board , no. 2594, pp. 35–43, 2016. [38] J. Su, A. Chin, A. Popivano va, A. Goel, and E. De Lara, “User mobility for opportunistic ad-hoc networking, ” in Sixth IEEE W orkshop on Mobile Computing Systems and Applications (WMCSA), . IEEE, 2004, pp. 41– 50. [39] Q.-T . Nguyen-V uong, N. Agoulmine, and Y . Ghamri-Doudane, “T erminal-controlled mobility management in heterogeneous wireless networks, ” IEEE Communications Magazine , vol. 45, no. 4, 2007. [40] F . Calabrese, M. Diao, G. Di Lorenzo, J. Ferreira Jr, and C. Ratti, “Understanding individual mobility patterns from urban sensing data: A mobile phone trace example, ” T ransportation resear ch part C: emerging technologies , v ol. 26, pp. 301–313, 2013. [41] G. Mao, B. Fidan, and B. D. Anderson, “Wireless sensor network localization techniques, ” Computer networks , vol. 51, no. 10, pp. 2529– 2553, 2007. [42] H. W ymeersch, J. Lien, and M. Z. Win, “Cooperative localization in wireless networks, ” Pr oceedings of the IEEE , vol. 97, no. 2, pp. 427– 450, 2009. [43] F . Meneses and A. Moreira, “Large scale movement analysis from wiﬁ based location data, ” in International Confer ence on Indoor P ositioning and Indoor Navigation (IPIN), . IEEE, 2012, pp. 1–9. [44] M. Conti and S. Giordano, “Multihop ad hoc networking: The theory , ” IEEE Communications Magazine , vol. 45, no. 4, 2007. [45] T . Y oshida and Y . T aniguchi, “Estimating the number of people using existing wiﬁ access point in indoor en vironment, ” in Pr oceedings of the 6th Eur opean Conference of Computer Science (ECCS) , 2015, pp. 46–53. [46] S. Kim, X. J. Zhang, and S. S. Lumetta, “Minimizing protection cost for high-speed recovery of mission critical trafﬁc in WDM mesh net- works, ” in Proceedings of the AFCEA/IEEE Military Communications Confer ence , 2007. [47] E. Horvitz and D. Mulligan, “Data, privac y , and the greater good, ” Science , vol. 349, no. 6245, pp. 253–255, 2015. [48] M. Srivatsa and M. Hicks, “Deanon ymizing mobility traces: Using social network as a side-channel, ” in Pr oceedings of the 2012 A CM confer ence on Computer and communications security . A CM, 2012, p. 628637. [49] X. Ding, L. Zhang, Z. W an, and M. Gu, “ A brief survey on de- anonymization attacks in online social networks, ” in International Con- fer ence on Computational Aspects of Social Networks (CASoN), . IEEE, 2010, p. 611615.

Privacy-preserving classifiers recognize shared mobility behaviours from WiFi network imperfect data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment