Previous studies have shown that Twitter users have biases to tweet from certain locations, locational bias, and during certain hours, temporal bias. We used three years of geolocated Twitter Data to quantify these biases and test our central hypothesis that Twitter users biases are consistent across US cities. Our results suggest that temporal and locational bias of Twitter users are inconsistent between three US metropolitan cities. We derive conclusions about the role of the complexity of the underlying data producing process on its consistency and argue for the potential research avenue for Geospatial Data Science to test and quantify these inconsistencies in the class of organically evolved Big Data.
Recently, we have witnessed an increase in the volume of digital footprints around urban environments. Such information has become increasingly crucial to understand the fast evolving urban landscapes and augments traditional high latency on-site survey methods. However, a major problem that is common among the organically evolved big data sets is the lack of information about its consistency [1]. Unlike traditional measurements, where data collection protocols guarantee its consistency and reproducibility, the data generation processes underlying most of big data are usually unknown. This lack of knowledge about the consistency of ambient geospatial big data has resulted in limitations, which over generalize the results beyond the presented case studies. We highlight this problem by testing the hypothesis of consistency of two metrics of Twitter users; namely the user preference to engage ('tweet') around certain landuse types, and the Twitter users' circadian rhythms across major metropolitan US cities.
Previous research demonstrated the possibility of inferring urban landuse from Twitter data based on the analysis of individual user mobility patterns [2]. In this regard, the underlying process generating the spatial temporal patterns of Twitter data is a convolution of two processes. First the process of engagement with the technology (tweeting), and second, the mobility patterns of technology users. Although human mobility is highly predictable and consists of a few key locations (e.g., home, work, etc.) [3], Twitter user biases are not well understood [4]. In this research work, we compared Twitter user biases toward tweeting from certain landuse type/urban activity and during certain times of the day. Our main guiding hypothesis is that there are no significant differences of the Twitter user biases across US cities. We tested this hypothesis using detailed landuse maps in different US cities and used it to quantify the consistency of a three year collection of geolocated Twitter data.
Here we demonstrate that assumptions made of about the big geolocated data sets must be proven true before incorporating it into larger studies. Specifically, we demonstrate that the process of geolocated tweet production is not spatially consistent, even when mined from the same source (e.g. the Twitter API).
Geotagged tweets were obtained using Twitter’s streaming API from January 2013 to January 2016 (~ 2.42 billions tweets for contiguous United States). From these we selected tweets within the geographic bounding boxes of Chicago (39 million tweets), Manhattan (18 million tweets), and San Diego (8 million tweets). We filtered the Twitter data to remove duplicate records. In addition, we removed the tweets of Twitter users who made less than 10 tweets per year, were active for less than 30 days, or exceeded the 99% percentile for speed between consecutive tweets.
We used parcel level landuse maps recently released from the New York City Department of City Planning, San Diego Land Layers (SANDAG’s), and landuse inventory for Northeastern Illinois to retrieve landuse types associated with each tweet collected in the cities of New York, San Diego and Chicago respectively. We assigned each tweet to the nearest landuse parcel using a scalable point in nearest polygon algorithm. Landuse types were grouped into twelve activity classes using a legend, which is popular in social studies [5]. We applied a DBSCAN clustering function using a search window of 0.00225 degrees and a minimum of three points to identify 884,737 key locations from 163,340 users in the city of Chicago. Similarly we identified 192,934 key locations from 47,356 users in in San Diego and 503,223 locations from 132,546 users in the island of Manhattan. The spatial clustering was done to identify significant key locations as indicated by multiple tweets at the same location and avoid random tweets. We labeled each key location using the dominant landuse associated with its tweets.
Twitter users are known to exhibit biases, sending tweets around certain locations (e.g., tweeting at home) or during certain hours of the day [4]. In the absence of locational bias, the distribution of different landuse types in Twitter data should resembles their abundance in the city. However, if a preferential bias exists, some landuse types will be more common in Twitter data compared to their relative weight (share) in the city. Our first metric assess the locational bias by measuring the ratio of landuse abundance in Twitter data compared to the city landuse map. The first weight measure is the ratio of the number of Twitter clusters (users’ key locations) labeled with a certain landuse type to the total number of clusters, while the latter measure is a relative ratio of occupied surface area of a certain landuse type to the total surface area of the city. We fitted a linear model to the relation between weights of different landuse types independently for each city. A
This content is AI-processed based on open access ArXiv data.