Automatic Recognition of Public Transport Trips from Mobile Device Sensor Data and Transport Infrastructure Information
Automatic detection of public transport (PT) usage has important applications for intelligent transport systems. It is crucial for understanding the commuting habits of passengers at large and over longer periods of time. It also enables compilation of door-to-door trip chains, which in turn can assist public transport providers in improved optimisation of their transport networks. In addition, predictions of future trips based on past activities can be used to assist passengers with targeted information. This article documents a dataset compiled from a day of active commuting by a small group of people using different means of PT in the Helsinki region. Mobility data was collected by two means: (a) manually written details of each PT trip during the day, and (b) measurements using sensors of travellers’ mobile devices. The manual log is used to cross-check and verify the results derived from automatic measurements. The mobile client application used for our data collection provides a fully automated measurement service and implements a set of algorithms for decreasing battery consumption. The live locations of some of the public transport vehicles in the region were made available by the local transport provider and sampled with a 30-second interval. The stopping times of local trains at stations during the day were retrieved from the railway operator. The static timetable information of all the PT vehicles operating in the area is made available by the transport provider, and linked to our dataset. The challenge is to correctly detect as many manually logged trips as possible by using the automatically collected data. This paper includes an analysis of challenges due to missing or partially sampled information in the data, and initial results from automatic recognition using a set of algorithms. Improvement of correct recognitions is left as an ongoing challenge.
💡 Research Summary
The paper presents a comprehensive study on automatically recognizing public‑transport (PT) trips by fusing mobile‑device sensor data with transport‑infrastructure information. The authors collected a unique dataset in the Helsinki region on a single day (August 26, 2016) involving eight participants who performed as many PT journeys as possible between 9 am and 4 pm. For each participant two parallel data streams were recorded: (1) a manually written log that details every PT leg (entry/exit stations, timestamps, line names, vehicle departure/arrival times) and (2) continuous measurements from a custom Android client (TrafficSense) that captures fused GPS positions, accuracy estimates, and activity‑recognition results supplied by Google Play Services. The manual log serves as a ground‑truth reference for evaluating the automatic detection pipeline.
The Android client is designed with a strong emphasis on battery conservation. It alternates between an ACTIVE state, where high‑frequency (10 s) location requests are issued, and a SLEEP state, entered after a 40 s timer expires while the activity recognizer reports “STILL”. The client wakes up if the device moves a distance larger than the current accuracy estimate, even if the activity recognizer still reports STILL – a situation common on smooth rail rides. Activity updates are always requested, but during STILL periods the interval may stretch up to 180 s. Each accepted location sample is paired with the most recent activity label, meaning the same activity can appear on multiple points.
Raw sensor data (≈6 030 rows) undergoes a filtering stage before being used for PT detection. The filter discards points older than 60 min since the last accepted point (to keep a “ping”), removes samples with estimated accuracy worse than 1 000 m, and applies a rule‑based activity filter: points are kept if the activity differs from the last queued activity and is classified as “good”, or if the activity is unchanged but the spatial distance to the last accepted point exceeds the reported accuracy. After filtering, 5 975 points remain (≈30 % of the theoretical maximum), reflecting the aggressive power‑saving strategy.
Three complementary sources of transport‑infrastructure data are incorporated: (i) real‑time vehicle positions sampled from the Helsinki Regional Transport (HSL) SIRI API at 30 s intervals, (ii) static timetables provided in GTFS format covering all lines and departures on the test day, and (iii) train‑stop times retrieved as JSON from the Finnish Transport Agency’s Digitraffic service. The real‑time feed covers only a subset of the fleet, leading to missing vehicle positions for many trips.
The detection algorithm proceeds in stages. First, the activity recognizer’s “IN_VEHICLE” label is used to extract candidate movement segments. Within each candidate, the algorithm searches for a contiguous sequence of real‑time vehicle positions that lie within a configurable spatial tolerance (≈50 m) of the user’s trajectory. If such a match is found, the corresponding line name and direction are assigned to the segment. When real‑time data is unavailable, the algorithm falls back to matching the candidate segment against static GTFS schedules: it checks whether the start and end timestamps of the segment align (within a tolerance of a few minutes) with any scheduled departure/arrival pair for a line that passes through the observed locations. The combination of live‑feed matching and schedule cross‑checking yields a set of automatically recognized PT legs.
Evaluation against the 103 manually logged trips shows that the system correctly identifies 71 trips (≈69 %). Success rates vary by mode: bus and tram trips are recognized with higher accuracy, while subway and train trips suffer from GPS degradation (underground environments) and sparse real‑time vehicle data. Failure cases are primarily attributed to three factors: (a) missing or delayed real‑time vehicle positions, (b) GPS errors that cause alternating “good” and “bad” location points (especially in tunnels), and (c) activity‑recognition uncertainty where the classifier outputs “UNKNOWN” or “TILTING”. The authors note that the manual timestamps are approximate, which adds further ambiguity to the evaluation.
The paper discusses several avenues for improvement. Probabilistic models such as Hidden Markov Models or Bayesian filters could smooth the sequence of observations and better handle intermittent data loss. Integration of Wi‑Fi and Bluetooth Low Energy (BLE) beacons, which were only available on trams during the study, could provide richer proximity cues. Linking the detection pipeline to ticket‑validation systems (smart‑card taps) would give precise boarding/alighting events, eliminating reliance on noisy sensor data. Finally, scaling the dataset to multiple days and a larger participant pool would enable training of machine‑learning models that generalize across users, devices, and transport conditions.
In summary, the study demonstrates that a modest set of smartphone sensors, when combined with publicly available live vehicle feeds and static timetables, can automatically recover a substantial fraction of real‑world PT trips. The work highlights practical challenges—battery constraints, incomplete infrastructure data, and sensor noise—and provides a solid baseline for future research aimed at achieving near‑real‑time, high‑accuracy PT trip detection for intelligent transport systems, personalized travel assistance, and large‑scale mobility analytics.
Comments & Academic Discussion
Loading comments...
Leave a Comment