Distilling human mobility models with symbolic regression

Distilling human mobility models with symbolic regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Human mobility is a fundamental aspect of social behavior, with broad applications in transportation, urban planning, and epidemic modeling. Represented by the gravity model and the radiation model, established analytical models for mobility phenomena are often discovered by analogy to physical processes. Such discoveries can be challenging and rely on intuition, while the potential of emerging social observation data in model discovery is largely unexploited. Here, we propose a systematic approach that leverages symbolic regression to automatically discover interpretable models from human mobility data. Our approach finds several well-known formulas, such as the distance decay effect and classical gravity models, as well as previously unknown ones, such as an exponential-power-law decay that can be explained by the maximum entropy principle. By relaxing the constraints on the complexity of model expressions, we further show how key variables of human mobility are progressively incorporated into the model, making this framework a powerful tool for revealing the underlying mathematical structures of complex social phenomena directly from observational data.


💡 Research Summary

**
The paper introduces a systematic framework that uses Symbolic Regression (SR) to automatically discover interpretable mathematical models of human mobility directly from large‑scale observational data. Traditional analytical models such as the gravity model and the radiation model have been derived largely by analogy to physical processes and rely on researcher intuition, which limits their ability to capture the complexity of modern, high‑resolution mobility datasets. To overcome this limitation, the authors formulate the problem in terms of an “allocation weight” (f_{ij}) that represents the probability that an individual at origin (i) chooses destination (j). The total out‑flow from each origin (O_i) is then allocated to destinations using the normalized weights, i.e., (\hat{F}{ij}=O_i,f{ij}/\sum_{k\neq i} f_{ik}). By focusing on (f_{ij}) rather than directly modelling the flow matrix, the search space is dramatically reduced, making SR computationally tractable.

Four explanatory variables are used: workplace population (w_i, w_j), residential population (r_i, r_j), geographic distance (d_{ij}), and intervening opportunities (s_{ij}) (computed separately for workplace and residential populations). The allowed operators are the five basic binary operators (+, −, ×, ÷, ^), plus the unary functions exp and ln. Model complexity is quantified as the number of nodes in the expression tree, with each variable, constant, and binary operator counting as one node, and each unary operator counting as two nodes (to reflect the implicit constant).

The SR engine is a genetic‑programming based optimizer (implemented with the Julia package SymbolicRegression.jl). It evolves a population of candidate expressions through selection, crossover, and mutation, while simultaneously penalising model complexity via a regularisation term (\lambda C(f)) in the objective function. This yields a Pareto front of models that trade off mean‑squared error (MSE) against complexity, allowing the analyst to pick a model that balances interpretability and predictive performance.

The framework is applied to four massive datasets covering three countries: (1) 5 million cellphone users in Guangdong Province, China (November 2020), aggregated at 500 m grid cells; (2) inter‑city cellphone data for the Beijing‑Tianjin‑Hebei urban agglomeration (November 2019); (3) commuting flows derived from the 2011 UK Census at the merged local authority district level; and (4) US commuting flows from the American Community Survey (2011‑2015) at the county level. For each dataset, the authors compute the four explanatory variables and filter out self‑flows.

Key findings:

  1. The simplest high‑performing expression (complexity 5) that repeatedly appears across all datasets is (f_{ij}=m_j/d_{ij}^{\beta}), where (m_j) denotes the population (either workplace or residential) of the destination and (\beta) is a distance‑decay exponent. This is essentially the classic gravity model’s distance‑decay component, confirming that SR can rediscover well‑known laws without any prior specification.
  2. When the allowed complexity is increased, SR recovers the full gravity model, the radiation model, Schneider’s intervening‑opportunity model, and a hybrid “exponential‑power‑law” decay:
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment