We study generative modeling of variable-length trajectories-sequences of visited locations/items with associated timestamps-for downstream simulation and counterfactual analysis. A recurring practical issue is that standard mini-batch training can be unstable when trajectory lengths are highly heterogeneous, which in turn degrades distribution matching for trajectoryderived statistics. We propose length-aware sampling (LAS), a simple batching strategy that groups trajectories by length and samples batches from a single length bucket, reducing within-batch length heterogeneity (and making updates more consistent) without changing the model class. We integrate LAS into a conditional trajectory GAN with auxiliary time-alignment losses and provide (i) a distribution-level guarantee for derived variables under mild boundedness assumptions, and (ii) an IPM/Wasserstein mechanism explaining why LAS improves distribution matching by removing length-only shortcut critics and targeting within-bucket discrepancies. Empirically, LAS consistently improves matching of derived-variable distributions on a multimall dataset of shopper trajectories and on diverse public sequence datasets (GPS, education, e-commerce, and movies), outperforming random sampling across dataset-specific metrics.
Learning realistic trajectory and sequence models-and increasingly, trajectory generators for simulation and counterfactual analysis-is important in domains such as mobility analytics [Gonzalez et al., 2008, Feng et al., 2018, Mohamed et al., 2020], recommender systems [Kang and McAuley, 2018, Sun et al., 2019, Tagliabue and Yu, 2020], and sequential decision logs in education [Piech et al., 2015]. A key difficulty shared across these settings is variable trajectory length: real sequences can range from a few steps to hundreds, and length is often strongly correlated with other characteristics (e.g., dwell time, inter-event timing, or item/category diversity).
In practice, we train deep generative models with stochastic mini-batches. When trajectory lengths are highly heterogeneous, mini-batches mix very short and very long sequences, encouraging the discriminator/critic to exploit length-correlated signals rather than within-length behavioral structure. This is especially damaging when the goal is distribution matching for trajectory-derived variables-statistics computed from an entire sequence (e.g., total duration, average per-step time, transition structure, or entropy-like measures). As a result, the adversarial objective may improve while important derived-variable distributions remain mismatched, limiting fidelity for downstream simulation.
We address this with a length-aware sampling (LAS) scheme that (i) partitions trajectories into length buckets and (ii) draws each mini-batch from a single bucket. LAS is a trainingtime intervention (no model changes) that controls within-batch length heterogeneity and makes discriminator/generator updates more consistent in practice. We combine LAS with a conditional trajectory GAN and auxiliary time-alignment losses to build digital twins for trajectory datagenerators that can be conditioned on scenario variables to support counterfactual simulation.
Mall digital twin as a motivating case study. Shopping malls remain among the most data-rich yet under-optimized physical marketplaces [Eppli and Benjamin, 1994, Brueckner, 1993, Seiler, 2017]. We study a proprietary dataset of anonymized foot-traffic trajectories collected from four large malls, enabling counterfactual questions such as: How would closing an anchor store, changing the tenant mix, or re-routing flows affect dwell time and the distribution of visits? While the mall application motivates the paper, our method and evaluation are domain-agnostic and are validated on additional public sequence datasets.
• We formalize trajectory generation with derived-variable distribution matching as an evaluation target.
• We propose length-aware sampling (LAS), a simple length-bucket batching strategy, and show how to integrate it into GAN training.
• We provide theory: (i) a Wasserstein bound for derived-variable distributions under boundedness and controlled training losses, and (ii) an IPM/Wasserstein mechanism explaining why LAS improves distribution matching by removing length-only shortcut critics and targeting within-bucket discrepancies.
• We demonstrate empirical gains of LAS over random sampling on a multi-mall dataset and multiple public sequence datasets.
Our work connects to (i) modeling and generating sequential/trajectory data, (ii) digital twins and counterfactual simulation, and (iii) stabilizing adversarial/stochastic training under heterogeneous data.
Trajectory and sequence modeling. Trajectory data are central in mobility analytics [Gonzalez et al., 2008, Feng et al., 2018, Mohamed et al., 2020]. Beyond mobility, generative sequence modeling has been explored in settings such as pedestrian motion [Gupta et al., 2018] and in general-purpose sequence generators, including GAN-style methods for discrete sequences [Yu et al., 2017] and synthetic time-series generation [Yoon et al., 2019]. In recommender systems, sequential models are widely used to represent and generate user-item trajectories (e.g., recurrent or attention-based models) [Hidasi et al., 2015, Kang and McAuley, 2018, Sun et al., 2019, Wu et al., 2018, Tagliabue and Yu, 2020]. Our focus differs: we optimize and evaluate distribution matching of trajectory-derived statistics and study how batching by length shapes this objective.
Digital twins and counterfactual simulation. Digital twins aim to create forward simulators for complex systems [Grieves and Vickers, 2016, Fuller et al., 2020, Kritzinger et al., 2018, Attaran and Celik, 2023]. In many operational settings (including retail), counterfactual analysis is often addressed with observational causal methods that are inherently backward-looking [Athey, 2017]. We contribute a complementary generative angle: a learned simulator calibrated on observed trajectories that can be conditioned on scenario variables to support “what-if” analyses.
Mall retail analytics and shopper trajectories. Marketing and operations research have studied mall design, tenant mix, and shopper flows,
This content is AI-processed based on open access ArXiv data.