Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory
We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.
💡 Research Summary
Infinite‑World tackles the long‑standing reality gap in interactive world modeling by introducing three complementary innovations that enable coherent long‑horizon simulation on noisy real‑world video data. First, the Hierarchical Pose‑free Memory Compressor (HPMC) replaces quadratic‑time attention and pose‑dependent retrieval with a recursive compression pipeline. Short sequences (≤ k·T_max) are locally compressed by a lightweight temporal encoder that downsamples latent frames by a factor of four, preserving fine‑grained dynamics. For longer horizons, the model partitions the latent history into overlapping chunks using a sliding window, applies the same encoder to each chunk (local compression), concatenates the resulting tokens, and then performs a second‑stage global compression. This two‑level process recurses until the entire history is represented by a fixed‑size token set z_com that never exceeds the pre‑allocated memory budget T_max, guaranteeing constant computational cost regardless of sequence length. Crucially, the compressor f_ϕ is trained jointly with the diffusion transformer (DiT) backbone, so the compression learns to retain exactly those historical cues that minimize future frame generation loss. As a result, the model can anchor generations to distant past frames without any external pose information, achieving stable spatial consistency over 1000+ frames.
Second, the Uncertainty‑aware Action Labeling module addresses noisy pose estimates and the lack of clean action annotations in real video. After estimating relative 6‑DoF camera motion ΔP between consecutive frames, the method decouples translation magnitude ‖ΔP_trans‖ and rotation magnitude ‖ΔP_rot‖. Two thresholds τ₁ (noise floor) and τ₂ (action trigger) define a tri‑state label: “No‑operation” if the magnitude is below τ₁, “Discrete Action” if it exceeds τ₂, and “Uncertain” for intermediate values. The “Uncertain” state is explicitly retained rather than discarded, allowing the training pipeline to make use of all raw frames while shielding the deterministic action space from jitter‑induced corruption. Discrete actions are mapped to intuitive keyboard commands (W/A/S/D for translation, arrow keys for rotation), providing a clean, robust control signal for the generative model.
Third, the authors observe that long‑range memory activation depends more on the density of viewpoint revisits than on total data volume. Guided by a pilot study, they curate a 30‑minute Revisit‑Dense Dataset (RDD) containing frequent loop‑closure events and fine‑grained viewpoint changes. After pre‑training on large open‑domain video corpora, Infinite‑World is fine‑tuned on RDD using a “Revisit‑Dense Finetuning” strategy that reinforces the hierarchical compressor’s ability to recall distant scenes. This targeted finetuning dramatically improves loop‑closure performance without requiring massive additional data.
Extensive evaluation combines objective metrics (Fréchet Video Distance, CLIP‑Score, action‑matching accuracy) and human user studies. Infinite‑World achieves a >30 % reduction in FVD and a 0.12 increase in CLIP‑Score over prior state‑of‑the‑art methods such as Genie‑3 and RELIC, while action accuracy improves by 18 %. In user studies, participants rate visual fidelity, spatial consistency, and controllability at an average of 4.6/5, noting that global landmarks (e.g., window and desk layout) remain stable even after 1000 frames. Qualitative examples show the model accurately rendering viewpoint changes in response to keyboard inputs and preserving scene geometry across long loops.
In summary, Infinite‑World delivers a pose‑free, memory‑efficient, and action‑robust framework that bridges the synthetic‑real divide for interactive world models. By jointly learning hierarchical compression, uncertainty‑aware discretization of motion, and a revisit‑dense fine‑tuning regime, the system maintains coherent world states over unprecedented horizons. Future work may explore further compression‑rate optimization, multimodal extensions (audio, text), and deployment on real‑time robotic platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment