GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association

GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at https://github.com/ron941/GTATrack-STC2025.


💡 Research Summary

GTATrack is a two‑stage hierarchical framework designed to tackle the unique challenges of multi‑object tracking (MOT) in soccer videos captured by static fisheye cameras. The authors identify four major sources of difficulty: (1) highly irregular player motion that violates the linear‑motion assumptions of Kalman‑filter‑based trackers, (2) extreme scale variation caused by the wide‑angle lens, which makes distant players appear as tiny, low‑resolution blobs, (3) near‑identical appearance due to uniform jerseys, and (4) frequent, prolonged occlusions during physical contact. To address these, GTATrack combines a motion‑agnostic online association module (Deep‑EIoU) with a global trajectory‑level refinement module (GTA‑Link), while also enhancing detector recall through a pseudo‑labeling strategy.

Detection Backbone – The authors evaluate two detectors, YOLOv11x (a high‑capacity one‑stage CNN) and SO‑DETR (a transformer‑based small‑object detector). YOLOv11x is selected for its balance of speed, multi‑scale feature extraction, and robustness to the fisheye distortion. To improve recall on small, peripheral players, a semi‑supervised pseudo‑labeling pipeline is employed: the fine‑tuned YOLO model generates high‑confidence predictions on unlabeled frames, which are then added to the training set as additional ground‑truth boxes. This step significantly raises detection rates for tiny targets without sacrificing precision.

Appearance Embedding – For each detected player crop, a lightweight OSNet model extracts a 256‑dimensional L2‑normalized feature vector. OSNet’s omni‑scale design captures both fine‑grained texture (e.g., shoe patterns) and global body shape, which is crucial when jersey colors are identical across teammates. The cosine distance between two embeddings defines the appearance cost used later in data association.

Deep‑EIoU Online Tracker – Traditional online trackers (SORT, DeepSORT, OC‑SORT) rely on Kalman filters to predict the next position, an assumption that fails under abrupt accelerations and direction changes typical of soccer. Deep‑EIoU discards motion prediction entirely. Instead, it computes an “Expansion IoU” (EIoU) by iteratively enlarging the candidate bounding box (e.g., scaling by 1.2× per iteration) and measuring overlap with the detection. This expansion creates a flexible spatial search region that can accommodate sudden jumps. The final matching cost for a detection‑tracklet pair is a weighted sum of (1 − EIoU) and the cosine appearance distance. The Hungarian algorithm solves the resulting linear assignment problem, yielding a set of high‑confidence frame‑to‑frame links that form initial tracklets. Because no motion model is imposed, the tracker remains robust to non‑linear trajectories and temporary occlusions.

GTA‑Link Global Refinement – Even with a strong online stage, fragmented tracklets and identity switches still occur, especially after long occlusions or when a player re‑enters from the periphery. GTA‑Link operates offline on the set of initial tracklets. It first computes spatio‑temporal distances (based on predicted positions and timestamps) and appearance distances (average OSNet embeddings) between all pairs of tracklets. Using hierarchical clustering, tracklets that satisfy both spatial proximity (within a configurable time window, e.g., 10 frames) and appearance similarity are grouped. Within each cluster, a graph‑based optimization selects the most plausible connections, effectively merging fragments and correcting ID swaps. This global reasoning dramatically improves long‑term identity consistency, as reflected in the large gains in HOTA and IDF1 reported in the experiments.

Experimental Results – The method was evaluated on the SoccerTrack 2025 benchmark, which consists of fisheye‑recorded soccer matches with dense player interactions. GTATrack achieved a HOTA score of 0.60, surpassing the next best baseline (≈0.48) by a wide margin. False positives dropped to 982 from the baseline’s 1,450, and ID switches were reduced by more than 60 %. An ablation study confirmed the contribution of each component: removing pseudo‑labeling decreased small‑object recall by ~8 %; replacing Deep‑EIoU with a Kalman‑filter tracker increased ID switches by 2.3×; disabling GTA‑Link led to a 0.07 drop in HOTA. The full pipeline runs at >30 FPS on a single RTX 3090, demonstrating real‑time feasibility.

Contributions and Impact – The paper’s primary contributions are: (1) introducing Deep‑EIoU, a motion‑agnostic association method that leverages iterative box expansion and deep appearance cues, (2) proposing GTA‑Link, a global trajectory‑level clustering and merging algorithm that restores long‑term identity consistency, (3) applying pseudo‑labeling to boost detector performance on extremely small, distorted targets typical of fisheye views, and (4) delivering a complete, real‑time system that sets a new state‑of‑the‑art on a challenging sports MOT benchmark. The authors argue that the local‑global synergy of GTATrack can be transferred to other sports (basketball, volleyball) and to any wide‑angle surveillance scenario where motion irregularity and scale disparity are prevalent. Future work may explore transformer‑based global reasoning to further reduce computational overhead and extend the framework to multi‑camera collaborative tracking.


Comments & Academic Discussion

Loading comments...

Leave a Comment