Improvements of the ALICE GPU TPC tracking and GPU framework for online and offline processing of Run 3 Pb-Pb data
ALICE is the dedicated heavy ion experiment at the LHC at CERN and records lead-lead collisions at a rate of up to 50 kHz in LHC Run 3. To cope with such collision and data rates, ALICE uses a new GEM TPC with continuous readout and a GPU-based online computing farm for data compression. Operating the first GEM TPC of this size with large space charge distortions due to the high collision rate has many implications for the track reconstruction algorithm, both anticipated and unanticipated. With real Pb-Pb data available, the TPC tracking algorithm needed to be refined, particularly with respect to improved cluster attachment at the inner TPC region. In order to use the online computing farm efficiently for offline processing when there is no beam in the LHC, ALICE is currently running TPC tracking on GPUs also in offline processing. For the future, ALICE aims to run more computing steps on the GPU, and to use other GPU-enabled resources besides its online computing farm. These aspects, along with better possibilities for performance optimizations led to several improvements of the GPU framework and GPU tracking code, particularly using Run Time Compilation (RTC). The talk will give an overview of the improvements for the ALICE tracking code, mostly based on experience from reconstructing real Pb-Pb data with high TPC occupancy. In addition, an overview of the online and offline processing status on GPUs will be given, and an overview of how RTC improves the ALICE tracking code and GPU support.
💡 Research Summary
ALICE has upgraded its central tracking detector for LHC Run 3 to a GEM‑based Time Projection Chamber (TPC) with continuous read‑out, enabling a record‑breaking Pb‑Pb interaction rate of up to 50 kHz. Because the TPC now records overlapping events within a ∼100 µs drift window, the fundamental reconstruction unit is a “time frame” of about 2.8 ms (32 LHC orbits). This imposes severe demands on the tracking software: it must operate on the full raw data, perform clusterisation, track finding, and compression entirely on GPUs in order to meet the online compression throughput, while also delivering offline‑quality physics performance when the accelerator is not delivering beam.
The original Run 2 High‑Level Trigger (HLT) tracking algorithm, based on a cellular‑automaton seed finder followed by a Kalman‑filter track follower, three‑iteration fitting, and an ambiguity‑resolution step, was ported to GPUs for Run 3. However, real Pb‑Pb data revealed two major shortcomings at the high local TPC occupancies (up to 0.9 a.u.) typical of Run 3: (i) a dramatic loss of clusters in the innermost pad rows (row < 20), leading to a steep drop in ITS‑TPC matching efficiency, and (ii) instability of the three‑iteration fit for low‑pT looping tracks, especially when the fit parameters drift away from the true helix. The loss of inner‑pad clusters was traced to the ambiguity‑resolution stage, where longer tracks “steal” shared clusters from shorter ones, causing many short track segments to be discarded.
To address these issues the authors introduced a three‑phase improvement program, each phase being implemented on the GPU code base and, where possible, exploiting Run‑Time Compilation (RTC) to generate hardware‑specific kernels at execution time.
Phase I focuses on stabilising the fit and improving inner‑pad cluster attachment. The key changes are:
- Looping‑track legs are no longer merged into a single helix; each leg is treated as an independent track, preventing long tracks from monopolising inner‑pad clusters.
- The shared‑cluster counting order is reversed (outer‑most to innermost pad rows), allowing more sharing in the inner region where it matters most for low‑pT tracks.
- The interpolation used for cluster‑rejection is moved to the first (inward) fit iteration; the second (outward) iteration stores position and covariance per pad row, and the third (inward) iteration performs the χ²‑based rejection. This eliminates the “domino effect” where early‑iteration rejections destabilise later iterations.
Phase I was deployed in September 2025. It raises the overall fraction of attached clusters from 60.1 % to 63.3 % (a 5.4 % increase) and restores the ITS‑TPC matching efficiency at high occupancies by roughly a factor of two. The price is a ∼50 % increase in fake‑track rate, but the overall fake‑cluster attachment remains modest (≈1.5 %).
Phase II adds an extrapolation step that attempts to recover missing inner‑pad clusters. After the Phase I fit, each track is extrapolated both inward and outward across sector boundaries; any compatible clusters found are fed into a second ambiguity‑resolution pass, optionally with tighter χ² cuts. The seeding stage is also iterated on the remaining unassigned clusters, providing a second chance to create seeds that were missed initially. Phase II therefore further improves the cluster‑attachment fraction (another ≈2 % gain) while keeping the tracking efficiency essentially unchanged. The downside is a slight rise in fake‑cluster attachment (from 1.28 % to 1.51 %). Ongoing work focuses on heuristic cuts that abort extrapolation when the probability of a fake attachment is high.
Phase III (planned) will migrate the remaining reconstruction steps (e.g. calibration, final compression) to GPUs and will fully exploit RTC. By compiling kernels at runtime with knowledge of the exact GPU architecture (core count, memory bandwidth, warp size), the code can be tuned for both the dedicated online farm (EPN) and heterogeneous Grid resources, reducing warp divergence and improving memory coalescing. This will enable a truly unified online/offline GPU pipeline, simplifying operations and further reducing processing latency.
Performance studies presented in Figures 1 and 2 illustrate the impact of the improvements. Figure 1 shows that the ITS‑TPC matching efficiency, which fell below 0.6 at local occupancies >0.8 with the original algorithm, recovers to ≈0.8 after Phase I. Figure 2 compares tracking efficiency, clone rate, and fake rate for three configurations: the original algorithm (up to August 2025), the Phase I‑enhanced version (deployed September 2025), and a development version including Phase II features. Phase I yields a noticeable efficiency gain at low pₜ (≤0.2 GeV/c) and a substantial reduction in fake and clone rates. Adding Phase II maintains the efficiency while further lowering the fake rate, albeit with a modest increase in fake‑cluster attachment that is being mitigated.
In summary, the paper demonstrates that high‑occupancy continuous TPC data can be reconstructed with GPU acceleration while meeting the stringent physics performance required for Run 3. The three‑phase algorithmic refinements—adjusted shared‑cluster handling, earlier interpolation, and extrapolation‑based cluster recovery—combined with runtime‑compiled GPU kernels, provide a robust solution that bridges online compression needs and offline analysis quality. Future work will refine the extrapolation heuristics, complete the Phase III GPU‑only pipeline, and integrate real‑time space‑charge distortion corrections, positioning ALICE to fully exploit the unprecedented data rates of Run 3 and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment