Hawkeye: Reproducing GPU-Level Non-Determinism

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA’s Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference.

💡 Research Summary

**
The paper introduces Hawkeye, a system that enables exact, bit‑level reproduction of NVIDIA Tensor Core matrix‑multiplication operations on a CPU. Modern machine‑learning workloads rely heavily on GPUs, and the nondeterministic behavior of Tensor Cores—stemming from undocumented rounding directions, subnormal handling, and the order of floating‑point accumulation—makes it difficult to verify that a service provider has executed a model correctly. Existing solutions either disable nondeterministic hardware features (incurring large performance penalties) or store intermediate rounding decisions (incurring high storage costs). Hawkeye addresses these limitations by first characterizing the internal arithmetic pipeline of Tensor Cores and then emulating it on a CPU with negligible overhead.

The authors design a suite of micro‑benchmarks that directly invoke the hardware MMA (matrix‑multiply‑accumulate) instruction via inline PTX assembly. Using 16 × 16 tiles for inputs A, B, and accumulator C, they isolate five key aspects of the Tensor Core computation:

Summation Dependency and Order Test – Determines the exact sequence in which partial sums are combined during accumulation, revealing a tree‑reduction pattern that varies across architectures.
Internal Precision Detection Test – Identifies the bit‑width of the intermediate accumulator (e.g., 32‑bit for FP16 on Ampere/Hopper, 48‑bit for BF16/FP8), showing how the hardware preserves extra precision to avoid overflow.
Rounding Mode Detection Test – Discovers that while most operations follow IEEE‑754 “nearest‑even” rounding, specific stages use “toward‑zero” or “away‑from‑zero,” affecting final results.
Normalization Stage Detection Test – Pinpoints when intermediate results are normalized back to the target floating‑point format, especially after subnormal values appear.
Subnormal Behavior Detection Test – Characterizes how each architecture treats subnormal numbers; Hopper and Lovelace truncate them to zero, whereas Ampere retains a minimal representation.

From these experiments the authors construct a detailed computational model for each GPU architecture (Ampere, Hopper, Lovelace) and precision type (FP16, BF16, FP8). The model captures the exact order of partial‑sum reductions, the internal precision used at each step, the rounding mode applied, and the handling of subnormals. They then implement a CPU‑based simulator (open‑sourced as gpu‑simulator) that reproduces the identified pipeline using SIMD instructions (AVX‑512/AVX2). The simulator explicitly performs the same rounding and normalization steps, ensuring that every output element matches the GPU result bit‑for‑bit.

Extensive evaluation demonstrates 100 % bit‑exact replication for large 4096 × 4096 matrix multiplications across all tested architectures and data types. The performance overhead is modest—approximately 1.2× the native GPU execution time—substantially lower than deterministic CUDA settings, which can be 5× slower. The system also correctly handles edge cases involving extreme values and subnormal numbers, confirming its robustness.

Hawkeye’s contribution is twofold. First, it provides a rigorous methodology for dissecting and modeling GPU nondeterminism, filling a gap in the verifiable‑ML literature where prior work assumed deterministic execution. Second, it enables a practical verification workflow: an auditor only needs the GPU model identifier and execution logs to reproduce the exact computation on a CPU, creating an offline “oracle” against which any GPU run can be checked with negligible cost. This capability is especially valuable for cloud‑based ML‑as‑a‑Service platforms, where users must trust that providers are not tampering with models or cutting corners.

Future directions include extending the methodology to other NVIDIA architectures (e.g., RTX series), supporting newer low‑precision formats such as FP4 or INT8, and optimizing the simulator for real‑time streaming workloads. Overall, Hawkeye demonstrates that GPU‑level nondeterminism can be fully understood and faithfully reproduced, paving the way for trustworthy, auditable machine‑learning deployments.

Hawkeye: Reproducing GPU-Level Non-Determinism

💡 Research Summary

Comments & Academic Discussion

Leave a Comment