AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl’s law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.


💡 Research Summary

AutoKernel is an open‑source framework that automates the optimization of GPU kernels for arbitrary PyTorch models by mimicking the iterative workflow of a human kernel engineer. The system begins by profiling a given model with torch.profiler, collecting per‑kernel GPU execution times across configurable warm‑up and profiling runs. Using pattern matching on kernel names, each kernel is classified into one of nine operation types (e.g., matmul, softmax, layernorm). The profiler also detects the underlying hardware (various NVIDIA and AMD GPUs) and estimates peak FP16/FP32 throughput for unknown devices.

After profiling, AutoKernel ranks kernels by their contribution to total runtime using Amdahl’s law, which quantifies the expected end‑to‑end speedup from accelerating a particular kernel. For each high‑impact kernel, an extractor generates a standalone kernel file containing a starter implementation, model‑specific shape variants (base, half‑scale, double‑scale), FLOPs and memory‑traffic formulas, and datatype‑specific tolerances.

The core optimization loop follows Algorithm 1. An LLM‑based agent reads a 909‑line instruction document (program.md) that defines a six‑tier optimization playbook: (1) block‑size tuning, (2) memory‑access improvements, (3) compute‑level refinements, (4) advanced techniques such as split‑K and persistent kernels, (5) architecture‑specific features (e.g., TMA on Hopper, cp.async on Ampere), and (6) kernel‑specific tricks (online softmax for attention, Wel‑ford algorithm for RMSNorm). The agent modifies only a single kernel file per iteration, keeping diffs minimal and enabling clean git‑based revert operations.

Each edited kernel is passed through a five‑stage correctness harness before any performance measurement:

  1. Smoke test – a quick forward pass on a tiny input to catch compilation or shape errors.
  2. Shape sweep – execution across 8‑10 input configurations and three data types (FP16, BF16, FP32) to expose size‑dependent bugs.
  3. Numerical stability – adversarial inputs that trigger overflow/underflow or extreme dynamic range, ensuring the kernel’s arithmetic remains accurate.
  4. Determinism – three repeated runs must produce bit‑wise identical outputs, detecting race conditions or nondeterministic atomics.
  5. Edge cases – non‑power‑of‑two dimensions (e.g., 1023, 4097) to verify correct handling of tile remainders and masking logic.

Only after all five stages pass does the harness measure throughput (TFLOPS for compute‑bound kernels or GB/s for memory‑bound kernels). If the new throughput exceeds the previous best by at least 1 % the change is kept; otherwise the system reverts the git commit. The orchestrator monitors several move‑on criteria: five consecutive reverts, reaching 90 % of the GPU’s theoretical peak, a two‑hour time budget, or achieving a 2× speedup. When any criterion triggers, the system proceeds to the next bottleneck kernel.

AutoKernel supports two backends: Triton and CUDA C++. Triton provides a Python‑like DSL with rapid JIT compilation (1–5 seconds), enabling quick exploration of block sizes, warps, stages, accumulator precision, and loop structures. It typically attains 80–95 % of cuBLAS performance for matmul. The CUDA C++ backend offers direct access to low‑level primitives such as WMMA fragments, warp‑shuffle intrinsics, vectorized loads (float4, half2), bank‑conflict‑free shared memory layouts, double buffering, and explicit register control via launch_bounds. Both backends expose an identical kernel_fn() interface, allowing the benchmark harness to remain unchanged regardless of implementation.

Experimental evaluation on an NVIDIA H100 GPU demonstrates substantial gains. AutoKernel’s Triton kernels achieve 5.29× speedup over PyTorch eager for RMSNorm, 2.82× for softmax, and 2.21× for cross‑entropy. Compared to torch.compile with max‑autotune, the same kernels are respectively 2.83×, 3.44×, and 2.94× faster. In community benchmarks, an AutoKernel‑optimized vectorsum_v2 kernel secured first place on the B200 leaderboard, and a Triton FP4 matmul kernel outperformed CUTLASS by 1.63–2.15×.

The paper highlights three distinguishing contributions relative to prior LLM‑driven kernel optimization work: (1) model‑level profiling and Amdahl‑guided prioritization of kernels, (2) unified support for both Triton and CUDA C++ within a single framework, and (3) a minimalist loop (edit‑benchmark‑keep/revert) coupled with a rigorous five‑stage correctness pipeline, trading architectural complexity for transparency, reproducibility, and reliability. All source code, starter kernels, and the agent instruction document are publicly released at https://github.com/RightNow-AI/autokernel, facilitating community adoption and further research.


Comments & Academic Discussion

Loading comments...

Leave a Comment