FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
  • ArXiv ID: 2601.00227
  • Date: 2026-01-01
  • Authors: Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye, Charlie Ruan, Yingyi Huang, Yineng Zhang, Liangsheng Yin, Aksara Bayyapu, Luis Ceze, Tianqi Chen

📝 Abstract

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctnessand performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

💡 Deep Analysis

📄 Full Content

The rapid advancement of Large Language Models (LLMs) has catalyzed a new era of computing, but their widespread deployment is increasingly constrained by the performance and cost of their underlying inference systems (Zheng et al., 2025;Kwon et al., 2023;NVIDIA, 2025;MLC team, 2023MLC team, -2025)). At the heart of these systems are GPU kernels that execute the core operations like attention, matrix multiplication, and sampling. Optimizing GPU kernels used in LLM inference systems requires deep, expert-level engineering effort.

This paper asks a practical question: how can AI-generated kernels be effectively incorporated into production LLM systems? Recent works (Ouyang et al., 2025;Li et al., 2025;Baronio et al., 2025;Fisches et al., 2025) show early promise that LLMs can produce complex low-level GPU code. However, there are still three fundamental challenges to bridge AI generation to real-world deployments. First, kernels in LLM systems have many dependencies on different characteristics, such as ragged distribution, data precision, which impact their performance. It is important to effectively communicate this information to AI agents. Second, real-world LLM inference traffic may differ from a typical uniform * Equal contribution 1 University of Washington 2 Carnegie Mellon University 3 NVIDIA 4 University of California, Berkeley 5 Independent Researcher. Correspondence to: Yixin Dong yixind@andrew.cmu.edu, Tianqi Chen tqchen@cmu.edu.

or random setup we pick in a single kernel. We need an effective way to track the kernel performance on real-world LLM inference workloads. Finally, there still can be an integration gap after AI-generating promising kernel candidates, as it can take extra engineering effort to bring them up to end-to-end LLM systems.

To address these challenges, we introduce FlashInfer-Bench, a benchmark and standard operational flow for AI-driven LLM systems (Figure 1). To standardize overall workloads, we introduce FlashInfer-Bench Trace, a self-contained standard JSON schema that describes the kernel task, workloads, the solution, and the final evaluation result. Building on top of the schema, we curate the FlashInfer-Bench Dataset from real-world LLM workloads. We also design a robust kernel benchmark framework on top of the FlashInfer Trace that features runtime isolation to prevent performance-related reward hacking and includes specialized support for evaluating low-bit and non-deterministic sampling kernels. Finally, we built a dynamic kernel substitution mechanism to directly update the FlashInfer kernel library to redirect operators to the optimal kernel provided by FlashInfer-Bench trace in runtime. This approach enables us to directly integrate common LLM-generated kernels into open source LLM engines such as SGLang and vLLM with no code change.

We build a live leaderboard to track the GPU programming capabilities of the frontier models across real-world workloads and LLM workloads. We also did a comprehensive study of the current state of LLM agents on real-world LLM inference systems. Our evaluation and analysis show that:

(1) Most correctness errors come from compilation failures;

(2) models struggle to exploit hardware-specific details such as architectural specifications or intrinsics; and (3) a language trade-off exists: high-level languages like Triton yield better performance on most tasks, while low-level CUDA provides more potential for specialized optimization.

The main contributions are as follows:

• We proposed FlashInfer Trace for standardizing the description of task, workload, and solution for AIgenerated workloads.

• We curated the FlashInfer-Bench Dataset that serves a rich ground for evaluating AI-generated kernels on real-world workloads.

• We proposed a pragmatic operational workflow to continuously generate and directly apply AI-generated kernels into a real-world production system.

• We provided a comprehensive analysis of how LLMgenerated kernels perform on LLM systems.

The rest of the paper is organized as follows: Section 2 reviews background on LLM inference, GPU kernels, and LLM for GPU kernel generation. Section 3 presents the design of FlashInfer-Bench, including the FlashInfer Trace schema, dataset curation, a robust performance-aware benchmarking framework, and dynamic kernel substitution for production engines. Section 4 details the dataset and comprehensive evaluation of agent-generated kernels, with case studies on GEMM and GQA decode, as well as end-to-end substitution results. Section 5 surveys related work. Section 6 concludes the paper.

Modern LLM inference is powered by LLM serving engines, which handle batching, scheduling, and parallelism, and consist of GPU kernel invocations and CPU logic. GPU kernels dominates execution time, so optimizing them translates directly into reduced latency for the LLM engine. Despite model diversity, most models share a small set of GPU kernels, including:

  1. GEMM: Inputs and outputs

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut