Enabling RISC-V Vector Code Generation in MLIR through Custom xDSL Lowerings

Enabling RISC-V Vector Code Generation in MLIR through Custom xDSL Lowerings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The growing adoption of RISC-V in high-performance and scientific computing has increased the need for performance-portable code targeting the RISC-V Vector (RVV) extension. However, current compiler infrastructures provide limited end-to-end support for generating optimized RVV code from high-level representations to low-level implementations. In particular, existing MLIR distributions lack practical lowering paths that map high-level abstractions to RVV intrinsics, limiting their applicability for production-ready RISC-V kernels. This paper presents a compilation approach that combines MLIR with xDSL to bridge the missing lowering stages required for RVV code generation. Using custom intermediate representations and transformation passes implemented in xDSL, we systematically translate high-level operations into specialized, hardware-aware C code invoking RVV intrinsics. The resulting kernels are emitted as portable C functions that can be directly integrated into existing applications, enabling incremental adoption without modifying surrounding software stacks. We demonstrate the approach on the General Matrix Multiplication (GEMM) kernel and evaluate the generated micro-kernels on two real RISC-V platforms, the K230 and the BananaPi F3, comparing against OpenBLAS for both square-matrix benchmarks and transformer-based workloads derived from the BERT-Large model. When integrated into a matrix multiplication kernel, the proposed approach consistently outperforms OpenBLAS, reaching up to 12.2 GFLOPS compared to the baseline’s 5.1 GFLOPS and providing performance improvements between 10–35% across the evaluated workloads. These results demonstrate that combining MLIR with xDSL provides a practical pathway to portable, optimized code generation for RISC-V platforms.


💡 Research Summary

The paper addresses a critical gap in the current compiler ecosystem for RISC‑V Vector (RVV) extensions: the lack of end‑to‑end support in MLIR for lowering high‑level abstractions to RVV intrinsics and portable C code. To fill this gap, the authors propose a hybrid compilation pipeline that combines the multi‑level IR capabilities of MLIR with the lightweight, Python‑native compiler toolkit xDSL.

The workflow consists of six stages. First, users specify kernel parameters such as micro‑tile dimensions (mr × nr), data type, and vector register length (vlen). Second, xDSL’s API dynamically constructs a high‑level MLIR representation that mixes standard dialects (arith, memref, scf) with a custom “RVV‑IR” dialect that explicitly models vector‑length‑agnostic operations, loads, stores, and fused‑multiply‑add (FMA) primitives. Third, two custom lowering passes—MemRefToEmitCPass and RVVToEmitCPass—translate memref allocations into plain C pointers and map RVV‑IR ops to emitc intrinsics (e.g., lb.vfmacc). Fourth, the official mlir‑translate tool converts the emitc dialect into pure C source. Fifth, Python scripts automatically generate the necessary headers, a test harness, and a Makefile; the code is then deployed to a RISC‑V board, compiled with the native toolchain, and benchmarked. Finally, performance results are collected.

A key technical contribution is the design of the RVV‑IR dialect, which stores the vector length as metadata, allowing the same IR to be reused across different hardware configurations (e.g., 256‑bit vs. 512‑bit vectors). The pipeline also automatically enumerates all possible micro‑tile shapes from 1 × 1 up to the user‑specified mr × nr, generating a complete library of micro‑kernels that cover edge cases without manual hand‑tuning. Loop L6, the innermost FMA loop of the classic Goto‑BLIS GEMM algorithm, is generated programmatically in xDSL (see Figure 3), demonstrating how vector loads, scalar multiplies, and accumulation are expressed as emitc calls.

The authors evaluate the generated kernels on two real RISC‑V platforms: the SiFive K230 (256‑bit vector registers) and the BananaPi F3 (512‑bit vector registers). Benchmarks include a square‑matrix GEMM and the matrix‑multiply intensive layers of a BERT‑Large transformer model. Compared against the highly optimized OpenBLAS implementation, the auto‑generated kernels achieve up to 12.2 GFLOPS versus 5.1 GFLOPS for OpenBLAS, representing a 10 %–35 % speed‑up depending on the workload and platform. The performance gains are especially pronounced on the wider‑vector BananaPi F3, where more data fit into a single register, reducing memory traffic.

Beyond the immediate performance results, the work demonstrates a practical methodology for extending MLIR with custom dialects and lowering passes without deep modifications to the upstream codebase. By leveraging xDSL’s Python‑centric development model, the authors avoid the steep learning curve and heavy C++ engineering effort traditionally associated with MLIR extensions. The generated C functions are portable, requiring only a standard C compiler, which simplifies integration into existing software stacks and enables incremental adoption of RVV acceleration.

The paper also discusses limitations and future directions. Currently the implementation targets FP32; extending to FP64, BF16, or integer types would broaden applicability. Automatic tuning of tile sizes and memory prefetch strategies could further close the gap with hand‑tuned assembly kernels. Moreover, the RVV‑IR and its lowering passes are conceptually transferable to other vector‑length‑agnostic ISAs such as ARM SVE or Intel AVX‑512, suggesting a path toward a unified, multi‑ISA vector code generator.

In summary, the authors present a novel, low‑overhead solution that bridges the missing RVV lowering stages in MLIR by integrating xDSL. The resulting pipeline automatically produces high‑performance, portable C micro‑kernels for GEMM, outperforms a state‑of‑the‑art BLAS library on real hardware, and offers a flexible foundation for future extensions to other data types, workloads, and vector architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment