AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization
AscendC (Ascend C) operator optimization on Huawei Ascend neural processing units (NPUs) faces a two-fold knowledge bottleneck: unlike the CUDA ecosystem, there are few public reference implementations to learn from, and performance hinges on a coupled two-part artifact - a host-side tiling program that orchestrates data movement and a kernel program that schedules and pipelines instructions. We present AscendOptimizer, an episodic agent that bootstraps this missing expertise by turning execution into experience. On the host side, AscendOptimizer performs profiling-in-the-loop evolutionary search to discover valid and high-performing tiling and data-movement configurations directly from hardware feedback. On the kernel side, it mines transferable optimization motifs by rewinding optimized kernels - systematically de-optimizing them to synthesize instructive “bad-to-good” trajectories - and distills these motifs into a retrievable experience bank for guided rewriting. By alternating host tuning and kernel rewriting in a closed loop, AscendOptimizer steadily expands feasibility and pushes latency down. On a benchmark of 127 real AscendC operators, AscendOptimizer achieves a 1.19x geometric-mean speedup over the open-source baseline, with 49.61% of operators outperforming their references, outperforming strong agent and search baselines.
💡 Research Summary
The paper tackles the severe “knowledge scarcity” problem that hampers operator optimization on Huawei’s Ascend NPU. Unlike the CUDA ecosystem, AscendC lacks public reference implementations, and performance depends on a tightly coupled pair of artifacts: a host‑side tiling program that schedules data movement and a device‑side kernel that orchestrates instruction pipelines. To address both the lack of external data and the intrinsic coupling of the two artifacts, the authors propose AscendOptimizer, an episodic agent that alternates between two complementary stages, each tailored to the characteristics of its search space.
Stage I focuses on tiling. The tiling space is highly discrete; a small change can turn a configuration from “fast” to “fails‑to‑compile.” AscendOptimizer therefore employs an evolutionary‑guided program search. An initial population of tiling templates is generated, then iteratively refined through selection, crossover, and mutation. Each candidate is compiled and executed on real hardware; the measured latency and compilation success serve as the fitness signal. This hardware‑in‑the‑loop feedback acts as a hard boundary detector, allowing the search to quickly converge to valid, high‑performance tiling configurations without any analytical cost model.
Stage II addresses kernel optimization. The authors observe that while good kernels are hard to obtain, it is easy to create “bad‑to‑good” trajectories by deliberately de‑optimizing strong seed kernels (removing pipelining, vectorization, loop unrolling, etc.). This “optimization rewind” process generates a series of intermediate kernels, each annotated with the bottleneck it suffers (e.g., memory bandwidth saturation, register pressure, pipeline stalls). These trajectories are distilled into a reusable pattern library: each pattern is a rewrite rule that can be applied to a kernel to recover a specific optimization. During online optimization, AscendOptimizer diagnoses the current kernel’s bottleneck, retrieves the most relevant patterns from the library, and applies them as structured rewrites. The rewritten kernel is then re‑evaluated under the current tiling configuration, closing the loop between “what we have learned” and “what actually runs fast.”
The two stages are executed in an alternating loop. Improvements in tiling expand the feasible region for kernel rewrites, while a better kernel can expose new high‑performance tiling opportunities. This co‑evolution gradually enlarges the search space while maintaining rapid convergence.
The authors evaluate AscendOptimizer on a curated benchmark of 127 real‑world AscendC operators. Compared with an open‑source baseline (hand‑tuned vendor kernels), AscendOptimizer achieves a geometric‑mean latency reduction of 1.19×, and 49.61 % of the operators outperform their reference implementations. Against strong baselines—including evolutionary search tools (Ansor, TVM) and recent LLM‑driven agents (Astra, PRAGMA)—AscendOptimizer shows higher success rates and larger speedups, especially in the low‑data regime where Pass@1 for AscendC code generation drops below 2 %. Table 2 highlights that AscendOptimizer uniquely satisfies three desiderata simultaneously: it optimizes existing implementations, operates fully automatically without hand‑crafted rules, and requires no additional model training.
Key contributions are: (1) formalizing AscendC operator optimization as a dual search problem over tiling and kernel spaces under hard hardware constraints; (2) introducing “optimization rewind” to bootstrap kernel‑level experience without any external dataset, and building a pattern‑based experience bank for guided rewrites; (3) delivering a comprehensive benchmark and demonstrating consistent gains across a wide variety of operators.
The work opens several future directions: integrating the pattern bank with LLM‑based code generation to provide expert guidance at synthesis time, extending the episodic framework to other domain‑specific accelerators (e.g., Qualcomm Hexagon, Intel Habana), and exploring multi‑agent collaborations to further accelerate search. AscendOptimizer shows that, even in environments where expert knowledge and training data are scarce, a carefully designed self‑supervised agent can achieve expert‑level performance on heterogeneous hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment