EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

Code efficiency is a fundamental aspect of software quality, yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, retrieved exemplars, or prompt-based search, b…

Authors: Zimu Wang, Yuling Shi, Mengfan Li

EffiSkill: Agent Skill Based Automated Code Efficiency Optimization
EiSkill: Agent Skill Based A utomated Code Eiciency Optimization Zimu W ang ∗ zm.wang@berkeley .edu Shanghai Jiao T ong University University of California, Berkeley USA Y uling Shi ∗ yuling.shi@sjtu.edu.cn Shanghai Jiao T ong University China Mengfan Li lmf2951510526@sjtu.edu.cn Shanghai Jiao T ong University China Zijun Liu zl3031@columbia.edu Columbia University USA Jie M. Zhang jie.zhang@kcl.ac.uk King’s College London United Kingdom Chengcheng W an ccwan@sei.ecnu.edu.cn East China Normal University Shanghai Innovation Institute China Xiaodong Gu † xiaodong.gu@sjtu.edu.cn Shanghai Jiao T ong University China Abstract Code eciency is a fundamental aspect of softwar e quality , yet how to harness large language models (LLMs) to optimize programs remains challenging. Prior approaches have sought for one-shot rewriting, r etrieved exemplars, or pr ompt-based search, but they do not explicitly distill reusable optimization knowledge, which limits generalization beyond individual instances. In this paper , we present ESkill, a framework for code-eciency optimization that builds a portable optimization toolbox for LLM- based agents. The key idea is to model recurring slow-to-fast trans- formations as reusable agent skills that captur e both concrete trans- formation mechanisms and higher-level optimization strategies. ESkill adopts a two-stage design: Stage I mines Operator and Meta Skills from large-scale slow/fast program pairs to build a skill library; Stage II applies this librar y to unseen programs through execution-free diagnosis, skill retrieval, plan composition, and can- didate generation, without runtime feedback. Results on EBench-X show that ESkill achieves higher opti- mization success rates, improving over the strongest baseline by 3 . 69 to 12 . 52 percentage points across model and language settings. These ndings suggest that mechanism-level skill reuse provides a useful foundation for execution-free code optimization, and that the resulting skill library can serve as a reusable resource for broader agent workows. 1 Introduction The computational eciency of software is a fundamental aspect of software quality [ 11 , 18 ]. Inecient implementations can in- crease latency , waste computational resources, and degrade user experience, and such ineciencies have long been studied as perfor- mance bugs in software systems [ 22 , 30 ]. A utomatically improving the eciency of existing code, which we refer to in this paper as ∗ Both authors contributed equally to this research. † Corresponding author . Retrieve Agent Skills LLM Agent Generate Debug Optimize e.g. compress dynamic programming state Plan Stage Action Stage Figure 1: O verview of agent skills as reusable knowledge. code optimization, has therefore become an important problem in automated software engineering. Early work on code optimization mainly relied on manually de- signed rules that target spe cic ineciency patterns, such as loop- related ineciencies or performance miscongurations [ 24 , 30 ]. Although eective in narrow settings, these approaches require sub- stantial expert eort and often provide limited coverage [ 11 ]. More recently , large language models (LLMs) have enabled data-driven approaches that generate optimized code directly from inecient programs using pr ompting, retrie val augmentation, or search-based renement [ 10 , 19 , 44 ]. While these methods show encouraging progress, their current formulations still leave an important gap be- tween instance-lev el optimization and r eusable optimization knowl- edge. A central limitation of existing LLM-based approaches is that they primarily operate at the level of individual examples. Pr ompting- based methods typically attempt to rewrite a program in one shot, which makes it dicult to handle optimizations that require struc- tured, multi-step reasoning. Search-based methods such as SBLLM [ 10 ] partially address this issue by iteratively exploring candidate r ewrites, but they still r ely on retrieved examples and generic sear ch prompts Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu to guide optimization. As a result, the optimization knowledge re- mains tied to specic instances rather than being distilled into reusable principles. This instance-centric design is restrictive because performance- improving edits often exhibit recurring transformation patterns across tasks. For example, dierent programs may b enet from the same underlying optimization mechanism, such as replacing brute-force enumeration with a more ecient aggregation strategy , reformulating a computation algebraically , compressing dynamic- programming state, or reducing constant-factor ov erhead through more suitable data structures. Tr eating every optimization task inde- pendently misses this recurring structure and prevents the system from building a transferable repertoire of optimization knowledge. In this paper , we propose ESkill, a framework that addresses this gap by explicitly modeling code optimization through reusable agent skills . Figure 1 presents the general notion of agent skills as reusable knowledge, which motivates our approach. In our set- ting, an agent skill is a structured unit of optimization knowl- edge distilled from recurring slow-to-fast program transforma- tions. Rather than representing optimization knowledge only as raw example pairs or implicit model behavior , ESkill captures it as reusable transformation mechanisms that can be retriev ed and applied to new tasks. This skill-based view shifts the optimization process from surface-lev el instance matching toward mechanism- level reuse . ESkill follows a two-stage design. Stage I (Skill Mining) op- erates oine on paired slow and fast programs to extract recur- ring optimization patterns and distill them into a reusable skill library . The resulting library contains two complementary forms of knowledge: Operator Skills , which encode concrete optimization mechanisms, and Meta Skills , which provide higher-level proce- dural guidance for selecting and composing those me chanisms. Stage II (Skill-Guided Optimization) applies this library to un- seen programs through structured diagnosis, skill retrieval, plan composition, and candidate generation, enabling the e xploration of multiple optimization routes beyond one-shot re writing. A practical property of our setting is that inference is execution- free : when optimizing a new program, the system does not rely on repeated candidate execution or runtime feedback inside the optimization loop. This simplies deployment, as many realistic software-engineering settings do not provide convenient access to execution environments, repr esentative workloads, or the budget required to evaluate multiple candidate rewrites online . Although we evaluate ESkill in the setting of competitive- programming optimization, this should be understood as an ex- perimental scenario for validating the framework rather than a limitation of the contribution itself. Competitive-pr ogramming cor- pora provide abundant paired slow/fast solutions with measurable eciency dierences, making them a practical and well-controlled testbed for studying how optimization skills can be mined and reused. This choice is also shaped by the current availability of benchmarks that expose ne-grained eciency dierences with reliable execution signals. The broader contribution is a skill-based optimization framew ork that can, in principle, be extended to other software-engineering domains when suitable optimization corp ora become available. Operator Skill Card Metadata: skill_id , type , language name , description family , tags , triggers Body: • When to use • Transformation steps • Expected comple xity ef- fect • Common pitfalls • When not to use • Minimal example Meta Skill Card Metadata: skill_id , type , language name , description Body: • Overview • Core loop • Routing heuristics • Budgeting / control logic • Decision checklist Figure 2: Structure of the two skill-card artifacts in ESkill. W e evaluate ESkill on EBench-X [ 36 ], a benchmark for code- eciency optimization, and compare it against strong baselines including prompting-based, retrieval-based, and search-based meth- ods such as SBLLM [ 10 ]. Experimental results show that ESkill consistently achieves higher optimization performance across all settings. Measured by optimization success rate, it impr oves ov er the strongest baseline by 12 . 03 percentage points on C++ and 4 . 98 points on Python with GPT -5-mini, and by 12 . 36 and 8 . 67 percent- age points on C++ and Python, respectively , with Q wen3-Coder- 30B- A3B-Instruct. Beyond eectiveness, the learned skill library also provides a structured view of the optimization knowledge used by the system. While we do not claim full interpretability in a strict sense, this e xplicit intermediate abstraction makes the optimization process more analyzable than purely black-box rewrite prompting. More broadly , ESkill is designe d not only as a standalone method, but also as a p ortable optimization toolbox for LLM-based agents. The skill librar y produced by ESkill is organized as a plug-and-play resource that can, in principle, be integrated into broader agent workows, allowing optimization knowledge mined oine to be reused across dierent downstream coding settings. This perspective is particularly relevant as coding agents become increasingly modular and tool-oriented [5, 26]. W e summarize our contributions as follows: (1) W e introduce performance-improving agent skills as a reusable abstraction for code-eciency optimization and show how recurring optimization patterns can be distilled from large- scale slow/fast program pairs. (2) W e propose ESkill, a two-stage framework that integrates oine skill mining with skill-guided optimization for LLM- based code eciency improvement. (3) W e conduct an extensive evaluation on EBench-X, demon- strating that ESkill outperforms or remains competitive with strong baselines while enabling structur ed analysis of learned optimization skills. (4) W e organize the mined skills as a p ortable optimization library that can be r eused as a plug-and-play toolb ox in broader agent workows. EiSkill: Agent Skill Based Automated Code Eiciency Optimization Conference’17, July 2017, W ashington, DC, USA Optimized Code S T A G E I : O f f l i n e S k i l l M i n i n g S T A G E I I : O p t i m i z a t i o n I n f e r e n c e INPUT: F ast / Slow Code Pairs I N P U T : Unoptimized statement + code O U T P U T : Skill library Analyze Traces Traces Clustering Induce Skills Diagnose &Retrieve skills Compose plans Generate Candidat es ProblemBrief · SlowAudit DeltaSummary Hybrid Clustering Operator Skill Meta Skill Example Operator Skills: BFS -> formula full DP -> rolling state factorial loop -> small-k nCk O U T P U T : Diagnose -> OptimizationBrief Multi skill set Plan-guided rewrite Figure 3: O verview of ESkill: Skill mining and execution-free optimization. 2 Background 2.1 Problem Denition Following prior work [ 10 , 44 ], we study code-eciency optimiza- tion for a functionally correct but inecient program. Let 𝑆 denote the input program to be optimized, and let 𝐶 denote the task-level context available to the optimizer , such as a problem statement, specication, or other natural-language description of the intended behavior . The obje ctive is to generate a new program 𝑂 that pre- serves the intended functionality of 𝑆 while improving eciency under the target evaluation protocol. Let T denote the evaluation test suite, and let 𝐶𝑜 𝑠 𝑡 ( ·) denote the benchmark-specic eciency metric. The desired output should satisfy two conditions: (1) se- mantic correctness, namely that 𝑂 preserves the behavior of 𝑆 on T ; and (2) eciency impr ovement, namely 𝐶 𝑜𝑠 𝑡 ( 𝑂 ) < 𝐶 𝑜 𝑠 𝑡 ( 𝑆 ) . A key constraint in our setting is that the optimizer receives no execution signal during inference. That is, when optimizing a new program, the system does not execute candidate rewrites, observe runtime traces, or iteratively rene the output using execution feedback. Instead, it must infer likely bottlenecks and optimization opportunities using only the input program, the available task context, and the mined skill library . This setting diers from search- based or reinforcement-style optimization frameworks that rely on repeated candidate execution during the optimization loop. This formulation is further motivated by practical software en- gineering scenarios. In many realistic deployment settings, an opti- mization assistant may have access to sour ce code and high-level intent, but not to a production-like execution environment, rep- resentative workloads, full external dependencies, or the budget required to run many candidate rewrites online. This setting arises, for example, in code-re view assistants, repository-scale optimiza- tion to ols, and enterprise development environments with strict latency , security , or sandbox constraints. Although competitive- programming benchmarks do not capture all aspects of industrial software optimization, they pro vide a controlled setting for study- ing a practically relevant problem: impro ving code eciency when inference-time execution feedback is unavailable. 2.2 Agent Skills Recent agent frameworks conceptualize skills as modular units that encapsulate task-specic instructions, tool-usage patterns, and auxiliary resources for recurring workows [ 1 , 33 ]. This notion aligns with a broader trend in LLM-based systems toward mod- ularity and compositionality , in which reusable abstractions can improve generalization and scalability [ 27 , 38 , 50 ]. In particular , tool-augmented language models learn to invoke external func- tions and reuse structured behaviors across tasks [ 38 ], while agent frameworks interleave reasoning and action to construct multi-step solutions [50]. In ESkill, we adopt a denition tailored specically to code- eciency optimization. W e dene an agent skill as a reusable , struc- tured unit of optimization knowledge that captures either a concrete transformation mechanism or higher-level control knowledge gov- erning its application. This denition distinguishes skills from raw prompts, retriev ed exemplars, and fully optimized programs. Our key observation is that, although individual slow/fast pr o- gram pairs are instance-specic, the underlying optimization mech- anisms often recur across supercially dierent tasks. ESkill there- fore abstracts optimization knowledge at the mechanism level : each skill species applicability conditions, transformation procedures, and potential constraints or failure modes. For e xample, a dynamic- programming state-reduction skill may eliminate r edundant dimen- sions, prune unreachable states, or replace scan-heavy transitions with mor e ecient formulations. Although the resulting implemen- tations may dier substantially in syntax or domain, the underlying optimization principle remains transferable. W e categorize skills into two types. Figure 2 illustrates the struc- ture of the corresponding skill cards used in ESkill. Operator Skills enco de concrete transformation mechanisms distilled from recurring slow-to-fast edits, including applicability signals, trans- formation procedures, expected complexity eects, and common pitfalls. Meta Skills , in contrast, capture higher-level control logic for orchestrating operator skills, including diagnosis, retrieval, compo- sition, and execution-free candidate assessment, drawing on recent Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu advances in agentic reasoning, planning, and to ol-use orchestra- tion [ 3 , 50 ]. This distinction is central to ESkill: Operator Skills externalize reusable transformation knowledge, wher eas Meta Skills govern how such knowledge is selected and composed under task- specic constraints. In our implementation, both skill types are instantiated as skill cards rather than free-form prompts, making optimization knowledge explicit, inspectable, and reusable. The resulting skill library therefore serves as an interme diate represen- tation between raw optimization examples and downstream code generation. Section 3 describ es how these skills are constructe d and operationalized in our two-stage pipeline. 3 Proposed Approach: ESkill 3.1 Overview The key idea of ESkill is to treat code optimization not as one-shot rewriting, but as a pr ocess of composing reusable transformation mechanisms learned from prior examples. ESkill follows a two-stage paradigm that decouples the learn- ing of optimization knowledge from its application to new programs. Stage I: Skill Mining op erates oine on paired slow/fast solutions to extract recurring optimization patterns and distill them into a reusable skill library . Stage II: Skill-Guided Optimization applies this library to unse en programs through structured diagnosis, skill retrieval, and plan-driv en generation, enabling the exploration of multiple optimization routes rather than committing to a single rewrite. The learne d skill library consists of two complementary com- ponents. Operator Skills enco de reusable transformation mecha- nisms (e.g., algorithm replacement, state compression, and constant- factor reduction), together with their applicability conditions and expected eects. Meta Skills act as proce dural controllers that govern how Operator Skills are selected, combined, and execute d during optimization. From a workow perspective, Figur e 3 illustrates the following process. Stage I transforms slow/fast solution pairs into a compact set of transferable optimization operators. Stage II then applies these operators to new tasks by rst diagnosing bottlenecks, re- trieving relevant skills, composing multiple optimization plans, and generating candidate implementations accordingly . This design allows ESkill to explore diverse optimization strategies while remaining fully execution-free during inference. 3.2 Stage I: Skill Mining Stage I constructs the skill library from paired slow/fast solutions through a fully automated mining pipeline. At a high level, this stage comprises four steps: (1) extract structured optimization traces from paired solutions; (2) abstract each trace into a compact sig- nature; (3) cluster similar signatures and distill each cluster into an Operator Skill; and (4) construct Meta Skills from the mine d operator library . The goal is not to pr eserve individual code edits as isolated examples, but to recover the transformation mechanisms that repeatedly underlie performance improvements. Trace extraction. Given a paired slow/fast solution for the same task, we use gpt-5.1 [ 32 ] to generate a structured optimization trace. Each trace contains four elds: ProblemBrief , which summarizes the task context and inferred constraints; Slo w A udit , which identi- es dominant operations and likely bottlenecks in the slow solution; FastA udit , which captures the core optimization idea in the fast solution; and DeltaSummary , which describ es the transformation bridge between the two solutions. T ogether , these traces provide a normalized view of how performance improvements arise across tasks. Signature abstraction. W e then abstract each trace into a com- pact signature that summarizes the transformation me chanism expressed by the pair , including signals such as the optimization type, complexity shift, trigger conditions, bottleneck categor y , and problem characteristics. This step shifts the representation from instance-specic code e dits to reusable optimization knowledge that can be grouped across tasks and implementations. Clustering and operator-skill distillation. Next, we cluster the re- sulting signatures in a hybrid lexical-semantic space so that traces expressing the same underlying transformation mechanism are grouped together even when their surface wording diers. Con- cretely , each signature is represented by the concatenation of two normalized views: (1) a TF-IDF r epresentation with unigram and bigram features, and (2) a dense sentence embedding produced by the Sentence- Transformers model all-MiniLM-L6-v2 [ 39 ]. T o keep cluster granularity data-driven, we estimate the cluster count 𝑘 automatically by maximizing the cosine-silhouette score over can- didate e ven values in a bounded range around √ 𝑛 traces, wher e 𝑛 is the number of traces. After ward, we t KMeans on the full hybrid representation to produce the nal clustering. Operator-Skill construction. Each cluster is summarized as a reusable prole of its shared transformation behavior , and this prole is dis- tilled into an Operator Skill. The resulting skill card records the transformation intent, applicability signals, expected eect, and common risks or pitfalls of the corresponding optimization mech- anism. T o ke ep the nal skill librar y compact, we merge highly similar summaries using TF-IDF cosine similarity with a merge threshold of 0.8, followed by a conservative LLM prompt to merge them into a single skill card. Meta-skill construction. After constructing the operator library , we build a small set of Meta Skills from the mined skill library . These procedural controllers specify how to diagnose a new optimization problem, retriev e relevant Operator Skills, and compose them into executable optimization plans. T ogether , the Operator Skills and Meta Skills form the skill registry used in Stage II. 3.3 Stage II: Skill-Guided Optimization Stage II applies the mined skill library to unseen optimization tasks. Rather than attempting to identify a single correct rewrite upfront, ESkill explores multiple plausible optimization routes thr ough a structured pipeline consisting of diagnosis, retrieval, planning, and candidate generation. Diagnosis. Given a problem statement and a baseline solution, ESkill rst produces a structured optimization brief that summa- rizes likely bottlenecks, dominant operations, inferred constraints, EiSkill: Agent Skill Based Automated Code Eiciency Optimization Conference’17, July 2017, W ashington, DC, USA Algorithm 1 ESkill T wo-Stage W orkow Input: Paired corpus D 𝑝 𝑎𝑖 𝑟 , test tasks Q Output: Skill registry R and generated candidates O 1 // Stage I: Skill Mining 2 T ← ExtractTraces ( D 𝑝 𝑎𝑖 𝑟 ) 3 Σ ← BuildSignatures ( T ) 4 C ← HybridCluster ( Σ ) 5 S 𝑜 𝑝 ← DistillOperatorSkills ( C ) 6 S 𝑚𝑒 𝑡 𝑎 ← InduceMetaSkills ( S 𝑜 𝑝 ) 7 R ← BuildRegistr y ( S 𝑜 𝑝 , S 𝑚𝑒 𝑡 𝑎 ) 8 return R 9 // Stage II: Skill-Guided Optimization 10 O ← ∅ 11 foreach 𝑞 ∈ Q do 12 𝑏 ← Diagnose ( 𝑞, R ) 13 K ← RetrieveSkills ( 𝑏, R ) 14 Π ← ComposePlans ( 𝑏, K ) 15 Y ← GenerateCandidates ( 𝑞, Π ) 16 O ← O ∪ Y 17 end 18 return O and the anticipated optimization scope. This step converts the in- put task into a form that can be matched against the learned skill library . Skill retrieval. Conditioned on the diagnosis, ESkill retriev es 3 candidate sets of Operator Skills, where each set represents a plausible optimization route. This retrie val process is deliberately designed to return multiple alternatives, as inecient programs often admit several valid optimization directions, and committing to a single skill choice too early can unduly narrow the search space. Plan Composition. For each retrieved skill bundle, the Meta Skills act as procedural controllers that compose the selected Operator Skills into 2 to 3 optimization plans. Each plan species a coherent transformation strategy , the anticipated eciency improvement, and the main risks that must be manage d during rewriting. This design intr oduces diversity both in the skills considered and in how those skills are applied. Candidate generation. Finally , ESkill generates optimized code candidates by following the composed plans while preserving the original program interface and intended behavior . Because candi- date generation does not depend on iterative execution feedback inside the generation loop, the method can apply the mined skill library through a retrieve-and-compose workow while still ex- ploring multiple plausible optimization routes. 4 Experimental Setup 4.1 Research Questions W e study the following research questions: RQ1: Overall eectiveness. Does ESkill improve program eciency on EBench-X r elative to strong baselines, including prompting-based, retrieval-based, supervised, and optimization-oriented methods? RQ2: Ablation analysis. How do the key components of E- iSkill (diagnosis, skill r etrieval, and multi-plan composi- tion) contribute to its performance? RQ3: Consistency across languages. Does ESkill remain eective when transferred from Python to C++ under the same evaluation protocol? RQ4: Analysis of learned skills. What kinds of optimization skills are learned, and what insights do these patterns provide into how ESkill impr oves code eciency? 4.2 Benchmark and Evaluation Protocol Skill-Mining Corpus. The Stage I skill library is mine d fr om paired slow/fast solutions constructed from external competitive- programming corpora. For C++, we use the PIE dataset [ 44 ] with gem5-simulated runtime, following the original PIE measurement setting. For Python, we aggregate solutions from PIE, Mercury [ 9 ], DeepMind Code Contests, and CodeParrot APPS, and measure ef- ciency using CP U instruction counts following COFFE [ 35 ]. For each problem with multiple valid solutions, we construct a slo w/- fast pair by selecting one slower and one faster implementation, and dene the sp eedup ratio as 𝑟 = 𝑇 slow / 𝑇 fast , where 𝑇 denotes the language-specic eciency measure. W e exclude pairs with 𝑟 < 2 to avoid treating marginal dierences as meaningful opti- mizations. The nal mining corpus contains 900 Python pairs and 900 C++ pairs. W e further verify that there is no overlap between the constructed mining corpus and the tasks in EBench-X [ 36 ], maintaining a strict separation between mining and evaluation data. Benchmark. W e evaluate ESkill on EBench-X [36], a bench- mark for code-eciency optimization that provides functional test suites and execution-based runtime measurement. EBench-X con- tains 623 optimization tasks with expert-written canonical solu- tions across six programming languages. W e focus on the Python and C++ subsets. Python serves as the primary evaluation setting, whereas C++ is used to assess cross-language transfer . Input programs. For every task, the input to be optimized is the canonical solution provided by EBench-X. Using the same expert-written starting point for all methods supports a fair com- parison among optimization strategies. Public/private split and ranking protocol. For each task, we parti- tion the provided test cases into public and private subsets using a xed random see d (42), with 20% assigned to public tests and the remaining 80% to private tests. Public tests are used only for candi- date selection. Among the generated candidates, we retain those that pass all public tests and rank them by runtime on the public tests. Private tests are reserved exclusively for nal evaluation. All reported correctness and eciency results are compute d on the private tests. Models. W e instantiate all compared methods using two LLM backbones: GPT -5-mini [ 31 ] and Qwen3-Coder-30B- A3B-Instruct [ 37 ]. Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu Using the same backbones across methods allows us to b etter isolate the eect of the optimization strategy from that of the underlying model. Candidate budget. Each method is allocated the same generation budget of 𝑘 = 8 candidates per task. This places the evaluation in a small multi-candidate regime consistent with T op- 𝑘 coding evaluation [ 4 ] and recent software-engineering benchmarks that report 𝑘 = 8 by sampling eight candidates [ 51 ]. Unless otherwise stated, all reported T op- 𝑘 results use this xed budget. Execution environment. All candidates ar e evaluated oine us- ing the ocial EBench-X execution harness. Experiments are conducted on a machine equipped with an AMD EPYC 7302 CPU (16 cores, 32 threads, 3.0 GHz) and 112 GB RAM, running Ubuntu 20.04. Programs are executed using Python 3.11.11 (bookworm) and gcc:14.2.0-bookworm. T o reduce runtime variability and measure- ment noise, we execute each candidate thr ee times and report the mean runtime as the nal result. 4.3 Baselines W e compare EffiSkill with baselines from four categories. Prompting. Instruction directly prompts the LLM to optimize code without interme diate reasoning or retrieval. Co T [ 47 ] adds zero-shot chain-of-thought guidance to rst identify bottlenecks before generating optimized code. Retrieval-augmented. RA G [ 25 ] retrieves top- 𝑘 similar (slow , fast) pairs from PIE via CodeBERT embeddings as few-shot opti- mization examples. FasterPy [ 48 ] augments generation with opti- mization summaries retrieved via UniX coder from historical code transformations. Evolutionary search. SBLLM [ 10 ] initializes candidates via Co T prompting and renes them with transformation patterns; we adopt a single-iteration variant without execution-based selection for fair comparison. Fine-tuning. ECoder [ 17 ] ne-tunes on execution-validated slow/fast pairs to directly generate optimized code; as it applies only to open-weight backb ones, we report results for Qwen3-Coder- 30B- A3B-Instruct in T able 1. 4.4 Metrics W e follow prior work on code-eciency optimization [10], where the objective is to improv e runtime over the input program while preserving functional correctness. For each task, let 𝑜 denote the input canonical program, and let { 𝑐 𝑗 } 𝑘 𝑗 = 1 denote the ranked candidate list produced by a method. W e report OPT@ 𝒌 , dened as the percentage of tasks for which at least one of the top- 𝑘 candidates passes all test cases and attains a runtime reduction of at least 10% relative to the input program on the private tests, following prior work [10]. T op- 𝑘 protocol. For each task, every method produces 𝑘 = 8 candidates. Candidates are ranke d using the public-test protocol described ab ov e, whereas all reported T op-1 and T op-8 results are computed on the private tests. T able 1: Optimization p erformance on EBench-X. " ∗ " and " † " denote one-sided paired b ootstrap tests against the b est non- ESkill baseline in the same setting ( 𝑝 < 0 . 05 and 𝑝 < 0 . 10 , respectively). Model Method Python C++ OPT@1 (%) OPT@8 (%) OPT@1 (%) OPT@8 (%) GPT -5-mini Instruction 18.62 31.62 31.62 48.15 RAG 19.90 29.37 34.19 45.43 Co T 18.62 28.73 40.93 54.74 SBLLM 17.34 18.94 38.84 53.45 FasterPy 21.19 32.42 38.84 51.69 ESkill 26.48 ∗ 37.40 ∗ 44.62 † 66.77 ∗ Qwen3-Coder Instruction 12.52 24.40 26.00 32.58 RAG 7.22 19.74 16.37 25.04 Co T 16.05 26.00 23.11 30.02 SBLLM 6.90 8.67 28.41 43.98 FasterPy 13.96 28.25 19.90 34.19 ECoder 14.29 16.69 14.77 25.20 ESkill 21.35 ∗ 36.60 ∗ 34.19 ∗ 56.50 ∗ 5 Experimental Results In this section, we report the experimental results that answer the research questions introduced in Section 4.1. Unless other wise specied, all experiments follow the evaluation protocol describe d in Section 4. 5.1 RQ1: O verall Eectiveness W e evaluate whether ESkill improves code eciency over strong baselines on EBench-X. T able 1 reports results for two LLM back- bones and two programming languages using OPT @ 𝑘 with a xed candidate budget of 𝑘 = 8 , where signicance markers indicate a one-sided paired bootstrap test against the best non-ESkill base- line in the same model–language setting. For each comparison, we compute paired task-level performance dier ences, resample the paired tasks with replacement, and estimate the probability that the mean improvement is no greater than zero. Here, ∗ denotes 𝑝 < 0 . 05 and † denotes 𝑝 < 0 . 10 . Overall, ESkill achieves the strongest optimization success rates. Across all four model–language settings, it attains the best OPT @1 and OPT @8 , indicating that its main advantage is a higher likelihood of producing at least one candidate that is both function- ally correct and measurably faster than the input program within a small e xecution-free budget. This result is particularly important in our setting, where the central challenge is not merely to generate candidate rewrites, but to convert a limited candidate budget into successful eciency improvements while pr eserving correctness. This advantage is consistent across b oth backbones. Under GPT - 5-mini, ESkill achieves the str ongest results on Python and also attains the best optimization success rates on C++. Under Qwen3- Coder-30B- A3B-Instruct, the same pattern holds: ESkill again achieves the b est OPT @1 and OPT @8 on both Python and C++. T aken together , these results suggest that ESkill primarily im- proves the coverage of successful optimizations, thereby increasing the likelihood that a xed candidate set contains at least one useful optimization. T o further characterize this advantage, we analyze candidate- budget scaling, task-level comparisons, and ne-grained outcome EiSkill: Agent Skill Based Automated Code Eiciency Optimization Conference’17, July 2017, W ashington, DC, USA 0 20 40 60 OPT@k (%) (a) Python / GPT -5-mini (b) Python / Qwen3-Coder-30B 1 2 3 4 5 6 7 8 0 20 40 60 OPT@k (%) (c) C++ / GPT -5-mini 1 2 3 4 5 6 7 8 (d) C++ / Qwen3-Coder-30B Candidate Budget k Instruction RAG CoT SBLLM FasterPy EffiSkill Figure 4: Optimization success growth from 𝑘 = 1 to 𝑘 = 8 . 0 25 50 75 100 455 26 142 (a) Python / GPT -5-mini vs SBLLM 0 25 50 75 100 401 41 181 (b) Python / Qwen3-Coder -30B vs EffiCoder 0 25 50 75 100 348 34 241 (c) C++ / GPT -5-mini vs CoT 0 25 50 75 100 313 103 207 (d) C++ / Qwen3-Coder -30B vs Instruction T asks (%) EffiSkill W ins Ties Losses Figure 5: T ask-level win/loss comparison against the strongest non-ESkill baselines. distributions. Figure 4 shows that ESkill benets more consis- tently from additional candidates as 𝑘 increases, with especially pronounced gains on C++. This nding suggests that the diagnosis– retrieval–planning pipeline improves not only the top-ranked can- didate, but also the overall quality of the candidate set produce d under a xed generation budget. Figure 5 provides a complemen- tary task-level comparison with the strongest non-ESkill baseline in each setting. In all four settings, ESkill wins on more tasks than it loses, indicating that its advantage is broadly distributed rather than driven by a small number of favorable cases. Figur e 6 further shows that ESkill consistently reduces the proportion of tasks with no improvement and shifts more tasks into meaningful improvement ranges, particularly in the C++ settings. Collectively , these analyses reinforce the interpretation of Table 1: ESkill’s main strength is reliable optimization discovery . Relative to direct prompting baselines such as Instruction and Co T , ESkill more Instruction RAG CoT SBLLM FasterPy EffiSkill (a) Python / GPT -5-mini Instruction RAG CoT SBLLM FasterPy EffiCoder EffiSkill (b) Python / Qwen3-Coder-30B 0 25 50 75 100 Instruction RAG CoT SBLLM FasterPy EffiSkill (c) C++ / GPT -5-mini 0 25 50 75 100 Instruction RAG CoT SBLLM FasterPy EffiCoder EffiSkill (d) C++ / Qwen3-Coder-30B T asks (%) No improvement Mild (1.00-1.10) Moderate (1.10-1.25) Strong (1.25-1.50) V ery strong (>1.50) Figure 6: Fine-graine d distribution of top-8 optimization. eectively converts the candidate budget into successful optimiza- tions, highlighting the value of explicit diagnosis, skill retrieval, and multi-plan composition over one-shot rewriting. Relative to retrieval-based and optimization-oriente d baselines, its consistently stronger OPT @ 𝑘 results suggest that mechanism-level skill reuse is a promising inductive bias for execution-free optimization. T able 1 reports OPT @ 𝑘 results with 𝑘 = 8 candidates per task, with Python and C++ shown side by side. Best results are high- lighted in light gray . Where signicance markers are present, the improvements of ESkill are statistically signicant under this test. The overall evidence from T able 1 and Figures 4–6 supports the same conclusion: among the compared metho ds, ESkill is the most reliable for identifying useful optimizations across a broad set of tasks under a xed execution-free candidate budget. Answer to RQ1: ESkill is the most consistent method for improving optimization success on EBench-X. It achieves the best OPT @1 and OPT @8 in all four model–language settings and uses the candidate budget more eectively than competing methods. Additional analyses further show that ESkill wins on a majority of tasks against the strongest baseline in each setting and shifts more tasks fr om no improvement into moderate and strong improv ement ranges. Its main advantage therefore lies in reliably nding optimized candidates under a small execution- free budget. 5.2 RQ2: Ablation Study T o assess the contribution of each inference-stage component, we evaluate three ablations. w/o Retrieval removes skill retrieval and generates candidates directly from the diagnosis brief. w/ Random Skills replaces retrieved skills with uniformly sampled ones from the library . w/o Multi P lans retains diagnosis and retrieval but restricts generation to a single plan per retrieved bundle. All settings use the same backb one, candidate budget ( 𝑘 = 8 ), and ranking protocol. Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu T able 2: Ablation results on EBench-X. Model Method OPT@1 (%) OPT@8 (%) GPT -5-mini ESkill 26.48 37.40 w/o Retrieval 12.20 27.45 w/ Random Skills 13.80 27.82 w/o Multi Plans 18.78 25.84 Qwen3-Coder ESkill 21.35 36.60 w/o Retrieval 7.22 9.15 w/ Random Skills 12.36 13.80 w/o Multi Plans 11.88 14.29 T able 2 shows that each component contributes materially , with dierent degradation patterns across backbones. Skill relevance is critical. Replacing retrieved skills with ran- dom ones consistently reduces performance on both models, indi- cating that the benet does not arise from adding arbitrary skills alone; it depends on whether the selected skills align with the di- agnosed bottleneck. For example, OPT@8 declines from 37 . 40% to 27 . 82% on GPT -5-mini and fr om 36 . 60% to 13 . 80% on Qwen3-Coder- 30B- A3B-Instruct. Multi-plan composition drives candidate quality . Restrict- ing generation to a single plan substantially reduces OPT@8 , es- pecially on Qwen3-Coder-30B-A3B-Instruct, where OPT@8 drops from 36 . 60% to 14 . 29% . This indicates that diversied plan explo- ration plays an important role in discovering high-quality optimiza- tion candidates. Retrieval improves robustness. Removing retrieval leads to substantial degradation in b oth OPT@1 and OPT@8 on Q wen3- Coder-30B- A3B-Instruct, suggesting that diagnosis alone is insuf- cient. On GPT -5-mini, the degradation is more pronounced in OPT@1 than in OPT@8 , suggesting that direct generation can oc- casionally produce strong candidates, but retrieval helps surface them more reliably among the top-ranked outputs. Answer to RQ2: Each comp onent of ESkill contributes to its performance. Relevant skill retrieval improves alignment with optimization bottlenecks, while multi-plan composition appears to b e a major factor in impro ving OPT@8 . Retrieval also improves robustness by helping promote high-quality candidates to top- ranked outputs. T ogether , these components account for the gains achieved by ESkill. 5.3 RQ3: Consistency Across Languages W e next assess whether ESkill maintains its eectiveness across programming languages under the same evaluation protocol. As shown in T able 1, the framework consistently impro ves optimiza- tion success on both Python and C++. On Python, ESkill achieves the best OPT @1 and OPT @8 for both backbones, reaching 26 . 48% / 37 . 40% with GPT -5-mini and 21 . 35% / 36 . 60% with Qwen3-Coder- 30B- A3B-Instruct. On C++, the improv ements are larger: ESkill reaches 44 . 62% and 66 . 77% with GPT -5-mini, and 34 . 19% and 56 . 50% with Qwen3-Coder-30B- A3B-Instruct. Relative to the strongest non- ESkill baseline in each setting, this corresponds to gains of + 12 . 03 points on C++ and + 4 . 98 points on Python for GPT -5-mini, and T able 3: Skill-family summary across both backb ones. Family Representative transformations Usage (%) Implementation & constant-factor constant-factor cleanup; lighter data structures; cheaper arithmetic 54.2 Algebraic / closed-form reformulation loop elimination; parity / bitwise simplication; range counting 19.7 DP / state compression state reformulation; rolling-state compression 13.1 Combinatorics & number theory modular combinatorics; coprimality reasoning; binomial simplication 9.5 Graph / data structure / set operations graph restructuring; set-intersection reformulation; incremental search 3.5 + 12 . 52 points on C++ and + 8 . 35 points on Python for Qwen3- Coder-30B- A3B-Instruct. T ogether , these r esults suggest that the diagnosis–retrieval–composition pipeline is eective in both lan- guage settings rather than being tie d to a single programming language. Overall, the results suggest that ESkill remains eectiv e across languages with dierent syntax, idioms, and performance charac- teristics. The improv ements are particularly pronounced on C++, where the margin on OPT @8 is substantially larger , suggesting that the framework is especially eective in the more optimization- sensitive C++ setting. Answer to RQ3: ESkill generalizes across Python and C++ at the framework level. It achieves the best OPT @1 and OPT @8 in both languages, with particularly strong gains on C++ of up to + 12 . 36 points over the strongest non-ESkill baseline. Its primary advantage across languages is more reliable optimization discovery under an execution-free candidate budget. 5.4 RQ4: Analysis of Learned Skills Finally , we analyze the learned skill librar y to b etter understand how ESkill improv es program eciency . A key advantage of ESkill over black-box prompting methods is that its optimization behavior is mediated by e xplicit, reusable skills. Each skill describes a recognizable optimization idea, such as constant-factor cleanup, replacing arithmetic loops with closed forms, or simplifying small-parameter combinatorial computation. This makes it possible to inspect both what kinds of optimization knowledge are learned and how that knowledge is combined during inference. For clarity , we use two analysis units. A candidate-skill pair denotes one skill attached to one generated candidate. A problem- bundle observation denotes one distinct combination of problem and retrieved skill set after merging candidates that share the same bundle on the same problem. Across the two backbones studied here, this yields 7,260 valid candidate-runtime measurements, 16,412 candidate-skill pairs, and 3,134 problem-bundle observations. EiSkill: Agent Skill Based Automated Code Eiciency Optimization Conference’17, July 2017, W ashington, DC, USA T o make the distribution easier to interpret, we group the 29 operator skills into ve semantic families accor ding to their opti- mization intent. T able 3 summarizes the resulting library structure. The dominant family is implementation and constant-factor opti- mization, which accounts for 54.2% of all candidate-skill pairs and appears in 623 problems. Algebraic reformulation and dynamic- programming compression also appear frequently , indicating that the learned library is not limite d to local cleanups. At the same time, combinatorics and number-theoretic skills account for only 9.5% of usage but are associated with larger average gains, suggesting a stable backbone of broadly useful skills together with a smaller set of high-impact specialized op erators. The usage distribution is concentrated but not collapsed. Al- though the library contains 29 skills, the entropy of the empirical usage distribution corresponds to only about 10–13 eectively ac- tive skills. Moreov er , the top-5 skills account for 68.3% of all skill usages under GPT -5-mini and 70.0% under Qwen3-Coder-30B-A3B- Instruct. W e therefore interpret the library as a compact core plus a specialized tail, rather than as a uniformly distributed or obviously redundant inventory . The bundle-level results further suggest that these skills are composed coherently at inference time. For example, the recur- ring bundle that combines constant-factor cleanup, right-sized data structures, and cheaper arithmetic appears in 133 problem–bundle observations across the two backb ones (101 under GPT -5-mini and 32 under Qwen3-Coder-30B-A3B-Instruct) and achieves a mean improvement of 0.341. Similar recurring bundles also arise for dynamic-programming compression and combinatorial simpli- cation, suggesting that ESkill do es not rely on isolated heuristics, but on reusable optimization recipes. Overall, these observations suggest that the learned skill library is both inspectable and practically useful. It captures reusable op- timization mechanisms at a human-readable lev el, while still sup- porting exible composition across pr oblems and model backbones. Answer to RQ4: The learne d skill library is compact, diverse, and operationally meaningful. It contains 29 operator skills, with the top-5 accounting for 68.0–70.0% of usage, and organizes them into coherent optimization families and recurring skill bundles. These results indicate that ESkill impr oves eciency through explicit, reusable optimization mechanisms rather than opaque black-box rewriting. 6 Discussion 6.1 Why Do es Skill-Base d Optimization W ork? ESkill improves inference-time optimization by shifting the level at which reuse occurs. Conventional pr ompting and exemplar re- trieval primarily reuse surface-level patterns in programs, whereas ESkill externalizes recurring slow-to-fast transformations as e x- plicit Operator Skills . As a result, the reusable unit is a transforma- tion mechanism rather than an instance. This supports generaliza- tion across programs that exhibit similar performance bottlenecks but dier syntactically , allowing the model to reason about how to restructure computation rather than what to imitate. Empirically , this is reected in the compact skill library (29 Operator Skills), where the top-5 skills account for only 68.0–69.7% of usage, sug- gesting a stable and compositional set of reusable transformations. A second key factor is compositionality . Many performance issues require coordinated changes across multiple aspects of a program, such as state representation, transition structure, and constant-factor optimization. ESkill explicitly supp orts this through its diagnosis–retrieval–planning pipeline, which identies bottle- neck classes, retrieves multiple relevant skills, and composes them into candidate optimization plans. This structured exploration con- strains the search space while preserving diversity . For example , the best-performing candidate in Section 6.2 combines sev eral transfor- mations, including state reformulation, rolling-state compr ession, and compact data representations, replacing a tuple-based dynamic program with a bitset formulation and improving eciency . Third, the skill library acts as a structured prior over candidate generation. By conditioning generation on a small set of retrieved transformation mechanisms, the model is guide d toward mor e tar- geted and empirically grounded rewrites. This is particularly im- portant in our setting, where optimization is p erformed without execution-time fe edback. This mechanism-level prior eectively regularizes generation, which may help explain why ESkill con- sistently outperforms direct prompting under the same backbone model. 6.2 Case Study W e present a representativ e success case on chmax-rush! . The task applies 𝑄 prex/sux assignments to an initially zero array and asks for the number of valid choices of left or right op erations. An operation is invalid if it overwrites a larger existing value. This case is informative because the bottleneck in the input program is ex- plicit and the dier ence in optimization behavior is straightforward to inspect. Input program. The original solution determines validity by scan- ning subsequent op erations and enforcing pairwise constraints, which yields an O ( 𝑄 2 ) b ottleneck. ops = [ tuple ( map ( int , input ().split())) for _ in range (Q)] req = [0] * Q for i, (p1, v1) in enumerate (ops): for j in range (i + 1, Q): p2, v2 = ops[j] if v1 <= v2: continue if p1 == p2: print (0); exit() if p1 < p2: if req[i] == 2 or req[j] == 1: print (0); exit() req[i], req[j] = 1, 2 else : if req[i] == 1 or req[j] == 2: print (0); exit() req[i], req[j] = 2, 1 print ( pow (2, req.count(0), 998244353)) Prompting-based candidate. A standard prompting baseline mainly renes the implementation, but preserves the same pairwise rea- soning pattern. P, V = [...], [...] req = [0] * Q for i in range (Q): for j in range (i + 1, Q): if V[i] <= V[j]: continue if P[i] == P[j]: print (0); return if P[i] < P[j]: if req[i] == 2 or req[j] == 1: print (0); return req[i], req[j] = 1, 2 else : if req[i] == 1 or req[j] == 2: print (0); return Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu req[i], req[j] = 2, 1 print ( pow (2, req.count(0), 998244353)) ESkill candidate. In contrast, ESkill changes the underlying computational formulation. It replaces repeated pairwise checks with aggregated masks over value ranks and positions, and then resolves constraints through mask intersections. # build aggregated masks ranks = compress(V) mask_rank_lt = build_rank_masks(ranks) mask_pos_lt, mask_pos_gt, mask_pos_eq = build_pos_masks(P, N) # resolve constraints full = (1 << Q) - 1 left_req = right_req = 0 for i in range (Q): later = full ^ ((1 << (i + 1)) - 1) cand = later & mask_rank_lt[ranks[i]] if not cand: continue if cand & mask_pos_eq[P[i]]: print (0); return greater = cand & mask_pos_gt[P[i]] less = cand & mask_pos_lt[P[i]] if greater and less: print (0); return left_req |= greater | (1 << i if less else 0) right_req |= less | (1 << i if greater else 0) # ... (omitted due to space limit) The comparison indicates that the pr ompting-based candidate preserves the original pair wise-check structure and therefore re- mains O ( 𝑄 2 ) , whereas ESkill replaces repeated pairwise com- parisons with aggregated mask operations, reducing the overall complexity to O ( 𝑄 + 𝑁 ) , where 𝑁 is the array length. The improv e- ment therefore comes from algorithmic r eformulation rather than local implementation cleanup. Runtime comparison. For the exact examples shown above , the prompting-based candidate runs in 3838.11 ms , whereas the E- iSkill candidate runs in 1434.23 ms , yielding a 2.68 × improvement on eciency . Although this e xample is not intended to demonstrate the largest absolute gain, it highlights a qualitative dierence in optimization behavior: ESkill produces a non-trivial algorithmic rewrite that is not obtained by the pr ompting-based baseline in this example. 6.3 Threats to V alidity W e identify the following threats to the validity of our study: External V alidity . Our evaluation is conducted on competitive- programming benchmarks. These benchmarks provide clean slow/- fast program pairs and controlled eciency measurements, but they do not fully capture the complexity of real-world software systems. In practical settings, optimization can depend on factors such as external libraries, compiler behavior , hardware environments, and cross-le context. Therefor e, our ndings should be interpreted as evidence for the eectiveness of skill-based optimization in a con- trolled code-eciency setting, rather than as a direct claim that the same performance will transfer unchanged to all software domains. Internal V alidity . The reported results may be inuenced by design choices in both skill construction and evaluation. The qual- ity of the mined skill library depends on LLM-generated traces, clustering decisions, and ltering thresholds, all of which can aect downstream optimization performance. Rep orted performance may also vary with the candidate budget, ranking setup, and execution environment. T o mitigate this threat, we apply the same genera- tion budget, evaluation protocol, and execution harness across all compared methods. Construct V alidity . W e evaluate eectiveness using the opti- mization success rate under a xed execution-free budget. This metric aligns with the goal of generating correct and more ecient code, but it does not capture other practical dimensions such as read- ability , maintainability , or broader resource trade-os. Moreov er , because both the skill-mining corpus and the evaluation b enchmark come from the competitive-pr ogramming domain, some distribu- tional similarity may remain, ev en though we explicitly che cked for direct task overlap and found none. Our conclusions are also base d on tw o backbones and two programming languages, so broader val- idation across additional models, languages, and software domains is still needed. 7 Related W ork Large language models have demonstrated strong capabilities across diverse software engineering tasks such as code generation [ 4 , 12 , 41 , 42 ], program repair [ 20 , 46 ], testing [ 2 , 7 ], and many other tasks [ 13 , 14 , 34 , 40 , 45 ]. Among these, code eciency optimization— transforming functionally correct but slow co de into faster e quivalents— has emerged as a distinct challenge, as recent benchmarks consis- tently rev eal a substantial gap between LLM-generated code and human-expert solutions [9, 28, 36, 43]. In contrast to classical rule-based or search-based techniques [ 21 , 29 ], recent work leverages LLMs for source-level eciency im- provement. These methods dier in how they represent and de- liver optimization knowledge —reusable patterns, transformations, or high-level strategies that impro ve runtime eciency while pre- serving program semantics. Prompting-based approaches ask models to rewrite code without supplying additional knowledge. PIE’s zero-shot GPT -4 baseline achiev es roughly 1 . 3 × speedup, far below the 6 . 86 × of PIE-SFT under identical conditions [ 44 ]. ECO [ 23 ] further nds that chain-of-thought instructions yield no signicant gain—suggesting the bottleneck may lie in the lack of explicit optimization knowledge rather than reasoning capac- ity alone. Retrieval-augmented methods supply knowledge at inference time: PIE-RA G [ 44 ] retrieves similar slo w–fast pairs as few-shot examples, and SBLLM [ 10 ] combines evolutionary search with adaptive retrieval. How ever , such knowledge is typically tied to concrete co de instances rather than explicitly abstracted into transferable principles, so eectiveness is bounded by corpus cov- erage and instance similarity . Fine-tuning internalizes knowledge into model weights. PIE-SFT [ 44 ], Supersonic [ 6 ], and ECo der [ 17 ] consistently outperform retrieval-only methods, yet the y encode knowledge in language-specic parameters, which may limit cross- language transferability . Perf Coder [ 49 ] comes closest to explicit knowledge representation by training a dedicated Planner to gen- erate natural-language strategies; however , strategies are produced online without explicit oine curation, the approach relies on a ded- icated Planner model, and it is primarily evaluated on C++. RL and self-evolution methods such as Afterburner [ 8 ], ELearner [ 16 ], and CSE [ 15 ] leverage execution feedback to iterativ ely rene code, but every step r equires a live execution environment, making them unsuitable for latency-sensitive or compute-restricted settings. EiSkill: Agent Skill Based Automated Code Eiciency Optimization Conference’17, July 2017, W ashington, DC, USA Across these paradigms, optimization knowledge remains either strictly implicit—buried in weights or narrow retrieval instances—or fundamentally runtime-dependent. Our approach instead addresses this limitation by explicitly distilling optimization strategies into a reusable skill library designed to be language-agnostic: each skill encodes a high-level algorithmic principle validated oine via execution feedback, enabling single-pass inference that requires neither live execution environments nor deployment-time model training. 8 Conclusion In this paper , we presented ESkill, a two-stage framework for code-eciency optimization that explicitly models reusable opti- mization knowledge as agent skills. By separating oine skill min- ing from execution-free skill-guided optimization, ESkill enables LLM-based agents to reuse mechanism-level optimization knowl- edge instead of relying on one-shot rewriting or instance-level exemplar matching. Experimental results on EBench-X show that ESkill consistently improves optimization success across GPT - 5-mini and Q wen3-Coder-30B- A3B-Instruct, and in b oth Python and C++, achieving the best OPT @1 and OPT @8 in all settings. These results indicate that ESkill is more likely to produce at least one correct and eciency-improving candidate within a xed execution-free budget. Our analyses further show that these gains arise from the in- teraction of structured diagnosis, relevant skill retrie val, and plan diversity , rather than from any single component. The learned skill library captures a compact yet diverse set of reusable optimization operators, including both broadly applicable constant-factor im- provements and higher-impact spe cialized transformations, and remains eective b eyond the primary Python setting, extending to C++, where the margins in optimization success are often even larger . Taken together , these ndings suggest that explicit skill reuse is a practical basis for execution-free code optimization. A natural direction for future work is to extend ESkill to additional lan- guages and broader software-engineering optimization scenarios, while improving the quality , scope, and novelty of the skill library . More broadly , our results indicate that the mined skills can serve as a reusable optimization toolbox for coding agents, which may support the integration of optimization knowledge into real-world development worko ws. Data A vailability Statement All data and artifacts are publicly available at https://doi.org/10. 5281/zenodo.19249527. References [1] Anthropic. 2026. The Complete Guide to Building Skills for Claude. https://resources.anthropic.com/hubfs/The- Complete- Guide- to- Building- Skill- for- Claude.pdf . Accessed: 2026-03-18. [2] Pengyu Chang, Yixiong Fang, Silin Chen, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. T est vs Mutant: Adversarial LLM Agents for Robust Unit T est Genera- tion. arXiv preprint arXiv:2602.08146 (2026). [3] Guoxin Chen, Zhong Zhang, Xin Cong, Fangda Guo, Y esai Wu, Y ankai Lin, W enzheng Feng, and Y asheng W ang. 2025. Learning Evolving T ools for Large Language Models. In The Thirteenth International Conference on Learning Repre- sentations . https://openreview .net/forum?id=wtrDLMF U9v [4] Mark Chen, Jerry T worek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 (2021). [5] Silin Chen, Shaoxin Lin, Yuling Shi, Heng Lian, Xiaodong Gu, Longfei Y un, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, et al . 2025. Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361 (2025). [6] Zimin Chen, Sen Fang, and Martin Monperrus. 2023. Supersonic: Learning to Generate Source Code Optimizations in C/C++. CoRR abs/2309.14846 (2023). [7] Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, and Lingxiao Jiang. 2026. Rethinking the V alue of Agent-Generated T ests for LLM- Based Software Engineering Agents. arXiv preprint arXiv:2602.07900 (2026). [8] Mingzhe Du, Anh Tuan Luu, Yue Liu, Y uhao QING, Dong HU ANG, Xinyi He, Qian Liu, Zejun MA, and See-Kiong Ng. 2025. Afterburner: Reinforcement Learning Facilitates Self-Improving Code Eciency Optimization. In ICML 2025 W orkshop on Programmatic Representations for Agent Learning . https://openreview .net/ forum?id=RTI8LsGjSi [9] Mingzhe Du, Luu A Tuan, Bin Ji, Qian Liu, and See-Kiong Ng. 2024. Mercury: A code eciency benchmark for code large language models. Advances in Neural Information Processing Systems 37 (2024), 16601–16622. [10] Shuzheng Gao, Cuiyun Gao, W enchao Gu, and Michael Lyu. 2024. Search-based llms for code optimization. arXiv preprint arXiv:2408.12159 (2024). [11] Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sun- daresan, and Chen W u. 2022. De epDev-PERF: a deep learning-based approach for improving software performance. In ESEC/FSE . A CM, 948–958. [12] Chao Hu, W enhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. In Line with Context: Repository-Level Code Generation via Context Inlining. In The ACM International Conference on the Foundations of Software Engineering (FSE) . [13] Minghao Hu, Junzhe W ang, W eisen Zhao, Qiang Zeng, and Lannan Luo. 2025. FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-A dapter Architecture. (2025). [14] Minghao Hu, Qiang Zeng, and Lannan Luo. 2026. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training. arXiv:2603.21058 [cs.CR] [15] T u Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, W enqi Jiang, Yuqi Fang, Sen Hu, et al . 2026. Controlled self-evolution for algorithmic code optimization. arXiv preprint (2026). [16] Dong Huang, Jianb o Dai, Han W eng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie Zhang. 2024. Elearner: Enhancing eciency of generated code via self-optimization. Advances in Neural Information Processing Systems 37 (2024), 84482–84522. [17] Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han W eng, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie M Zhang. 2024. Ecoder: Enhancing code generation in large language models through eciency-aware ne-tuning. arXiv preprint arXiv:2410.10209 (2024). [18] ISO/IEC 25010. 2011. Systems and software engineering – Systems and software Quality Requirements and Evaluation (SQuaRE). [19] Gautier Izacard, Patrick S. H. Lewis, et al . 2023. Atlas: Few-shot Learning with Retrieval Augmented Language Models. J. Mach. Learn. Res. 24 (2023), 251:1– 251:43. [20] Carlos E Jimenez, John Y ang, Alexander W ettig, Shunyu Y ao, K exin Pei, Or Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues? . In ICLR . [21] Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and dete cting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77–88. [22] Milan Jovic, Andrea A damoli, and Matthias Hauswirth. 2011. Catch Me If Y ou Can: Performance Bug Detection in the Wild. In OOPSLA . ACM, 155–170. [23] Su-Hyeon Kim, Joonghyuk Hahn, Sooyoung Cha, and Y o-Sub Han. 2025. ECO: Enhanced Code Optimization via Performance- A ware Prompting for Code-LLMs. arXiv preprint arXiv:2510.10517 (2025). [24] Rahul Krishna, Muhammad Shaque Iqbal, Mohammad Amin Javidian, Baishakhi Ray , and Pooyan Jamshidi. 2020. CADET: Debugging and Fixing Miscongura- tions using Counterfactual Reasoning. arXiv preprint arXiv:2010.06061 (2020). [25] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Yih, Tim Rocktäschel, et al . 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33 (2020), 9459–9474. [26] Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin W ang, Yantao Jia, T ao Huang, and Qianxiang W ang. 2025. Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348 (2025). [27] Xiangyi Li, W enbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Y ubo Li, Bingran Y ou, Haotian Shen, Jiankai Sun, Shuyi W ang, Qunhong Zeng, Di W ang, Xuandong Zhao, Yuanli W ang, Roey Ben Chaim, Zonglin Di, Yip eng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Y ueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei W ang, Runhui W ang, Tianyu W ang, W engao Y e, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, and Han chung Lee. 2026. SkillsBench: Benchmarking How W ell Agent Skills W ork Across Diverse Tasks. arXiv:2602.12670 [cs.AI] https://arxiv .org/abs/2602.12670 Conference’17, July 2017, W ashington, DC, USA Zimu W ang, Y uling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng W an, and Xiaodong Gu [28] Jiawei Liu, Songrun Xie, Junhao W ang, Y uxiang W ei, Yifeng Ding, and Lingming Zhang. 2024. Evaluating Language Models for Ecient Code Generation. arXiv preprint arXiv:2408.06450 (2024). [29] Adrian Nistor , Po-Chun Chang, Cosmin Radoi, and Shan Lu. 2015. Caramel: Detecting and xing performance problems that have non-intrusive xes. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering , V ol. 1. IEEE, 902–912. [30] Adrian Nistor , Tian Jiang, and Lin Tan. 2013. Discovering, Reporting, and Fixing Performance Bugs. In MSR . IEEE Computer Society , 237–246. [31] OpenAI. 2025. GPT -5 mini. https://developers.openai.com/api/docs/mo dels/gpt- 5- mini. [32] OpenAI. 2025. GPT -5.1: A smarter , more conversational ChatGPT. https://op enai. com/index/gpt- 5- 1/. Accessed: 2026-03-17. [33] OpenAI. 2026. From Model to Agent: Equipping the Resp onses API with a Com- puter Environment. https://openai.com/index/equip- responses- api- computer- environment/. Accessed: 2026-03-18. [34] W eihan Peng, Yuling Shi, Yuhang W ang, Xinyun Zhang, Beijun Shen, and Xi- aodong Gu. 2025. Swe-qa: Can language models answer repository-level code questions? arXiv preprint arXiv:2509.14635 (2025). [35] Y un Peng, Jun W an, Yichen Li, and Xiaoxue Ren. 2025. COFFE: A Code Eciency Benchmark for Code Generation. Proceedings of the ACM on Software Engineering 2, FSE (2025), 242–265. [36] Y uhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, et al . 2025. EBench-X: A Multi-Language Benchmark for Measuring Eciency of LLM- Generated Code. arXiv preprint arXiv:2505.13004 (2025). [37] Qwen. 2025. Qwen/Qwen3-Coder-30B-A3B-Instruct. https://huggingface.co/ Qwen/Qwen3- Coder- 30B- A3B- Instruct. [38] Timo Schick, Jane Dwivedi- Y u, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer , Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can T each Themselves to Use T ools. arXiv:2302.04761 [cs.CL] https://arxiv .org/abs/2302.04761 [39] Sentence- Transformers. 2020. all-MiniLM-L6-v2: Sentence Embeddings using MiniLM. https://huggingface.co/sentence- transformers/all- MiniLM- L6- v2. Hug- ging Face model, accessed: 2026-03-17. [40] Y uling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2025. LongCodeZip: Compress Long Context for Code Language Models. In 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 141–153. [41] Y uling Shi, Songsong W ang, Chengcheng Wan, Min W ang, and Xiaodong Gu. 2024. From co de to correctness: Closing the last mile of co de generation with hierarchical debugging. arXiv preprint arXiv:2410.01215 (2024). [42] Y uling Shi, Chaoxiang Xie, Zhensu Sun, Y eheng Chen, Chenxu Zhang, Longfei Y un, Chengcheng W an, Hongyu Zhang, David Lo, and Xiao dong Gu. 2026. CodeOCR: On the Eectiveness of Vision Language Models in Code Under- standing. arXiv:2602.01785 [cs.CL] https://ar xiv .org/abs/2602.01785 [43] Y uling Shi, Hongyu Zhang, Chengcheng W an, and Xiaodong Gu. 2025. Between lines of code: Unraveling the distinct patterns of machine and human program- mers. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . IEEE, 1628–1639. [44] Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gar dner , Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Y azdanbakhsh. 2023. Learning Performance-Improving Co de Edits. arXiv preprint arXiv:2302.07867 (2023). [45] Chaofan Wang, Tingrui Y u, Chen Xie, Jie W ang, Dong Chen, W enrui Zhang, Y uling Shi, Xiao dong Gu, and Beijun Shen. 2026. EV OC2RUST: A Skeleton- guided Framework for Project-Lev el C-to-Rust Translation. In 2026 IEEE/ACM 48th International Conference on Software Engineering, SEIP Track (ICSE-SEIP) . [46] Y uhang W ang, Y uling Shi, Mo Y ang, Rongrui Zhang, Shilin He , Heng Lian, Y uting Chen, Siyu Ye , Kai Cai, and Xiaodong Gu. 2026. SWE-Pruner: Self-Adaptiv e Context Pruning for Coding Agents. arXiv preprint arXiv:2601.16746 (2026). [47] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [48] Y ue Wu, Minghao Han, Ruiyin Li, Peng Liang, Amjed Tahir , Zengyang Li, Qiong Feng, and Mojtaba Shahin. 2025. FasterPy: An LLM-based Code Execution Eciency Optimization Framework. arXiv preprint arXiv:2512.22827 (2025). [49] Jiuding Y ang, Shengyao Lu, Hongxuan Liu, Shayan Shirahmad Gale Bagi, Zahra Fazel, T omasz Czajkowski, and Di Niu. 2025. Perf Coder: Large Language Models for Interpretable Co de Performance Optimization. arXiv preprint (2025). [50] Shunyu Y ao, Jerey Zhao , Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. ReA ct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://ar xiv .org/abs/2210.03629 [51] Yilin Zhang, Xinran Zhao, Zora Zhiruo W ang, Chenyang Y ang, Jiayi W ei, and T ongshuang Wu. 2025. cast: Enhancing co de retrieval-augmented generation with structural chunking via abstract syntax tree. arXiv preprint (2025).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment